📊 ArXiv 研究报告 (2026-04-16)

生成时间: 2026-04-16 09:25:50 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 341 篇
及格论文: 10 篇 (2.9%)

⭐ 及格论文详细分析

1. QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence

作者: Zhichao Lin, Zhichao Liang, Gaoqiang Liu, Meng Xu, Baoyu Xiang, Jian Xu, Guanjun Jiang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12867v1

评分: 75.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文QuarkMedSearch专注于医疗领域的深度搜索智能体研究，基于基础模型Tongyi DeepResearch构建。核心相关关键词包括：1) “Large Language Models” (权重1.0) - 论文明确基于"agentic foundation models"，属于大模型应用，给10分；2) “Post-training” (权重1.0) - 论文采用两阶段SFT和RL训练策略，是核心方法，给10分；3) “RLHF” (权重1.0) - 论文提到RL训练，与RLHF相关但非核心，给5分；4) “Chain of Thought” (权重1.0) - 深度搜索涉及多步推理，给5分；5) “System 2 Thinking” (权重1.0) - 深度搜索需要深入推理，给5分；6) “Self-Correction” (权重1.0) - 论文强调反思能力，给10分；7) “LLM Agents” (权重1.0) - 论文是智能体研究，给10分；8) “Tool Use” (权重1.0) - 论文涉及工具调用，给10分；9) “AI for Science” (权重1.0) - 医疗AI应用，给10分。其他关键词如MoE、量化、RAG等未在摘要中体现，给0分。

!!! tip deepseek-chat TL;DR

该论文针对医疗领域深度搜索数据稀缺和性能提升的挑战，提出了QuarkMedSearch智能体，通过医疗知识图谱与在线探索构建训练数据、采用两阶段SFT和RL训练策略增强规划与反思能力，并在自建基准上实现了开源模型中的最先进性能。

摘要翻译

随着具身化基础模型的持续演进，如何进一步提升其在垂直领域中的性能已成为重要挑战。为此，我们在强大的具身化基础模型通义DeepResearch基础上，聚焦中文医疗深度搜索场景，提出QuarkMedSearch，系统性地探索了涵盖医疗多跳数据构建、训练策略与评估基准的全流程方法，以进一步推进并评估其在垂直领域的性能上限。具体而言，在数据合成方面，针对医疗领域深度搜索训练数据稀缺的问题，我们结合大规模医疗知识图谱与实时在线探索，构建了长周期的医疗深度搜索训练数据；在训练后优化阶段，我们采用两阶段的监督微调（SFT）与强化学习（RL）训练策略，逐步增强模型进行深度搜索所需的规划、工具调用与反思能力，同时保持搜索效率；在评估方面，我们与医学专家合作，通过严格的人工验证构建了QuarkMedSearch Benchmark。实验结果表明，QuarkMedSearch在QuarkMedSearch Benchmark上取得了同规模开源模型中的最优性能，同时在通用基准测试中也保持了强劲的竞争力。

摘要 (Abstract)

As agentic foundation models continue to evolve, how to further improve their performance in vertical domains has become an important challenge. To this end, building upon Tongyi DeepResearch, a powerful agentic foundation model, we focus on the Chinese medical deep search scenario and propose QuarkMedSearch, systematically exploring a full-pipeline approach spanning medical multi-hop data construction, training strategies, and evaluation benchmarks to further push and assess its performance upper bound in vertical domains. Specifically, for data synthesis, to address the scarcity of deep search training data in the medical domain, we combine a large-scale medical knowledge graph with real-time online exploration to construct long-horizon medical deep search training data; for post-training, we adopt a two-stage SFT and RL training strategy that progressively enhances the model’s planning, tool invocation, and reflection capabilities required for deep search, while maintaining search efficiency; for evaluation, we collaborate with medical experts to construct the QuarkMedSearch Benchmark through rigorous manual verification. Experimental results demonstrate that QuarkMedSearch achieves state-of-the-art performance among open-source models of comparable scale on the QuarkMedSearch Benchmark, while also maintaining strong competitiveness on general benchmarks.

关键词: Medical Deep Search, Agentic Foundation Models, Two-stage SFT and RL Training, Long-horizon Search, Tool Invocation, Reflection Capabilities, Medical Knowledge Graph, QuarkMedSearch Benchmark

2. AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognit

作者: Zeheng Wang, Zitong Yu, Yijie Zhu, Bo Zhao, Haochen Liang, Taorui Wang, Wei Xia, Jiayu Zhang, Zhishu Liu, Hui Ma, Fei Ma, Qi Tian 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12735v1

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based多模态情感识别，提出AffectAgent框架，包含三个协作智能体（query planner, evidence filter, emotion generator），使用Multi-Agent Proximal Policy Optimization (MAPPO)优化。高度相关的关键词包括：LLMs（核心基础）、MoE（提出MB-MoE方法）、RAG（检索增强生成框架）、LLM Agents和Multi-agent Systems（多智能体协作框架）。中等相关的关键词：Chain of Thought和System 2 Thinking（涉及分析推理过程）、Hallucination Mitigation（解决幻觉问题）。其他关键词如SLMs、Scaling Laws、Pre-training等未在论文中涉及。

!!! tip deepseek-chat TL;DR

论文针对LLM-based多模态情感识别中存在的幻觉问题和模态模糊性挑战，提出了AffectAgent——一个基于协作多智能体检索增强生成的框架，通过三个专门智能体的联合优化和MB-MoE、RAAF等创新方法，在MER-UniBench基准上实现了优越的性能。

摘要翻译

基于大语言模型的多模态情感识别依赖静态参数化记忆，在解读细微情感状态时常产生幻觉。本文针对单轮检索增强生成方法易受模态模糊性影响、难以捕捉跨模态复杂情感依赖的问题，提出AffectAgent——一种面向情感的多智能体检索增强生成框架，该框架利用智能体间的协同决策实现细粒度情感理解。具体而言，AffectAgent包含三个联合优化的专用智能体：查询规划器、证据过滤器和情感生成器，它们通过协作式分析推理实现跨模态样本检索、证据评估与预测生成。这些智能体采用多智能体近端策略优化算法进行端到端优化，并通过共享情感奖励机制确保情感理解的一致性。此外，我们提出模态平衡专家混合模块与检索增强自适应融合模块：前者动态调节不同模态的贡献以缓解跨模态异质性导致的表征失配，后者通过引入检索到的视听嵌入向量增强缺失模态条件下的语义补全能力。在MER-UniBench基准上的大量实验表明，AffectAgent在复杂场景中均取得优越性能。代码发布于：https://github.com/Wz1h1NG/AffectAgent。

摘要 (Abstract)

LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: https://github.com/Wz1h1NG/AffectAgent.

关键词: AffectAgent, multi-agent reasoning, retrieval-augmented generation, multimodal emotion recognition, Mixture of Experts, LLM-based, collaborative decision-making, affective understanding

3. LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

作者: Haocheng Xi, Harman Singh, Yuezhou Hu, Coleman Hooper, Rishabh Tiwari, Aditya Tomar, Minjae Lee, Wonjun Kang, Michael Mahoney, Chenfeng Xu, Kurt Keutzer, Amir Gholami 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12056v1

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	10.0/10	10.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出LoSA方法，针对块状扩散语言模型（DLMs）在长上下文场景中的内存瓶颈问题，通过利用令牌隐藏状态的稳定性，设计了一种局部感知的稀疏注意力机制。该方法核心涉及稀疏模型（MoE/Sparse Models）和注意力优化技术（KV Cache Compression/Linear Attention/FlashAttention），以解决长上下文（Long Context LLMs）处理中的效率问题，并实现推理加速（Inference Acceleration）。论文虽未直接使用"LLMs"术语，但DLMs属于大语言模型的一种变体，因此给予中等相关度。其他关键词如小模型、训练方法、对齐、代理等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对块状扩散语言模型在长上下文场景中因注意力机制导致的内存瓶颈问题，提出了一种局部感知的稀疏注意力方法（LoSA），通过重用稳定令牌的缓存前缀注意力结果并仅对活跃令牌应用稀疏注意力，显著减少了KV索引加载数量，在保持接近密集注意力精度的同时实现了高达4.14倍的注意力加速。

摘要翻译

分块扩散语言模型（DLMs）能够以任意顺序生成多个词元，为自回归解码流程提供了一种有前景的替代方案。然而，在长上下文场景中，它们仍然受限于内存约束的注意力机制。由于存在KV膨胀问题，朴素的稀疏注意力在DLMs上效果不佳：不同的查询会选择不同的前缀位置，导致被访问的KV页面集合过大。为解决此问题，我们观察到在连续的降噪步骤之间，只有一小部分活跃词元的隐藏状态发生显著变化，而大多数稳定词元的状态几乎保持不变。基于这一洞察，我们提出了局部感知稀疏注意力（Locality-aware Sparse Attention, LOSA）。该方法对稳定词元复用已缓存的前缀注意力结果，仅对活跃词元应用稀疏注意力。这大幅减少了必须加载的KV索引数量，从而实现了更高的加速比和更高的准确性。在多种分块DLMs和基准测试中，LOSA在保持接近稠密模型准确性的同时，显著提升了效率：在激进的稀疏度水平下，平均准确率最高提升9个百分点，同时注意力密度降低至1.54倍以下。在RTX A6000 GPU上，注意力计算速度最高提升4.14倍，证明了所提方法的有效性。

摘要 (Abstract)

Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context scenarios. Naive sparse attention fails on DLMs due to a KV Inflation problem, where different queries select different prefix positions, making the union of accessed KV pages large. To address this, we observe that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes, while the majority of stable tokens remain nearly constant. Based on this insight, we propose LOSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This substantially shrinks the number of KV indices that must be loaded, yielding both higher speedup and higher accuracy. Across multiple block-wise DLMs and benchmarks, LOSA preserves near-dense accuracy while significantly improving efficiency, achieving up to +9 points in average accuracy at aggressive sparsity levels while maintaining 1.54x lower attention density. It also achieves up to 4.14x attention speedup on RTX A6000 GPUs, demonstrating the effectiveness of the proposed method.

关键词: Block-wise Diffusion Language Models, Sparse Attention, Long-context Scenarios, KV Inflation, Locality-aware Sparse Attention (LoSA), Attention Speedup, Memory-bound Attention, Inference Efficiency

4. Latent-Condensed Transformer for Efficient Long Context Modeling

作者: Zeng You, Yaofo Chen, Qiuwu Chen, Ying Sun, Shuhai Zhang, Yingjian Li, Yaowei Wang, Mingkui Tan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12452v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	10.0/10	10.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs处理长上下文时的效率瓶颈（KV缓存和计算复杂度），提出Latent-Condensed Attention（LCA）方法。高度相关（10分）的关键词：“Large Language Models”（论文明确针对LLMs）、“Context Window Extension”（解决长上下文建模）、“KV Cache Compression”（核心贡献是减少KV缓存）。中等相关（5分）的关键词：“Mixture of Experts”（提及稀疏方法作为背景）、“Quantization”（涉及模型效率优化）、“Speculative Decoding”（属于推理加速范畴）。其他关键词与论文内容无直接关联（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型处理长上下文时KV缓存线性增长和自注意力二次复杂度的挑战，提出了一种名为Latent-Condensed Attention的新方法，在保持性能的同时实现了高达2.5倍的预填充加速和90%的KV缓存减少。

摘要翻译

大语言模型在处理长上下文时面临显著挑战，这主要源于键值缓存（KV cache）的线性增长与自注意力机制（self-attention）的二次计算复杂度。现有方法通常分别应对这些瓶颈：多头潜在注意力（Multi-head Latent Attention, MLA）通过将词元投影至低维潜在空间以减少KV缓存，而稀疏注意力则致力于降低计算开销。然而，稀疏方法无法直接在MLA压缩后的潜在结构上运行，错失了联合优化的机会。本文提出潜在凝聚注意力（Latent-Condensed Attention, LCA），该方法直接在MLA的潜在空间内对上下文进行凝聚，其中表示被解耦为语义潜在向量与位置键。LCA通过查询感知池化分别聚合语义向量，并通过锚点选择保留位置键。这一设计在不增加参数的前提下，联合降低了计算成本与KV缓存。除MLA外，LCA的设计与架构无关，可轻松扩展至其他注意力机制，如分组查询注意力（GQA）。理论上，我们证明了其误差界与长度无关。实验表明，在128K上下文长度下，LCA实现了高达2.5倍的预填充加速与90%的KV缓存减少，同时保持了具有竞争力的性能。

摘要 (Abstract)

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA’s compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA’s latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA’s design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5$\times$ prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

关键词: Large Language Models, Long Context Modeling, KV Cache Reduction, Attention Mechanism, Efficiency Optimization, Latent-Condensed Attention, Computational Complexity, Inference Acceleration

5. Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

作者: Jaywon Koo, Jefferson Hernandez, Ruozhen He, Hanjie Chen, Chen Wei, Vicente Ordonez 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12999v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文提出HypoExplore框架，使用大语言模型（LLM）作为核心组件来生成和评估神经网络架构假设，属于LLM在科学发现（AI for Science）领域的创新应用。框架本质上是基于LLM的自主智能体（LLM Agents）系统，通过多智能体（Multi-agent Systems）协作分析实验结果，并涉及假设驱动的推理过程（Chain of Thought/System 2 Thinking）和自我改进机制（Self-Correction）。论文主要聚焦于视觉架构发现，未涉及其他关键词如MoE、训练技术、推理优化等具体技术细节。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为HypoExplore的智能体框架，利用大语言模型以假设驱动的方式自动发现和优化视觉识别任务的神经网络架构，在多个数据集上实现了从低基线到高性能的显著提升，并展示了在医学图像领域的应用潜力。

摘要翻译

我们提出HypoExplore，这是一个智能体框架，它将视觉识别中的神经架构发现构建为一种假设驱动的科学探究过程。在给定人类指定的高层次研究方向后，HypoExplore通过演化分支进行神经架构的构思、实现、评估与改进。新假设的生成利用大型语言模型完成，通过选择父假设作为构建基础，并遵循一种双重策略进行引导：该策略既注重利用已验证的设计原则，也致力于解决尚不确定的问题。我们提出的框架维护着一个记录所有提出架构谱系的“轨迹树”，以及一个通过实验证据主动追踪置信度的“假设记忆库”。每次实验后，多个反馈智能体会从不同角度分析结果，并将其发现整合为假设置信度的更新。我们在CIFAR-10数据集上测试了该框架以发现轻量级视觉架构，其中最佳架构从初始基线精度18.91%演化至94.11%的准确率，并成功推广至CIFAR-100和Tiny-ImageNet。我们通过在MedMNIST数据集上进行独立的架构发现实验，进一步证明了该框架在专业领域的适用性，所获架构达到了最先进的性能。研究表明，随着证据的积累，假设置信度分数变得越来越具有预测性，并且学习到的设计原则能够在独立的演化谱系间迁移，这表明HypoExplore不仅能发现更强的架构，还有助于真正理解设计空间。

摘要 (Abstract)

We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.

关键词: agentic framework, large language model, neural architecture discovery, hypothesis-driven, visual recognition, evolutionary branching, autonomous agents, AI for science

6. Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

作者: Keshu Wu, Chenchen Kuai, Zihao Li, Jiwan Jiang, Shiyu Shen, Shian Wang, Chan-Wei Hu, Zhengzhong Tu, Yang Zhou 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12185v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	15.0/10	15.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究RAG方法创新，与"Retrieval-Augmented Generation"高度相关（15分），直接应用于大语言模型（10分）。论文关注顺序敏感推理，与"Chain of Thought"和"System 2 Thinking"有一定关联（各8分），涉及多步推理和深度推理过程。其他关键词如MoE、量化、对齐等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对现有RAG方法将检索证据视为无序集合的局限性，提出了Order-Aware Knowledge Hypergraph RAG（OKH-RAG），通过建模交互顺序来恢复连贯的交互轨迹，在顺序敏感的问答和解释任务中显著优于现有方法。

摘要翻译

检索增强生成（RAG）通过将输出建立在检索到的知识基础上，增强了大型语言模型的能力。然而，现有的RAG方法，包括基于图和超图的方法，都将检索到的证据视为无序集合，隐含地假设了排列不变性。这一假设与许多现实世界的推理任务不符，因为任务结果不仅取决于发生了哪些交互，还取决于这些交互展开的顺序。我们提出了顺序感知知识超图RAG（Order-Aware Knowledge Hypergraph RAG, OKH-RAG），它将顺序视为一种首要的结构属性。OKH-RAG将知识表示为带有优先序结构的超图中的高阶交互，并将检索重新定义为对超边的序列推断。它并非选择独立的事实，而是恢复反映底层推理过程的连贯交互轨迹。一个学习到的转移模型直接从数据中推断优先序，而无需显式的时间监督。我们在顺序敏感的问答和解释任务（包括热带气旋和港口运营场景）上评估了OKH-RAG。OKH-RAG始终优于排列不变的基线方法，消融实验表明，这些性能提升正是源于对交互顺序的建模。这些结果突显了基于集合的检索的一个关键局限：有效的推理不仅需要检索相关证据，还需要将其组织成结构化的序列。

摘要 (Abstract)

Retrieval-augmented generation (RAG) enhances large language models by grounding outputs in retrieved knowledge. However, existing RAG methods including graph- and hypergraph-based approaches treat retrieved evidence as an unordered set, implicitly assuming permutation invariance. This assumption is misaligned with many real-world reasoning tasks, where outcomes depend not only on which interactions occur, but also on the order in which they unfold. We propose Order-Aware Knowledge Hypergraph RAG (OKH-RAG), which treats order as a first-class structural property. OKH-RAG represents knowledge as higher-order interactions within a hypergraph augmented with precedence structure, and reformulates retrieval as sequence inference over hyperedges. Instead of selecting independent facts, it recovers coherent interaction trajectories that reflect underlying reasoning processes. A learned transition model infers precedence directly from data without requiring explicit temporal supervision. We evaluate OKH-RAG on order-sensitive question answering and explanation tasks, including tropical cyclone and port operation scenarios. OKH-RAG consistently outperforms permutation-invariant baselines, and ablations show that these gains arise specifically from modeling interaction order. These results highlight a key limitation of set-based retrieval: effective reasoning requires not only retrieving relevant evidence, but organizing it into structured sequences.

关键词: Retrieval-augmented generation, RAG, Order-aware, Hypergraph, Reasoning, Sequence inference, Interaction trajectories, Precedence structure

7. Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

作者: Md Tanvirul Alam 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12119v1

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究大型视觉语言模型（VLMs）中的语义固化现象，属于大模型技术原理的创新研究。核心相关关键词：1) “Large Language Models” (8分)：论文明确研究VLMs，属于大语言模型范畴；2) “Post-training” (10分)：论文核心实验涉及后训练对齐，研究不同规则训练对模型性能的影响；3) “Instruction Tuning” (5分)：与提示干预和规则对齐相关；4) “Hallucination Mitigation” (5分)：研究模型固化和错误，与事实性/幻觉缓解相关；5) “Mechanistic Interpretability” (8分)：通过激活操控研究错误机制，属于可解释AI。其他关键词如MoE、量化、推理加速等未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了大型视觉语言模型中存在的语义固化现象（即模型倾向于依赖默认语义先验而非遵循提示指定的替代规则），通过VLM-Fix基准测试发现模型在标准与逆向规则间存在性能差距，并证明后训练对齐和激活操控可以部分缓解这种错误。

摘要翻译

大型视觉语言模型（VLMs）通常依赖于熟悉的语义先验，但现有评估方法未能清晰区分感知失败与规则映射失败。我们将这种行为定义为语义固着：即使提示指定了另一种同样有效的映射关系，模型仍会保持默认解释。为分离这一效应，我们提出了VLM-Fix基准测试，该测试基于四款抽象策略游戏构建，通过在成对的标准规则与逆向规则表述下评估相同的终局棋盘状态来实施控制。在14个开源与闭源VLM上的实验表明，模型在标准规则下的准确率始终优于逆向规则，揭示出显著的语义固着差距。提示干预实验验证了该机制：使用中性别名提示能大幅缩小逆向规则差距，而语义负载的别名提示则会重新扩大差距。模型的后训练过程表现出强烈的规则对齐特性：针对单一规则的训练能提升同规则迁移性能，但损害反规则迁移能力；而联合规则训练则能提升更广泛的迁移性能。为在合成游戏之外检验外部有效性，我们在VLMBias数据集上评估了类似的去熟悉化干预措施，并观察到相同的定性模式。最后，通过对模型深层激活向量进行定向干预，能够部分恢复性能下降，表明语义固着错误至少在深层表征中具备可编辑性。项目页面、代码及数据集详见https://maveryn.github.io/vlm-fix/。

摘要 (Abstract)

Large vision-language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from rule-mapping failures. We study this behavior as semantic fixation: preserving a default interpretation even when the prompt specifies an alternative, equally valid mapping. To isolate this effect, we introduce VLM-Fix, a controlled benchmark over four abstract strategy games that evaluates identical terminal board states under paired standard and inverse rule formulations. Across 14 open and closed VLMs, accuracy consistently favors standard rules, revealing a robust semantic-fixation gap. Prompt interventions support this mechanism: neutral alias prompts substantially narrow the inverse-rule gap, while semantically loaded aliases reopen it. Post-training is strongly rule-aligned: training on one rule improves same-rule transfer but hurts opposite-rule transfer, while joint-rule training improves broader transfer. To test external validity beyond synthetic games, we evaluate analogous defamiliarization interventions on VLMBias and observe the same qualitative pattern. Finally, late-layer activation steering partially recovers degraded performance, indicating that semantic-fixation errors are at least partly editable in late representations. Project page, code, and dataset available at https://maveryn.github.io/vlm-fix/.

关键词: Large Vision-Language Models, Semantic Fixation, Post-training Alignment, Rule Mapping, Activation Steering, VLM-Fix Benchmark, Interpretability, Model Evaluation

8. Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

作者: Manh Nguyen, Sunil Gupta, Hung Le 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12196v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM生成多个候选答案后的最佳选择问题，提出Radial Consensus Score方法，因此与"Large Language Models"高度相关（10分）。论文涉及多智能体辩论和推理任务，与"LLM Agents"和"Multi-agent Systems"有一定关联（各5分）。方法旨在提高答案可靠性，与"Hallucination Mitigation"有一定关联（5分）。论文在长形式推理任务上测试，与"Chain of Thought"有一定关联（5分）。其他关键词如MoE、量化、训练方法等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型生成多个候选答案时选择最可靠答案的挑战，提出了一种无需训练的几何共识评分方法Radial Consensus Score，在多个基准测试中优于现有方法，并可作为多数投票的有效替代方案。

摘要翻译

大型语言模型（LLM）针对给定提示频繁生成多个候选回答，但从中选择最可靠的答案仍具挑战性，尤其在正确性与表面多数一致发生偏离时。现有方法（如自洽性）依赖于离散投票，而基于概率的方法往往无法捕捉候选答案间的关联，或倾向于低估高质量但出现频率较低的回应，且未能充分利用答案表征的几何结构。为应对这些局限，我们提出径向共识分数（RCS），这是一种简单、高效且无需训练的最佳答案选择方法。RCS通过计算答案嵌入的加权弗雷歇均值（语义中心），并依据各候选答案到该中心的径向距离进行排序，从而建模语义共识。重要的是，RCS提供了一个通用框架，支持多种加权方案，包括均匀加权、基于频率和基于概率的变体，能够在完全适用于黑盒场景的同时，灵活整合一致信号与模型置信度。在涵盖短问答和长推理任务的七个基准测试及五个开源权重模型上的大量实验表明，RCS各变体均持续优于强基线方法，且随着采样预算增加，其优势更为显著。RCS还可作为多智能体辩论中多数投票的有效即插即用替代方案，并在黑盒场景中展现出强大的鲁棒性。总体而言，这些结果凸显了几何共识作为一种可扩展且广泛适用的原则，能够实现可靠的答案选择，其应用范围超越了多数投票，延伸至LLM推理中更具表达力与鲁棒性的聚合机制。

摘要 (Abstract)

Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fréchet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.

关键词: Large Language Models, Best-of-N Selection, Radial Consensus Score, Answer Selection, Semantic Consensus, Multi-agent Debate, Reasoning Tasks, Black-box Scenarios

9. Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

作者: Aditya Agrawal, Alwarappan Nakkiran, Darshan Fofadiya, Alex Karlsson, Harsha Aduri 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12138v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	15.0/10	15.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统在主观内容处理上的局限性，并提出Opinion-Aware RAG架构。与"Retrieval-Augmented Generation"高度相关（15分），因为全文围绕RAG系统展开。与"Large Language Models"相关（10分），因为论文使用LLM进行意见提取。与"Hallucination Mitigation"有一定关联（5分），因为论文讨论偏见和代表性不足问题，涉及事实性和透明度风险。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、代理系统、模型压缩等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现当前RAG系统存在事实性偏见，在处理主观内容时表现不足，提出了一个Opinion-Aware RAG架构，通过LLM意见提取和意见图增强索引，在电子商务论坛数据上实现了检索多样性显著提升。

摘要翻译

检索增强生成（RAG）系统已经改变了大型语言模型（LLM）获取外部知识的方式，但我们发现，当前实现方案存在偏向事实性、客观内容的偏差，这一点在现有优先考虑客观检索的基准测试和数据集上得到了印证。这种事实性偏差——将观点和多元视角视为噪声而非待综合的信息——限制了RAG系统在涉及主观内容的现实场景（从社交媒体讨论到产品评论）中的应用。除了技术局限，这种偏差还对透明和负责任的人工智能构成了风险：可能引发回音室效应而放大主流观点，导致少数群体声音系统性代表不足，以及通过有偏差的信息综合进行潜在观点操纵。我们通过不确定性视角将这一局限形式化：事实性查询涉及可通过证据减少的认知不确定性，而观点性查询则涉及反映人类视角真实异质性的偶然不确定性。这一区别意味着，事实性RAG应最小化后验熵，而观点感知型RAG必须保留它。基于这一理论基础，我们提出了一种观点感知型RAG架构，其特点包括基于LLM的观点提取、实体关联的观点图谱以及观点增强的文档索引。我们在电子商务卖家论坛数据上评估了该方法，将观点增强型知识库与传统基线进行对比。实验表明，检索多样性得到显著提升：在实体匹配文档上，情感多样性提升+26.8%，实体匹配率提升+42.7%，作者人口统计覆盖度提升+31.6%。我们的结果为“将主观性视为一等公民可产生可测量的、更具代表性的检索”提供了实证证据，这是迈向观点感知型RAG的第一步。未来工作包括为分布保真度而联合优化检索与生成。

摘要 (Abstract)

RAG systems have transformed how LLMs access external knowledge, but we find that current implementations exhibit a bias toward factual, objective content, as evidenced by existing benchmarks and datasets that prioritize objective retrieval. This factual bias - treating opinions and diverse perspectives as noise rather than information to be synthesized - limits RAG systems in real-world scenarios involving subjective content, from social media discussions to product reviews. Beyond technical limitations, this bias poses risks to transparent and accountable AI: echo chamber effects that amplify dominant viewpoints, systematic underrepresentation of minority voices, and potential opinion manipulation through biased information synthesis. We formalize this limitation through the lens of uncertainty: factual queries involve epistemic uncertainty reducible through evidence, while opinion queries involve aleatoric uncertainty reflecting genuine heterogeneity in human perspectives. This distinction implies that factual RAG should minimize posterior entropy, whereas opinion-aware RAG must preserve it. Building on this theoretical foundation, we present an Opinion-Aware RAG architecture featuring LLM-based opinion extraction, entity-linked opinion graphs, and opinion-enriched document indexing. We evaluate our approach on e-commerce seller forum data, comparing an Opinion-Enriched knowledge base against a traditional baseline. Experiments demonstrate substantial improvements in retrieval diversity: +26.8% sentiment diversity, +42.7% entity match rate, and +31.6% author demographic coverage on entity-matched documents. Our results provide empirical evidence that treating subjectivity as a first-class citizen yields measurably more representative retrieval-a first step toward opinion-aware RAG. Future work includes joint optimization of retrieval and generation for distributional fidelity.

关键词: Retrieval-Augmented Generation, RAG, Large Language Models, opinion-aware, subjectivity, retrieval diversity, factual bias, uncertainty

10. Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Rea

作者: Syed Rifat Raiyan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12076v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在道德决策中表现出的可识别受害者效应（IVE），并系统评估了16个前沿模型。高度相关的关键词包括：1）“Large Language Models”（论文研究对象）；2）“Instruction Tuning” OR “Alignment”（研究发现对齐训练强烈调节IVE，指令调优模型表现出极端IVE）；3）“Chain of Thought” OR “CoT Reasoning”（论文测试了标准CoT提示和功利主义CoT对IVE的影响）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等均未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在道德决策中是否继承人类可识别受害者效应的非理性倾向，发现对齐训练强烈调节该效应，指令调优模型表现出极端效应而推理专用模型反转效应，且标准思维链提示会加剧效应而功利主义思维链可消除它。

摘要翻译

可识别受害者效应（Identifiable Victim Effect, IVE）——即人们倾向于向一个具体、通过叙事描述的受害者分配比向一个面临同等困境的统计特征群体分配更多资源的倾向——是道德心理学和行为经济学中最稳健的发现之一。随着大语言模型（LLMs）在人道主义分级诊疗、自动化资助评估和内容审核等领域承担起重要角色，一个关键问题随之产生：这些系统是否继承了人类道德推理中存在的情感非理性？我们首次对大语言模型中的IVE进行了系统性、大规模的实证研究，涵盖来自九个机构谱系（谷歌、Anthropic、OpenAI、Meta、DeepSeek、xAI、阿里巴巴、IBM和Moonshot）的16个前沿模型，共计N=51,955次经过验证的API试验。通过采用十项实验——移植并扩展了Small等人（2007）以及Kogut和Ritov（2005）的经典范式——我们发现IVE普遍存在，但受到对齐训练的强烈调节。经过指令微调的模型表现出极端的IVE（科恩d值高达1.56），而专长于推理的模型则逆转了该效应（d值低至-0.85）。汇总效应（d=0.223，p=2e-6）约为Lee和Feeley（2016）报告的单受害者人类元分析基线（d≈0.10）的两倍——并且很可能以更大的幅度超过人类的整体汇总效应，因为人类对群体受害者的效应近乎为零。标准的思维链（Chain-of-Thought, CoT）提示——与其作为审慎纠正机制的角色相反——几乎使IVE效应量增加了两倍（从d=0.15增至d=0.41），而只有功利主义的CoT提示能可靠地消除该效应。我们进一步记录了心理物理麻木、完全的数量忽视以及微小的内群体/外群体文化偏见，这些发现对于人工智能在人道主义和伦理决策情境中的部署具有重要启示。

摘要 (Abstract)

The Identifiable Victim Effect (IVE) $-$ the tendency to allocate greater resources to a specific, narratively described victim than to a statistically characterized group facing equivalent hardship $-$ is one of the most robust findings in moral psychology and behavioural economics. As large language models (LLMs) assume consequential roles in humanitarian triage, automated grant evaluation, and content moderation, a critical question arises: do these systems inherit the affective irrationalities present in human moral reasoning? We present the first systematic, large-scale empirical investigation of the IVE in LLMs, comprising N=51,955 validated API trials across 16 frontier models spanning nine organizational lineages (Google, Anthropic, OpenAI, Meta, DeepSeek, xAI, Alibaba, IBM, and Moonshot). Using a suite of ten experiments $-$ porting and extending canonical paradigms from Small et al. (2007) and Kogut and Ritov (2005) $-$ we find that the IVE is prevalent but strongly modulated by alignment training. Instruction-tuned models exhibit extreme IVE (Cohen’s d up to 1.56), while reasoning-specialized models invert the effect (down to d=-0.85). The pooled effect (d=0.223, p=2e-6) is approximately twice the single-victim human meta-analytic baseline (d$\approx$0.10) reported by Lee and Feeley (2016) $-$ and likely exceeds the overall human pooled effect by a larger margin, given that the group-victim human effect is near zero. Standard Chain-of-Thought (CoT) prompting $-$ contrary to its role as a deliberative corrective $-$ nearly triples the IVE effect size (from d=0.15 to d=0.41), while only utilitarian CoT reliably eliminates it. We further document psychophysical numbing, perfect quantity neglect, and marginal in-group/out-group cultural bias, with implications for AI deployment in humanitarian and ethical decision-making contexts.

关键词: Identifiable Victim Effect, Large Language Models, Moral Reasoning, Alignment Training, Chain-of-Thought Prompting, Humanitarian Decision-making, Behavioral Economics, AI Ethics

📋 所有论文列表

1. ✅ QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence

作者: Zhichao Lin, Zhichao Liang, Gaoqiang Liu, Meng Xu, Baoyu Xiang, Jian Xu, Guanjun Jiang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12867v1

评分: 75.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对医疗领域深度搜索数据稀缺和性能提升的挑战，提出了QuarkMedSearch智能体，通过医疗知识图谱与在线探索构建训练数据、采用两阶段SFT和RL训练策略增强规划与反思能力，并在自建基准上实现了开源模型中的最先进性能。

摘要翻译

随着具身化基础模型的持续演进，如何进一步提升其在垂直领域中的性能已成为重要挑战。为此，我们在强大的具身化基础模型通义DeepResearch基础上，聚焦中文医疗深度搜索场景，提出QuarkMedSearch，系统性地探索了涵盖医疗多跳数据构建、训练策略与评估基准的全流程方法，以进一步推进并评估其在垂直领域的性能上限。具体而言，在数据合成方面，针对医疗领域深度搜索训练数据稀缺的问题，我们结合大规模医疗知识图谱与实时在线探索，构建了长周期的医疗深度搜索训练数据；在训练后优化阶段，我们采用两阶段的监督微调（SFT）与强化学习（RL）训练策略，逐步增强模型进行深度搜索所需的规划、工具调用与反思能力，同时保持搜索效率；在评估方面，我们与医学专家合作，通过严格的人工验证构建了QuarkMedSearch Benchmark。实验结果表明，QuarkMedSearch在QuarkMedSearch Benchmark上取得了同规模开源模型中的最优性能，同时在通用基准测试中也保持了强劲的竞争力。

摘要 (Abstract)

As agentic foundation models continue to evolve, how to further improve their performance in vertical domains has become an important challenge. To this end, building upon Tongyi DeepResearch, a powerful agentic foundation model, we focus on the Chinese medical deep search scenario and propose QuarkMedSearch, systematically exploring a full-pipeline approach spanning medical multi-hop data construction, training strategies, and evaluation benchmarks to further push and assess its performance upper bound in vertical domains. Specifically, for data synthesis, to address the scarcity of deep search training data in the medical domain, we combine a large-scale medical knowledge graph with real-time online exploration to construct long-horizon medical deep search training data; for post-training, we adopt a two-stage SFT and RL training strategy that progressively enhances the model’s planning, tool invocation, and reflection capabilities required for deep search, while maintaining search efficiency; for evaluation, we collaborate with medical experts to construct the QuarkMedSearch Benchmark through rigorous manual verification. Experimental results demonstrate that QuarkMedSearch achieves state-of-the-art performance among open-source models of comparable scale on the QuarkMedSearch Benchmark, while also maintaining strong competitiveness on general benchmarks.

2. ✅ AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文针对LLM-based多模态情感识别中存在的幻觉问题和模态模糊性挑战，提出了AffectAgent——一个基于协作多智能体检索增强生成的框架，通过三个专门智能体的联合优化和MB-MoE、RAAF等创新方法，在MER-UniBench基准上实现了优越的性能。

摘要翻译

基于大语言模型的多模态情感识别依赖静态参数化记忆，在解读细微情感状态时常产生幻觉。本文针对单轮检索增强生成方法易受模态模糊性影响、难以捕捉跨模态复杂情感依赖的问题，提出AffectAgent——一种面向情感的多智能体检索增强生成框架，该框架利用智能体间的协同决策实现细粒度情感理解。具体而言，AffectAgent包含三个联合优化的专用智能体：查询规划器、证据过滤器和情感生成器，它们通过协作式分析推理实现跨模态样本检索、证据评估与预测生成。这些智能体采用多智能体近端策略优化算法进行端到端优化，并通过共享情感奖励机制确保情感理解的一致性。此外，我们提出模态平衡专家混合模块与检索增强自适应融合模块：前者动态调节不同模态的贡献以缓解跨模态异质性导致的表征失配，后者通过引入检索到的视听嵌入向量增强缺失模态条件下的语义补全能力。在MER-UniBench基准上的大量实验表明，AffectAgent在复杂场景中均取得优越性能。代码发布于：https://github.com/Wz1h1NG/AffectAgent。

摘要 (Abstract)

LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: https://github.com/Wz1h1NG/AffectAgent.

3. ✅ LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	10.0/10	10.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对块状扩散语言模型在长上下文场景中因注意力机制导致的内存瓶颈问题，提出了一种局部感知的稀疏注意力方法（LoSA），通过重用稳定令牌的缓存前缀注意力结果并仅对活跃令牌应用稀疏注意力，显著减少了KV索引加载数量，在保持接近密集注意力精度的同时实现了高达4.14倍的注意力加速。

摘要翻译

分块扩散语言模型（DLMs）能够以任意顺序生成多个词元，为自回归解码流程提供了一种有前景的替代方案。然而，在长上下文场景中，它们仍然受限于内存约束的注意力机制。由于存在KV膨胀问题，朴素的稀疏注意力在DLMs上效果不佳：不同的查询会选择不同的前缀位置，导致被访问的KV页面集合过大。为解决此问题，我们观察到在连续的降噪步骤之间，只有一小部分活跃词元的隐藏状态发生显著变化，而大多数稳定词元的状态几乎保持不变。基于这一洞察，我们提出了局部感知稀疏注意力（Locality-aware Sparse Attention, LOSA）。该方法对稳定词元复用已缓存的前缀注意力结果，仅对活跃词元应用稀疏注意力。这大幅减少了必须加载的KV索引数量，从而实现了更高的加速比和更高的准确性。在多种分块DLMs和基准测试中，LOSA在保持接近稠密模型准确性的同时，显著提升了效率：在激进的稀疏度水平下，平均准确率最高提升9个百分点，同时注意力密度降低至1.54倍以下。在RTX A6000 GPU上，注意力计算速度最高提升4.14倍，证明了所提方法的有效性。

摘要 (Abstract)

Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context scenarios. Naive sparse attention fails on DLMs due to a KV Inflation problem, where different queries select different prefix positions, making the union of accessed KV pages large. To address this, we observe that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes, while the majority of stable tokens remain nearly constant. Based on this insight, we propose LOSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This substantially shrinks the number of KV indices that must be loaded, yielding both higher speedup and higher accuracy. Across multiple block-wise DLMs and benchmarks, LOSA preserves near-dense accuracy while significantly improving efficiency, achieving up to +9 points in average accuracy at aggressive sparsity levels while maintaining 1.54x lower attention density. It also achieves up to 4.14x attention speedup on RTX A6000 GPUs, demonstrating the effectiveness of the proposed method.

4. ✅ Latent-Condensed Transformer for Efficient Long Context Modeling

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	10.0/10	10.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型处理长上下文时KV缓存线性增长和自注意力二次复杂度的挑战，提出了一种名为Latent-Condensed Attention的新方法，在保持性能的同时实现了高达2.5倍的预填充加速和90%的KV缓存减少。

摘要翻译

大语言模型在处理长上下文时面临显著挑战，这主要源于键值缓存（KV cache）的线性增长与自注意力机制（self-attention）的二次计算复杂度。现有方法通常分别应对这些瓶颈：多头潜在注意力（Multi-head Latent Attention, MLA）通过将词元投影至低维潜在空间以减少KV缓存，而稀疏注意力则致力于降低计算开销。然而，稀疏方法无法直接在MLA压缩后的潜在结构上运行，错失了联合优化的机会。本文提出潜在凝聚注意力（Latent-Condensed Attention, LCA），该方法直接在MLA的潜在空间内对上下文进行凝聚，其中表示被解耦为语义潜在向量与位置键。LCA通过查询感知池化分别聚合语义向量，并通过锚点选择保留位置键。这一设计在不增加参数的前提下，联合降低了计算成本与KV缓存。除MLA外，LCA的设计与架构无关，可轻松扩展至其他注意力机制，如分组查询注意力（GQA）。理论上，我们证明了其误差界与长度无关。实验表明，在128K上下文长度下，LCA实现了高达2.5倍的预填充加速与90%的KV缓存减少，同时保持了具有竞争力的性能。

摘要 (Abstract)

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA’s compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA’s latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA’s design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5$\times$ prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

5. ✅ Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

作者: Jaywon Koo, Jefferson Hernandez, Ruozhen He, Hanjie Chen, Chen Wei, Vicente Ordonez 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12999v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文提出了一个名为HypoExplore的智能体框架，利用大语言模型以假设驱动的方式自动发现和优化视觉识别任务的神经网络架构，在多个数据集上实现了从低基线到高性能的显著提升，并展示了在医学图像领域的应用潜力。

摘要翻译

我们提出HypoExplore，这是一个智能体框架，它将视觉识别中的神经架构发现构建为一种假设驱动的科学探究过程。在给定人类指定的高层次研究方向后，HypoExplore通过演化分支进行神经架构的构思、实现、评估与改进。新假设的生成利用大型语言模型完成，通过选择父假设作为构建基础，并遵循一种双重策略进行引导：该策略既注重利用已验证的设计原则，也致力于解决尚不确定的问题。我们提出的框架维护着一个记录所有提出架构谱系的“轨迹树”，以及一个通过实验证据主动追踪置信度的“假设记忆库”。每次实验后，多个反馈智能体会从不同角度分析结果，并将其发现整合为假设置信度的更新。我们在CIFAR-10数据集上测试了该框架以发现轻量级视觉架构，其中最佳架构从初始基线精度18.91%演化至94.11%的准确率，并成功推广至CIFAR-100和Tiny-ImageNet。我们通过在MedMNIST数据集上进行独立的架构发现实验，进一步证明了该框架在专业领域的适用性，所获架构达到了最先进的性能。研究表明，随着证据的积累，假设置信度分数变得越来越具有预测性，并且学习到的设计原则能够在独立的演化谱系间迁移，这表明HypoExplore不仅能发现更强的架构，还有助于真正理解设计空间。

摘要 (Abstract)

We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.

关键词: agentic framework, large language model, neural architecture discovery, hypothesis-driven, visual recognition, evolutionary branching, autonomous agents, AI for science

6. ✅ Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	15.0/10	15.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文针对现有RAG方法将检索证据视为无序集合的局限性，提出了Order-Aware Knowledge Hypergraph RAG（OKH-RAG），通过建模交互顺序来恢复连贯的交互轨迹，在顺序敏感的问答和解释任务中显著优于现有方法。

摘要翻译

检索增强生成（RAG）通过将输出建立在检索到的知识基础上，增强了大型语言模型的能力。然而，现有的RAG方法，包括基于图和超图的方法，都将检索到的证据视为无序集合，隐含地假设了排列不变性。这一假设与许多现实世界的推理任务不符，因为任务结果不仅取决于发生了哪些交互，还取决于这些交互展开的顺序。我们提出了顺序感知知识超图RAG（Order-Aware Knowledge Hypergraph RAG, OKH-RAG），它将顺序视为一种首要的结构属性。OKH-RAG将知识表示为带有优先序结构的超图中的高阶交互，并将检索重新定义为对超边的序列推断。它并非选择独立的事实，而是恢复反映底层推理过程的连贯交互轨迹。一个学习到的转移模型直接从数据中推断优先序，而无需显式的时间监督。我们在顺序敏感的问答和解释任务（包括热带气旋和港口运营场景）上评估了OKH-RAG。OKH-RAG始终优于排列不变的基线方法，消融实验表明，这些性能提升正是源于对交互顺序的建模。这些结果突显了基于集合的检索的一个关键局限：有效的推理不仅需要检索相关证据，还需要将其组织成结构化的序列。

摘要 (Abstract)

Retrieval-augmented generation (RAG) enhances large language models by grounding outputs in retrieved knowledge. However, existing RAG methods including graph- and hypergraph-based approaches treat retrieved evidence as an unordered set, implicitly assuming permutation invariance. This assumption is misaligned with many real-world reasoning tasks, where outcomes depend not only on which interactions occur, but also on the order in which they unfold. We propose Order-Aware Knowledge Hypergraph RAG (OKH-RAG), which treats order as a first-class structural property. OKH-RAG represents knowledge as higher-order interactions within a hypergraph augmented with precedence structure, and reformulates retrieval as sequence inference over hyperedges. Instead of selecting independent facts, it recovers coherent interaction trajectories that reflect underlying reasoning processes. A learned transition model infers precedence directly from data without requiring explicit temporal supervision. We evaluate OKH-RAG on order-sensitive question answering and explanation tasks, including tropical cyclone and port operation scenarios. OKH-RAG consistently outperforms permutation-invariant baselines, and ablations show that these gains arise specifically from modeling interaction order. These results highlight a key limitation of set-based retrieval: effective reasoning requires not only retrieving relevant evidence, but organizing it into structured sequences.

关键词: Retrieval-augmented generation, RAG, Order-aware, Hypergraph, Reasoning, Sequence inference, Interaction trajectories, Precedence structure

7. ✅ Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

作者: Md Tanvirul Alam 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12119v1

评分: 36.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了大型视觉语言模型中存在的语义固化现象（即模型倾向于依赖默认语义先验而非遵循提示指定的替代规则），通过VLM-Fix基准测试发现模型在标准与逆向规则间存在性能差距，并证明后训练对齐和激活操控可以部分缓解这种错误。

摘要翻译

大型视觉语言模型（VLMs）通常依赖于熟悉的语义先验，但现有评估方法未能清晰区分感知失败与规则映射失败。我们将这种行为定义为语义固着：即使提示指定了另一种同样有效的映射关系，模型仍会保持默认解释。为分离这一效应，我们提出了VLM-Fix基准测试，该测试基于四款抽象策略游戏构建，通过在成对的标准规则与逆向规则表述下评估相同的终局棋盘状态来实施控制。在14个开源与闭源VLM上的实验表明，模型在标准规则下的准确率始终优于逆向规则，揭示出显著的语义固着差距。提示干预实验验证了该机制：使用中性别名提示能大幅缩小逆向规则差距，而语义负载的别名提示则会重新扩大差距。模型的后训练过程表现出强烈的规则对齐特性：针对单一规则的训练能提升同规则迁移性能，但损害反规则迁移能力；而联合规则训练则能提升更广泛的迁移性能。为在合成游戏之外检验外部有效性，我们在VLMBias数据集上评估了类似的去熟悉化干预措施，并观察到相同的定性模式。最后，通过对模型深层激活向量进行定向干预，能够部分恢复性能下降，表明语义固着错误至少在深层表征中具备可编辑性。项目页面、代码及数据集详见https://maveryn.github.io/vlm-fix/。

摘要 (Abstract)

Large vision-language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from rule-mapping failures. We study this behavior as semantic fixation: preserving a default interpretation even when the prompt specifies an alternative, equally valid mapping. To isolate this effect, we introduce VLM-Fix, a controlled benchmark over four abstract strategy games that evaluates identical terminal board states under paired standard and inverse rule formulations. Across 14 open and closed VLMs, accuracy consistently favors standard rules, revealing a robust semantic-fixation gap. Prompt interventions support this mechanism: neutral alias prompts substantially narrow the inverse-rule gap, while semantically loaded aliases reopen it. Post-training is strongly rule-aligned: training on one rule improves same-rule transfer but hurts opposite-rule transfer, while joint-rule training improves broader transfer. To test external validity beyond synthetic games, we evaluate analogous defamiliarization interventions on VLMBias and observe the same qualitative pattern. Finally, late-layer activation steering partially recovers degraded performance, indicating that semantic-fixation errors are at least partly editable in late representations. Project page, code, and dataset available at https://maveryn.github.io/vlm-fix/.

关键词: Large Vision-Language Models, Semantic Fixation, Post-training Alignment, Rule Mapping, Activation Steering, VLM-Fix Benchmark, Interpretability, Model Evaluation

8. ✅ Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

作者: Manh Nguyen, Sunil Gupta, Hung Le 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12196v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型生成多个候选答案时选择最可靠答案的挑战，提出了一种无需训练的几何共识评分方法Radial Consensus Score，在多个基准测试中优于现有方法，并可作为多数投票的有效替代方案。

摘要翻译

大型语言模型（LLM）针对给定提示频繁生成多个候选回答，但从中选择最可靠的答案仍具挑战性，尤其在正确性与表面多数一致发生偏离时。现有方法（如自洽性）依赖于离散投票，而基于概率的方法往往无法捕捉候选答案间的关联，或倾向于低估高质量但出现频率较低的回应，且未能充分利用答案表征的几何结构。为应对这些局限，我们提出径向共识分数（RCS），这是一种简单、高效且无需训练的最佳答案选择方法。RCS通过计算答案嵌入的加权弗雷歇均值（语义中心），并依据各候选答案到该中心的径向距离进行排序，从而建模语义共识。重要的是，RCS提供了一个通用框架，支持多种加权方案，包括均匀加权、基于频率和基于概率的变体，能够在完全适用于黑盒场景的同时，灵活整合一致信号与模型置信度。在涵盖短问答和长推理任务的七个基准测试及五个开源权重模型上的大量实验表明，RCS各变体均持续优于强基线方法，且随着采样预算增加，其优势更为显著。RCS还可作为多智能体辩论中多数投票的有效即插即用替代方案，并在黑盒场景中展现出强大的鲁棒性。总体而言，这些结果凸显了几何共识作为一种可扩展且广泛适用的原则，能够实现可靠的答案选择，其应用范围超越了多数投票，延伸至LLM推理中更具表达力与鲁棒性的聚合机制。

摘要 (Abstract)

Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fréchet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.

关键词: Large Language Models, Best-of-N Selection, Radial Consensus Score, Answer Selection, Semantic Consensus, Multi-agent Debate, Reasoning Tasks, Black-box Scenarios

9. ✅ Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

作者: Aditya Agrawal, Alwarappan Nakkiran, Darshan Fofadiya, Alex Karlsson, Harsha Aduri 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12138v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	15.0/10	15.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究发现当前RAG系统存在事实性偏见，在处理主观内容时表现不足，提出了一个Opinion-Aware RAG架构，通过LLM意见提取和意见图增强索引，在电子商务论坛数据上实现了检索多样性显著提升。

摘要翻译

检索增强生成（RAG）系统已经改变了大型语言模型（LLM）获取外部知识的方式，但我们发现，当前实现方案存在偏向事实性、客观内容的偏差，这一点在现有优先考虑客观检索的基准测试和数据集上得到了印证。这种事实性偏差——将观点和多元视角视为噪声而非待综合的信息——限制了RAG系统在涉及主观内容的现实场景（从社交媒体讨论到产品评论）中的应用。除了技术局限，这种偏差还对透明和负责任的人工智能构成了风险：可能引发回音室效应而放大主流观点，导致少数群体声音系统性代表不足，以及通过有偏差的信息综合进行潜在观点操纵。我们通过不确定性视角将这一局限形式化：事实性查询涉及可通过证据减少的认知不确定性，而观点性查询则涉及反映人类视角真实异质性的偶然不确定性。这一区别意味着，事实性RAG应最小化后验熵，而观点感知型RAG必须保留它。基于这一理论基础，我们提出了一种观点感知型RAG架构，其特点包括基于LLM的观点提取、实体关联的观点图谱以及观点增强的文档索引。我们在电子商务卖家论坛数据上评估了该方法，将观点增强型知识库与传统基线进行对比。实验表明，检索多样性得到显著提升：在实体匹配文档上，情感多样性提升+26.8%，实体匹配率提升+42.7%，作者人口统计覆盖度提升+31.6%。我们的结果为“将主观性视为一等公民可产生可测量的、更具代表性的检索”提供了实证证据，这是迈向观点感知型RAG的第一步。未来工作包括为分布保真度而联合优化检索与生成。

摘要 (Abstract)

RAG systems have transformed how LLMs access external knowledge, but we find that current implementations exhibit a bias toward factual, objective content, as evidenced by existing benchmarks and datasets that prioritize objective retrieval. This factual bias - treating opinions and diverse perspectives as noise rather than information to be synthesized - limits RAG systems in real-world scenarios involving subjective content, from social media discussions to product reviews. Beyond technical limitations, this bias poses risks to transparent and accountable AI: echo chamber effects that amplify dominant viewpoints, systematic underrepresentation of minority voices, and potential opinion manipulation through biased information synthesis. We formalize this limitation through the lens of uncertainty: factual queries involve epistemic uncertainty reducible through evidence, while opinion queries involve aleatoric uncertainty reflecting genuine heterogeneity in human perspectives. This distinction implies that factual RAG should minimize posterior entropy, whereas opinion-aware RAG must preserve it. Building on this theoretical foundation, we present an Opinion-Aware RAG architecture featuring LLM-based opinion extraction, entity-linked opinion graphs, and opinion-enriched document indexing. We evaluate our approach on e-commerce seller forum data, comparing an Opinion-Enriched knowledge base against a traditional baseline. Experiments demonstrate substantial improvements in retrieval diversity: +26.8% sentiment diversity, +42.7% entity match rate, and +31.6% author demographic coverage on entity-matched documents. Our results provide empirical evidence that treating subjectivity as a first-class citizen yields measurably more representative retrieval-a first step toward opinion-aware RAG. Future work includes joint optimization of retrieval and generation for distributional fidelity.

关键词: Retrieval-Augmented Generation, RAG, Large Language Models, opinion-aware, subjectivity, retrieval diversity, factual bias, uncertainty

10. ✅ Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

作者: Syed Rifat Raiyan 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12076v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在道德决策中是否继承人类可识别受害者效应的非理性倾向，发现对齐训练强烈调节该效应，指令调优模型表现出极端效应而推理专用模型反转效应，且标准思维链提示会加剧效应而功利主义思维链可消除它。

摘要翻译

可识别受害者效应（Identifiable Victim Effect, IVE）——即人们倾向于向一个具体、通过叙事描述的受害者分配比向一个面临同等困境的统计特征群体分配更多资源的倾向——是道德心理学和行为经济学中最稳健的发现之一。随着大语言模型（LLMs）在人道主义分级诊疗、自动化资助评估和内容审核等领域承担起重要角色，一个关键问题随之产生：这些系统是否继承了人类道德推理中存在的情感非理性？我们首次对大语言模型中的IVE进行了系统性、大规模的实证研究，涵盖来自九个机构谱系（谷歌、Anthropic、OpenAI、Meta、DeepSeek、xAI、阿里巴巴、IBM和Moonshot）的16个前沿模型，共计N=51,955次经过验证的API试验。通过采用十项实验——移植并扩展了Small等人（2007）以及Kogut和Ritov（2005）的经典范式——我们发现IVE普遍存在，但受到对齐训练的强烈调节。经过指令微调的模型表现出极端的IVE（科恩d值高达1.56），而专长于推理的模型则逆转了该效应（d值低至-0.85）。汇总效应（d=0.223，p=2e-6）约为Lee和Feeley（2016）报告的单受害者人类元分析基线（d≈0.10）的两倍——并且很可能以更大的幅度超过人类的整体汇总效应，因为人类对群体受害者的效应近乎为零。标准的思维链（Chain-of-Thought, CoT）提示——与其作为审慎纠正机制的角色相反——几乎使IVE效应量增加了两倍（从d=0.15增至d=0.41），而只有功利主义的CoT提示能可靠地消除该效应。我们进一步记录了心理物理麻木、完全的数量忽视以及微小的内群体/外群体文化偏见，这些发现对于人工智能在人道主义和伦理决策情境中的部署具有重要启示。

摘要 (Abstract)

The Identifiable Victim Effect (IVE) $-$ the tendency to allocate greater resources to a specific, narratively described victim than to a statistically characterized group facing equivalent hardship $-$ is one of the most robust findings in moral psychology and behavioural economics. As large language models (LLMs) assume consequential roles in humanitarian triage, automated grant evaluation, and content moderation, a critical question arises: do these systems inherit the affective irrationalities present in human moral reasoning? We present the first systematic, large-scale empirical investigation of the IVE in LLMs, comprising N=51,955 validated API trials across 16 frontier models spanning nine organizational lineages (Google, Anthropic, OpenAI, Meta, DeepSeek, xAI, Alibaba, IBM, and Moonshot). Using a suite of ten experiments $-$ porting and extending canonical paradigms from Small et al. (2007) and Kogut and Ritov (2005) $-$ we find that the IVE is prevalent but strongly modulated by alignment training. Instruction-tuned models exhibit extreme IVE (Cohen’s d up to 1.56), while reasoning-specialized models invert the effect (down to d=-0.85). The pooled effect (d=0.223, p=2e-6) is approximately twice the single-victim human meta-analytic baseline (d$\approx$0.10) reported by Lee and Feeley (2016) $-$ and likely exceeds the overall human pooled effect by a larger margin, given that the group-victim human effect is near zero. Standard Chain-of-Thought (CoT) prompting $-$ contrary to its role as a deliberative corrective $-$ nearly triples the IVE effect size (from d=0.15 to d=0.41), while only utilitarian CoT reliably eliminates it. We further document psychophysical numbing, perfect quantity neglect, and marginal in-group/out-group cultural bias, with implications for AI deployment in humanitarian and ethical decision-making contexts.

关键词: Identifiable Victim Effect, Large Language Models, Moral Reasoning, Alignment Training, Chain-of-Thought Prompting, Humanitarian Decision-making, Behavioral Economics, AI Ethics

11. ❌ Agentic Control in Variational Language Models

作者: Yves Ruffenach 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12513v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究变分语言模型中的代理控制，核心是使用不确定性作为主动控制信号来调节训练、保留检查点和指导推理干预，实现闭环内部控制。与"LLM Agents"高度相关（10分），因为论文明确研究代理控制、代理评估和代理路由。与"Large Language Models"相关（8分），因为研究变分语言模型作为基础。与"Self-Correction"相关（8分），因为涉及自我调节、自我改进的控制机制。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何利用变分语言模型内部的不确定性作为主动控制信号，实现代理控制框架，实验表明该框架在语言建模任务上优于确定性基准，并能通过校准控制器实现质量与成本的权衡。

摘要翻译

本研究探讨变分语言模型是否能够基于其内部证据支持一种最小化且可测量的主体控制形式。我们的模型结合了局部变分隐式计算（EVE）、稳态潜在调节器、结构感知的检查点保留机制，以及在保留模型之上运行的校准化不确定性感知控制器。我们并非将不确定性视为预测后被动测量的诊断指标，而是将其作为一种可调节训练、支持检查点保留并指导推理时干预的操作性信号。由此构建的框架具有明确的聚焦性：它研究一种闭环式的内部控制形式，使结构性与预测性信号可转化为具体操作。实证结果表明，该变分主干模型在语言建模任务上优于匹配的确定性参照模型，同时展现出更丰富且更可用的不确定性特征。在此主干模型基础上，校准控制器持续保持活跃状态，在完整的主体评估中使用多种操作，并实现了质量与成本的正向权衡。这些结果支持一个精确论断：内部不确定性不仅可以作为变分语言模型的描述性属性，更可作为调节、检查点保留及最小化主体路由的实用控制接口。

摘要 (Abstract)

We study whether a variational language model can support a minimal and measurable form of agentic control grounded in its own internal evidence. Our model combines local variational hidden computation (EVE), a homeostatic latent regulator, structurally aware checkpoint retention and a calibrated uncertainty-aware controller operating on top of the retained model. Rather than treating uncertainty as a passive diagnostic measured after prediction, we treat it as an operational signal that can regulate training, support checkpoint retention and guide inference-time intervention. The resulting framework is deliberately focused. It studies a closed-loop form of internal control in which structural and predictive signals become actionable. Empirically, the variational backbone improves over a matched deterministic reference on the language-modeling task while also exhibiting a richer and more usable uncertainty profile. On top of this backbone, the calibrated controller remains active, uses multiple actions under a full agentic evaluation and yields a positive quality-cost trade-off. These results support a precise claim: internal uncertainty can serve not only as a descriptive property of a variational language model, but also as a practical control interface for regulation, checkpoint retention and minimal agentic routing.

关键词: variational language model, agentic control, uncertainty-aware controller, internal evidence, checkpoint retention, closed-loop control, calibrated controller, quality-cost trade-off

12. ❌ Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents

作者: Wanchun Ni, Jiugeng Sun, Yixian Liu, Mennatallah El-Assady 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12545v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents在模拟公民对官僚主义的情感反应方面的应用，并评估其与文化对齐的能力。因此，与"Large Language Models"和"LLM Agents"高度相关（10分）。论文涉及评估模型与人类情感反应的"alignment"，因此与"Instruction Tuning" OR “Alignment"有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Context Window、KV Cache、Reasoning、Self-Correction、Tool Use、Multi-agent、Quantization、Speculative Decoding、Hallucination、Interpretability、World Models、Model Merging、In-context Learning、AI for Science等均未在论文中提及或相关，故给0分。

!!! tip deepseek-chat TL;DR

该研究评估了LLM agents在模拟不同文化背景下公民对官僚主义的情感反应的能力，发现所有模型与人类情感反应的对齐性有限，尤其在东方文化中表现更差，并提出了一个交互界面RAMO来改进模型。

摘要翻译

提升政策制定效能是公共行政领域的核心关切。既有的人类主体研究表明，在政策执行过程中，公民对繁文缛节的情绪反应存在显著的跨文化差异。尽管大型语言模型（LLM）智能体为模拟类人反应和降低实验成本提供了机遇，但其能否针对繁文缛节产生符合文化背景的情绪反应仍未得到验证。为填补这一空白，我们提出了一个评估框架，用以评估LLM在不同文化背景下对繁文缛节的情绪反应。作为一项试点研究，我们将此框架应用于一个单一的繁文缛节场景。结果显示，所有模型与人类情绪反应的契合度均有限，在东方文化背景下的表现尤为薄弱。文化提示策略在提升契合度方面基本无效。我们进一步介绍了 \textbf{RAMO}，这是一个用于模拟公民对繁文缛节情绪反应并收集人类数据以改进模型的交互式界面。该界面已公开于 https://ramo-chi.ivia.ch。

摘要 (Abstract)

Improving policymaking is a central concern in public administration. Prior human subject studies reveal substantial cross-cultural differences in citizens’ emotional responses to red tape during policy implementation. While LLM agents offer opportunities to simulate human-like responses and reduce experimental costs, their ability to generate culturally appropriate emotional responses to red tape remains unverified. To address this gap, we propose an evaluation framework for assessing LLMs’ emotional responses to red tape across diverse cultural contexts. As a pilot study, we apply this framework to a single red-tape scenario. Our results show that all models exhibit limited alignment with human emotional responses, with notably weaker performance in Eastern cultures. Cultural prompting strategies prove largely ineffective in improving alignment. We further introduce \textbf{RAMO}, an interactive interface for simulating citizens’ emotional responses to red tape and for collecting human data to improve models. The interface is publicly available at https://ramo-chi.ivia.ch.

关键词: LLM agents, emotional responses, cross-cultural, red tape, bureaucracy, alignment, simulation, policy making

13. ❌ The role of System 1 and System 2 semantic memory structure in human and LLM biases

作者: Katherine Abramski, Giulio Rossetti, Massimo Stella 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12816v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs与人类在System 1/System 2思维模式下的语义记忆结构与隐性偏见的关系，直接涉及"Large Language Models"和"System 2 Thinking"关键词，因此这两项给10分。论文通过比较人类与LLMs的认知机制来探索偏见根源，与"Mechanistic Interpretability"有一定关联，给5分。其他关键词如MoE、SFT、RAG等涉及具体技术方法，论文未涉及，给0分。

!!! tip deepseek-chat TL;DR

该研究通过建模人类和LLMs的System 1/System 2语义记忆结构，发现只有人类的语义记忆结构不可约且与隐性偏见相关（System 2结构偏见更低），揭示了LLMs缺乏某些类人概念知识，凸显了人机认知的根本差异。

摘要翻译

人类与大型语言模型（LLM）中存在的隐性偏见构成了显著的社会风险。双重加工理论认为，偏见主要源于联想式的系统1思维，而审慎的系统2思维则能减轻偏见，但导致这一现象的认知机制仍不甚明晰。为更好地理解人类（以及可能的LLM）中这种双重性的本质，我们将系统1和系统2思维建模为具有不同结构的语义记忆网络，这些网络基于人类和LLM生成的同类数据集构建。随后，我们运用基于网络的评估指标，探究这些不同的语义记忆结构如何与隐性性别偏见相关联。研究发现，语义记忆结构的不可约简性仅存在于人类中，这表明LLM缺乏某些类人化的概念知识。此外，仅在人类中，语义记忆结构与隐性偏见始终存在关联，且系统2结构中的偏见水平较低。这些发现表明，特定类型的概念知识有助于人类进行偏见调控，但对LLM无效，从而凸显了人类认知与机器认知之间的根本差异。

摘要 (Abstract)

Implicit biases in both humans and large language models (LLMs) pose significant societal risks. Dual process theories propose that biases arise primarily from associative System 1 thinking, while deliberative System 2 thinking mitigates bias, but the cognitive mechanisms that give rise to this phenomenon remain poorly understood. To better understand what underlies this duality in humans, and possibly in LLMs, we model System 1 and System 2 thinking as semantic memory networks with distinct structures, built from comparable datasets generated by both humans and LLMs. We then investigate how these distinct semantic memory structures relate to implicit gender bias using network-based evaluation metrics. We find that semantic memory structures are irreducible only in humans, suggesting that LLMs lack certain types of human-like conceptual knowledge. Moreover, semantic memory structure relates consistently to implicit bias only in humans, with lower levels of bias in System~2 structures. These findings suggest that certain types of conceptual knowledge contribute to bias regulation in humans, but not in LLMs, highlighting fundamental differences between human and machine cognition.

关键词: Large Language Models, System 1, System 2, semantic memory, implicit bias, dual process theory, cognitive mechanisms, human-machine cognition

14. ❌ TriFit: Trimodal Fusion with Protein Dynamics for Mutation Fitness Prediction

作者: Seungik Cho 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12026v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文TriFit专注于蛋白质突变适应性预测，属于生物信息学领域，与"AI for Science"高度相关（10分）。它使用ESM-2（一种蛋白质语言模型）提取序列嵌入，因此与"Large Language Models"有一定关联（5分）。核心创新是四专家混合专家（MoE）融合模块，与"Mixture of Experts"高度相关（10分）。其他关键词如SLMs、Scaling Laws、训练方法、推理优化、代理系统等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

TriFit通过整合序列、结构和蛋白质动力学的四专家混合专家框架，显著提升了单氨基酸替换的功能影响预测性能，在ProteinGym基准上优于现有方法。

摘要翻译

预测单氨基酸取代（SAVs）的功能影响是理解遗传疾病和设计治疗性蛋白质的核心。虽然蛋白质语言模型和基于结构的方法在此任务上已取得优异表现，但它们系统性地忽略了蛋白质动力学；残基柔性、协同运动及变构耦合是结构生物学中公认的突变耐受性决定因素，却尚未被纳入监督式变异效应预测器中。本文提出TriFit——一个多模态框架，它通过四专家混合专家（MoE）融合模块结合三模态跨模态对比学习，整合了序列、结构与蛋白质动力学信息。序列嵌入通过ESM-2（650M）的掩蔽边际评分提取；结构嵌入来自AlphaFold2预测的C-alpha几何构型；动力学嵌入则源自高斯网络模型（GNM）的B因子、模态振型及残基-残基互相关数据。MoE路由器根据输入自适应加权模态组合，实现了无需固定模态假设的蛋白质特异性融合。在ProteinGym取代基准测试（217个DMS实验，69.6万个SAVs）中，TriFit取得了AUROC 0.897 +/- 0.0002，超越了所有监督基线模型（包括Kermut的0.864和ProteinNPT的0.844）以及最佳零样本模型ESM3（0.769）。消融实验证实，动力学特征在成对模态组合基础上提供了最大的边际贡献，且TriFit无需后验校正即可获得校准良好的概率输出（ECE = 0.044）。

摘要 (Abstract)

Predicting the functional impact of single amino acid substitutions (SAVs) is central to understanding genetic disease and engineering therapeutic proteins. While protein language models and structure-based methods have achieved strong performance on this task, they systematically neglect protein dynamics; residue flexibility, correlated motions, and allosteric coupling are well-established determinants of mutational tolerance in structural biology, yet have not been incorporated into supervised variant effect predictors. We present TriFit, a multimodal framework that integrates sequence, structure, and protein dynamics through a four-expert Mixture-of-Experts (MoE) fusion module with trimodal cross-modal contrastive learning. Sequence embeddings are extracted via masked marginal scoring with ESM-2 (650M); structural embeddings from AlphaFold2-predicted C-alpha geometries; and dynamics embeddings from Gaussian Network Model (GNM) B-factors, mode shapes, and residue-residue cross-correlations. The MoE router adaptively weights modality combinations conditioned on the input, enabling protein-specific fusion without fixed modality assumptions. On the ProteinGym substitution benchmark (217 DMS assays, 696k SAVs), TriFit achieves AUROC 0.897 +/- 0.0002, outperforming all supervised baselines including Kermut (0.864) and ProteinNPT (0.844), and the best zero-shot model ESM3 (0.769). Ablation studies confirm that dynamics provides the largest marginal contribution over pairwise modality combinations, and TriFit achieves well-calibrated probabilistic outputs (ECE = 0.044) without post-hoc correction.

关键词: Mutation Fitness Prediction, Protein Dynamics, Mixture-of-Experts, Multimodal Fusion, ESM-2, AlphaFold2, Gaussian Network Model, ProteinGym

15. ❌ Operationalising the Right to be Forgotten in LLMs: A Lightweight Sequential Unlearning Framework for Privacy-Aligned Deployment in Politically Sensitive Environments

作者: Esen Kurt, Haithem Afli 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12459v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在政治敏感环境中的隐私合规部署，提出了一种顺序遗忘框架，直接涉及LLMs和微调技术。与"Large Language Models"和"Post-training"高度相关（10分），因为论文明确研究LLMs并采用正负微调方法。其他关键词如MoE、SLMs、Scaling Laws、RAG、Agents等均未在标题或摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在政治敏感环境中部署时的隐私合规问题，提出了一种轻量级顺序遗忘框架，通过正负微调有效抑制敏感信息同时保持模型性能。

摘要翻译

大型语言模型正日益部署于政治敏感环境中，其可能记忆个人数据或机密内容的特性引发了GDPR等监管框架下“被遗忘权”相关的合规担忧。将此类法律原则转化为适用于大规模生成式系统的技术方案，面临着显著挑战。
本文提出一种轻量级顺序遗忘框架，该框架明确区分知识保留与信息抑制目标。该方法首先通过正向微调稳定模型的良性能力，随后实施分层受限的负向微调，在保持通用语言能力的同时抑制指定的敏感信息模式。
在SemEval-2025 LLM遗忘基准测试中的实验表明，该方法能有效实现行为抑制，且对事实准确性与语言流畅性影响最小。GPT-2相比DistilGPT-2展现出更强的鲁棒性，这揭示了模型容量在隐私对齐适应中的关键作用。我们将顺序遗忘定位为一种可操作、可复现的机制，能够为政治场景中部署的LLM实现数据擦除需求提供实践路径。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly deployed in politically sensitive environments, where memorisation of personal data or confidential content raises regulatory concerns under frameworks such as the GDPR and its Right to be Forgotten. Translating such legal principles into large-scale generative systems presents significant technical challenges. We introduce a lightweight sequential unlearning framework that explicitly separates retention and suppression objectives. The method first stabilises benign capabilities through positive fine-tuning, then applies layer-restricted negative fine-tuning to suppress designated sensitive patterns while preserving general language competence. Experiments on the SemEval-2025 LLM Unlearning benchmark demonstrate effective behavioural suppression with minimal impact on factual accuracy and fluency. GPT-2 exhibits greater robustness than DistilGPT-2, highlighting the role of model capacity in privacy-aligned adaptation. We position sequential unlearning as a practical and reproducible mechanism for operationalising data erasure requirements in politically deployed LLMs.

关键词: Large Language Models, LLMs, unlearning, privacy, fine-tuning, GDPR, Right to be Forgotten, politically sensitive environments

16. ❌ Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

作者: Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, Anbang Yao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12391v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种名为Chain-of-Models Pre-Training (CoM-PT)的新方法，旨在加速视觉基础模型（VFMs）的训练。该方法的核心是构建一个按模型大小升序排列的模型链，通过从较小的模型向较大的模型进行顺序逆向知识迁移，实现训练加速。论文与关键词的相关性分析如下：1）与"Pre-training"高度相关（10分），因为该方法专注于预训练阶段的加速，是论文的核心内容。2）与"Large Language Models” OR “LLMs” OR “Foundation Models"有一定关联（8分），论文虽然主要针对视觉基础模型，但摘要中提到该方法可扩展至大语言模型预训练，且"Foundation Models"一词与论文主题直接相关。3）其他关键词（如MoE、SFT、RAG、Quantization等）与论文内容无直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Chain-of-Models Pre-Training (CoM-PT)的新方法，通过构建模型链并利用顺序逆向知识迁移，在保持性能无损的前提下显著加速了视觉基础模型的训练，并展示了该方法的高效扩展性。

摘要翻译

本文提出链式模型预训练（Chain-of-Models Pre-Training, CoM-PT），一种面向视觉基础模型（Vision Foundation Models, VFMs）的新型无损性能训练加速方法。该方法的核心动机与现有加速方法存在根本差异：其目标并非优化单个模型，而是在模型家族层面加速整个训练流程，并能随模型家族扩展高效扩展。具体而言，CoM-PT为模型家族建立按模型规模升序排列的预训练序列，称为模型链。在该链中，仅最小模型需进行标准的独立预训练，其余模型则通过联合复用参数空间与特征空间中的知识，从其较小的前驱模型进行序列化逆向知识迁移，从而实现高效训练。因此，CoM-PT使所有模型在显著降低训练成本的同时，获得普遍优于标准独立训练的性能，这一结论在涵盖零样本与微调任务的45个数据集上得到广泛验证。值得注意的是，其高效扩展特性带来一个显著现象：训练更多模型甚至能带来更高的效率。例如，在CC3M数据集上进行预训练时：i）以ViT-L作为最大模型，在模型链中逐步前置更小的模型可将计算复杂度降低高达72%；ii）在固定模型规模范围内，当VFM家族扩展至3、4、7个模型时，CoM-PT的加速比呈现出显著跃升：从4.13倍提升至5.68倍及7.09倍。由于CoM-PT天然与具体预训练范式无关，我们开源代码以促进其在计算密集型场景（如大语言模型预训练）中的进一步扩展。

摘要 (Abstract)

In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.

关键词: Chain-of-Models Pre-Training, training acceleration, vision foundation models, model chain, inverse knowledge transfer, parameter space, feature space, computational efficiency

17. ❌ ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

作者: Boyang Li, Hongzhe Shou, Yuanyuan Liang, Jingbin Zhang, Fang Zhou 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12321v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究中文毒性检测的可解释性方法，使用BERT风格编码器，并提到使用轻量级LLM指导来细化显著性线索。因此，与"Large Language Models"有一定关联（5分），因为使用了LLM指导；与"Mechanistic Interpretability"高度相关（10分），因为论文核心是可解释性AI方法。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Agents、Quantization等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ToxiTrace的可解释性中文毒性检测方法，通过梯度对齐训练和对比学习，提高了分类准确性和毒性证据提取能力，同时生成更连贯、可读的解释。

摘要翻译

现有中文有害内容检测方法主要针对句子级分类，但往往无法提供可读且连贯的有害证据片段。我们提出ToxiTrace，一种面向可解释性的BERT风格编码器方法，包含三个核心组件：（1）CuSA，在轻量化大语言模型引导下将编码器衍生的显著性线索细化为细粒度有害文本片段；（2）GCLoss，一种梯度约束目标函数，可将词元级显著性聚焦于有害证据同时抑制无关激活；（3）ARCL，通过构建样本特异性对比推理对来锐化有害与非有害内容之间的语义边界。实验表明，ToxiTrace在保持高效编码器推理能力的同时，提升了分类准确率与有害片段提取性能，并能生成更连贯、可人工解读的解释。模型已发布于https://huggingface.co/ArdLi/ToxiTrace。

摘要 (Abstract)

Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbf{ToxiTrace}, an explainability-oriented method for BERT-style encoders with three components: (1) \textbf{CuSA}, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) \textbf{GCLoss}, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) \textbf{ARCL}, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations. We have released the model at https://huggingface.co/ArdLi/ToxiTrace.

关键词: ToxiTrace, Chinese toxicity detection, explainability, gradient-aligned training, BERT-style encoders, toxic span extraction, contrastive learning, saliency cues

18. ❌ Evaluating the Limitations of Protein Sequence Representations for Parkinson’s Disease Classification

作者: César Jesús Núñez-Prado, Grigori Sidorov, Liliana Chanona-Hernández 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11852v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文主要研究使用蛋白质语言模型（ProtBERT）等表示方法进行帕金森病分类，属于生物信息学应用。与关键词"AI for Science” OR “Bioinformatics” OR “Cheminformatics"高度相关（10分），因为论文明确应用AI于生物医学领域。与"Large Language Models” OR “LLMs” OR “Foundation Models"有一定关联（5分），因为ProtBERT是基于Transformer的蛋白质语言模型，属于大语言模型在科学领域的应用变体。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、代理系统等均未在论文中涉及或提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文评估了多种蛋白质序列表示方法（包括蛋白质语言模型ProtBERT）对帕金森病分类的局限性，发现仅凭初级序列信息区分能力有限，需要更丰富的生物特征进行稳健疾病建模。

摘要翻译

由于帕金森病的多因素性质，寻找其可靠的分子生物标志物仍然面临挑战。尽管蛋白质序列是基础且广泛可用的生物信息来源，但其自身对于复杂疾病分类的判别能力尚不明确。本研究对仅从蛋白质一级序列衍生的多种表征进行了受控且无数据泄露的评估，包括氨基酸组成、k-mer、理化描述符、混合表征以及来自蛋白质语言模型的嵌入表示。所有评估均在嵌套分层交叉验证框架下进行，以确保性能估计的无偏性。最佳性能配置（ProtBERT + MLP）取得了0.704 +/- 0.028的F1分数和0.748 +/- 0.047的ROC-AUC值，表明其判别性能仅为中等水平。经典表征如k-mer达到了可比的F1值（最高约0.667），但表现出高度不平衡的行为：召回率接近0.98，而精确度约为0.50，反映出对阳性预测的强烈偏向。在所有表征中，性能差异均保持在较窄范围内（F1分数介于0.60至0.70之间），而无监督分析显示不存在与类别标签对齐的内在结构，统计检验（弗里德曼检验，p = 0.1749）也未表明模型间存在显著差异。这些结果证明类别间存在实质性重叠，并表明仅凭一级序列信息对帕金森病分类的判别能力有限。本研究建立了一个可复现的基线，并提供了实证证据，表明需要更具信息量的生物学特征（如结构、功能或基于相互作用的描述符）来构建稳健的疾病模型。

摘要 (Abstract)

The identification of reliable molecular biomarkers for Parkinson’s disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including amino acid composition, k-mers, physicochemical descriptors, hybrid representations, and embeddings from protein language models, all assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 +/- 0.028 and ROC-AUC of 0.748 +/- 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions. Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70), while unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson’s disease classification. This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as structural, functional, or interaction-based descriptors, are required for robust disease modeling.

关键词: Parkinson’s disease classification, protein sequence representations, protein language models, ProtBERT, bioinformatics, machine learning, biomarker identification, cross-validation

19. ❌ Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport

作者: Rui Wang, Yi Zheng, Dongxin Wang, Haiping Huang, Yuanzhi Yao, Yuxiang Zhou, Jialin Yu, Philip Torr 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12663v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文提出了一种基于LLM的目标提示对比主题建模方法（GCTM-OT），核心创新在于将人类目标整合到主题建模中，使用LLM进行目标候选提取，并通过最优传输进行对比学习。论文明确提到"LLM-based approaches"和"LLM-based prompting”，因此与"Large Language Models"高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术（预训练、微调、对齐等）、推理优化、代理系统、模型压缩、科学AI等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有主题建模方法（包括LLM方法）产生的主题冗余且偏离用户意图的问题，提出了Human-centric Topic Modeling任务和GCTM-OT方法，通过LLM提示提取目标候选并结合最优传输对比学习，在三个公开数据集上实现了更好的主题一致性、多样性和目标对齐。

摘要翻译

现有主题建模方法（从LDA到近期基于神经网络与大语言模型的技术）主要关注统计连贯性，往往产生冗余或偏离目标的话题，未能捕捉用户的深层意图。我们提出人本主题建模（Human-centric Topic Modeling, Human-TM），这是一种新颖的任务框架，将人类提供的目标直接整合到主题建模过程中，以生成可解释、多样化且面向目标的话题。为应对这一挑战，我们提出基于最优传输的目标提示对比主题模型（Goal-prompted Contrastive Topic Model with Optimal Transport, GCTM-OT）。该方法首先利用基于大语言模型的提示技术从文档中提取目标候选，随后通过最优传输将其融入语义感知的对比学习框架中进行主题发现。在三个公开的Reddit子论坛数据集上的实验结果表明，GCTM-OT在主题连贯性与多样性方面优于当前最先进的基线模型，同时显著提升了与人类设定目标的契合度，为人本主题发现系统的未来发展开辟了新路径。

摘要 (Abstract)

Existing topic modeling methods, from LDA to recent neural and LLM-based approaches, which focus mainly on statistical coherence, often produce redundant or off-target topics that miss the user’s underlying intent. We introduce Human-centric Topic Modeling, \emph{Human-TM}), a novel task formulation that integrates a human-provided goal directly into the topic modeling process to produce interpretable, diverse and goal-oriented topics. To tackle this challenge, we propose the \textbf{G}oal-prompted \textbf{C}ontrastive \textbf{T}opic \textbf{M}odel with \textbf{O}ptimal \textbf{T}ransport (GCTM-OT), which first uses LLM-based prompting to extract goal candidates from documents, then incorporates these into semantic-aware contrastive learning via optimal transport for topic discovery. Experimental results on three public subreddit datasets show that GCTM-OT outperforms state-of-the-art baselines in topic coherence and diversity while significantly improving alignment with human-provided goals, paving the way for more human-centric topic discovery systems.

关键词: Human-centric Topic Modeling, Goal-prompted Contrastive Learning, Optimal Transport, LLM-based prompting, Topic Discovery, Topic Coherence, Goal Alignment, Contrastive Topic Model

20. ❌ GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

作者: Zhaochen Liu, Limeng Qiao, Guanglu Wan, Tingting Jiang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12630v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）在空间推理任务中的性能提升，核心创新在于动态聚合多层几何特征以对齐任务需求。因此，仅与关键词"Large Language Models" OR “LLMs” OR “Foundation Models"高度相关（得10分），因为MLLMs是LLMs的扩展，且论文直接涉及3D基础模型。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Post-training、Alignment、RLHF、PEFT、RAG、Context Window、KV Cache、Reasoning、Agents、Quantization、Speculative Decoding、Hallucination、Interpretability、World Models、Model Merging、In-context Learning、AI for Science等均未在论文中提及或讨论，故得0分。

!!! tip deepseek-chat TL;DR

论文针对多模态大语言模型在空间推理任务中因几何特征与任务需求不匹配而性能受限的问题，提出了GeoAlign框架，通过动态聚合多层几何特征进行对齐，使紧凑的4B模型在多个基准测试中实现了最先进的性能。

摘要翻译

多模态大语言模型（MLLMs）在各种视觉任务中展现出卓越性能，但在空间推理方面仍面临挑战。近期研究尝试通过引入来自三维基础模型的几何特征来缓解此问题，但这些方法依赖于静态的单层特征提取。我们发现，此类方法会引发任务对齐偏差：几何特征自然趋向于三维预训练目标，这可能与多模态大语言模型多样化的空间需求相矛盾，使得任何单层特征本质上都无法满足要求。为解决这一问题，我们提出GeoAlign——一种动态聚合多层几何特征以重新对齐实际需求的新框架。GeoAlign构建了分层几何特征库，并利用多模态大语言模型原有的视觉标记作为内容感知查询，执行分层稀疏路由，自适应地为每个图像块提取合适的几何特征。在VSI-Bench、ScanQA和SQA3D数据集上的大量实验表明，我们仅40亿参数的紧凑模型有效实现了最先进的性能，甚至超越了现有规模更大的多模态大语言模型。

摘要 (Abstract)

Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM’s original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.

关键词: Multimodal Large Language Models, MLLMs, Spatial Reasoning, Geometric Features, Feature Realignment, 3D Foundation Models, Hierarchical Feature Bank, State-of-the-art Performance

作者: Hao Wang, Jiqing Zhang, Xin Yang, Baocai Yin, Lu Jiang, Zetian Mi, Huibing Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12380v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的伪装目标检测（COD），提出了一种基于Segment Anything Model（SAM）的多模态提示学习框架。该研究主要涉及视觉模型（SAM）的参数高效微调（PEFT）技术，与关键词"PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning"高度相关（10分），因为论文明确提出了参数高效的适配方法。其他关键词均与论文内容无关（0分），因为论文未涉及大语言模型（LLMs）、MoE、缩放定律、对齐、推理、代理、量化等主题，也未直接涉及科学AI应用（如生物信息学）。

!!! tip deepseek-chat TL;DR

该论文提出了一种模态无关的提示学习框架，用于多模态伪装目标检测，通过参数高效地适配Segment Anything Model（SAM）并引入掩码细化模块，显著提升了检测性能。

摘要翻译

伪装目标检测（COD）旨在分割那些与复杂背景无缝融合的目标，当前研究日益关注利用额外的视觉模态，通过互补信息提升检测的鲁棒性。然而，现有方法大多依赖特定模态的架构或定制化的融合策略，这限制了其可扩展性与跨模态泛化能力。为此，我们提出了一种新颖的框架，为分割一切模型（Segment Anything Model，简称SAM）生成与模态无关的多模态提示，从而能够以参数高效的方式适应任意辅助模态，并显著提升COD任务的整体性能。具体而言，我们通过数据驱动的内容域与知识驱动的提示域之间的交互来建模多模态学习，将任务相关的线索提炼为统一的提示，用于SAM解码。我们进一步引入一个轻量级的掩码优化模块，通过融入细粒度的提示线索来校准粗粒度预测，从而获得更精确的伪装目标边界。在RGB-深度、RGB-热成像和RGB-偏振等多个基准数据集上的大量实验验证了我们所提出的模态无关框架的有效性与泛化能力。

摘要 (Abstract)

Camouflaged Object Detection (COD) aims to segment objects that blend seamlessly into complex backgrounds, with growing interest in exploiting additional visual modalities to enhance robustness through complementary information. However, most existing approaches generally rely on modality-specific architectures or customized fusion strategies, which limit scalability and cross-modal generalization. To address this, we propose a novel framework that generates modality-agnostic multi-modal prompts for the Segment Anything Model (SAM), enabling parameter-efficient adaptation to arbitrary auxiliary modalities and significantly improving overall performance on COD tasks. Specifically, we model multi-modal learning through interactions between a data-driven content domain and a knowledge-driven prompt domain, distilling task-relevant cues into unified prompts for SAM decoding. We further introduce a lightweight Mask Refine Module to calibrate coarse predictions by incorporating fine-grained prompt cues, leading to more accurate camouflaged object boundaries. Extensive experiments on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks validate the effectiveness and generalization of our modality-agnostic framework.

关键词: Camouflaged Object Detection, Multi-modal, Modality-agnostic, Prompt Learning, Segment Anything Model, Parameter-efficient Adaptation, Mask Refine Module, RGB-Depth-Thermal-Polarization

22. ❌ Adaptive Budget Allocation in LLM-Augmented Surveys

作者: Zikun Ye, Jiameng Lyu, Rui Tao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12497v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在调查问卷生成中的应用，并提出了一个自适应预算分配算法来优化人类验证资源的分配。因此，仅与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（10分），因为LLMs是研究的核心工具和背景。其他关键词涉及模型架构、训练方法、推理技术、特定应用领域（如生物信息学）等，论文未直接探讨这些方面，故均评0分。

!!! tip deepseek-chat TL;DR

该论文研究在LLM增强的调查中，如何实时自适应地分配有限的人类标注预算到LLM可靠性未知的不同问题上，提出了一种算法，该算法能减少预算浪费2-6%，并在真实调查数据上验证了其优于均匀分配方法的性能。

摘要翻译

大型语言模型（LLM）能够以低成本生成调查问卷的回复，但其可靠性在不同问题间差异显著，且在数据收集前无法预知。在调查中部署LLM仍需耗费高昂的人力响应进行验证与修正。应如何在问题间实时分配有限的人工标注预算？我们提出一种自适应分配算法，该算法能够在收集人工响应的同时，学习哪些问题对LLM而言最为困难。每个人工标注都扮演双重角色：既改进对该问题的估计，又揭示LLM在该问题上预测人工响应的准确程度。该算法将更多预算导向LLM可靠性最低的问题，且无需任何关于问题层面LLM准确性的先验知识。我们证明，随着预算增长，相对于最佳可能分配的分配差距趋近于零，并在合成数据以及一个包含68个问题、超过2000名受访者的真实调查数据集上验证了该方法。在真实调查数据上，传统均匀分配人工标注的做法相较于最优分配浪费了10–12%的预算；我们的算法将浪费降至2–6%，且当问题间在LLM预测质量上异质性更强时，其优势进一步扩大。该算法以更少的人工样本实现了与传统均匀抽样相同的估计质量，无需预实验，并得到经真实调查数据验证的正式性能保证。更广泛而言，本框架适用于任何需要在LLM可靠性未知的各项任务间分配稀缺人工监督的场景。

摘要 (Abstract)

Large language models (LLMs) can generate survey responses at low cost, but their reliability varies substantially across questions and is unknown before data collection. Deploying LLMs in surveys still requires costly human responses for verification and correction. How should a limited human-labeling budget be allocated across questions in real time? We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard practice of allocating human labels uniformly across questions wastes 10–12% of the budget relative to the optimal; our algorithm reduces this waste to 2–6%, and the advantage grows as questions become more heterogeneous in LLM prediction quality. The algorithm achieves the same estimation quality as traditional uniform sampling with fewer human samples, requires no pilot study, and is backed by formal performance guarantees validated on real survey data. More broadly, the framework applies whenever scarce human oversight must be allocated across tasks where LLM reliability is unknown.

关键词: Large Language Models, LLMs, survey responses, adaptive allocation, human-labeling budget, reliability, optimal allocation, estimation quality

23. ❌ StructDiff: A Structure-Preserving and Spatially Controllable Diffusion Model for Single-Image Generation

作者: Yinxi He, Kang Liao, Chunyu Lin, Tianyi Wei, Yao Zhao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12575v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究基于扩散模型的单图像生成方法（StructDiff），其核心创新在于结构保持和空间可控性，属于计算机视觉和生成模型领域。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关。唯一的相关点是摘要中提到使用大语言模型（LLMs）作为评估单图像生成的新标准，因此给予"Large Language Models"等关键词5分（有一定关联，但非论文核心）。

!!! tip deepseek-chat TL;DR

该论文提出了StructDiff，一种基于单尺度扩散模型的单图像生成框架，通过自适应感受野模块和3D位置编码解决了现有方法在结构保持和空间可控性方面的不足，并在多个下游任务中取得了优异性能。

摘要翻译

本文提出StructDiff，一种基于单尺度扩散模型的单图像生成框架。单图像生成旨在通过捕捉源图像的内部统计特征（无需依赖外部数据），合成具有相似视觉内容的多样化样本。然而，现有方法往往难以保持结构布局，特别是对于包含大型刚性物体或严格空间约束的图像。此外，大多数方法缺乏空间可控性，难以引导生成内容的结构或位置。为解决这些挑战，StructDiff引入了自适应感受野模块以同时保持全局与局部分布。在此基础上，StructDiff将三维位置编码作为空间先验，从而能够灵活控制生成对象的位置、尺度及局部细节。据我们所知，这种空间控制能力是首次在单图像生成中探索基于位置编码的操控方法。此外，我们提出了一种基于大语言模型的新型单图像生成评估标准，该标准专门针对现有客观指标的局限性以及用户研究的高人力成本问题。StructDiff还展示了在下游任务中的广泛适用性，例如文本引导图像生成、图像编辑、外绘和绘画到图像合成。大量实验表明，StructDiff在结构一致性、视觉质量和空间可控性方面均优于现有方法。项目页面详见https://butter-crab.github.io/StructDiff/。

摘要 (Abstract)

This paper introduces StructDiff, a generative framework based on a single-scale diffusion model for single-image generation. Single-image generation aims to synthesize diverse samples with similar visual content to the source image by capturing its internal statistics, without relying on external data. However, existing methods often struggle to preserve the structural layout, especially for images with large rigid objects or strict spatial constraints. Moreover, most approaches lack spatial controllability, making it difficult to guide the structure or placement of generated content. To address these challenges, StructDiff introduces an \textit{adaptive receptive field} module to maintain both global and local distributions. Building on this foundation, StructDiff incorporates 3D positional encoding (PE) as a spatial prior, allowing flexible control over positions, scale, and local details of generated objects. To our knowledge, this spatial control capability represents the first exploration of PE-based manipulation in single-image generation. Furthermore, we propose a novel evaluation criterion for single-image generation based on large language models (LLMs). This criterion specifically addresses the limitations of existing objective metrics and the high labor costs associated with user studies. StructDiff also demonstrates broad applicability across downstream tasks, such as text-guided image generation, image editing, outpainting, and paint-to-image synthesis. Extensive experiments demonstrate that StructDiff outperforms existing methods in structural consistency, visual quality, and spatial controllability. The project page is available at https://butter-crab.github.io/StructDiff/.

关键词: single-image generation, diffusion model, structure-preserving, spatial controllability, adaptive receptive field, 3D positional encoding, evaluation criterion, downstream tasks

24. ❌ PAL: Personal Adaptive Learner

作者: Megha Chakraborty, Darssan L. Eswaramoorthi, Madhur Thareja, Het Riteshkumar Shah, Finlay Palmer, Aryaman Bahl, Michelle A Ihetu, Amit Sheth 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PAL专注于AI驱动的教育平台，通过多模态内容分析和自适应决策实现实时个性化学习。所有评分关键词均针对大模型/深度学习的技术原理、训练方法、推理优化、对齐技术、应用框架等具体技术细节，而本文未提及任何具体的大模型技术（如LLM、MoE、RLHF等）、训练方法或底层优化，也未涉及科学领域的AI应用（如生物信息学）。论文仅泛泛提及“AI-powered platform”和“AI-driven education”，未与任何评分关键词建立直接技术关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究针对AI教育平台静态个性化不足的问题，提出了PAL平台，通过多模态内容分析和自适应决策将讲座视频转化为交互式学习体验，实现了实时个性化学习支持。

摘要翻译

人工智能驱动的教育平台在个性化方面已取得一定进展，但多数仍局限于静态适应模式——如预设测验、统一进度或通用反馈——这限制了其响应学习者动态理解能力的发展。这一不足凸显了对兼具情境感知与实时适应能力的系统的需求。本文介绍PAL（Personal Adaptive Learner，个性化自适应学习者），一个将讲座视频转化为互动学习体验的人工智能平台。PAL持续分析多模态讲座内容，在教学进程中通过动态调整难度的问题与学习者互动，并依据其回答实时适配。在每个学习阶段结束时，PAL会生成个性化总结，在强化核心概念的同时，根据学习者的兴趣定制示例。通过融合多模态内容分析与自适应决策机制，PAL为响应式数字学习贡献了一个创新框架。我们的研究表明，人工智能如何突破静态个性化局限，转向实时个体化支持，从而应对智能教育领域的核心挑战。

摘要 (Abstract)

AI-driven education platforms have made some progress in personalisation, yet most remain constrained to static adaptation–predefined quizzes, uniform pacing, or generic feedback–limiting their ability to respond to learners’ evolving understanding. This shortfall highlights the need for systems that are both context-aware and adaptive in real time. We introduce PAL (Personal Adaptive Learner), an AI-powered platform that transforms lecture videos into interactive learning experiences. PAL continuously analyzes multimodal lecture content and dynamically engages learners through questions of varying difficulty, adjusting to their responses as the lesson unfolds. At the end of a session, PAL generates a personalized summary that reinforces key concepts while tailoring examples to the learner’s interests. By uniting multimodal content analysis with adaptive decision-making, PAL contributes a novel framework for responsive digital learning. Our work demonstrates how AI can move beyond static personalization toward real-time, individualized support, addressing a core challenge in AI-enabled education.

关键词: AI-driven education, personalized learning, adaptive learning, multimodal content analysis, real-time adaptation, interactive learning, personalized summary, digital learning platform

25. ❌ Visual Preference Optimization with Rubric Rewards

作者: Ya-Qi Yu, Fangyu Hong, Xiangyang Qu, Hao Wang, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13029v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Direct Preference Optimization (DPO)在视觉任务中的应用，提出rDPO框架，因此与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分）。论文涉及偏好优化、指令调整和微调，与’Post-training OR Supervised Fine-tuning OR SFT’和’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（各5分）。论文提到大模型作为评估者（如GPT-4），与’Large Language Models OR LLMs OR Foundation Models’有间接关联（5分）。其他关键词如MoE、量化、RAG等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉任务中现有DPO方法依赖粗粒度反馈的问题，提出了基于实例特定量规的rDPO框架，通过结合在线策略数据构建和细粒度标准级反馈，显著提升了视觉偏好优化的性能。

摘要翻译

直接偏好优化（DPO）的有效性依赖于能够反映多模态任务中关键质量差异的偏好数据。现有流程通常依赖于离策略扰动或基于结果的粗粒度信号，这些方法并不适用于细粒度的视觉推理。我们提出了rDPO，一个基于实例特定量规的偏好优化框架。针对每个图像-指令对，我们创建一个清单式量规，包含必要和附加标准，用以评估来自任何可能策略的响应。指令-量规池在线下构建，并在策略内数据构建过程中重复使用。在公开的奖励建模基准测试中，基于量规的提示方法显著改进了30B-A3B评判模型，使其性能接近GPT-5.4。在公开的下游基准测试中，基于量规的过滤将宏观平均值提升至82.69，而基于结果的过滤则将其从81.14降至75.82。在综合基准上评估可扩展性时，rDPO达到了61.01，显著优于风格受限基线（52.36），并超越了59.48的基础模型。这些结果表明，视觉偏好优化受益于将策略内数据构建与实例特定的标准级反馈相结合。

摘要 (Abstract)

The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.

关键词: Direct Preference Optimization, DPO, visual preference optimization, rubric-based feedback, multimodal tasks, fine-grained visual reasoning, on-policy data, rDPO

26. ❌ Representation geometry shapes task performance in vision-language modeling for CT enterography

作者: Cristian Minoccheri, Emily Wittrup, Kayvan Najarian, Ryan Stidham 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13021v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学影像（CT enterography）的视觉-语言建模，主要涉及LoRA（参数高效微调）和RAG（检索增强生成）技术，属于AI for Science（生物信息学）应用。其他关键词如大模型基础技术、推理方法、对齐训练等均未涉及。

!!! tip deepseek-chat TL;DR

该研究首次将视觉-语言迁移学习应用于腹部CT肠造影，发现切片嵌入的平均池化优于注意力池化用于疾病分类，多窗口RGB编码优于多平面采样，且检索增强生成（RAG）显著提升了报告生成的准确性。

摘要翻译

计算机断层扫描（CT）小肠造影是评估炎症性肠病（IBD）的主要影像学手段，但何种表征选择最有利于该模态的自动化分析尚不明确。本研究首次针对腹部CT小肠造影进行视觉-语言迁移学习探索，并得出两项主要发现。首先，切片嵌入的平均池化能提供更好的疾病分类评估（59.2%的三分类准确率），而注意力池化则更有利于跨模态检索（文本到图像MRR为0.235）。该模式在所有测试的LoRA配置中均成立，表明两种聚合器强调了学习表征的不同特性。其次，单切片组织对比度比广泛的空间覆盖更为重要：将互补亨氏单位（Hounsfield Unit）窗口映射到RGB通道的多窗RGB编码策略，优于所有通过多平面采样增加空间覆盖的方法，且在此设置中添加冠状面和矢状面视图反而会降低分类性能。在报告生成任务中，未使用检索上下文的微调模型在严重程度分级上（within-1 accuracy）仅达到与患病率匹配的随机基线水平（70.4% vs. 71%随机），表明模型未能学习到超越类别分布的排序信息。检索增强生成（RAG）方法在所有配置中均提升了性能，评分较随机基线提高7-14个百分点，并将序数平均绝对误差（MAE）从0.98降低至0.80-0.89。本研究采用三教师伪标签框架，使得所有比较均无需专家标注。这些发现共同为这一尚未充分探索的影像模态建立了首批基线，并为构建面向容积医学影像的视觉-语言系统提供了实用指导。

摘要 (Abstract)

Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4% vs.\ 71% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7–14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80–0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

关键词: CT enterography, vision-language modeling, LoRA, retrieval-augmented generation, medical imaging, inflammatory bowel disease, transfer learning, representation geometry

27. ❌ Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

作者: Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究On-policy Distillation (OPD)这一大语言模型后训练技术，因此与’Large Language Models’和’Post-training’高度相关（10分）。论文涉及模型对齐和知识传递，与’Instruction Tuning OR Alignment’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等均未在摘要中提及或与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

本文系统研究了大型语言模型在策略蒸馏的训练动态和机制，发现成功蒸馏需要师生模型思维模式兼容且教师提供新能力，并提出了两种实用策略来恢复失败的蒸馏过程。

摘要翻译

同策略蒸馏已成为大语言模型后训练的核心技术，但其训练动态机制尚未得到充分理解。本文对同策略蒸馏的动态过程与机制进行了系统性研究。我们首先发现决定同策略蒸馏成败的两个关键条件：（一）学生模型与教师模型应具备兼容的思维模式；（二）即使思维模式一致且教师评分更高，教师仍需提供学生训练过程中未曾接触的真正新能力。我们通过弱到强逆向蒸馏验证了这些发现，表明从学生模型视角来看，同家族的1.5B与7B参数教师在分布上不可区分。在词元层面机制的探究中，我们发现成功的同策略蒸馏表现为：在学生访问状态的高概率词元上实现渐进式对齐，这些少量共享词元集合（仅占总数3%-1%）却集中了97%-99%的概率质量。我们进一步提出两种实用策略来挽救失败的同策略蒸馏：离策略冷启动和教师对齐提示选择。最后，我们揭示同策略蒸馏表面上的密集词元级奖励“免费午餐”实则存在代价，这引发了同策略蒸馏能否扩展到长程蒸馏场景的根本性问题。

摘要 (Abstract)

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student’s perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD’s apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

关键词: On-policy Distillation, Large Language Models, Post-training, Training Dynamics, Knowledge Distillation, Teacher-Student Models, Token-level Mechanism, Weak-to-Strong Distillation

28. ❌ Bilevel Late Acceptance Hill Climbing for the Electric Capacitated Vehicle Routing Problem

作者: Yinghao Qin, Mosab Bazargani, Edmund K. Burke, Carlos A. Coello Coello, Zhongmin Song, Jun Chen 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13013v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是运筹学领域的电动汽车路径规划问题（E-CVRP），提出了一种双层优化框架和b-LAHC算法，属于传统组合优化和启发式算法范畴。论文内容完全不涉及大模型、深度学习、AI技术原理或AI for Science应用，所有关键词均与大模型技术、AI方法或科学AI应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对电动汽车路径规划问题（E-CVRP）提出了一种双层优化框架和b-LAHC算法，在IEEE基准测试中取得了优于或媲美现有算法的性能，并创造了9项新的最佳记录。

摘要翻译

本文通过一种双层优化框架处理电动容量约束车辆路径问题（E-CVRP），该框架根据搜索阶段将路径规划与充电决策分别或协同处理。通过分析两者间的交互作用，我们在上层引入了一个代理目标函数以引导搜索并加速收敛。本文提出了一种双层延迟接受爬山算法（b-LAHC），其运行包含三个阶段：贪婪下降、邻域探索以及最终解优化。b-LAHC采用固定参数运行，无需复杂自适应调整，同时保持轻量高效。在IEEE WCCI-2020基准测试上的大量实验表明，相较于八种先进算法，b-LAHC取得了更优或具有竞争力的性能。在固定评估预算下，该算法在小规模算例上获得了接近最优的解，并在大规模基准测试中创造了9/10项新的最佳已知结果，将现有记录平均提升了1.07%。此外，代理目标函数与完整成本之间观察到的强相关性（虽非绝对）证明了使用代理目标函数的合理性，同时仍需要两个层次的协同求解，从而验证了所提双层框架的有效性，并凸显了其在高效求解具有层次结构的大规模路径问题方面的潜力。

摘要 (Abstract)

This paper tackles the Electric Capacitated Vehicle Routing Problem (E-CVRP) through a bilevel optimization framework that handles routing and charging decisions separately or jointly depending on the search stage. By analyzing their interaction, we introduce a surrogate objective at the upper level to guide the search and accelerate convergence. A bilevel Late Acceptance Hill Climbing algorithm (b-LAHC) is introduced that operates through three phases: greedy descent, neighborhood exploration, and final solution refinement. b-LAHC operates with fixed parameters, eliminating the need for complex adaptation while remaining lightweight and effective. Extensive experiments on the IEEE WCCI-2020 benchmark show that b-LAHC achieves superior or competitive performance against eight state-of-the-art algorithms. Under a fixed evaluation budget, it attains near-optimal solutions on small-scale instances and sets 9/10 new best-known results on large-scale benchmarks, improving existing records by an average of 1.07%. Moreover, the strong correlation (though not universal) observed between the surrogate objective and the complete cost justifies the use of the surrogate objective while still necessitating a joint solution of both levels, thereby validating the effectiveness of the proposed bilevel framework and highlighting its potential for efficiently solving large-scale routing problems with a hierarchical structure.

关键词: Electric Capacitated Vehicle Routing Problem, bilevel optimization, Late Acceptance Hill Climbing, surrogate objective, routing and charging, heuristic algorithm, computational experiments, benchmark performance

29. ❌ Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

作者: Yecheng Wu, Song Han, Hai Cai 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13010v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	15.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的高效后训练（post-training）方法，特别是基于策略蒸馏（OPD）的改进。因此，与’Large Language Models’和’Post-training OR Supervised Fine-tuning OR SFT’高度相关（核心内容）。论文在数学推理和代码生成任务上评估，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’有一定关联（推理任务）。其他关键词如MoE、量化、RAG、对齐等未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型后训练中在线策略蒸馏（OPD）效率低下的问题，提出了Lightning OPD框架，通过离线预计算教师模型对数概率并确保教师一致性，在保持性能的同时实现了4倍加速，显著降低了LLM后训练的学术研究门槛。

摘要翻译

在线策略蒸馏已成为大型语言模型一种高效的后训练范式。然而，标准在线策略蒸馏在整个训练过程中都需要一个实时的教师模型推理服务器，导致巨大的基础设施开销。在本研究中，我们探讨了在线策略蒸馏是否能够离线执行。一种自然的方法是预先在有监督微调生成的轨迹上计算一次教师模型的对数概率，并在训练过程中重复使用它们。然而在实践中，这种离线变体无法可靠地达到标准在线策略蒸馏的性能。为了理解这种差异，我们识别出了一个先前被忽视、但对任何在线策略蒸馏流程都至关重要的条件，我们称之为教师一致性。该条件要求在有监督微调和在线策略蒸馏中使用相同的教师模型。我们证明，违反教师一致性会引入一个不可约的梯度偏差，导致离线和在线策略蒸馏无论训练多久都会收敛到一个次优的固定点。基于这一洞见，我们提出了闪电在线策略蒸馏，这是一个离线的在线策略蒸馏框架，它通过预先在有监督微调轨迹上计算教师模型对数概率来强制保证教师一致性。这一设计完全消除了对实时教师服务器的需求。我们进一步证明，在教师一致性条件下，闪电在线策略蒸馏与标准在线策略蒸馏共享相同的最优点，同时具有有界的梯度差异和一种有助于防止策略漂移的隐式正则化效应。在数学推理和代码生成任务上的大量实验表明，闪电在线策略蒸馏以显著提升的效率实现了最先进的性能。从一个经过有监督微调初始化的Qwen3-8B-Base模型出发，闪电在线策略蒸馏仅用30个GPU小时就在AIME 2024上达到了69.9%的准确率，相比标准在线策略蒸馏实现了4.0倍的加速，并大幅降低了学术界进行大型语言模型后训练研究的门槛。

摘要 (Abstract)

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.

关键词: On-policy distillation, Post-training, Large language models, Teacher consistency, Offline training, Efficiency, Mathematical reasoning, Code generation

30. ❌ LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software

作者: Syed Md Mukit Rashid, Abdullah Al Ishtiaq, Kai Tu, Yilu Dong, Tianwei Wu, Ali Ranjbar, Tianchang Yang, Najrin Sultana, Shagufta Mehnaz, Syed Rafiul Hussain 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12994v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LLMs在软件安全领域的应用，特别是用于修复逻辑漏洞。摘要明确提到’large language models (LLMs) in understanding and repairing code’，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。论文未涉及其他关键词的具体技术细节（如MoE、SFT、RAG等），也未涉及生物信息学等科学领域，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了LogicEval框架，首次系统评估了传统方法和基于大语言模型的方法在修复真实世界软件逻辑漏洞方面的能力与局限性，并创建了包含86个漏洞的数据集LogicDS。

摘要翻译

软件中的逻辑漏洞源于程序逻辑缺陷而非内存安全问题，可能导致严重的安全故障。尽管现有的自动化程序修复技术主要专注于修复内存破坏型漏洞，但由于其对漏洞代码及其预期行为的语义理解有限，这些技术在应对逻辑漏洞时面临困难。另一方面，大型语言模型在理解和修复代码方面取得的近期成果显示出良好前景。然而，目前尚缺乏系统分析此类技术处理逻辑漏洞的能力与局限性的框架。本文旨在系统评估传统修复方法与基于大型语言模型的修复方法在解决现实世界逻辑漏洞方面的表现。为支持评估工作，我们创建了首个逻辑漏洞数据集LogicDS，包含86个已分配通用漏洞披露（CVE）编号且具有实际安全影响的逻辑漏洞。同时开发了系统化评估框架LogicEval，用于评估逻辑漏洞补丁。评估结果表明，编译与测试失败主要受提示词敏感性、代码上下文丢失及补丁定位困难等因素驱动。

摘要 (Abstract)

Logical vulnerabilities in software stem from flaws in program logic rather than memory safety, which can lead to critical security failures. Although existing automated program repair techniques primarily focus on repairing memory corruption vulnerabilities, they struggle with logical vulnerabilities because of their limited semantic understanding of the vulnerable code and its expected behavior. On the other hand, recent successes of large language models (LLMs) in understanding and repairing code are promising. However, no framework currently exists to analyze the capabilities and limitations of such techniques for logical vulnerabilities. This paper aims to systematically evaluate both traditional and LLM-based repair approaches for addressing real-world logical vulnerabilities. To facilitate our assessment, we created the first ever dataset, LogicDS, of 86 logical vulnerabilities with assigned CVEs reflecting tangible security impact. We also developed a systematic framework, LogicEval, to evaluate patches for logical vulnerabilities. Evaluations suggest that compilation and testing failures are primarily driven by prompt sensitivity, loss of code context, and difficulty in patch localization.

关键词: logical vulnerabilities, automated program repair, large language models, LLMs, software security, evaluation framework, patch evaluation, code understanding

31. ❌ One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

作者: Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13006v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究指令微调（Instruction Tuning）对大型语言模型（LLMs）鲁棒性的影响，发现简单的词汇约束会导致指令微调模型响应崩溃，而基础模型不受影响。因此，与’Large Language Models’和’Instruction Tuning’高度相关（10分），因为这是研究的核心对象和主题。与’Mechanistic Interpretability’有一定关联（5分），因为论文通过线性探针分析提示表示来探究崩溃机制，涉及模型行为的解释。其他关键词如MoE、SFT、RAG等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

研究发现简单的词汇约束会导致指令微调的大型语言模型响应崩溃，显著降低回答的全面性，而基础模型不受影响，这表明指令微调将任务能力耦合到狭窄的表面形式模板中，造成了这种脆弱性。

摘要翻译

经过指令微调的大语言模型能够生成有用且结构化的回答，但当受到微不足道的约束时，这种有用性是否稳健？我们发现，简单的词汇约束（禁止使用单个标点符号或常见词汇）会导致指令微调的大语言模型（LLMs）的回答质量崩溃，在三个开源模型系列和一个闭源模型（GPT-4o-mini）的成对评估中，其回答的全面性损失了14%至48%。在由GPT-4o-mini和GPT-4o评判的1,920组成对比较中，基线回答在77%至100%的情况下更受青睐。值得注意的是，GPT-4o-mini的全面性损失达31%（基线胜率为99%），这表明这种脆弱性也延伸到了商业部署的闭源模型，这与先前关于格式层面约束的研究发现相反。通过机制分析，我们将其归因于规划失败：采用两阶段生成（首先生成自由回答，再进行约束重写）可恢复59%至96%的回答长度，并且在生成开始前，对提示词表征进行线性探测（linear probe）即可预测回答长度（$R^2 = 0.51$–$0.93$），且$R^2$值在不同模型中与崩溃程度相关。同样的线性探测在基础模型上则产生负的$R^2$值，证实了指令微调创造了编码崩溃决策的表征结构。关键的是，基础模型在相同约束下并未出现系统性崩溃，其影响微小、随机且双向，这表明指令微调通过将任务能力与狭窄的表面形式模板耦合，从而创造了这种脆弱性。该效应在MT-Bench的所有八个任务类别中均得到复现。我们进一步发现，标准的独立LLM-as-judge评估仅检测到平均3.5%的质量下降，而成对评估则揭示了23%的下降，这暴露了当前评估受限生成方法中的一个方法论盲点。

摘要 (Abstract)

Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness when trivially constrained? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14–48% of comprehensiveness in pairwise evaluation across three open-weight model families and one closed-weight model (GPT-4o-mini). The baseline response is preferred in 77–100% of 1,920 pairwise comparisons judged by GPT-4o-mini and GPT-4o. Notably, GPT-4o-mini suffers 31% comprehensiveness loss (99% baseline win rate), demonstrating that the fragility extends to commercially deployed closed-weight models, contrary to prior findings on format-level constraints. Through mechanistic analysis, we identify this as a planning failure: two-pass generation (free generation followed by constrained rewriting) recovers 59–96% of response length, and linear probes on prompt representations predict response length with $R^2 = 0.51$–$0.93$ before generation begins, with $R^2$ tracking collapse severity across models. The same probes yield negative $R^2$ on base models, confirming that instruction tuning creates the representational structure encoding the collapse decision. Crucially, base models show no systematic collapse under identical constraints, with effects that are small, noisy, and bidirectional, demonstrating that instruction tuning creates this fragility by coupling task competence to narrow surface-form templates. The effect replicates on MT-Bench across all eight task categories. We further show that standard independent LLM-as-judge evaluation detects only a 3.5% average quality drop where pairwise evaluation reveals 23%, exposing a methodological blind spot in how constrained generation is assessed.

关键词: instruction tuning, large language models, robustness, constrained generation, response collapse, mechanistic analysis, GPT-4o, MT-Bench

32. ❌ ROSE: An Intent-Centered Evaluation Metric for NL2SQL

作者: Wenqi Pei, Shizheng Hou, Boyan Li, Han Chen, Zhichao Shi, Yuyu Luo 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12988v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于NL2SQL评估指标（ROSE）的开发，这是一个自然语言处理中的特定任务评估问题。论文内容涉及SQL语义正确性评估、对抗性验证框架和评估指标设计，但完全不涉及大模型技术原理、训练方法、推理优化、对齐技术、模型压缩、AI代理或科学AI应用等关键词领域。所有关键词均与大模型技术、训练方法、优化技术或特定应用领域相关，而本文是纯粹的评估指标研究，与这些技术主题无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对自然语言转SQL（NL2SQL）任务中现有执行准确率（EX）评估指标的不可靠性问题，提出了一个以意图为中心的评估指标ROSE，通过对抗性证明者-反驳者级联来评估预测SQL是否回答了用户问题，并在专家对齐验证集上实现了与人类专家最佳的一致性。

摘要翻译

执行准确率（EX）作为评估自然语言转SQL（NL2SQL）方案有效性的广泛使用指标，正变得日益不可靠。该指标对语法变化敏感，忽略了问题可能存在多种解释的可能性，且易受错误的标准答案SQL误导。为解决这一问题，我们提出了ROSE——一种以意图为中心的评估指标，其关注点在于预测的SQL是否回答了问题，而非在参考依赖范式下与标准答案SQL的一致性。ROSE采用对抗性的证明者-反驳者级联框架：SQL证明者独立评估预测SQL相对于用户意图的语义正确性，而对抗性反驳者则利用标准答案SQL作为证据，对此判断进行挑战与优化。在我们与专家对齐的验证集ROSE-VEC上，ROSE取得了与人类专家最高的一致性，其科恩卡帕系数比次优指标高出近24%。我们还对19种NL2SQL方法进行了大规模重评估，揭示了四项有价值的发现。我们公开ROSE与ROSE-VEC，以推动更可靠的NL2SQL研究。

摘要 (Abstract)

Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores that questions may admit multiple interpretations, and is easily misled by erroneous ground-truth SQL. To address this, we introduce ROSE, an intent-centered metric that focuses on whether the predicted SQL answers the question, rather than consistency with the ground-truth SQL under the reference-dependent paradigm. ROSE employs an adversarial Prover-Refuter cascade: SQL Prover assesses the semantic correctness of a predicted SQL against the user’s intent independently, while Adversarial Refuter uses the ground-truth SQL as evidence to challenge and refine this judgment. On our expert-aligned validation set ROSE-VEC, ROSE achieves the best agreement with human experts, outperforming the next-best metric by nearly 24% in Cohen’s Kappa. We also conduct a largescale re-evaluation of 19 NL2SQL methods, revealing four valuable insights. We release ROSE and ROSE-VEC to facilitate more reliable NL2SQL research.

关键词: NL2SQL, evaluation metric, intent-centered, adversarial Prover-Refuter, semantic correctness, SQL evaluation, ROSE, execution accuracy

33. ❌ Parallax: Why AI Agents That Think Must Never Act

作者: Joel Fokou 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12986v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自主AI代理的安全架构，与’LLM Agents/Autonomous Agents’和’Tool Use/Function Calling’高度相关（10分），因为论文讨论的是具有执行能力的代理及其安全机制。与’Large Language Models/Foundation Models’有一定关联（8分），因为AI代理通常基于大模型构建，但论文未深入讨论LLM技术细节。其他关键词（如MoE、Scaling Laws、训练方法、推理优化、科学AI应用等）与论文的安全架构主题无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出Parallax安全架构，通过认知-执行分离、对抗验证、信息流控制和可逆执行四大原则，解决了具有执行能力的自主AI代理的安全漏洞问题，在280个对抗测试中实现了98.9%-100%的攻击拦截率。

摘要翻译

自主人工智能代理正迅速从实验工具转变为运营基础设施，预计到2026年底，80%的企业应用将嵌入AI副驾驶。随着代理获得执行现实世界操作的能力（读取文件、运行命令、发起网络请求、修改数据库），一个根本性的安全缺口已经显现。当前主流的代理安全方法依赖于提示层护栏：即通过自然语言指令在与其试图防范的威胁相同的抽象层级进行操作。本文认为，对于具备执行能力的代理，基于提示的安全机制在架构上是不充分的，并提出了Parallax——一种基于四大原则的安全自主AI执行范式：认知-执行分离，该原则从结构上阻止推理系统执行操作；渐进确定性对抗验证，即在推理与执行之间插入一个独立的多层验证器；信息流控制，通过代理工作流传播数据敏感度标签以检测上下文相关威胁；以及可逆执行，该机制捕获破坏前的状态，以便在验证失败时实现回滚。我们推出了OpenParallax，一个用Go语言编写的开源参考实现，并采用“假定已遭入侵”评估方法进行测试，该方法完全绕过推理系统，以测试代理在完全被入侵情况下的架构边界。在涵盖九类攻击的280个对抗性测试案例中，Parallax在默认配置下成功拦截98.9%的攻击且无误报，在最高安全配置下拦截率达到100%。当推理系统被入侵时，提示层护栏无法提供任何保护，因为它们仅存在于已被入侵的系统内部；而Parallax的架构边界则始终保持有效。

摘要 (Abstract)

Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax’s architectural boundary holds regardless.

关键词: AI Agents, Autonomous Agents, Agent Safety, Execution Security, Cognitive-Executive Separation, Adversarial Validation, Information Flow Control, Reversible Execution

34. ❌ Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

作者: Sohyun An, Shuibenyang Yuan, Hayeon Lee, Cho-Jui Hsieh, Alexander Min 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12967v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于强化学习（RL）的搜索代理训练框架，使用循环一致性技术来生成奖励信号，而不依赖黄金监督（如标准答案）。论文的核心是信息检索（IR）和强化学习，不涉及大模型（LLM）或深度学习技术原理的创新，也未应用于科学领域（如生物信息学）。所有关键词均与大模型、深度学习技术或科学AI应用相关，而本文专注于传统RL在搜索任务中的应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需黄金监督的搜索代理训练框架Cycle-Consistent Search（CCS），通过循环一致性将问题重构能力作为奖励信号，在问答基准测试中达到了与监督基线相当的性能。

摘要翻译

强化学习（Reinforcement Learning, RL）在优化复杂信息检索任务中的搜索智能体方面展现出巨大潜力。然而，现有方法主要依赖于黄金监督信号（如标准答案），这类信号难以规模化获取。为克服这一局限，受无监督机器翻译和图像到图像转换中循环一致性技术的启发，我们提出了一种无需黄金监督的搜索智能体训练框架——循环一致性搜索（Cycle-Consistent Search, CCS）。我们的核心假设是：与不充分或无关的搜索轨迹不同，最优搜索轨迹能够无损编码问题的意图。因此，高质量的轨迹应保留准确重构原始问题所需的信息，从而为策略优化提供奖励信号。然而，简单的循环一致性目标易受信息泄漏影响，因为重构可能依赖表面的词汇线索而非底层的搜索过程。为减弱这种效应，我们引入了信息瓶颈约束，包括排除最终响应以及对搜索查询进行命名实体识别（Named Entity Recognition, NER）掩码。这些约束迫使重构过程依赖于检索到的观测结果与结构框架，确保生成的奖励信号反映信息充分性而非语言冗余性。在问答基准测试上的实验表明，CCS达到了与有监督基线相当的性能，同时优于先前不依赖黄金监督的方法。这些结果表明，在缺乏黄金监督的场景下，CCS为训练搜索智能体提供了一种可扩展的训练范式。

摘要 (Abstract)

Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question’s intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.

关键词: Reinforcement Learning, search agents, cycle-consistency, information retrieval, gold-supervision-free, question reconstruction, policy optimization, reward signal

35. ❌ Modeling Co-Pilots for Text-to-Model Translation

作者: Serdar Kadioglu, Karthik Uppuluri, Akash Singirikonda 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12955v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确使用LLMs进行文本到模型的翻译任务，并测试了包括chain-of-thought reasoning和agentic approaches在内的多种策略，因此与’Large Language Models’和’Chain of Thought’高度相关（10分），且agentic approaches与’LLM Agents’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、优化技术、科学AI应用等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了利用大语言模型（LLMs）将自然语言描述的组合优化和满足问题转化为形式化模型的文本到模型翻译任务，提出了统一的Text2Model框架和Text2Zinc数据集，并通过实验表明LLMs在该领域具有潜力但尚未达到一键式解决方案的水平。

摘要翻译

利用大型语言模型（LLM）进行文本到模型的转换与优化任务正受到日益广泛的关注。本文旨在通过引入 \textsc{Text2Model} 和 \textsc{Text2Zinc} 来推进这一研究方向。\textsc{Text2Model} 是一套基于多种不同复杂度的 LLM 策略的协同辅助工具集，并包含一个在线排行榜。\textsc{Text2Zinc} 则是一个跨领域数据集，用于捕获以自然语言描述的优化与满足性问题，同时配有一个内置人工智能助手的交互式编辑器。尽管已有研究开始探索利用 LLM 将组合问题转化为形式化模型，但我们的工作是首次尝试将满足性问题与优化问题同时整合到一个统一的架构和数据集之中。此外，与现有专注于向特定求解器模型转换的研究不同，我们的方法是求解器无关的。为此，我们利用 \textsc{MiniZinc} 的求解器与范式无关的建模能力来表述组合问题。我们进行了全面的实验，比较了多种单次调用与多次调用策略的执行与求解精度，这些策略包括：零样本提示、思维链推理、通过知识图谱的中间表示、基于语法的句法编码，以及将模型分解为顺序子任务的智能体方法。我们的协同辅助策略具有竞争力，并在某些方面改进了该领域的最新研究。我们的研究结果表明，尽管 LLM 前景广阔，但它们尚未成为组合建模的“一键式”技术。我们将 \textsc{Text2Model} 协同辅助工具集与排行榜，以及 \textsc{Text2Zinc} 数据集和交互式编辑器开源，以支持缩小这一性能差距。

摘要 (Abstract)

There is growing interest in leveraging large language models (LLMs) for text-to-model translation and optimization tasks. This paper aims to advance this line of research by introducing \textsc{Text2Model} and \textsc{Text2Zinc}. \textsc{Text2Model} is a suite of co-pilots based on several LLM strategies with varying complexity, along with an online leaderboard. \textsc{Text2Zinc} is a cross-domain dataset for capturing optimization and satisfaction problems specified in natural language, along with an interactive editor with built-in AI assistant. While there is an emerging literature on using LLMs for translating combinatorial problems into formal models, our work is the first attempt to integrate \textit{both} satisfaction and optimization problems within a \textit{unified architecture} and \textit{dataset}. Moreover, our approach is \textit{solver-agnostic} unlike existing work that focuses on translation to a solver-specific model. To achieve this, we leverage \textsc{MiniZinc}’s solver-and-paradigm-agnostic modeling capabilities to formulate combinatorial problems. We conduct comprehensive experiments to compare execution and solution accuracy across several single- and multi-call strategies, including; zero-shot prompting, chain-of-thought reasoning, intermediate representations via knowledge-graphs, grammar-based syntax encoding, and agentic approaches that decompose the model into sequential sub-tasks. Our co-pilot strategies are competitive, and in parts improve, recent research in this domain. Our findings indicate that while LLMs are promising they are not yet a push-button technology for combinatorial modeling. We contribute \textsc{Text2Model} co-pilots and leaderboard, and \textsc{Text2Zinc} and interactive editor to open-source to support closing this performance gap.

关键词: large language models, text-to-model translation, combinatorial optimization, chain-of-thought reasoning, agentic approaches, MiniZinc, solver-agnostic, co-pilots

36. ❌ Distorted or Fabricated? A Survey on Hallucination in Video LLMs

作者: Yiyang Huang, Yitian Zhang, Yizhou Wang, Mingyuan Zhang, Liang Shi, Huimin Zeng, Yun Fu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12944v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频大语言模型（Vid-LLMs）中的幻觉问题，因此与’Large Language Models’高度相关（10分），与’Hallucination Mitigation’高度相关（10分）。论文对幻觉进行了分类和原因分析，与’Mechanistic Interpretability’有一定关联（5分）。论文未涉及其他关键词的具体技术细节或应用领域，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文系统综述了视频大语言模型中的幻觉问题，提出了动态失真和内容伪造的分类法，分析了其根源，并回顾了评估和缓解方法，为构建可靠的视频语言系统奠定了基础。

摘要翻译

尽管视频语言建模已取得显著进展，幻觉问题仍然是视频大语言模型（Vid-LLMs）中持续存在的挑战，即模型生成看似合理但与输入视频内容相矛盾的输出。本综述对Vid-LLMs中的幻觉现象进行了全面分析，并提出了一种系统分类法，将其归纳为两大核心类型：动态失真与内容虚构，每种类型下又包含两个子类并附有代表性案例。基于此分类框架，我们回顾了幻觉评估与缓解方面的最新进展，涵盖了关键基准、度量指标及干预策略。我们进一步分析了动态失真与内容幻觉的根本成因，指出其通常源于时序表征能力有限和视觉基础不足。这些分析为未来研究指明了若干有前景的方向，包括开发运动感知的视觉编码器以及整合反事实学习技术。本综述整合了当前分散的研究进展，以促进对Vid-LLMs幻觉问题的系统性理解，为构建鲁棒可靠的视频语言系统奠定基础。相关工作的最新整理列表持续维护于https://github.com/hukcc/Awesome-Video-Hallucination。

摘要 (Abstract)

Despite significant progress in video-language modeling, hallucinations remain a persistent challenge in Video Large Language Models (Vid-LLMs), referring to outputs that appear plausible yet contradict the content of the input video. This survey presents a comprehensive analysis of hallucinations in Vid-LLMs and introduces a systematic taxonomy that categorizes them into two core types: dynamic distortion and content fabrication, each comprising two subtypes with representative cases. Building on this taxonomy, we review recent advances in the evaluation and mitigation of hallucinations, covering key benchmarks, metrics, and intervention strategies. We further analyze the root causes of dynamic distortion and content fabrication, which often result from limited capacity for temporal representation and insufficient visual grounding. These insights inform several promising directions for future work, including the development of motion-aware visual encoders and the integration of counterfactual learning techniques. This survey consolidates scattered progress to foster a systematic understanding of hallucinations in Vid-LLMs, laying the groundwork for building robust and reliable video-language systems. An up-to-date curated list of related works is maintained at https://github.com/hukcc/Awesome-Video-Hallucination .

关键词: Video Large Language Models, Hallucination, Dynamic Distortion, Content Fabrication, Visual Grounding, Temporal Representation, Evaluation, Mitigation

37. ❌ Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

作者: Benjamin Stern, Peter Nadel 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12948v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM Agents的持久记忆编码方法（dual-trace memory encoding），与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文明确研究LLM agents并改进其记忆系统。其他关键词如MoE、SLMs、训练方法、推理技术、压缩、科学应用等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM Agents的持久记忆系统提出了一种双轨迹编码方法，通过在存储事实时附加具体场景轨迹来增强上下文记忆，实验表明该方法在跨会话回忆任务中显著提升了准确率（从53.5%提高到73.7%）。

摘要翻译

具备持久记忆的大语言模型智能体通常将信息存储为扁平的事实记录，这难以为时序推理、变化追踪或跨会话聚合提供足够的上下文。受绘图效应[3]的启发，我们引入了双迹记忆编码方法。在此方法中，每个被存储的事实都与一个具体的场景痕迹配对，后者是对信息获取时刻及上下文情境的叙事性重建。该机制迫使智能体在编码过程中必须记录特定的情境细节，从而形成更丰富、更具区分度的记忆痕迹。基于LongMemEval-S基准测试（包含4,575个会话和100个回忆问题），我们在99个共享问题上将双迹编码与仅记录事实的对照组（在覆盖范围和格式上均匹配）进行了比较。双迹编码实现了73.7%的整体准确率，而对照组为53.5%，获得了+20.2个百分点（pp）的提升（95%置信区间：[+12.1, +29.3]，自助法p < 0.0001）。提升主要集中在时序推理（+40pp）、知识更新追踪（+25pp）和跨会话聚合（+30pp）任务上，而在单会话检索任务中未见优势，这与编码特异性理论[8]的预期一致。词元分析表明，双迹编码在不增加额外成本的情况下实现了上述增益。此外，我们初步勾勒了将双迹编码适配于代码生成智能体的架构设计，并提供了初步的试点验证结果。

摘要 (Abstract)

LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session aggregation. Inspired by the drawing effect [3], we introduce dual-trace memory encoding. In this method, each stored fact is paired with a concrete scene trace, a narrative reconstruction of the moment and context in which the information was learned. The agent is forced to commit to specific contextual details during encoding, creating richer, more distinctive memory traces. Using the LongMemEval-S benchmark (4,575 sessions, 100 recall questions), we compare dual-trace encoding against a fact-only control with matched coverage and format over 99 shared questions. Dual-trace achieves 73.7% overall accuracy versus 53.5%, a +20.2 percentage point (pp) gain (95% CI: [+12.1, +29.3], bootstrap p < 0.0001). Gains concentrate in temporal reasoning (+40pp), knowledge-update tracking (+25pp), and multi-session aggregation (+30pp), with no benefit for single-session retrieval, consistent with encoding specificity theory [8]. Token analysis shows dual-trace encoding achieves this gain at no additional cost. We additionally sketch an architectural design for adapting dual-trace encoding to coding agents, with preliminary pilot validation.

关键词: LLM agents, persistent memory, dual-trace encoding, cross-session recall, temporal reasoning, memory traces, LongMemEval-S benchmark, encoding specificity theory

38. ❌ CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference

作者: Qiang Zhang, Zhongnian Li 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12913v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在二进制反编译中的应用，直接涉及’Large Language Models’（10分）和’Hallucination Mitigation’（8分，解决逻辑幻觉问题）；使用1.3B模型属于轻量级，与’Small Language Models’相关（8分）；涉及训练策略，与’Post-training’有一定关联（5分）；其他关键词如MoE、Scaling Laws、RAG等未在论文中体现，给0分。

!!! tip deepseek-chat TL;DR

该论文提出CoDe-R框架，通过两阶段代码精炼方法解决LLM在二进制反编译中的逻辑幻觉和语义错位问题，使用1.3B模型在HumanEval-Decompile基准上实现了新的SOTA性能，首次使1.3B模型平均重执行率超过50%。

摘要翻译

二进制反编译是一项关键的逆向工程任务，旨在从剥离符号的可执行文件中重建高级源代码。尽管大型语言模型（LLMs）近期展现出潜力，但由于编译过程中不可逆的语义损失，它们常遭受“逻辑幻觉”和“语义错位”问题，导致生成的代码无法重新执行。本研究提出具有鲁棒性的认知反编译器优化框架（CoDe-R），一个轻量级的两阶段代码优化框架。第一阶段引入语义认知增强（SCE），这是一种基于原理指导的语义注入策略，训练模型在生成代码的同时恢复高级算法意图。第二阶段在推理过程中引入动态双路径回退（DDPF）机制，通过混合验证策略自适应地平衡语义恢复与句法稳定性。在HumanEval-Decompile基准测试上的评估表明，CoDe-R（使用1.3B参数骨干模型）在轻量级模型中建立了新的最先进水平（SOTA）。值得注意的是，它是首个平均重执行率超过50.00%的1.3B参数模型，显著超越基线性能，有效弥合了高效模型与专家级性能之间的差距。我们的代码公开于https://github.com/Theaoi/CoDe-R。

摘要 (Abstract)

Binary decompilation is a critical reverse engineering task aimed at reconstructing high-level source code from stripped executables. Although Large Language Models (LLMs) have recently shown promise, they often suffer from “logical hallucinations” and “semantic misalignment” due to the irreversible semantic loss during compilation, resulting in generated code that fails to re-execute. In this study, we propose Cognitive Decompiler Refinement with Robustness (CoDe-R), a lightweight two-stage code refinement framework. The first stage introduces Semantic Cognitive Enhancement (SCE), a Rationale-Guided Semantic Injection strategy that trains the model to recover high-level algorithmic intent alongside code. The second stage introduces a Dynamic Dual-Path Fallback (DDPF) mechanism during inference, which adaptively balances semantic recovery and syntactic stability via a hybrid verification strategy. Evaluation on the HumanEval-Decompile benchmark demonstrates that CoDe-R (using a 1.3B backbone) establishes a new State-of-the-Art (SOTA) in the lightweight regime. Notably, it is the first 1.3B model to exceed an Average Re-executability Rate of 50.00%, significantly outperforming the baseline and effectively bridging the gap between efficient models and expert-level performance. Our code is available at https://github.com/Theaoi/CoDe-R.

关键词: Binary Decompilation, Large Language Models, Logical Hallucinations, Code Refinement, Rationale-Guided Semantic Injection, Dynamic Dual-Path Fallback, Re-executability Rate, Lightweight Model

39. ❌ Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

作者: Ronald Skorobogat, Ameya Prabhu, Matthias Bethge 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12911v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究多语言大模型的评估方法，核心贡献是提出了一种基于往返翻译的评估基准。论文明确针对’frontier models’（前沿模型），这与’Large Language Models OR LLMs OR Foundation Models’高度相关，因此给予10分。其他关键词涉及具体技术（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）或特定应用领域（如科学AI），论文均未涉及，因此给予0分。

!!! tip deepseek-chat TL;DR

该论文揭示了当前多语言基准测试主要衡量数学推理和事实回忆而非多语言能力，并提出了一种基于往返翻译的新评估方法，该方法与真实世界多语言任务表现高度相关。

摘要翻译

多语言基准测试指导着前沿模型的开发。然而，前沿模型所报告的多语言评估虽然在结构上类似于流行的推理与知识基准，却覆盖了多种语言。我们证明，此类基准测试及其衍生的多语言评估所衡量的实质是数学推理与事实记忆能力，而非多语言熟练度。例如，在这些基准上，思维变体（thinking variants）的表现显著优于指令变体（instruct variants），但在真实世界的多语言任务（如LMArena）中却往往表现更差。我们提出一种简单的替代方案：通过往返翻译来评估多语言能力。给定源语言文本，将其翻译为目标语言后再译回源语言；原文与回译结果之间的语义差异揭示了多语言生成能力的缺陷。往返翻译方法在我们的基准测试中与LMArena上的用户评分几乎完全相关（ρ = 0.94），无需人工参考译文，且不要求评估者具备比被测模型更强的多语言能力。最后，我们引入了“翻译中的丢失”（Lost in Translation, LiT）——一个涵盖全球广泛使用语言的、具有挑战性的往返翻译基准，旨在实现对前沿多语言模型的真实评估。

摘要 (Abstract)

Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.

关键词: multilingual benchmarks, frontier models, round-trip translation, evaluation methodology, Lost in Translation (LiT), multilingual proficiency, semantic gaps, LMArena

40. ❌ BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design

作者: Chuyang Xiang, Yichen Wei, Jiale Ma, Handing Wang, Junchi Yan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12898v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	10.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM驱动的超启发式算法设计（LHH），与’Large Language Models’高度相关（10分）。方法中明确使用Monte Carlo Tree Search（MCTS）实现内部层优化，与’MCTS AND LLM’高度相关（10分）。研究应用于优化问题（如CVRP、MIS），属于AI在科学/工程领域的应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及或与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

论文提出BEAM框架，将启发式设计重构为双层优化问题，通过遗传算法和蒙特卡洛树搜索自动生成高性能算法，在多个优化问题上显著超越现有LLM驱动的超启发式方法。

摘要翻译

基于大语言模型的超启发式算法（LHH）近年来已成为自动化启发式设计的高效途径。然而，现有大多数LHH仅在预定义求解器内优化单一函数时表现良好。其单层演化机制使得它们不足以编写出性能完备的完整求解器。尽管部分变体引入了超参数调优或尝试通过迭代式局部修改生成复杂代码，它们仍缺乏高层级的算法建模，导致探索效率有限。为解决这一问题，我们将启发式设计重新构建为双层优化问题，并提出 BEAM（双层记忆自适应算法演化框架）。BEAM的外层通过遗传算法（GA）演化带有函数占位符的高层算法结构，内层则通过蒙特卡洛树搜索（MCTS）实现这些占位符的具体内容。我们进一步引入自适应记忆模块以促进复杂代码的生成。为支持复杂代码生成的评估，我们指出了从零开始或基于代码模板启动LHH的局限性，并提出了知识增强（KA）流水线。在多个优化问题上的实验结果表明，BEAM显著优于现有LHH方法，在CVRP混合算法设计中整体将最优性差距降低了37.84%。BEAM还设计出一种优于最大独立集（MIS）当前最优求解器KaMIS的启发式算法。

摘要 (Abstract)

Large Language Model-based Hyper Heuristic (LHH) has recently emerged as an efficient way for automatic heuristic design. However, most existing LHHs just perform well in optimizing a single function within a pre-defined solver. Their single-layer evolution makes them not effective enough to write a competent complete solver. While some variants incorporate hyperparameter tuning or attempt to generate complex code through iterative local modifications, they still lack a high-level algorithmic modeling, leading to limited exploration efficiency. To address this, we reformulate heuristic design as a Bi-level Optimization problem and propose \textbf{BEAM} (Bi-level Memory-adaptive Algorithmic Evolution). BEAM’s exterior layer evolves high-level algorithmic structures with function placeholders through genetic algorithm (GA), while the interior layer realizes these placeholders via Monte Carlo Tree Search (MCTS). We further introduce an Adaptive Memory module to facilitate complex code generation. To support the evaluation for complex code generation, we point out the limitations of starting LHHs from scratch or from code templates and introduce a Knowledge Augmentation (KA) Pipeline. Experimental results on several optimization problems demonstrate that BEAM significantly outperforms existing LHHs, notably reducing the optimality gap by 37.84% on aggregate in CVRP hybrid algorithm design. BEAM also designs a heuristic that outperforms SOTA Maximum Independent Set (MIS) solver KaMIS.

关键词: Large Language Models, Hyper Heuristic, Bi-level Optimization, Genetic Algorithm, Monte Carlo Tree Search, Algorithmic Evolution, Optimization Problems, Heuristic Design

41. ❌ Towards Long-horizon Agentic Multimodal Search

作者: Yifan Du, Zikang Liu, Jinbiao Peng, Jie Wu, Junyi Li, Jinyang Li, Wayne Xin Zhao, Ji-Rong Wen 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12890v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究长视野多模态深度搜索代理，基于大型语言模型（Qwen3-VL-Thinking-30A3B）进行监督微调（SFT），构建具备工具使用能力的智能代理（LLM Agents），通过fetch-image工具实现主动感知，涉及多步推理（Multi-step Reasoning）和深度推理（In-depth Reasoning）解决复杂跨模态任务。其他关键词如MoE、量化、RAG、对齐等未在论文中涉及。

!!! tip deepseek-chat TL;DR

论文提出LMM-Searcher框架，通过文件化视觉表示和渐进式视觉加载策略，解决了长视野多模态搜索中上下文爆炸和视觉信号丢失的问题，并利用合成数据微调大模型，在多个基准测试中实现了最先进的性能。

摘要翻译

多模态深度搜索智能体通过迭代收集文本与视觉证据，在解决复杂任务方面展现出巨大潜力。然而，在长程任务中管理多模态输入带来的异构信息和高令牌成本仍是一个关键挑战，现有方法常面临上下文爆炸或关键视觉信号丢失的问题。为此，我们提出了一种新颖的长程多模态深度搜索框架，命名为LMM-Searcher，其核心是基于文件的视觉表征机制。通过将视觉资源卸载至外部文件系统并将其映射为轻量级文本标识符（UIDs），我们的方法在减轻上下文开销的同时，保留了多模态信息以供后续访问。我们为智能体配备了定制的图像获取工具，使其能够采用渐进式、按需加载的视觉感知策略。此外，我们引入了一种数据合成流程，专门用于生成需要复杂跨模态多跳推理的查询指令。利用该流程，我们蒸馏出12K条高质量轨迹数据，对Qwen3-VL-Thinking-30A3B模型进行微调，将其转化为专业的多模态深度搜索智能体。在四个基准测试上的大量实验表明，我们的方法成功扩展至100轮次搜索范围，在MM-BrowseComp和MMSearch-Plus等具有挑战性的长程基准测试中取得了开源模型中的最优性能，同时在不同基础模型间也展现出强大的泛化能力。代码将在https://github.com/RUCAIBox/LMM-Searcher发布。

摘要 (Abstract)

Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.

关键词: Long-horizon Agentic Search, Multimodal Deep Search, LLM Agents, Tool Use, Supervised Fine-tuning, Cross-modal Reasoning, Visual Representation, Context Overhead Mitigation

42. ❌ FastGrasp: Learning-based Whole-body Control method for Fast Dexterous Grasping with Mobile Manipulators

作者: Heng Tao, Yiming Zhong, Zemin Yang, Yuexin Ma 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12879v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FastGrasp》专注于机器人抓取控制，使用强化学习和条件变分自编码器等技术解决移动机器人高速抓取问题。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而本文研究的是机器人控制与感知，属于机器人学领域，与评分关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于学习的移动机器人快速灵巧抓取框架，通过两阶段强化学习策略和触觉反馈，在仿真和真实场景中实现了对不同几何形状物体的鲁棒抓取。

摘要翻译

快速抓取对于物流、制造和服务应用中的移动机器人至关重要。现有方法受限于固定基座、简单夹爪或缓慢的触觉响应能力，在高速运动下的冲击稳定、实时全身协调以及跨不同物体与场景的泛化方面面临根本性挑战。我们提出\textbf{FastGrasp}，一种基于学习的框架，集成了抓取引导、全身控制和触觉反馈以实现移动快速抓取。我们的两阶段强化学习策略首先通过以物体点云为条件的条件变分自编码器生成多样化的抓取候选方案，随后在最优抓取选择的指导下执行移动基座、机械臂和手部的协调运动。触觉传感支持实时抓取调整，以应对冲击效应和物体差异。大量实验在仿真和真实场景中均展示了卓越的抓取性能，通过有效的仿真到现实迁移，实现了跨不同物体几何形状的鲁棒操作。

摘要 (Abstract)

Fast grasping is critical for mobile robots in logistics, manufacturing, and service applications. Existing methods face fundamental challenges in impact stabilization under high-speed motion, real-time whole-body coordination, and generalization across diverse objects and scenarios, limited by fixed bases, simple grippers, or slow tactile response capabilities. We propose \textbf{FastGrasp}, a learning-based framework that integrates grasp guidance, whole-body control, and tactile feedback for mobile fast grasping. Our two-stage reinforcement learning strategy first generates diverse grasp candidates via conditional variational autoencoder conditioned on object point clouds, then executes coordinated movements of mobile base, arm, and hand guided by optimal grasp selection. Tactile sensing enables real-time grasp adjustments to handle impact effects and object variations. Extensive experiments demonstrate superior grasping performance in both simulation and real-world scenarios, achieving robust manipulation across diverse object geometries through effective sim-to-real transfer.

关键词: Fast Grasping, Mobile Manipulators, Reinforcement Learning, Whole-body Control, Tactile Feedback, Sim-to-real Transfer, Conditional Variational Autoencoder, Grasp Guidance

43. ❌ AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

作者: Abiodun A. Solanke 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12875v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究大语言模型（LLM）安全评估基准的元分析，与’Large Language Models’高度相关（10分），因为论文明确聚焦LLM安全评估。与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分），因为安全评估涉及对齐和价值观问题。与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分），因为安全评估包括事实性和真实性指标。其他关键词（如MoE、SLMs、训练技术、推理方法、代理等）在论文中未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过构建AISafetyBenchExplorer目录对195个AI安全基准进行元分析，发现基准激增但测量标准碎片化，导致缺乏共享的测量语言和可持续的管理规范。

摘要翻译

大型语言模型（LLM）安全评估的快速扩展催生了庞大的基准测试生态系统，但并未形成相应协调的测量生态系统。本文介绍AISafetyBenchExplorer——一个收录2018年至2026年间发布的195个AI安全基准的结构化目录，其通过多表格架构组织，记录了基准级别的元数据、指标级别的定义、基准论文元数据及代码库活跃度。该设计支持对现有基准的元分析，同时能系统性考察安全概念在文献中如何被操作化、聚合与判定。基于更新后的目录，我们发现一个核心结构性问题：基准数量的激增已超越测量标准化进程。当前格局以中等复杂度基准（94/195）为主导，仅有7个基准属于流行层级。工作簿进一步揭示出高度集中的现象：仅支持英语的评估（165/195）、纯评估型资源（170/195）、停滞的GitHub代码库（137/195）、陈旧的Hugging Face数据集（96/195），以及在已知出版渠道的基准中对arXiv预印本的严重依赖。在指标层面，目录显示常见的标签（如准确率、F1分数、安全分数和基准综合分数）往往掩盖了实质上不同的评判标准、聚合规则与威胁模型。我们认为该领域的主要失效模式是碎片化而非资源匮乏。研究者虽拥有众多基准工具，但普遍缺乏共享的测量语言、基准选择的规范依据以及发布后维护的可持续管理准则。AISafetyBenchExplorer通过提供可溯源的基准目录、受控的元数据架构和复杂度分类体系，共同支持更严谨的基准发现、比较与元评估，以弥合这一缺口。

摘要 (Abstract)

The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks released between 2018 and 2026, organized through a multi-sheet schema that records benchmark-level metadata, metric-level definitions, benchmark-paper metadata, and repository activity. This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Using the updated catalogue, we identify a central structural problem: benchmark proliferation has outpaced measurement standardization. The current landscape is dominated by medium-complexity benchmarks (94/195), while only 7 benchmarks occupy the Popular tier. The workbook further reports strong concentration around English-only evaluation (165/195), evaluation-only resources (170/195), stale GitHub repositories (137/195), stale Hugging Face datasets (96/195), and heavy reliance on arXiv preprints among benchmarks with known venue metadata. At the metric level, the catalogue shows that familiar labels such as accuracy, F1 score, safety score, and aggregate benchmark scores often conceal materially different judges, aggregation rules, and threat models. We argue that the field’s main failure mode is fragmentation rather than scarcity. Researchers now have many benchmark artifacts, but they often lack a shared measurement language, a principled basis for benchmark selection, and durable stewardship norms for post publication maintenance. AISafetyBenchExplorer addresses this gap by providing a traceable benchmark catalogue, a controlled metadata schema, and a complexity taxonomy that together support more rigorous benchmark discovery, comparison, and meta-evaluation.

关键词: AI safety benchmarks, large language models, benchmark fragmentation, measurement standardization, meta-analysis, safety evaluation, benchmark governance, metric-aware catalogue

44. ❌ LIFE – an energy efficient advanced continual learning agentic AI framework for frontier systems

作者: Anne Lee, Gurudutt Hosangadi 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12874v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	3.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出LIFE框架，这是一个以智能体为中心的AI系统，用于HPC的持续学习和网络管理。它明确提到了’agentic AI’和’agent centric system’，因此与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（8分）。论文也提到了’continual learning’，但未具体涉及大语言模型（LLMs），因此’Large Language Models OR LLMs OR Foundation Models’仅得3分，因为框架可能适用于LLMs但未明确讨论。其他关键词如MoE、SLMs、训练方法、推理技术、科学AI应用等均未在摘要中提及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对高性能计算（HPC）中能源效率低和持续学习能力有限的问题，提出了LIFE——一个以智能体为中心的、增量式、灵活且节能的AI框架，用于实现自演化的网络管理和操作。

摘要翻译

人工智能的快速发展已改变了高性能计算在规模规划、资源供给及任务执行等方面的应用范式。这不仅显著提升了能耗需求，现有粗浅的持续学习能力也制约了人工智能有效管理高性能计算系统的潜力。本文审视了超越单一Transformer架构的新兴研究方向，强调智能体人工智能与类脑架构作为实现可持续自适应系统的互补路径。我们提出LIFE框架——一种增量式、灵活且高能效的推理与学习框架，其以智能体为核心构建，而非采用单一整体模型。LIFE通过独特整合四个组件实现高性能计算系统的自主演进式网络管理与运维：编排器、智能体情境工程、新型记忆系统及信息格学习机制。该框架还具备泛化能力，可支持多种正交应用场景。我们将LIFE框架具体落地于闭环高性能计算运维案例，针对类Kubernetes集群中关键微服务面临的延迟峰值问题进行检测与缓解。

摘要 (Abstract)

The rapid advancement of AI has changed the character of HPC usage such as dimensioning, provisioning, and execution. Not only has energy demand been amplified, but existing rudimentary continual learning capabilities limit ability of AI to effectively manage HPCs. This paper reviews emerging directions beyond monolithic transformers, emphasizing agentic AI and brain inspired architectures as complementary paths toward sustainable, adaptive systems. We propose LIFE, a reasoning and Learning framework that is Incremental, Flexible, and Energy efficient that is implemented as an agent centric system rather than a single monolithic model. LIFE uniquely combines four components to realize self evolving network management and operations in HPCs. The components are an orchestrator, Agentic Context Engineering, a novel memory system, and information lattice learning. LIFE can also generalize to enable a variety of orthogonal use cases. We ground LIFE in a specific closed loop HPC operations example for detecting and mitigating latency spikes experienced by critical micro services running on a Kubernetes like cluster.

关键词: agentic AI, continual learning, energy efficient, HPC, autonomous agents, incremental learning, network management, brain inspired architectures

45. ❌ From edges to meaning: Semantic line sketches as a cognitive scaffold for ancient pictograph invention

作者: Seowung Leem, Lin Gu, Ruogu Fang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12865v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究古代象形文字的神经计算起源，构建了一个受生物学启发的视觉层次数字孪生模型来模拟人类视觉认知过程，虽然涉及AI框架，但核心是认知科学、神经科学和考古学的交叉研究，与所有给定的大模型、深度学习技术原理、AI应用等关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该研究通过构建一个受生物学启发的视觉层次数字孪生模型，揭示了古代象形文字可能源于大脑将视觉输入压缩为稳定边界抽象的内在倾向，并模拟了人类首次将感知外化为符号的认知过程。

摘要翻译

人类能够轻易地从稀疏的线条画中识别物体，这种能力在发育早期即出现，并跨越不同文化而存在，这表明其源于神经机制而非纯粹的后天习得。然而，大脑如何将高层语义知识转化为低层视觉符号的计算机制，目前仍知之甚少。本文提出，古老的象形文字源于大脑将视觉输入压缩为稳定的、基于边界的抽象表征的内在倾向。我们构建了一个受生物学启发的视觉层级数字孪生模型，该模型将图像编码为低层特征，生成轮廓草图，并在语义表征引导下通过自上而下的反馈进行迭代优化，从而模拟了人类视觉皮层的前馈与循环结构。由此产生的符号在结构上与多个文化背景迥异的早期书写系统中的象形文字——包括埃及象形文字、中国甲骨文和原始楔形文字——呈现出惊人的相似性，并为尚未破译的古文字提供了可能的解释。我们的研究结果支持了象形文字起源于神经计算过程的观点，并建立了一个框架，使人工智能能够重演人类最初将感知外化为符号的认知过程。

摘要 (Abstract)

Humans readily recognize objects from sparse line drawings, a capacity that appears early in development and persists across cultures, suggesting neural rather than purely learned origins. Yet the computational mechanism by which the brain transforms high-level semantic knowledge into low-level visual symbols remains poorly understood. Here we propose that ancient pictographic writing emerged from the brain’s intrinsic tendency to compress visual input into stable, boundary-based abstractions. We construct a biologically inspired digital twin of the visual hierarchy that encodes an image into low-level features, generates a contour sketch, and iteratively refines it through top-down feedback guided by semantic representations, mirroring the feedforward and recurrent architecture of the human visual cortex. The resulting symbols bear striking structural resemblance to early pictographs across culturally distant writing systems, including Egyptian hieroglyphs, Chinese oracle bone characters, and proto-cuneiform, and offer candidate interpretations for undeciphered scripts. Our findings support a neuro-computational origin of pictographic writing and establish a framework in which AI can recapitulate the cognitive processes by which humans first externalized perception into symbols.

关键词: pictographic writing, visual hierarchy, semantic line sketches, neuro-computational origin, ancient scripts, cognitive scaffold, digital twin, visual symbols

46. ❌ Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic

作者: Saeed Rahmani, Shiva Rasouli, Daphne Cornelisse, Eugene Vinitsky, Bart van Arem, Simeon C. Calvert 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12857v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文是一篇关于人工智能在混合自动驾驶和人类交通模拟中应用的综述，主要关注交通工程和计算机科学的交叉领域。虽然论文涉及AI方法在交通模拟中的应用，但所有关键词都专门针对大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化等）、特定AI科学应用（如生物信息学），或特定推理技术（如CoT、MCTS）。论文摘要和标题中完全没有提及大语言模型、深度学习技术原理创新或AI for Science的具体子领域，而是聚焦于交通行为建模、仿真工具和传统AI方法。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

这篇综述论文系统回顾了人工智能方法在混合自动驾驶和人类交通模拟中的应用，提出了一个涵盖个体行为建模到全场景模拟的统一分类法，并分析了现有仿真平台的不足以弥合交通工程与计算机科学之间的差距。

摘要翻译

自动驾驶车辆现已于公共道路上运行，这使得其测试与验证变得比以往更为关键。仿真技术为评估自动驾驶车辆在不同条件下的性能提供了安全可控的环境。然而，现有的仿真工具主要侧重于图形真实性，并依赖简单的基于规则的模型，因而无法准确呈现驾驶行为与交互的复杂性。人工智能已展现出应对这些局限性的巨大潜力；然而，尽管人工智能方法学进展迅速，目前仍缺乏对其在混合自主交通仿真中应用的全景式综述。现有综述或聚焦于仿真工具而未深入探讨其背后的人工智能方法，或仅关注以自我为中心的决策问题，未能涵盖建模周边交通这一更广泛的挑战。此外，它们也未提供一套涵盖从个体行为建模到全场景仿真的人工智能方法的统一分类体系。为弥补这些不足，本综述对混合自主交通仿真中用于建模自动驾驶车辆与人类驾驶行为的人工智能方法进行了系统性的回顾与整合。我们提出了一种分类体系，将相关方法归纳为三大类别：智能体级行为模型、环境级仿真方法以及认知与物理信息融合方法。本文分析了现有仿真平台在满足混合自主交通研究需求方面的不足，并指出了缩小这一差距的发展方向。同时，综述按时间顺序梳理了人工智能方法的发展脉络，并评述了评估协议与指标、仿真工具及相关数据集。通过融合交通工程与计算机科学双重视角，我们旨在弥合这两个领域之间的隔阂。

摘要 (Abstract)

Autonomous vehicles (AVs) are now operating on public roads, which makes their testing and validation more critical than ever. Simulation offers a safe and controlled environment for evaluating AV performance in varied conditions. However, existing simulation tools mainly focus on graphical realism and rely on simple rule-based models and therefore fail to accurately represent the complexity of driving behaviors and interactions. Artificial intelligence (AI) has shown strong potential to address these limitations; however, despite the rapid progress across AI methodologies, a comprehensive survey of their application to mixed autonomy traffic simulation remains lacking. Existing surveys either focus on simulation tools without examining the AI methods behind them, or cover ego-centric decision-making without addressing the broader challenge of modeling surrounding traffic. Moreover, they do not offer a unified taxonomy of AI methods covering individual behavior modeling to full scene simulation. To address these gaps, this survey provides a structured review and synthesis of AI methods for modeling AV and human driving behavior in mixed autonomy traffic simulation. We introduce a taxonomy that organizes methods into three families: agent-level behavior models, environment-level simulation methods, and cognitive and physics-informed methods. The survey analyzes how existing simulation platforms fall short of the needs of mixed autonomy research and outlines directions to narrow this gap. It also provides a chronological overview of AI methods and reviews evaluation protocols and metrics, simulation tools, and datasets. By covering both traffic engineering and computer science perspectives, we aim to bridge the gap between these two communities.

关键词: Autonomous Vehicles, Traffic Simulation, AI Methods, Mixed Autonomy, Driving Behavior Modeling, Simulation Tools, Taxonomy, Survey

47. ❌ Loop Corrections to the Training and Generalization Errors of Random Feature Models

作者: Taeyoung Kim 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12827v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究随机特征模型的理论分析，采用统计物理方法推导训练、测试和泛化误差的环路修正，属于神经网络理论分析领域。所有关键词均涉及大模型、深度学习技术原理或具体应用（如AI for Science），而本文专注于随机特征模型的数学理论分析，与这些具体技术、应用或大模型研究无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文采用统计物理方法研究了随机特征模型中训练、测试和泛化误差的环路修正，推导了其标度律并通过实验验证了理论。

摘要翻译

我们研究随机特征模型，其中从预设初始化分布中采样的神经网络被冻结并用作随机特征，仅优化读出权重。采用统计物理学的视角，我们研究了超越平均核近似的训练误差、测试误差和泛化误差。由于预测器是诱导随机核的非线性泛函，系综平均误差不仅取决于平均核，还依赖于高阶涨落统计量。在一个有效的场论框架内，这些有限宽度贡献自然地以圈修正的形式出现。我们推导了训练误差、测试误差和泛化误差的圈修正，得到了它们的标度律，并通过实验验证支持了该理论。

摘要 (Abstract)

We investigate random feature models in which neural networks sampled from a prescribed initialization ensemble are frozen and used as random features, with only the readout weights optimized. Adopting a statistical-physics viewpoint, we study the training, test, and generalization errors beyond the mean-kernel approximation. Since the predictor is a nonlinear functional of the induced random kernel, the ensemble-averaged errors depend not only on the mean kernel but also on higher-order fluctuation statistics. Within an effective field-theoretic framework, these finite-width contributions naturally appear as loop corrections. We derive the loop corrections to the training, test, and generalization errors, obtain their scaling laws, and support the theory with experimental verification.

关键词: random feature models, neural networks, training errors, generalization errors, loop corrections, statistical physics, scaling laws, mean-kernel approximation

48. ❌ Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models

作者: Iman Islam, Bram Ruijsink, Andrew J. Reader, Andrew P. King 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12832v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于深度学习在医学图像分割中的应用，特别是针对超声心动图分割中地面真值错误的检测和修复。论文的核心是深度学习模型训练过程中的数据质量控制方法（VOG和伪标签），不涉及大语言模型（LLMs）、模型架构创新（如MoE）、训练技术（如RLHF、PEFT）、推理优化或智能体系统等。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（超声心动图分析）领域的应用，但并非核心创新点，因此给予5分（有一定关联）。其他关键词均与论文内容无关，评分为0。

!!! tip deepseek-chat TL;DR

本研究针对超声心动图分割中地面真值标签错误的问题，提出了一种基于梯度方差（VOG）的检测方法和伪标签修复策略，实验表明该方法在高错误率条件下能有效提升模型性能。

摘要翻译

基于深度学习的医学图像分割通常依赖于通过人工标注获取的真实标签（GT），但这些标签可能存在随机误差或系统性偏差。本研究探讨了深度学习模型对超声心动图分割中此类误差的鲁棒性，并评估了一种在模型训练期间检测和修复错误标签的新策略。利用CAMUS数据集，我们模拟了三种误差类型，随后比较了基于损失函数的GT标签误差检测方法与基于梯度方差（Variance of Gradients, VOG）的方法。我们还提出了一种伪标签方法以修复疑似错误的GT标签。我们评估了所提方法在不同误差水平下的性能。结果表明，VOG在训练期间标记错误GT标签方面极为有效。然而，标准的U-Net模型在随机标签误差和中等水平的系统性误差（最高50%）下仍保持强劲性能。检测与修复策略则进一步提升了模型性能，尤其是在高误差条件下。

摘要 (Abstract)

Deep learning-based medical image segmentation typically relies on ground truth (GT) labels obtained through manual annotation, but these can be prone to random errors or systematic biases. This study examines the robustness of deep learning models to such errors in echocardiography (echo) segmentation and evaluates a novel strategy for detecting and refurbishing erroneous labels during model training. Using the CAMUS dataset, we simulate three error types, then compare a loss-based GT label error detection method with one based on Variance of Gradients (VOG). We also propose a pseudo-labelling approach to refurbish suspected erroneous GT labels. We assess the performance of our proposed approach under varying error levels. Results show that VOG proved highly effective in flagging erroneous GT labels during training. However, a standard U-Net maintained strong performance under random label errors and moderate levels of systematic errors (up to 50%). The detection and refurbishment approach improved performance, particularly under high-error conditions.

关键词: deep learning, medical image segmentation, echocardiography, ground truth errors, Variance of Gradients (VOG), pseudo-labelling, U-Net, CAMUS dataset

49. ❌ RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

作者: Jagadeesh Rachapudi, Pranav Singh, Ritali Vatsi, Praful Hambarde, Amit Shukla 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12820v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的机器遗忘问题，与’Large Language Models’高度相关（10分），涉及有害知识抑制和错误信息纠正，与’Hallucination Mitigation’高度相关（10分）。提出的STAMP方法支持on-device高效遗忘，与’Small Language Models/On-device AI’相关（8分）。通过激活重定向实现遗忘，属于’Self-Correction’范畴（8分）。论文提到预训练数据问题，与’Pre-training’有一定关联（5分）。STAMP方法加速推理，与’Inference Acceleration’相关（5分）。其他关键词如MoE、Scaling Laws、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了RePAIR框架，通过交互式机器遗忘让用户能够通过自然语言指令让LLMs在推理时遗忘有害知识、错误信息和个人数据，实现了高效的on-device遗忘，在保持模型效用的同时达到近乎完美的遗忘效果。

摘要翻译

大型语言模型（LLMs）在大规模网络语料库上进行预训练时，不可避免地会吸收有害知识、错误信息和个人数据，且缺乏选择性遗忘的固有机制。虽然机器遗忘提供了一种原则性解决方案，但现有方法以服务提供商为中心，需要重新训练流程、精心构建的保留数据集以及模型服务提供商（MSPs）的直接干预，从而将终端用户排除在自身数据的控制之外。我们提出了交互式机器遗忘（IMU）这一新范式，使用户能够在推理阶段通过自然语言指令引导LLMs遗忘特定知识。为实现IMU，我们设计了RePAIR框架——一种提示感知的模型修复系统，包含三个核心组件：（i）用于检测遗忘意图的看守模型，（ii）生成修复程序的外科医生模型，以及（iii）可自主更新参数的病人模型。在RePAIR的核心层，我们开发了基于伪逆的激活导向调控技术（STAMP），这是一种免训练的单样本遗忘方法，通过闭式伪逆更新将多层感知机（MLP）激活向量重定向至拒绝子空间。其低秩变体将计算复杂度从O(d^3)降至O(r^3 + r^2 * d)，实现了高效的设备端遗忘，相比基于训练的方法可获得约3倍的加速。在有害知识抑制、错误信息校正和个人数据擦除等场景的广泛实验中，RePAIR实现了接近零的遗忘分数（遗忘准确率Acc_f = 0.00，遗忘奖励F-RL = 0.00），同时保持了模型性能（保留准确率Acc_r最高达84.47，保留奖励R-RL最高达0.88），其表现优于六种前沿基线方法。这些结果表明RePAIR是用户驱动模型编辑的有效实践框架，推进了对已学知识的透明化与设备端控制，并具备向多模态基础模型扩展的潜力。

摘要 (Abstract)

Large language models (LLMs) inherently absorb harmful knowledge, misinformation, and personal data during pretraining on large-scale web corpora, with no native mechanism for selective removal. While machine unlearning offers a principled solution, existing approaches are provider-centric, requiring retraining pipelines, curated retain datasets, and direct intervention by model service providers (MSPs), thereby excluding end users from controlling their own data. We introduce Interactive Machine Unlearning (IMU), a new paradigm in which users can instruct LLMs to forget targeted knowledge through natural language at inference time. To realize IMU, we propose RePAIR, a prompt-aware model repair framework comprising (i) a watchdog model for unlearning intent detection, (ii) a surgeon model for generating repair procedures, and (iii) a patient model whose parameters are updated autonomously. At the core of RePAIR, we develop Steering Through Activation Manipulation with PseudoInverse (STAMP), a training-free, single-sample unlearning method that redirects MLP activations toward a refusal subspace via closed-form pseudoinverse updates. Its low-rank variant reduces computational complexity from O(d^3) to O(r^3 + r^2 * d), enabling efficient on-device unlearning with up to ~3x speedup over training-based baselines. Extensive experiments across harmful knowledge suppression, misinformation correction, and personal data erasure demonstrate that RePAIR achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. These results establish RePAIR as an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.

关键词: Machine Unlearning, Large Language Models, Interactive Unlearning, On-device AI, Harmful Knowledge Suppression, Model Repair, Activation Manipulation, Inference-time Editing

50. ❌ DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

作者: Hao Yan, Yuliang Liu, Xingchen Liu, Yuyi Zhang, Minghui Liao, Jihao Wu, Wei Chen, Xiang Bai 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12812v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DocSeeker专注于解决多模态大语言模型在长文档理解任务中的性能下降问题，提出了结构化推理工作流和两阶段训练框架。核心相关关键词包括：1) 大语言模型（LLMs）是基础技术；2) 监督微调（SFT）是训练框架的关键部分；3) 检索增强生成（RAG）被提及为协同系统；4) 长上下文LLMs是直接应用场景；5) 思维链（CoT）和系统2思维体现在结构化推理工作流中。其他关键词如MoE、量化、AI for Science等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在长文档理解中因信号噪声比低和监督稀缺导致的性能下降问题，提出了结构化分析-定位-推理工作流和两阶段训练框架，实现了在长短文档任务上的优越性能并可与视觉RAG系统协同。

摘要翻译

现有多模态大语言模型在长文档理解任务中，随着文档长度的增加，性能会出现显著下降。这源于两个根本性挑战：1）信噪比低，关键证据被淹没在无关页面中；2）监督信号稀缺，仅提供最终简短答案的数据集所能提供的学习信号较弱。本文通过提出一种要求模型执行结构化“分析、定位与推理”工作流程的范式来解决这些挑战。为培养这种能力，我们设计了一个两阶段训练框架：首先，我们利用通过高效知识蒸馏策略生成的高质量数据进行监督微调。随后，我们采用一种证据感知的组相对策略优化方法，联合优化证据定位和答案准确性。此外，我们引入了一种证据引导的分辨率分配策略，以缓解在多页文档训练时的内存限制。大量实验表明，DocSeeker 在领域内和领域外任务上均取得了卓越性能。我们证明其能够稳健地从短页训练泛化到超长文档，并能自然地与视觉检索增强生成系统协同工作，为其实施提供了坚实基础。

摘要 (Abstract)

Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured ``\textbf{Analysis}, \textbf{Localization} and \textbf{Reasoning}’’ workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as a solid foundation for their implementation.

关键词: Multimodal Large Language Models, Long Document Understanding, Structured Reasoning, Supervised Fine-Tuning, Evidence Grounding, Retrieval-Augmented Generation, Signal-to-Noise Ratio, Visual Reasoning

51. ❌ Algorithmic Analysis of Dense Associative Memory: Finite-Size Guarantees and Adversarial Robustness

作者: Madhava Gaikwad 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12811v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究密集联想记忆（DAM）的理论分析，属于经典神经网络理论范畴，与所有评分关键词（均围绕大模型、深度学习技术原理及应用）完全无关。论文未涉及任何大模型、LLM、MoE、训练方法、推理优化、对齐、代理、压缩、科学AI应用等现代深度学习技术主题。

!!! tip deepseek-chat TL;DR

该论文对密集联想记忆（DAM）的检索动态进行了算法分析，在有限规模下证明了其几何收敛性、对抗鲁棒性边界和存储容量保证。

摘要翻译

稠密联想记忆（Dense Associative Memory，DAM）通过高阶相互作用推广了霍普菲尔德网络，并在适当的模式分离条件下实现了存储容量按 $O(N^{n-1})$ 缩放。现有的动力学分析主要研究随机采样模式下的热力学极限 $N\to\infty$，因此未能提供有限规模保证或显式收敛速率。我们发展了一种针对DAM检索动力学的算法分析，该分析在显式且可验证的模式条件下给出了有限$N$的保证。在高负载下的分离假设和有界干扰条件下，我们证明了异步检索动力学的几何收敛性，这意味着一旦轨迹进入吸引域，收敛时间即为 $O(\log N)$。我们进一步建立了通过显式裕度条件表达的对抗鲁棒性界限，该条件量化了每次扫描可容忍的损坏比特数，并推导出在最坏情况下容量保证按 $Θ(N^{n-1})$ 缩放（至多相差多对数因子），同时对于随机模式集合恢复了经典的 $Θ(N^{n-1})$ 缩放结果。最后，我们证明了DAM检索动力学允许一种势博弈解释，这确保了在异步更新下收敛到纯纳什均衡。完整的证明在附录中提供，同时附有初步实验，以说明所预测的收敛性、鲁棒性和容量缩放行为。

摘要 (Abstract)

Dense Associative Memory (DAM) generalizes Hopfield networks through higher-order interactions and achieves storage capacity that scales as $O(N^{n-1})$ under suitable pattern separation conditions. Existing dynamical analyses primarily study the thermodynamic limit $N\to\infty$ with randomly sampled patterns and therefore do not provide finite-size guarantees or explicit convergence rates. We develop an algorithmic analysis of DAM retrieval dynamics that yields finite-$N$ guarantees under explicit, verifiable pattern conditions. Under a separation assumption and a bounded-interference condition at high loading, we prove geometric convergence of asynchronous retrieval dynamics, which implies $O(\log N)$ convergence time once the trajectory enters the basin of attraction. We further establish adversarial robustness bounds expressed through an explicit margin condition that quantifies the number of corrupted bits tolerable per sweep, and derive capacity guarantees that scale as $Θ(N^{n-1})$ up to polylogarithmic factors in the worst case, while recovering the classical $Θ(N^{n-1})$ scaling for random pattern ensembles. Finally, we show that DAM retrieval dynamics admit a potential-game interpretation that ensures convergence to pure Nash equilibria under asynchronous updates. Complete proofs are provided in the appendices, together with preliminary experiments that illustrate the predicted convergence, robustness, and capacity scaling behavior.

关键词: Dense Associative Memory, Hopfield networks, finite-size guarantees, adversarial robustness, storage capacity, convergence analysis, algorithmic analysis, retrieval dynamics

52. ❌ Rethinking Satellite Image Restoration for Onboard AI: A Lightweight Learning-Based Approach

作者: Adrien Dorise, Marjorie Bellizzi, Omar Hlimi 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12807v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于卫星图像恢复的轻量级CNN方法（ConvBEERS），用于星载AI预处理，与绝大多数关键词（涉及大模型技术、训练方法、推理优化、代理系统等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为卫星图像处理可视为AI在科学/遥感领域的应用，但非核心生物/化学信息学。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级卷积神经网络ConvBEERS，用于卫星图像恢复，以替代传统计算密集型方法，实验表明其在图像质量（PSNR提升6.9dB）和下游目标检测任务（mAP@50提升5.1%）上表现优异，并在FPGA上实现41倍延迟降低，验证了星载处理的可行性。

摘要翻译

卫星图像复原旨在通过补偿成像系统与采集条件引入的退化效应（如噪声与模糊）来提升图像质量。作为一项基础预处理步骤，复原过程直接影响地面产品生成与新兴的星载人工智能应用。基于串行物理模型的传统复原流程计算密集且速度缓慢，难以适用于星载环境。本文提出ConvBEERS：一种面向太空应用的卷积式星载就绪嵌入式高效复原模型，旨在探究基于模拟卫星数据训练的轻量化非生成式残差卷积网络，是否能在多种操作条件下达到或超越传统地面处理复原流程的性能。
在模拟数据集及真实Pleiades-HR影像上进行的实验表明，所提方法实现了具有竞争力的图像质量，峰值信噪比（PSNR）提升达+6.9dB。在下游目标检测任务中的评估显示，复原处理显著提升了检测性能，平均精度均值（mAP@50）最高提升+5.1%。此外，该模型在Xilinx Versal VCK190 FPGA上的成功部署验证了其在卫星星载处理中的实际可行性，与传统流程相比延迟降低约41倍。这些结果证明了利用轻量化卷积网络在满足星载系统实际约束的同时，能够实现具有竞争力的复原质量。

摘要 (Abstract)

Satellite image restoration aims to improve image quality by compensating for degradations (e.g., noise and blur) introduced by the imaging system and acquisition conditions. As a fundamental preprocessing step, restoration directly impacts both ground-based product generation and emerging onboard AI applications. Traditional restoration pipelines based on sequential physical models are computationally intensive and slow, making them unsuitable for onboard environments. In this paper, we introduce ConvBEERS: a Convolutional Board-ready Embedded and Efficient Restoration model for Space to investigate whether a light and non-generative residual convolutional network, trained on simulated satellite data, can match or surpass a traditional ground-processing restoration pipeline across multiple operating conditions. Experiments conducted on simulated datasets and real Pleiades-HR imagery demonstrate that the proposed approach achieves competitive image quality, with a +6.9dB PSNR improvement. Evaluation on a downstream object detection task demonstrates that restoration significantly improves performance, with up to +5.1% mAP@50. In addition, successful deployment on a Xilinx Versal VCK190 FPGA validates its practical feasibility for satellite onboard processing, with a ~41x reduction in latency compared to the traditional pipeline. These results demonstrate the relevance of using lightweight CNNs to achieve competitive restoration quality while addressing real-world constraints in spaceborne systems.

关键词: satellite image restoration, lightweight CNN, onboard AI, convolutional network, FPGA deployment, object detection, latency reduction, embedded processing

53. ❌ Efficiency of Proportional Mechanisms in Online Auto-Bidding Advertising

作者: Nguyen Kim Thang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12799v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究在线广告中的自动竞价拍卖机制效率（价格无政府状态边界），属于运筹学、机制设计和计算经济学领域。论文内容完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用。所有关键词均与大模型技术、AI应用或相关方法论相关，而本文是纯理论机制分析，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了在线广告自动竞价中比例机制在液体福利目标下的效率，建立了标准比例机制的价格无政府状态紧界为2，并提出了一种改进的支付方案使效率边界提升至1+O(1)/(n-1)，随着代理数量增加接近完全效率。

摘要翻译

在线广告中自动竞价策略的兴起，为设计和分析高效拍卖机制带来了新的挑战。本文聚焦于自动竞价背景下的比例分配机制，研究在流动性福利目标下纯纳什均衡的效率，具体分析无政府状态价格。我们首先为标准比例机制建立了紧致的无政府状态价格上界为2。随后，我们提出一种采用替代支付方案的改进版本，该方案实现了$1 + \frac{O(1)}{n-1}$的无政府状态价格上界，其中$n \geq 2$表示竞价智能体数量。这一改进突破了现有的无政府状态价格上界2的壁垒，并随着智能体数量的增加趋近于完全效率。我们的方法利用了线性与凸规划中的对偶理论和Karush-Kuhn-Tucker（KKT）条件。尽管概念简洁，但该方法被证明是强有力的，并可能为建立无政府状态价格上界提供更广泛的应用前景。

摘要 (Abstract)

The rise of automated bidding strategies in online advertising presents new challenges in designing and analyzing efficient auction mechanisms. In this paper, we focus on proportional mechanisms within the context of auto-bidding and study the efficiency of pure Nash equilibria, specifically the price of anarchy (PoA), under the liquid welfare objective. We first establish a tight PoA bound of 2 for the standard proportional mechanism. Next, we introduce a modified version with an alternative payment scheme that achieves a PoA bound of $1 + \frac{O(1)}{n-1}$ where $n \geq 2$ denotes the number of bidding agents. This improvement surpasses the existing PoA barrier of 2 and approaches full efficiency as the number of agents increases. Our methodology leverages duality and the Karush-Kuhn-Tucker (KKT) conditions from linear and convex programming. Despite its conceptual simplicity, our approach proves powerful and may offer broader applications for establishing PoA bounds.

关键词: online advertising, auto-bidding, proportional mechanisms, price of anarchy, liquid welfare, auction mechanisms, Nash equilibria, KKT conditions

54. ❌ VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

作者: Yupeng Sun, Yanzhao Li, Zhiqiang Zou, Bai Du, Zhiyuan Zhang, Hui Dong, Gaoyige Fan, Hui Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12798v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是优化FlashAttention的在线softmax计算，通过预计算全局最大值、重排序关键块和冻结最大值来减少向量操作瓶颈，属于大模型推理加速技术。与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（10分），因为直接改进FlashAttention算法；与’Speculative Decoding OR Inference Acceleration’相关（8分），因为目标是加速注意力计算；与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为注意力机制是大模型的核心组件。其他关键词涉及模型训练、对齐、应用等领域，本文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

本文提出VFA方法，通过预计算全局最大值和重排序关键块来优化FlashAttention的在线softmax计算，减少向量操作瓶颈，在保持性能的同时实现高达6倍的加速。

摘要翻译

FlashAttention风格的在线softmax通过将分数分片流式传输至片上内存并维护运行最大值与归一化因子，实现了线性内存下的精确注意力计算。然而，随着现代加速器上注意力内核逐渐接近张量核心/立方核心的峰值吞吐，在线softmax中的非矩阵乘法运算——尤其是每分片的行最大值（rowmax）与行求和（rowsum）规约及重缩放链——可能受限于向量或SIMD单元，成为延迟主导因素。本文重新审视FlashAttention，提出向量缓解闪存注意力（Vector Relieved Flash Attention, VFA），这是一种硬件友好的方法，在保持在线softmax结构的同时，减少了由行最大值驱动的运行最大值更新次数。VFA通过键块表示的廉价近似初始化运行最大值，重新排序键块遍历以优先处理高影响力的汇聚块与局部块，并对剩余块冻结最大值以避免重复规约与重缩放。我们进一步将VFA与块稀疏跳过方法（如BLASST）结合，形成向量缓解稀疏注意力（Vector Relieved Sparse Attention, VSA），从而同时减少块数量与每块开销。值得注意的是，VFA与VSA完全避免了FA4.0更新阶段使用的条件重缩放操作。在MMLU与MATH500等基准测试及注意力统计数据上的广泛评估验证了我们的设计：（i）汇聚块与局部块重排序能早期稳定运行最大值；（ii）简单的查询块与键块摘要因块内异质性而失效；（iii）当最大值出现在中间块时需要采用最大值初始化。总体而言，VFA与VSA有效缓解了在线softmax规约瓶颈且无性能损失。相较于C16V32基线，C8V32、C4V32与C4V16在现代硬件上实现了近两倍加速，同时触及向量瓶颈。随着未来架构改进，通过增强指数容量，C4V16将实现六倍加速。

摘要 (Abstract)

FlashAttention-style online softmax enables exact attention computation with linear memory by streaming score tiles through on-chip memory and maintaining a running maximum and normalizer. However, as attention kernels approach peak tensor-core/cube-core throughput on modern accelerators, non-matmul components of online softmax – especially per-tile rowmax and rowsum reductions and rescale chains – can become vector or SIMD limited and dominate latency. This paper revisits FlashAttention and proposes Vector Relieved Flash Attention (VFA), a hardware-friendly method that reduces rowmax-driven updates of the running maximum while retaining the online-softmax structure. VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to avoid repeated reductions and rescaling. We further integrate VFA with block-sparse skipping methods such as BLASST to form Vector Relieved Sparse Attention (VSA), which reduces both block count and per-block overhead. Notably, VFA and VSA completely avoid the conditional rescale operation in the update stage used in FA4.0. Extensive evaluations on benchmarks including MMLU and MATH500, together with attention statistics, verify our design: (i) sink and local reordering stabilizes the running maximum early; (ii) simple Q and K block summaries fail due to intra-block heterogeneity; (iii) m-initialization is required when maxima appear in middle blocks. Overall, VFA and VSA efficiently alleviate online-softmax reduction bottlenecks without performance loss. Compared to the C16V32 baseline, C8V32, C4V32 and C4V16 achieve nearly two times speedup on modern hardware while hitting the vector bottleneck. With upcoming architecture improvements, C4V16 will deliver six times speedup by enhancing exponent capacity.

关键词: FlashAttention, online softmax, vector operations, attention computation, inference acceleration, running maximum, block-sparse skipping, hardware-friendly optimization

55. ❌ OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

作者: Zhiyuan Zhang, Yanzhao Li, Zhiqiang Zou, Bai Du, Yupeng Sun, Hui Dong, Hui Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12782v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的4位量化（W4A4）技术，属于模型压缩和推理加速领域。与’Large Language Models’高度相关（10分），因为论文明确针对LLMs进行量化；与’Quantization OR Model Compression OR Low-bit Weights’高度相关（15分），因为这是论文的核心技术；与’Speculative Decoding OR Inference Acceleration’相关（10分），因为论文旨在通过硬件高效量化实现高吞吐量部署和推理加速。其他关键词如MoE、SLMs、对齐、RAG等均未涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为OSC的硬件高效W4A4量化框架，通过分离激活异常值的通道维度来减少大语言模型量化时的精度损失，在Qwen3模型上实现了较低的精度下降和显著的推理加速。

摘要翻译

尽管4位量化对于大语言模型的高吞吐量部署至关重要，但激活值中的异常值常因低比特格式受限的动态范围而导致显著的精度下降。本文系统性地研究了异常值的空间分布，并揭示了一种令牌间持续存在的结构聚类效应，即高幅值异常值在不同令牌间持续占据固定的通道。基于这一发现，我们提出了OSC，一种硬件高效的异常值抑制框架。在推理过程中，OSC执行双路径计算，包括一条低精度4位通用矩阵乘法（GEMM）路径和一条高精度16位分支GEMM路径。具体而言，OSC采用离线分组策略识别异常值所在的通道，随后在线执行结构化子张量提取，将这些分散的激活通道聚合为紧凑的稠密张量。该机制通过规范化且高吞吐的GEMM操作实现异常值保护，能够无缝适配现代4位微缩放硬件。此外，针对异常值聚类效应较不显著的W2输入，我们集成了回退至FP8的策略。在Qwen3-8B和Qwen3-30B上的评估将平均精度损失分别限制在2.19和1.12个百分点。值得注意的是，OSC具备优异的硬件友好性，在现代AI加速器上相比W8A8 GEMM基线实现了最高1.78倍的加速。

摘要 (Abstract)

While 4-bit quantization is essential for high-throughput deployment of Large Language Models, activation outliers often lead to significant accuracy degradation due to the restricted dynamic range of low-bit formats. In this paper, we systematically investigate the spatial distribution of outliers and demonstrate a token-persistent structural clustering effect, where high-magnitude outliers consistently occupy fixed channels across tokens. Building on this insight, we propose OSC, a hardware-efficient framework for outlier suppression. During inference, OSC executes a dual-path computation consisting of a low-precision 4-bit General Matrix Multiplication (GEMM) path and a high-precision 16-bit branch GEMM path. Specifically, OSC uses an offline group-wise strategy to identify the channels where outliers are located and then performs structured sub-tensor extraction to coalesce these scattered activation channels into a compact dense tensor online. This mechanism implements outlier protection through regularized and high-throughput GEMM operations, achieving a seamless fit with modern 4-bit micro-scaling hardware. Furthermore, for the inputs of W2 where outlier clustering is less pronounced, we integrate a fallback strategy to FP8. Evaluation on Qwen3-8B and Qwen3-30B restricts the average accuracy drop to 2.19 and 1.12 points, respectively. Notably, OSC is highly hardware-friendly, achieving a peak speedup of 1.78x over the W8A8 GEMM baseline on a modern AI accelerator.

关键词: Quantization, Large Language Models, W4A4, Outlier Suppression, Hardware Efficiency, Inference Acceleration, GEMM, Model Compression

56. ❌ Efficient Adversarial Training via Criticality-Aware Fine-Tuning

作者: Wenyun Li, Zheng Zhang, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12780v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Vision Transformer (ViT)的对抗训练，提出了一种名为CAAT的参数高效微调方法。该方法与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为摘要明确提到’leverages parameter-efficient fine-tuning (PEFT)’。其他关键词均与论文内容无关（0分），因为论文不涉及大语言模型、科学AI应用或其他指定技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CAAT的参数高效对抗训练方法，通过仅微调ViT模型中对抗鲁棒性最关键的少量参数，在保持高鲁棒性的同时大幅降低了计算成本。

摘要翻译

视觉Transformer（ViT）模型在各种视觉任务中取得了显著性能，其可扩展性在处理大规模数据集时展现出关键优势。这种可扩展性使ViT模型具备强大的泛化能力。然而，随着参数量的增加，ViT模型对对抗样本的鲁棒性并未按比例提升。对抗训练（AT）作为增强鲁棒性最有效的方法之一，通常需要对整个模型进行微调，导致计算成本极高，尤其对于大型ViT架构而言。本文旨在仅对一小部分参数进行鲁棒微调，以实现与标准对抗训练相当的鲁棒性。为此，我们提出关键性感知对抗训练（CAAT），这是一种自适应地将资源分配给对鲁棒性最关键参数的新方法，仅微调选定的模块。具体而言，CAAT能高效识别对对抗鲁棒性贡献最大的参数，随后利用参数高效微调（PEFT）技术，在关键参数数量超过预设阈值时对权重矩阵进行鲁棒调整。当扩展到更大的视觉Transformer架构时，CAAT展现出良好的泛化能力，可能为大规模对抗训练开辟新途径。例如，与普通对抗训练相比，CAAT在仅微调约6%参数的情况下，对抗鲁棒性仅下降4.3%。在三个广泛使用的对抗学习数据集上的大量实验表明，CAAT以更少的可训练参数超越了当前最先进的轻量级对抗训练方法。

摘要 (Abstract)

Vision Transformer (ViT) models have achieved remarkable performance across various vision tasks, with scalability being a key advantage when applied to large datasets. This scalability enables ViT models to exhibit strong generalization capabilities. However, as the number of parameters increases, the robustness of ViT models to adversarial examples does not scale proportionally. Adversarial training (AT), one of the most effective methods for enhancing robustness, typically requires fine-tuning the entire model, leading to prohibitively high computational costs, especially for large ViT architectures. In this paper, we aim to robustly fine-tune only a small subset of parameters to achieve robustness comparable to standard AT. To accomplish this, we introduce Criticality-Aware Adversarial Training (CAAT), a novel method that adaptively allocates resources to the most robustness-critical parameters, fine-tuning only selected modules. Specifically, CAAT efficiently identifies parameters that contribute most to adversarial robustness. It then leverages parameter-efficient fine-tuning (PEFT) to robustly adjust weight matrices where the number of critical parameters exceeds a predefined threshold. CAAT exhibits favorable generalization when scaled to larger vision transformer architectures, potentially paving the way for adversarial training at scale, e.g, compared with plain adversarial training, CAAT incurs only a 4.3% decrease in adversarial robustness while tuning approximately 6% of its parameters. Extensive experiments on three widely used adversarial learning datasets demonstrate that CAAT outperforms state-of-the-art lightweight AT methods with fewer trainable parameters.

关键词: Vision Transformer, Adversarial Training, Parameter-efficient Fine-tuning, Robustness, Criticality-Aware, Computational Efficiency, Fine-tuning

57. ❌ DoseRAD2026 Challenge dataset: AI accelerated photon and proton dose calculation for radiotherapy

作者: Fan Xiao, Nikolaos Delopoulos, Niklas Wahl, Lennart Volz, Lina Bucher, Matteo Maspero, Miguel Palacios, Muheng Li, Samir Schulz, Viktor Rogowski, Ye Zhang, Zoltan Perko, Christopher Kurz, George Dedes, Guillaume Landry, Adrian Thummerer 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12778v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学影像数据集创建（DoseRAD2026）用于放疗剂量计算，属于AI在生物医学领域的应用。所有关键词均与大模型/深度学习技术原理直接相关，但论文未涉及任何大模型、深度学习技术或算法创新，仅提及数据集可用于开发“advanced dose calculation methods”，未具体说明方法类型。唯一相关关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于生物信息学/医学AI应用，但非核心创新，故给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文创建了DoseRAD2026数据集，包含配对的CT和MRI图像以及光子/质子蒙特卡洛剂量分布，用于开发和评估放疗中快速、准确的剂量计算方法。

摘要翻译

目的：精确的剂量计算对于放射治疗中实现精准肿瘤照射并保护健康组织至关重要。随着磁共振成像（MRI）引导和实时自适应放射治疗的日益普及，对基于CT和MRI的快速、准确剂量计算的需求日益增长。DoseRAD2026数据集与挑战赛提供了一个公开基准，包含配对的CT与MRI数据以及束流级的光子和质子蒙特卡洛剂量分布，以支持开发和评估先进的剂量计算方法。采集与验证方法：该数据集包含115例患者（75例训练，40例测试）的配对CT和MRI图像，这些患者因胸部或腹部病变在MRI直线加速器上接受治疗，数据源自SynthRAD2025数据集。预处理包括可变形图像配准、空腔校正和重采样。真实的光子（6 MV）和质子剂量分布使用开源蒙特卡洛算法计算，生成了40,500个光子束和81,000个质子笔形束的剂量数据。数据格式与使用说明：数据按光子和质子子集组织，包含配对的CT-MRI图像、束流级剂量分布以及JSON格式的束流配置文件。文件以压缩的MetaImage（.mha）格式提供。该数据集基于CC BY-NC 4.0许可发布，训练数据将于2026年4月开放，测试集将保留至2030年3月。潜在应用：该数据集支持快速剂量计算方法的基准测试，包括光子和质子治疗的束流级剂量估计、MRI引导工作流程中基于MRI的剂量计算以及实时自适应放射治疗。

摘要 (Abstract)

Purpose: Accurate dose calculation is essential in radiotherapy for precise tumor irradiation while sparing healthy tissue. With the growing adoption of MRI-guided and real-time adaptive radiotherapy, fast and accurate dose calculation on CT and MRI is increasingly needed. The DoseRAD2026 dataset and challenge provide a public benchmark of paired CT and MRI data with beam-level photon and proton Monte Carlo dose distributions for developing and evaluating advanced dose calculation methods. Acquisition and validation methods: The dataset comprises paired CT and MRI from 115 patients (75 training, 40 testing) treated on an MRI-linac for thoracic or abdominal lesions, derived from the SynthRAD2025 dataset. Pre-processing included deformable image registration, air-cavity correction, and resampling. Ground-truth photon (6 MV) and proton dose distributions were computed using open-source Monte Carlo algorithms, yielding 40,500 photon beams and 81,000 proton beamlets. Data format and usage notes: Data are organized into photon and proton subsets with paired CT-MRI images, beam-level dose distributions, and JSON beam configuration files. Files are provided in compressed MetaImage (.mha) format. The dataset is released under CC BY-NC 4.0, with training data available from April 2026 and the test set withheld until March 2030. Potential applications: The dataset supports benchmarking of fast dose calculation methods, including beam-level dose estimation for photon and proton therapy, MRI-based dose calculation in MRI-guided workflows, and real-time adaptive radiotherapy.

关键词: radiotherapy dose calculation, Monte Carlo dose distributions, CT-MRI paired dataset, photon therapy, proton therapy, MRI-guided radiotherapy, adaptive radiotherapy, benchmark dataset

58. ❌ Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling

作者: Huanzhen Wang, Ziheng Zhou, Zeng Tao, Aoxing Li, Yingkai Zhao, Yuxuan Lin, Yan Wang, Wenqiang Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12777v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的动态面部表情识别（DFER），提出了一种受认知理论启发的双流语义增强模型（DuSE）。虽然论文涉及深度学习在情感计算中的应用，但所有关键词都直接针对大语言模型（LLM）技术、训练方法、推理优化、代理系统等特定领域。论文内容与这些LLM相关关键词无直接关联，仅与’Mechanistic Interpretability OR Explainable AI’有微弱联系（5分），因为论文提到了模型可解释性增强，但这并非核心焦点。论文不属于大模型在不同领域的创新应用研究，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究针对现有视觉动态情感建模方法忽视情感感知认知理论的问题，提出了一种受认知启发的双流语义增强模型（DuSE），通过模拟人脑情感处理机制，在动态面部表情识别任务上实现了最先进的性能并增强了模型可解释性。

摘要翻译

人脑构建情绪感知并非通过孤立处理面部表情，而是通过将感官输入与语义及语境知识进行动态、层级的整合来实现。然而，现有的基于视觉的动态情绪建模方法往往忽视了情绪感知与认知理论。为弥合机器与人类情绪感知之间的鸿沟，我们提出了认知启发的双流语义增强模型（DuSE）。该模型实例化了一种双流认知架构：第一流为层级时序提示簇（Hierarchical Temporal Prompt Cluster, HTPC），它实现了认知启动效应，通过将文本语义与面部动态的细粒度时序特征对齐，模拟语言线索如何预先敏化神经通路，从而调节对输入视觉刺激的处理；第二流为潜在语义情绪聚合器（Latent Semantic Emotion Aggregator, LSEA），它从计算角度建模知识整合过程，类似于概念行为理论所描述的机制，通过聚合感官输入并将其与习得的概念知识融合，反映了海马体与默认模式网络在构建连贯情绪体验中的作用。通过对这些神经认知机制进行显式建模，DuSE为动态面部表情识别（DFER）提供了一个更具神经合理性且更稳健的框架。在具有挑战性的真实场景基准测试上进行的大量实验验证了我们以认知为中心的方法，表明模拟大脑的情绪处理策略能够实现最先进的性能，并增强模型的可解释性。

摘要 (Abstract)

The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with semantic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition-inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual semantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in-the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain’s strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.

关键词: dynamic facial expression recognition, cognitive architecture, semantic enhancement, hierarchical temporal modeling, conceptual act theory, neural plausibility, model interpretability, in-the-wild benchmarks

59. ❌ CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models

作者: Yunkai Dang, Yizhu Jiang, Yifan Jiang, Qi Fan, Yinghuan Shi, Wenbin Li, Yang Gao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）的视觉令牌压缩问题，提出CLASP框架进行类自适应层融合和双阶段剪枝。核心相关关键词是’Large Language Models OR LLMs OR Foundation Models’（10分），因为论文明确研究MLLMs，属于大模型技术范畴。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、AI for Science等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型中视觉令牌序列冗余导致的巨大计算开销问题，提出了CLASP框架，通过类自适应层融合和双阶段剪枝实现高效令牌压缩，并在多个基准测试和模型架构中优于现有方法。

摘要翻译

多模态大语言模型（MLLMs）因视觉标记序列的高度冗余而承受着巨大的计算开销。现有方法通常采用单层视觉变换器（ViT）特征和静态剪枝策略来解决这一问题。然而，这种固定配置在多样化的指令下往往表现脆弱。为克服这些限制，我们提出了CLASP，一种基于类别自适应层融合与双阶段剪枝的即插即用标记缩减框架。具体而言，CLASP首先通过多层视觉特征融合构建类别特定的视觉表示，随后执行双阶段剪枝：在注意力显著的关键标记（用于相关性）与冗余感知的补充标记（用于覆盖度）之间分配标记预算。通过类别自适应剪枝，CLASP实现了基于提示的条件特征融合与预算分配，从而达成激进且鲁棒的视觉标记缩减。大量实验表明，CLASP在广泛的基准测试、剪枝比例和MLLM架构中均持续优于现有方法。代码将在https://github.com/Yunkaidang/CLASP公开。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single-layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug-and-play token reduction framework based on class-adaptive layer fusion and dual-stage pruning. Specifically, CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction. Extensive experiments demonstrate that CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures. Code will be available at https://github.com/Yunkaidang/CLASP.

关键词: Multimodal Large Language Models, MLLMs, token reduction, class-adaptive layer fusion, dual-stage pruning, visual token compression, Vision Transformer, computational overhead

60. ❌ ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

作者: Myungchul Kim, Kwanyong Park, Junmo Kim, In So Kweon 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12762v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ARGOS提出一个基于LLM的智能体框架，用于多摄像头人员搜索任务，核心涉及LLM驱动的智能体推理、工具使用和规划决策。因此，与’Large Language Models’、‘Chain of Thought’、‘System 2 Thinking’、‘LLM Agents’和’Tool Use’高度相关（10分），因为这些关键词直接对应论文中LLM作为推理引擎、多步推理、深度思考和工具调用的核心方法。其他关键词如MoE、量化、对齐等未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

论文ARGOS提出了首个将多摄像头人员搜索重构为交互式推理问题的基准和框架，通过LLM驱动的智能体在信息不对称下进行规划、提问和候选排除，实验表明现有LLM在该任务上仍有较大提升空间，且领域特定工具对性能至关重要。

摘要翻译

我们提出ARGOS，这是首个将多摄像头行人搜索重新定义为交互式推理问题的基准框架，该问题要求智能体在信息不对称条件下进行规划、提问并排除候选目标。ARGOS智能体接收模糊的目击者描述，必须在有限的交互轮次内决定询问内容、何时调用空间或时间工具，以及如何解读模糊的响应。其推理过程基于时空拓扑图（Spatio-Temporal Topology Graph, STTG）实现，该图编码了摄像头连通性与经验验证的转移时间。本基准包含来自14个真实场景的2,691项任务，分为三个渐进式赛道：语义感知（Who）、空间推理（Where）与时间推理（When）。在四种大语言模型基座上的实验表明，该基准远未被完全解决（最佳任务加权得分TWS：赛道2为0.383，赛道3为0.590），消融实验证实移除领域专用工具会导致准确率下降高达49.6个百分点。

摘要 (Abstract)

We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

关键词: multi-camera person search, agentic reasoning, LLM agents, interactive reasoning, spatio-temporal reasoning, tool use, benchmark, information asymmetry

61. ❌ GF-Score: Certified Class-Conditional Robustness Evaluation with Fairness Guarantees

作者: Arya Shah, Kaveri Visavadiya, Manisha Padala 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12757v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于神经网络对抗鲁棒性的评估框架，特别是针对类别条件的认证鲁棒性评估和公平性分析。论文内容涉及对抗攻击、认证鲁棒性、公平性指标、模型评估等传统深度学习安全领域，但完全不涉及大语言模型（LLMs）、大模型技术原理、大模型应用或任何评分关键词中列出的具体大模型相关技术（如MoE、Scaling Laws、RLHF、RAG、Agent等）。论文研究的是图像分类模型（CIFAR-10、ImageNet）的鲁棒性评估，而非大模型或AI for Science应用。

!!! tip deepseek-chat TL;DR

该论文提出了GF-Score框架，用于分解认证鲁棒性评分到每个类别，并量化类别间鲁棒性差异，通过评估22个模型发现更鲁棒的模型往往表现出更大的类别间差异。

摘要翻译

对抗鲁棒性对于在安全关键应用中部署神经网络至关重要，然而标准评估方法要么需要昂贵的对抗攻击，要么仅报告单一的聚合分数，这掩盖了鲁棒性在各类别间的分布情况。我们提出了 \emph{GF-Score}（GREAT-公平性分数）框架，该框架将经过认证的 GREAT Score 分解为每个类别的鲁棒性剖面，并通过基于福利经济学的四个指标量化其差异：鲁棒性差异指数（RDI）、归一化鲁棒性基尼系数（NRGC）、最差类别鲁棒性（WCR）以及公平性惩罚 GREAT 分数（FP-GREAT）。该框架进一步通过一种自校准程序消除了原始方法对对抗攻击的依赖，该程序仅利用干净准确率相关性来调整温度参数。通过在 CIFAR-10 和 ImageNet 数据集上评估来自 RobustBench 的 22 个模型，我们发现分解是精确的，每个类别的分数揭示了持续的脆弱性模式（例如，“猫”在 76% 的 CIFAR-10 模型中是鲁棒性最弱的类别），并且鲁棒性更强的模型往往表现出更大的类别间差异。这些结果建立了一个实用的、无需攻击的审计流程，用于诊断经过认证的鲁棒性保证在何处未能平等保护所有类别。我们在 \href{https://github.com/aryashah2k/gf-score}{GitHub} 上发布了代码。

摘要 (Abstract)

Adversarial robustness is essential for deploying neural networks in safety-critical applications, yet standard evaluation methods either require expensive adversarial attacks or report only a single aggregate score that obscures how robustness is distributed across classes. We introduce the \emph{GF-Score} (GREAT-Fairness Score), a framework that decomposes the certified GREAT Score into per-class robustness profiles and quantifies their disparity through four metrics grounded in welfare economics: the Robustness Disparity Index (RDI), the Normalized Robustness Gini Coefficient (NRGC), Worst-Case Class Robustness (WCR), and a Fairness-Penalized GREAT Score (FP-GREAT). The framework further eliminates the original method’s dependence on adversarial attacks through a self-calibration procedure that tunes the temperature parameter using only clean accuracy correlations. Evaluating 22 models from RobustBench across CIFAR-10 and ImageNet, we find that the decomposition is exact, that per-class scores reveal consistent vulnerability patterns (e.g., ``cat’’ is the weakest class in 76% of CIFAR-10 models), and that more robust models tend to exhibit greater class-level disparity. These results establish a practical, attack-free auditing pipeline for diagnosing where certified robustness guarantees fail to protect all classes equally. We release our code on \href{https://github.com/aryashah2k/gf-score}{GitHub}.

关键词: adversarial robustness, certified robustness, fairness evaluation, class-conditional robustness, robustness disparity, GF-Score, welfare economics metrics, attack-free auditing

62. ❌ Can AI Tools Transform Low-Demand Math Tasks? An Evaluation of Task Modification Capabilities

作者: Danielle S. Fox, Brenda L. Robles, Elizabeth DiPietro Brovey, Christian D. Schunn 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12743v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文评估了包括ChatGPT、Claude等通用AI工具和Khanmigo等数学专用工具在升级低认知需求数学任务方面的能力，这直接涉及大语言模型（LLMs）在教育领域的应用评估，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。然而，论文聚焦于应用效果评估（任务修改成功率、失败模式分析），并未深入探讨任何具体的大模型技术原理（如MoE、Scaling Laws、训练方法、推理优化、代理系统等）或科学领域（如生物信息学）的创新，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

本研究评估了AI工具（包括通用和数学专用工具）在升级低认知需求数学任务方面的能力，发现平均成功率仅为64%，且任务修改能力与任务分类能力呈负相关，表明AI在课程材料改编中潜力有限且需要专门支持。

摘要翻译

尽管近期研究已探讨人工智能工具在数学任务质量分类方面的能力（arXiv:2603.03512），但其提升现有任务质量的潜力尚不明确。本研究旨在探究人工智能工具能否成功提升低认知需求数学任务的品质。我们测试了十一种工具，包括六种广泛可用的通用人工智能工具（如ChatGPT和Claude）以及五种面向数学教师的专用工具（如Khanmigo、coteach.ai）。基于任务分析指南框架（Stein & Smith, 1998），我们通过提示词引导人工智能工具对两类低需求数学任务进行修改。提示策略旨在模拟专业教师可能采用的方法，而非通过大量优化寻找更有效的提示方案（即追求乐观的典型结果）。平均而言，人工智能工具仅取得中等程度的成功：任务准确升级的比例仅为64%，不同工具的表现差异显著，从较弱（33%）到较成功（88%）不等。专用工具的成功率仅略高于通用工具。失败模式包括“提升不足”（维持低认知需求）和“提升过度”（将任务提升至过于理想化的目标类别，可能被教师拒绝）。值得注意的是，特定人工智能工具正确分类任务认知需求的能力与其升级任务的能力之间存在微弱负相关（r = -.35），这表明修改任务（即生成性任务）的能力与分类任务（即依据量规进行判断）的能力代表两种不同的技能。这些发现对于理解人工智能在课程适应性调整中的潜在作用具有重要意义，并凸显了需要开发专门方法以支持教师改进教学材料。

摘要 (Abstract)

While recent research has explored AI tools’ ability to classify the quality of mathematical tasks (arXiv:2603.03512), little is known about their capacity to increase the quality of existing tasks. This study investigated whether AI tools could successfully upgrade low-cognitive-demand mathematics tasks. Eleven tools were tested, including six broadly available, general-purpose AI tools (e.g., ChatGPT and Claude) and five tools specialized for mathematics teachers (e.g., Khanmigo, coteach.ai). Using the Task Analysis Guide framework (Stein & Smith, 1998), we prompted AI tools to modify two different types of low-demand mathematical tasks. The prompting strategy aimed to represent likely approaches taken by knowledgeable teachers, rather than extensive optimization to find a more effective prompt (i.e., an optimistic typical outcome). On average, AI tools were only moderately successful: tasks were accurately upgraded only 64% of the time, with different AI tool performance ranging from quite weak (33%) to broadly successful (88%). Specialized tools were only moderately more successful than general-purpose tools. Failure modes included both “undershooting” (maintaining low cognitive demand) and “overshooting” (elevating tasks to an overly ambitious target category that likely would be rejected by teachers). Interestingly, there was a small negative correlation (r = -.35) between whether a given AI tool was able to correctly classify the cognitive demand of tasks and whether the AI was able to upgrade tasks, showing that the ability to modify tasks (i.e., a generative task) represents a distinct capability from the ability to classify them (i.e., judgement using a rubric). These findings have important implications for understanding AI’s potential role in curriculum adaptation and highlight the need for specialized approaches to support teachers in modifying instructional materials.

关键词: AI tools, mathematical tasks, cognitive demand, task modification, ChatGPT, Claude, Khanmigo, curriculum adaptation

63. ❌ Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning

作者: Zhenyu Ma, Yuyang Song, Chunyi Yang, Jingyi Zhu, Letian Yang, Xukai Jiang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12717v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based autonomous agents在复杂现实任务中的性能提升，通过case-based learning框架实现经验知识转移。高度相关关键词：LLMs（论文明确基于LLM构建agents）、Autonomous Agents（核心研究对象）。中等相关：Chain of Thought/System 2 Thinking（涉及结构化分析推理）、In-context Learning（与few-shot baseline比较）。其余关键词未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文针对LLM自主代理在复杂现实任务中难以有效利用任务结构和先验经验的问题，提出了一种基于案例学习的框架，通过将过往任务经验转化为可重用知识资产，使代理能够在新任务中实现知识转移和结构化分析，在六类复杂任务基准测试中一致优于现有基线方法。

摘要翻译

基于大语言模型（LLM）的自主智能体在通用推理任务上表现良好，但在复杂的现实场景中，仍难以可靠地利用任务结构、关键约束和先验经验。我们提出了一种基于案例的学习框架，该框架将过往任务的经验转化为可复用的知识资产，使智能体能够将先前的案例经验迁移至新任务，并执行更具结构化的分析。与主要依赖预训练知识或静态提示的方法不同，我们的框架强调从真实案例中提取并复用任务相关知识、分析性提示和操作技能。我们在一个包含六类复杂任务的统一基准上评估了该方法，并与零样本学习、少样本学习、检查表示提示以及规则记忆基线进行了比较。结果表明，我们的方法在所有任务上均取得了持续强劲的性能，在每种情况下都达到或超越了最佳基线，且在更复杂的任务上优势尤为明显。进一步分析表明，基于案例学习的优势随任务复杂性增加而增强，并且一个智能体获得的实践知识可被其他智能体复用。这些发现表明，基于案例的学习为构建适用于现实世界工作的专业智能体提供了一条有前景的路径。

摘要 (Abstract)

LLM-based autonomous agents perform well on general reasoning tasks but still struggle to reliably use task structure, key constraints, and prior experience in complex real-world settings. We propose a case-based learning framework that converts experience from past tasks into reusable knowledge assets, allowing agents to transfer prior case experience to new tasks and perform more structured analysis. Unlike methods based mainly on pretrained knowledge or static prompts, our framework emphasizes extracting and reusing task-relevant knowledge, analytical prompts, and operational skills from real cases. We evaluate the method on a unified benchmark of six complex task categories and compare it with Zero-Shot, Few-Shot, Checklist Prompt, and Rule Memory baselines. Results show that our method achieves consistently strong performance across all tasks and matches or outperforms the best baseline in every case, with especially clear gains on more complex tasks. Further analysis shows that the advantage of case-based learning increases with task complexity, and that practical knowledge acquired by one agent can be reused by others. These findings suggest that case-based learning offers a promising path for building professional agents for real-world work.

关键词: LLM-based autonomous agents, case-based learning, knowledge transfer, real-world tasks, structured analysis, experience reuse, complex task benchmark, professional agents

64. ❌ Information-Theoretic Optimization for Task-Adapted Compressed Sensing Magnetic Resonance Imaging

作者: Xinyu Peng, Ziyang Zheng, Wenrui Dai, Duoduo Xue, Shaohui Li, Chenglin Li, Junni Zou, Hongkai Xiong 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12709v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学影像（MRI）的压缩感知和任务适应优化，属于AI在科学（医学）领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。但论文未涉及大模型、深度学习技术原理创新或任何其他关键词（如LLMs、MoE、训练方法、推理技术等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于信息论的任务适应压缩感知MRI框架，通过最大化欠采样k空间测量与临床任务之间的互信息，实现了概率推理、自适应采样比控制和统一处理不同临床场景，在大型MRI数据集上验证了其竞争性性能和更好的后验分布匹配。

摘要翻译

面向任务的压缩感知磁共振成像（Task-adapted Compressed Sensing MRI, CS-MRI）正逐渐兴起，旨在以远低于奈奎斯特采样要求的k空间测量次数，满足下游临床任务的具体需求。然而，现有的面向任务的CS-MRI方法存在医学诊断中的不确定性问题，并且无法在端到端优化中与重建或临床任务实现自适应采样。为解决这些局限，我们首次从信息论视角提出一种面向任务的CS-MRI方法，以同时实现用于不确定性预测的概率推断，并适应任意采样率及多样化的临床应用。具体而言，我们通过最大化欠采样k空间测量与临床任务之间的互信息，形式化定义了面向任务的CS-MRI优化问题，从而启用概率推断以应对不确定性问题。我们利用摊销优化技术，为互信息构建了可处理的变分下界，以联合优化采样、重建和任务推断模型，这使得使用单一端到端训练模型即可实现灵活的采样率控制。此外，所提框架在一个统一方法中处理了两种截然不同的临床场景：i) 任务与重建联合，其中重建作为辅助过程以提升任务性能；ii) 抑制重建的任务执行，适用于隐私保护场景。在大规模MRI数据集上的大量实验表明，所提框架在Dice等标准指标上相比确定性方法取得了极具竞争力的性能，并且通过广义能量距离（Generalized Energy Distance, GED）衡量，能提供与真实后验分布更匹配的分布。

摘要 (Abstract)

Task-adapted compressed sensing magnetic resonance imaging (CS-MRI) is emerging to address the specific demands of downstream clinical tasks with significantly fewer k-space measurements than required by Nyquist sampling. However, existing task-adapted CS-MRI methods suffer from the uncertainty problem for medical diagnosis and cannot achieve adaptive sampling in end-to-end optimization with reconstruction or clinical tasks. To address these limitations, we propose the first task-adapted CS-MRI from the information-theoretic perspective to simultaneously achieve probabilistic inference for uncertainty prediction and adapt to arbitrary sampling ratios and versatile clinical applications. Specifically, we formalize the task-adapted CS-MRI optimization problem by maximizing the mutual information between undersampled k-space measurements and clinical tasks to enable probabilistic inference for addressing the uncertainty problem. We leverage amortized optimization and construct tractable variational bounds for mutual information to jointly optimize sampling, reconstruction, and task-inference models, which enables flexible sampling ratio control using a single end-to-end trained model. Furthermore, the proposed framework addresses two kinds of distinct clinical scenarios within a unified approach, i.e., i) joint task and reconstruction, where reconstruction serves as an auxiliary process to enhance task performance; and ii) task implementation with suppressed reconstruction, applicable for privacy protection. Extensive experiments on large-scale MRI datasets demonstrate that the proposed framework achieves highly competitive performance on standard metrics like Dice compared to deterministic counterpart but provides better distribution matching to the ground-truth posterior distribution as measured by the generalized energy distance (GED).

关键词: compressed sensing MRI, task-adapted optimization, information-theoretic, mutual information, probabilistic inference, adaptive sampling, clinical applications, end-to-end training

65. ❌ MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

作者: Shufang Lin, Muyang Chen, Xiabing Zhou, Rongrong Zhang, Dayou Zhang, Fangxin Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12700v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在复杂意图识别任务中的表现，并提出了FRACTAM框架来改进模型性能。因此，与’Large Language Models’高度相关（10分）。论文涉及’Retrieval-Augmented Generation’（5分，因为FRACTAM使用两阶段检索进行事实锚定）、‘Context Window Extension’（5分，因为数据集和任务涉及长上下文分析）、‘Chain of Thought’和’System 2 Thinking’（各5分，因为研究涉及因果推理和深度推理）、‘Hallucination Mitigation’（10分，因为论文明确解决文本先验视觉幻觉问题）和’Mechanistic Interpretability’（5分，因为研究模型缺陷和可解释的证据链）。其他关键词如MoE、SLMs、训练技术、代理、压缩等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了MISID数据集来评估多模态大语言模型在复杂战略欺骗游戏中的意图识别能力，并开发了FRACTAM框架以减少幻觉、增强跨模态推理，从而显著提升了模型在隐藏意图检测和推理任务上的性能。

摘要翻译

理解复杂多轮交互中的人类意图始终是人机交互与行为分析领域的核心挑战。现有意图识别数据集主要关注单轮话语或简单对话，而现实场景往往涉及复杂的策略性互动，参与者需要在较长时间内维持具有欺骗性的复杂叙事。为填补这一空白，我们提出了MISID——一个用于意图识别的综合性多模态、多轮次、多参与者基准数据集。该数据集源于高风险社交策略游戏，采用细粒度的双层多维标注方案，专为长语境话语分析和基于证据的因果追踪而设计。我们在MISID上对当前最先进的多模态大语言模型（MLLMs）进行了系统评估，揭示了其在复杂场景中的关键缺陷，包括文本先验视觉幻觉、跨模态协同能力受损以及因果线索串联能力有限。为此，我们提出了FRACTAM作为基线框架。该框架采用“解耦-锚定-推理”范式，通过提取纯净的单模态事实表征来降低文本偏差，利用两阶段检索实现长程事实锚定，并构建显式的跨模态证据链。大量实验表明，FRACTAM能有效提升主流模型在复杂策略任务中的表现，在保持稳健感知准确性的同时，显著增强了隐藏意图检测与推理能力。本数据集公开于 https://naislab.cn/datasets/MISID。

摘要 (Abstract)

Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios, including text-prior visual hallucination, impaired cross-modal synergy, and limited capacity in chaining causal cues. Consequently, we propose FRACTAM as a baseline framework. Using a ``Decouple-Anchor-Reason’’ paradigm, FRACTAM reduces text bias by extracting pure unimodal factual representations, employs two-stage retrieval for long-range factual anchoring, and constructs explicit cross-modal evidence chains. Extensive experiments demonstrate that FRACTAM enhances mainstream models’ performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset is available at https://naislab.cn/datasets/MISID.

关键词: Multimodal Large Language Models, Intent Recognition, Strategic Deception Games, Hallucination Mitigation, Cross-modal Reasoning, Long-context Analysis, Evidence-based Causal Tracking, FRACTAM Framework

66. ❌ BID-LoRA: A Parameter-Efficient Framework for Continual Learning and Unlearning

作者: Jagadeesh Rachapudi, Ritali Vatsi, Praful Hambarde, Amit Shukla 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12686v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出BID-LoRA框架，专注于参数高效的持续学习和机器遗忘，核心与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为直接扩展LoRA方法；与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），涉及持续学习；其他关键词如大模型、推理技术、对齐等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对持续学习和机器遗忘的统一框架需求，提出了BID-LoRA方法，通过双向低秩适配器和逃逸遗忘机制，在CIFAR-100和CASIA-Face100数据集上有效减少了知识泄漏并提升了性能。

摘要翻译

深度学习的最新进展凸显了对系统的需求，这些系统不仅能够通过持续学习（Continual Learning，CL）获取新知识，还能通过机器遗忘（Machine Unlearning，MU）移除过时、敏感或私有的信息。然而，尽管持续学习方法已较为成熟，机器遗忘技术仍处于早期阶段，这为依赖两者能力的统一框架留下了关键空白。我们发现，简单结合现有的持续学习与机器遗忘方法会导致知识泄漏——即在重复的适应周期中基础知识逐渐退化。为解决这一问题，我们将持续学习与遗忘（Continual Learning Unlearning，CLU）形式化为一个统一范式，其包含三个关键目标：（i）精确删除不需要的知识；（ii）在保留先前信息的同时高效整合新知识；（iii）最小化跨周期的知识泄漏。我们提出了双向低秩适应（Bi-Directional Low-Rank Adaptation，BID-LoRA），这是一种新颖的框架，包含三个专用适配器路径——保留、新增与遗忘，应用于注意力层，并结合了逃逸遗忘机制，该机制将遗忘类别的嵌入推至与保留知识距离最大的位置，仅更新5%的参数。在CIFAR-100上的实验表明，BID-LoRA在多个适应周期中均优于持续学习与遗忘基线。我们进一步在CASIA-Face100（一个精选的人脸识别子集）上进行了评估，证明了该方法在实际身份管理系统中的适用性，此类系统需要注册新用户并移除已注销用户。

摘要 (Abstract)

Recent advances in deep learning underscore the need for systems that can not only acquire new knowledge through Continual Learning (CL) but also remove outdated, sensitive, or private information through Machine Unlearning (MU). However, while CL methods are well-developed, MU techniques remain in early stages, creating a critical gap for unified frameworks that depend on both capabilities. We find that naively combining existing CL and MU approaches results in knowledge leakage a gradual degradation of foundational knowledge across repeated adaptation cycles. To address this, we formalize Continual Learning Unlearning (CLU) as a unified paradigm with three key goals: (i) precise deletion of unwanted knowledge, (ii) efficient integration of new knowledge while preserving prior information, and (iii) minimizing knowledge leakage across cycles. We propose Bi-Directional Low-Rank Adaptation (BID-LoRA), a novel framework featuring three dedicated adapter pathways-retain, new, and unlearn applied to attention layers, combined with escape unlearning that pushes forget-class embeddings to positions maximally distant from retained knowledge, updating only 5% of parameters. Experiments on CIFAR-100 show that BID-LoRA outperforms CLU baselines across multiple adaptation cycles. We further evaluate on CASIA-Face100, a curated face recognition subset, demonstrating practical applicability to real-world identity management systems where new users must be enrolled and withdrawn users removed.

关键词: Continual Learning, Machine Unlearning, Parameter-efficient Fine-tuning, LoRA, Knowledge Leakage, Adapter Pathways, BID-LoRA, Face Recognition

67. ❌ A hierarchical spatial-aware algorithm with efficient reinforcement learning for human-robot task planning and allocation in production

作者: Jintao Xue, Xiao Li, Nianmin Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12669v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是人机协作生产环境中的任务规划与分配问题，采用分层强化学习（高效缓冲深度Q学习EBQ）和空间感知路径规划方法（SAP）。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science等特定领域相关，而本文的核心是传统强化学习在机器人控制和生产调度中的应用，未涉及大模型、深度学习创新或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种分层空间感知算法（EBQ&SAP），通过高效强化学习和路径规划方法，有效解决了复杂动态生产环境中人机协作的任务规划与分配问题。

摘要翻译

在先进制造系统中，人与机器人协作完成生产过程。有效的任务规划与分配（TPA）对于实现高生产效率至关重要，但在复杂动态的制造环境中仍具挑战性。人与机器人的动态特性，尤其是需要考虑空间信息（例如人的实时位置以及完成任务需移动的距离），使得TPA问题显著复杂化。为应对上述挑战，我们将生产任务分解为可管理的子任务，进而实现一种实时分层人机协作TPA算法，该算法包含用于任务规划的高层智能体和用于任务分配的低层智能体。针对高层智能体，我们提出了一种高效的基于缓冲区的深度Q学习方法（EBQ），该方法减少了训练时间，并在具有长期稀疏奖励挑战的生产问题中提升了性能。对于低层智能体，我们设计了一种基于路径规划的空间感知方法（SAP），用于将任务分配给合适的人机资源，从而完成相应的顺序子任务。我们在3D模拟器中对一个复杂的实时生产过程进行了实验。结果表明，我们提出的EBQ&SAP方法能有效解决复杂动态生产过程中的人机协作TPA问题。

摘要 (Abstract)

In advanced manufacturing systems, humans and robots collaborate to conduct the production process. Effective task planning and allocation (TPA) is crucial for achieving high production efficiency, yet it remains challenging in complex and dynamic manufacturing environments. The dynamic nature of humans and robots, particularly the need to consider spatial information (e.g., humans’ real-time position and the distance they need to move to complete a task), substantially complicates TPA. To address the above challenges, we decompose production tasks into manageable subtasks. We then implement a real-time hierarchical human-robot TPA algorithm, including a high-level agent for task planning and a low-level agent for task allocation. For the high-level agent, we propose an efficient buffer-based deep Q-learning method (EBQ), which reduces training time and enhances performance in production problems with long-term and sparse reward challenges. For the low-level agent, a path planning-based spatially aware method (SAP) is designed to allocate tasks to the appropriate human-robot resources, thereby achieving the corresponding sequential subtasks. We conducted experiments on a complex real-time production process in a 3D simulator. The results demonstrate that our proposed EBQ&SAP method effectively addresses human-robot TPA problems in complex and dynamic production processes.

关键词: human-robot collaboration, task planning and allocation, reinforcement learning, deep Q-learning, spatial-aware algorithm, production systems, hierarchical agents, real-time scheduling

68. ❌ Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production

作者: Jintao Xue, Xiao Li, Nianmin Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12667v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于工业环境中的人机协作任务规划与分配问题，采用安全强化学习（safe RL）方法，结合粒子滤波和约束双深度Q学习。论文内容完全围绕机器人学、人机交互、强化学习在制造业的应用，未涉及任何大语言模型（LLM）、深度学习技术原理、AI for Science或其他指定关键词的相关技术。所有关键词均与大模型、深度学习技术或科学AI应用无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于粒子滤波和约束双深度Q学习的安全强化学习方法（PF-CD3Q），用于解决工业生产中人机协作任务规划与分配问题，通过实时估计人体疲劳参数并约束动作空间来确保工人疲劳在安全范围内。

摘要翻译

人机协同制造作为工业5.0的核心维度，强调通过人因工程提升工作者福祉。本文研究了动态人机任务规划与分配问题，该问题需决策任务执行时机与执行主体，在确保工作者生理疲劳处于安全限值内的同时实现效率最大化。疲劳约束与生产动态性的结合，显著增加了人机任务规划与分配问题的复杂性。传统人机任务规划与分配中的疲劳恢复模型常依赖静态预定义超参数，然而实践中，人类疲劳敏感性会因工作条件变化、睡眠不足等因素产生日度波动。为更好捕捉这种不确定性，我们将疲劳相关参数视为不精确变量，并基于生产过程中观测到的疲劳进展进行在线估计。针对这些挑战，本文提出PF-CD3Q——一种融合粒子滤波与约束对决双深度Q学习的安全强化学习方法，用于实现实时疲劳预测的人机任务规划与分配。具体而言，我们首先开发基于粒子滤波的估计器来追踪人体疲劳状态并实时更新疲劳模型参数；随后将这些估计器嵌入CD3Q框架，通过在决策阶段进行任务级疲劳预测并排除超限任务，从而约束动作空间，将问题建模为约束马尔可夫决策过程。

摘要 (Abstract)

Human-robot collaborative manufacturing, a core aspect of Industry 5.0, emphasizes ergonomics to enhance worker well-being. This paper addresses the dynamic human-robot task planning and allocation (HRTPA) problem, which involves determining when to perform tasks and who should execute them to maximize efficiency while ensuring workers’ physical fatigue remains within safe limits. The inclusion of fatigue constraints, combined with production dynamics, significantly increases the complexity of the HRTPA problem. Traditional fatigue-recovery models in HRTPA often rely on static, predefined hyperparameters. However, in practice, human fatigue sensitivity varies daily due to factors such as changed work conditions and insufficient sleep. To better capture this uncertainty, we treat fatigue-related parameters as inaccurate and estimate them online based on observed fatigue progression during production. To address these challenges, we propose PF-CD3Q, a safe reinforcement learning (safe RL) approach that integrates the particle filter with constrained dueling double deep Q-learning for real-time fatigue-predictive HRTPA. Specifically, we first develop PF-based estimators to track human fatigue and update fatigue model parameters in real-time. These estimators are then integrated into CD3Q by making task-level fatigue predictions during decision-making and excluding tasks that exceed fatigue limits, thereby constraining the action space and formulating the problem as a constrained Markov decision process (CMDP).

关键词: safe reinforcement learning, human-robot collaboration, task planning and allocation, fatigue prediction, particle filter, constrained Markov decision process, Industry 5.0, ergonomics

69. ❌ Broadening the Applicability of Conditional Syntax Splitting for Reasoning from Conditional Belief Bases

作者: Lars-Phillip Spiegel, Jonas Haldimann, Jesse Heyninck, Gabriele Kern-Isberner, Christoph Beierle 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12660v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究非单调推理中的条件信念库语法分割问题，属于逻辑推理和知识表示领域，与所有评分关键词（均涉及大模型、深度学习技术原理及应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、AI应用或相关创新概念。

!!! tip deepseek-chat TL;DR

该论文提出了一种广义的条件语法分割方法，扩展了非单调推理中条件信念库的语法分割适用性，克服了先前分割概念的限制，并评估了多种归纳推理算子对新分割公设的满足情况。

摘要翻译

在基于条件信念基的非单调推理中，满足语法拆分公设的推理算子允许仅考虑信念基的相关部分，前提是该信念基能够基于互斥的符号集拆分为子基。由于这种互斥性在实践中较为罕见，安全条件语法拆分被提出作为语法拆分的泛化形式，允许子基中的条件句共享某些原子。近期研究表明，这种条件句的重叠仅限于平凡的自满足条件句。本文提出了一种安全条件语法拆分的泛化形式，从而拓宽了拆分公设的适用范围。与安全条件语法拆分相比，我们的泛化概念支持信念基Δ的语法拆分，其中Δ的子基可以共享原子和非平凡条件句。我们阐述了这一新概念如何克服先前拆分概念的局限性，并区分了真拆分与简单拆分——后者无法为基于Δ的归纳推理带来益处。基于条件语法拆分的泛化，我们引入了调整后的推理公设，并评估了多种主流归纳推理算子对这些公设的满足情况。此外，我们证明：虽然每个满足泛化条件语法拆分的归纳推理算子也满足条件语法拆分，但反之并不成立。

摘要 (Abstract)

In nonmonotonic reasoning from conditional belief bases, an inference operator satisfying syntax splitting postulates allows for taking only the relevant parts of a belief base into account, provided that the belief base splits into subbases based on disjoint signatures. Because such disjointness is rare in practice, safe conditional syntax splitting has been proposed as a generalization of syntax splitting, allowing the conditionals in the subbases to share some atoms. Recently this overlap of conditionals has been shown to be limited to trivial, self-fulfilling conditionals. In this article, we propose a generalization of safe conditional syntax splittings that broadens the applicability of splitting postulates. In contrast to safe conditional syntax splitting, our generalized notion supports syntax splittings of a belief base Δ where the subbases of Δ may share atoms and nontrivial conditionals. We illustrate how this new notion overcomes limitations of previous splitting concepts, and we identify genuine splittings, separating them from simple splittings that do not provide benefits for inductive inference from Δ. We introduce adjusted inference postulates based on our generalization of conditional syntax splitting, and we evaluate several popular inductive inference operators with respect to these postulates. Furthermore, we show that, while every inductive inference operator satisfying generalized conditional syntax splitting also satisfies conditional syntax splitting, the reverse does not hold.

关键词: nonmonotonic reasoning, conditional belief bases, syntax splitting, inductive inference, inference postulates, generalized splitting, safe conditional syntax splitting, belief base splitting

70. ❌ PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

作者: Jinlong Liu, Wanggui He, Peng Zhang, Mushui Liu, Hao Jiang, Pipei Huang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12652v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是文本到图像生成模型的强化学习奖励信号构建方法，使用冻结的视觉语言模型（VLM）计算奖励，属于计算机视觉和强化学习的交叉领域。所有评分关键词都专注于纯文本大语言模型（LLM）的技术原理、训练方法、推理优化、对齐技术、应用范式等，而本文的核心是视觉语言模型（VLM）在图像生成任务中的应用，研究对象、技术方法和应用场景与给定的LLM关键词集均无直接关联。虽然研究背景提到“大模型在不同领域的研究应用则可以酌情给分”，但本文的VLM应用与关键词列表中的LLM技术原理、训练对齐、推理优化等具体方向完全不匹配，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需标注和训练的奖励构建方法PromptEcho，利用冻结的视觉语言模型为文本到图像生成模型的强化学习提供高质量的图像-文本对齐奖励信号，显著提升了模型在密集对齐基准上的表现。

摘要翻译

强化学习（RL）能够提升文本到图像（T2I）模型的提示跟随能力，但获取高质量奖励信号仍具挑战性：CLIP分数过于粗粒度，而基于视觉语言模型（VLM）的奖励模型（如RewardDance）需要昂贵的人工标注偏好数据及额外的微调。我们提出PromptEcho，一种无需标注且无需奖励模型训练的奖励构建方法。给定生成的图像和引导查询，PromptEcho通过将原始提示作为标签，计算一个冻结VLM的令牌级交叉熵损失，直接提取VLM预训练期间编码的图像-文本对齐知识。该奖励具有确定性、计算高效，并能随着更强大的开源VLM的出现而自动提升。为进行评估，我们开发了DenseAlignBench，这是一个包含丰富概念的密集描述基准，用于严格测试提示跟随能力。在两个最先进的T2I模型（Z-Image和QwenImage-2512）上的实验结果表明，PromptEcho在DenseAlignBench上实现了显著提升（净胜率分别提升+26.8个百分点和+16.2个百分点），并在GenEval、DPG-Bench和TIIFBench上取得了一致性增益，且无需任何任务特定训练。消融研究证实，PromptEcho全面优于使用相同VLM的基于推理的评分方法，且奖励质量随VLM规模扩展而提升。我们将开源训练好的模型及DenseAlignBench。

摘要 (Abstract)

Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.

关键词: PromptEcho, reinforcement learning, text-to-image models, vision-language models, reward signal, annotation-free, DenseAlignBench, image-text alignment

71. ❌ Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs

作者: Alkid Baci, Luke Friedrichs, Caglar Demir, N’Dah Jean Kouagou, Axel-Cyrille Ngonga Ngomo 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12651v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLMs）和思维链（CoT）提示进行知识图谱链接预测，因此与’Large Language Models’和’Chain of Thought’高度相关（10分）。论文通过提示学习实现推理，与’System 2 Thinking’和’In-context Learning’有一定关联（5分）。其他关键词如MoE、量化、RAG、对齐等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出RALP方法，将知识图谱链接预测重新定义为提示学习问题，通过思维链提示使大语言模型能够有效预测实体、关系和字面量，在多个基准测试中显著优于传统知识图谱嵌入模型。

摘要翻译

知识图谱嵌入模型在链接预测任务上表现良好，但在处理未见过的实体、关系尤其是字面值时存在困难，这限制了其在动态异构图中的应用。相比之下，预训练大语言模型通过提示能有效实现泛化。我们将链接预测重新定义为提示学习问题，并提出RALP方法，该方法学习基于字符串的思维链提示作为三元组的评分函数。通过MIPRO算法进行贝叶斯优化，RALP仅需不到30个训练样本即可在无需梯度访问的情况下识别有效提示。在推理阶段，RALP能够预测缺失的实体、关系或完整三元组，并根据学习到的提示分配置信度分数。我们在传导式、数值型及OWL实例检索基准上进行了评估。RALP在多个数据集上将最先进的知识图谱嵌入模型的平均倒数排名提升了超过5%，并通过高质量推断三元组增强了泛化能力。在涉及复杂类表达式（如$\exists hasChild.Female$、$\geq 5 ; hasChild.Female$）的OWL推理任务中，其杰卡德相似度达到88%以上。这些结果表明基于提示的大语言模型推理可作为嵌入方法的灵活替代方案。我们已将实现代码、训练及评估流程开源发布：https://github.com/dice-group/RALP。

摘要 (Abstract)

Knowledge graph embedding (KGE) models perform well on link prediction but struggle with unseen entities, relations, and especially literals, limiting their use in dynamic, heterogeneous graphs. In contrast, pretrained large language models (LLMs) generalize effectively through prompting. We reformulate link prediction as a prompt learning problem and introduce RALP, which learns string-based chain-of-thought (CoT) prompts as scoring functions for triples. Using Bayesian Optimization through MIPRO algorithm, RALP identifies effective prompts from fewer than 30 training examples without gradient access. At inference, RALP predicts missing entities, relations or whole triples and assigns confidence scores based on the learned prompt. We evaluate on transductive, numerical, and OWL instance retrieval benchmarks. RALP improves state-of-the-art KGE models by over 5% MRR across datasets and enhances generalization via high-quality inferred triples. On OWL reasoning tasks with complex class expressions (e.g., $\exists hasChild.Female$, $\geq 5 ; hasChild.Female$), it achieves over 88% Jaccard similarity. These results highlight prompt-based LLM reasoning as a flexible alternative to embedding-based methods. We release our implementation, training, and evaluation pipeline as open source: https://github.com/dice-group/RALP .

关键词: knowledge graph embedding, link prediction, large language models, chain-of-thought prompting, prompt learning, Bayesian optimization, OWL reasoning, generalization

72. ❌ TimeSAF: Towards LLM-Guided Semantic Asynchronous Fusion for Time Series Forecasting

作者: Fan Zhang, Shiming Fan, Hua Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12648v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文TimeSAF专注于LLM在时间序列预测中的应用，属于大模型在科学领域（AI for Science）的应用研究。论文核心是提出一种新的LLM引导的异步融合框架，因此与’Large Language Models’高度相关（10分）。论文涉及时间序列预测，属于AI在科学领域的应用，与’AI for Science’有一定关联（8分）。论文未涉及其他关键词如MoE、SLMs、训练技术、推理优化、代理系统等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

论文针对现有LLM在时间序列预测中采用同步融合策略导致语义感知失调的问题，提出了TimeSAF框架，通过分层异步融合机制有效解耦模态特征学习与跨模态交互，在标准基准测试中显著优于现有方法并展现出强大的泛化能力。

摘要翻译

尽管大语言模型（LLM）在时间序列预测领域近期取得了成功，但现有方法大多仍采用深度同步融合策略，即在网络的每一层强制进行文本特征与时间特征间的密集交互。这种设计忽视了模态间固有的粒度不匹配问题，并导致了我们称之为语义感知失调的现象：LLM提供的高层抽象语义与时间序列低层细粒度的数值动态被不恰当地纠缠在一起，使得语义先验难以有效指导预测。为解决这一问题，我们提出了TimeSAF，一个基于层次化异步融合的新框架。与同步方法不同，TimeSAF显式地将单模态特征学习与跨模态交互解耦。它引入了一个独立的跨模态语义融合主干，该主干通过可学习的查询以自底向上的方式从时间主干和提示主干中聚合全局语义；同时设计了一个阶段式语义精化解码器，异步地将这些高层信号注入回时间主干。该机制在避免干扰低层时间动态的同时，提供了稳定且高效的语义引导。在标准长期预测基准上的大量实验表明，TimeSAF显著优于现有先进基线模型，并在少样本和零样本迁移设置中进一步展现出强大的泛化能力。

摘要 (Abstract)

Despite the recent success of large language models (LLMs) in time-series forecasting, most existing methods still adopt a Deep Synchronous Fusion strategy, where dense interactions between textual and temporal features are enforced at every layer of the network. This design overlooks the inherent granularity mismatch between modalities and leads to what we term semantic perceptual dissonance: high-level abstract semantics provided by the LLM become inappropriately entangled with the low-level, fine-grained numerical dynamics of time series, making it difficult for semantic priors to effectively guide forecasting. To address this issue, we propose TimeSAF, a new framework based on hierarchical asynchronous fusion. Unlike synchronous approaches, TimeSAF explicitly decouples unimodal feature learning from cross-modal interaction. It introduces an independent cross-modal semantic fusion trunk, which uses learnable queries to aggregate global semantics from the temporal and prompt backbones in a bottom-up manner, and a stage-wise semantic refinement decoder that asynchronously injects these high-level signals back into the temporal backbone. This mechanism provides stable and efficient semantic guidance while avoiding interference with low-level temporal dynamics. Extensive experiments on standard long-term forecasting benchmarks show that TimeSAF significantly outperforms state-of-the-art baselines, and further exhibits strong generalization in both few-shot and zero-shot transfer settings.

关键词: Time Series Forecasting, Large Language Models, Semantic Asynchronous Fusion, Cross-modal Interaction, Hierarchical Fusion, Semantic Guidance, Few-shot Transfer, Zero-shot Transfer

73. ❌ Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

作者: Melvin Laux, Yi-Ling Liu, Rina Alo, Sören Töpper, Mariela De Lucas Alvarez, Frank Kirchner, Rebecca Adam 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12645v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于使用上下文多任务强化学习（Contextual Multi-Task Reinforcement Learning）来解决自主水下珊瑚礁监测中的控制问题，涉及模拟环境（HoloOcean）、样本效率、零样本泛化和鲁棒性评估。所有给定的关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的特定应用（如生物信息学、化学信息学）直接相关，而本论文研究的是强化学习在机器人控制中的应用，属于传统机器学习范畴，未涉及大模型、深度学习或指定的科学AI子领域。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出使用上下文多任务强化学习来训练自主水下车辆的控制策略，以解决珊瑚礁监测中因环境不确定性和任务变化导致的控制难题，实验表明该方法能提高样本效率、实现零样本泛化到新任务，并增强对水流变化的鲁棒性。

摘要翻译

尽管自主水下航行器具备海洋生态系统监测的潜力，但其部署从根本上受到高度不确定且非平稳的水下动力学环境下控制航行器困难的限制。为解决这些挑战，我们采用数据驱动的强化学习方法以补偿未知动力学和任务变化。传统的单任务强化学习容易对训练环境产生过拟合，从而限制所学策略的长期有效性。因此，我们提出改用情境化多任务强化学习范式，使我们能够学习可重复用于多种任务的控制器，例如在一个礁区检测牡蛎，在另一礁区检测珊瑚。我们评估了情境化多任务强化学习能否高效学习用于自主水下礁区监测的鲁棒且可泛化的控制策略。我们在HoloOcean的模拟礁区环境中训练了一个单一的情境依赖策略，该策略能够解决多个相关的监测任务。在实验中，我们通过实证评估了情境策略在样本效率、对未见任务的零样本泛化能力以及对不同水流的鲁棒性方面的表现。通过利用多任务强化学习，我们旨在提升训练效率以及所学策略的可重用性，从而朝着实现更可持续的自主礁区监测流程迈进一步。

摘要 (Abstract)

Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by the difficulty of controlling vehicles under highly uncertain and non-stationary underwater dynamics. To address these challenges, we employ a data-driven reinforcement learning approach to compensate for unknown dynamics and task variations.Traditional single-task reinforcement learning has a tendency to overfit the training environment, thus, limit the long-term usefulness of the learnt policy. Hence, we propose to use a contextual multi-task reinforcement learning paradigm instead, allowing us to learn controllers that can be reused for various tasks, e.g., detecting oysters in one reef and detecting corals in another. We evaluate whether contextual multi-task reinforcement learning can efficiently learn robust and generalisable control policies for autonomous underwater reef monitoring. We train a single context-dependent policy that is able to solve multiple related monitoring tasks in a simulated reef environment in HoloOcean. In our experiments, we empirically evaluate the contextual policies regarding sample-efficiency, zero-shot generalisation to unseen tasks, and robustness to varying water currents. By utilising multi-task reinforcement learning, we aim to improve the training effectiveness, as well as the reusability of learnt policies to take a step towards more sustainable procedures in autonomous reef monitoring.

关键词: autonomous underwater vehicles, reinforcement learning, multi-task learning, contextual policy, reef monitoring, sample efficiency, zero-shot generalization, robust control

74. ❌ RPRA: Predicting an LLM-Judge for Efficient but Performant Inference

作者: Dylan R. Ashley, Gaël Le Lan, Changsheng Zhao, Naina Dhingra, Zhipeng Cai, Ernie Chang, Mingchen Zhuge, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12634v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在效率与质量间的权衡，提出RPRA/PA范式让模型预测LLM法官对其输出的评分，从而让小型模型在自信时响应、不自信时求助大型模型。高度相关关键词：LLMs（核心研究对象）、Small Language Models（研究小型模型部署）、Supervised Fine-tuning（评估方法之一）、Self-Correction/Self-Improvement（模型学习预测自身性能限制）、In-context Learning（使用in-context report card方法）。其他关键词如MoE、Scaling Laws、Alignment等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究如何让语言模型预测LLM法官对其输出的评分，通过零样本预测、上下文报告卡和监督微调等方法，使小型模型能更准确地判断自身能力，从而在效率与输出质量间取得更好平衡，实验显示小型模型的预测准确率最高可提升55%。

摘要翻译

大型语言模型（LLMs）在计算效率（例如参数量）与输出质量之间存在根本性的权衡，尤其是在部署于手机或笔记本电脑等计算资源有限的设备时。应对这一挑战的一种方法是效仿人类行为：当模型自认为无法独立解决问题时，主动寻求帮助；通过允许较小模型在确信能提供良好回答时响应查询，而在自认能力不足时将问题递交给较大模型处理，我们可以突破这种权衡。为此，本文研究了预测-回答/行动（PA）与推理-预测-推理-回答/行动（RPRA）范式的可行性，这些范式要求模型在生成回应前，预先预测一个LLM评判者会对其输出给出何种评分。我们评估了三种方法：零样本预测、基于上下文报告卡的预测，以及监督微调。研究结果表明，较大模型（特别是推理模型）在零样本预测通用LLM评判者评分时表现良好，而较小模型在经过微调或获得上下文报告卡后，也能可靠地预测此类评判结果。总体而言，这两种方法均能显著提升较小模型的预测准确率：在多个数据集中，使用报告卡和微调分别实现了平均高达55%和52%的改进。这些发现表明，模型能够学会预测自身的性能局限，从而为构建更高效、更具自我认知能力的人工智能系统铺平道路。

摘要 (Abstract)

Large language models (LLMs) face a fundamental trade-off between computational efficiency (e.g., number of parameters) and output quality, especially when deployed on computationally limited devices such as phones or laptops. One way to address this challenge is by following the example of humans and have models ask for help when they believe they are incapable of solving a problem on their own; we can overcome this trade-off by allowing smaller models to respond to queries when they believe they can provide good responses, and deferring to larger models when they do not believe they can. To this end, in this paper, we investigate the viability of Predict-Answer/Act (PA) and Reason-Predict-Reason-Answer/Act (RPRA) paradigms where models predict – prior to responding – how an LLM judge would score their output. We evaluate three approaches: zero-shot prediction, prediction using an in-context report card, and supervised fine-tuning. Our results show that larger models (particularly reasoning models) perform well when predicting generic LLM judges zero-shot, while smaller models can reliably predict such judges well after being fine-tuned or provided with an in-context report card. Altogether, both approaches can substantially improve the prediction accuracy of smaller models, with report cards and fine-tuning achieving mean improvements of up to 55% and 52% across datasets, respectively. These findings suggest that models can learn to predict their own performance limitations, paving the way for more efficient and self-aware AI systems.

关键词: Large Language Models, Small Language Models, Efficiency-Quality Trade-off, Predict-Answer/Act, Reason-Predict-Reason-Answer/Act, LLM Judge, In-context Learning, Supervised Fine-tuning

75. ❌ Calibration-Aware Policy Optimization for Reasoning LLMs

作者: Ziqi Wang, Xingzhou Lou, Meiqi Wu, Zhengqi Wen, Junge Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12632v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理中的校准问题，提出CAPO方法优化推理准确性和校准性。高度相关关键词：LLMs（核心研究对象）、Chain of Thought/System 2 Thinking（涉及推理能力）、Hallucination Mitigation（校准问题与幻觉缓解直接相关）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对GRPO等LLM推理优化方法导致的过度自信和校准退化问题，提出了校准感知策略优化（CAPO），在多个数学推理基准上显著提升校准性15%的同时保持或提高准确性，并改善下游推理任务的性能。

摘要翻译

群体相对策略优化（GRPO）虽能提升大语言模型（LLM）的推理能力，但常引发过度自信问题，即错误回答的困惑度（perplexity）低于正确答案，导致如曲线下面积（AUC）所描述的相对校准性能下降。现有方法要么在校准方面改进有限，要么需牺牲推理准确性的提升。我们首先证明，此类GRPO风格算法的性能退化源于其忽略不确定性的优势估计，这不可避免地使优化梯度与校准目标失准，从而以牺牲校准性能为代价提升准确性。为此，我们提出校准感知策略优化（Calibration-Aware Policy Optimization, CAPO）。该方法采用逻辑AUC代理损失函数，该函数在理论上具有一致性且满足遗憾界，实现了不确定性感知的优势估计。通过进一步引入噪声掩蔽机制，CAPO获得了稳定的学习动态，能够联合优化校准性能与准确性。在多个数学推理基准测试上的实验表明，CAPO-1.5B模型将校准性能显著提升高达15%，同时达到与GRPO相当或更优的准确率，并在下游推理时缩放任务中进一步将准确率提升高达5%。此外，当允许在低置信度条件下弃答时，CAPO实现了帕累托最优的精度-覆盖率权衡，凸显了其在缓解幻觉方面的实用价值。

摘要 (Abstract)

Group Relative Policy Optimization (GRPO) enhances LLM reasoning but often induces overconfidence, where incorrect responses yield lower perplexity than correct ones, degrading relative calibration as described by the Area Under the Curve (AUC). Existing approaches either yield limited improvements in calibration or sacrifice gains in reasoning accuracy. We first prove that this degradation in GRPO-style algorithms stems from their uncertainty-agnostic advantage estimation, which inevitably misaligns optimization gradients with calibration. This leads to improved accuracy at the expense of degraded calibration. We then propose Calibration-Aware Policy Optimization (CAPO). It adopts a logistic AUC surrogate loss that is theoretically consistent and admits regret bound, enabling uncertainty-aware advantage estimation. By further incorporating a noise masking mechanism, CAPO achieves stable learning dynamics that jointly optimize calibration and accuracy. Experiments on multiple mathematical reasoning benchmarks show that CAPO-1.5B significantly improves calibration by up to 15% while achieving accuracy comparable to or better than GRPO, and further boosts accuracy on downstream inference-time scaling tasks by up to 5%. Moreover, when allowed to abstain under low-confidence conditions, CAPO achieves a Pareto-optimal precision-coverage trade-off, highlighting its practical value for hallucination mitigation.

关键词: Calibration-Aware Policy Optimization, LLM reasoning, overconfidence mitigation, mathematical reasoning benchmarks, hallucination mitigation, uncertainty-aware advantage estimation, precision-coverage trade-off, Group Relative Policy Optimization

76. ❌ KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

作者: Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin, Yu Sun, Hua Wu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12627v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的推理能力提升，通过强化学习（RL）框架解决奖励稀疏性问题，因此与’Large Language Models’高度相关（10分）。方法涉及强化学习训练框架，与’RLHF/RLAIF/DPO’相关（10分）。研究目标是提升推理能力，与’Chain of Thought/CoT Reasoning’和’System 2 Thinking/Slow Thinking’高度相关（各10分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、PEFT、RAG、Context Window、KV Cache、MCTS、Agents、Quantization、Hallucination、Interpretability、World Models、Model Merging、In-context Learning、AI for Science等，论文未涉及或仅间接提及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出KnowRL强化学习框架，通过最小充分知识指导解决大语言模型推理中的奖励稀疏问题，在1.5B规模模型上显著提升了八个推理基准的准确率，达到了该规模的新最优性能。

摘要翻译

RLVR（强化学习与验证反思）能够提升大语言模型的推理能力，但其效果常受限于困难任务上严重的奖励稀疏性问题。近期基于提示的强化学习方法通过注入部分解或抽象模板来缓解稀疏性，但这些方法通常通过增加更多标记来扩展引导，从而引入了冗余性、不一致性以及额外的训练开销。我们提出 \textbf{KnowRL}（知识引导的强化学习），这是一个将提示设计视为最小充分引导问题的强化学习训练框架。在强化学习训练过程中，KnowRL 将引导分解为原子知识点（KPs），并利用约束子集搜索（CSS）来构建紧凑且具有交互感知的子集用于训练。我们进一步发现了一种剪枝交互悖论——移除单个知识点可能有益，而移除多个此类知识点却可能有害——并在此依赖结构下显式优化鲁棒子集的构建。我们从 OpenMath-Nemotron-1.5B 出发训练了 KnowRL-Nemotron-1.5B。在 1.5B 规模下的八个推理基准测试中，KnowRL-Nemotron-1.5B 均一致优于强大的强化学习和提示基线方法。在推理阶段不使用知识点提示的情况下，KnowRL-Nemotron-1.5B 达到了 70.08 的平均准确率，已超过 Nemotron-1.5B 达 +9.63 分；若使用选定的知识点，其性能可提升至 74.16，在此规模下确立了新的技术水平。模型、精选的训练数据及代码已在 https://github.com/Hasuer/KnowRL 公开。

摘要 (Abstract)

RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose \textbf{KnowRL} (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox – removing one KP may help while removing multiple such KPs can hurt – and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at https://github.com/Hasuer/KnowRL.

关键词: Large Language Models, Reinforcement Learning, Reasoning, Knowledge Guidance, Reward Sparsity, Training Framework, Benchmark Performance, State-of-the-Art

77. ❌ Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments

作者: Jianhui Wu, Jian Zhou, Zhi Zhou, Zhangjin Huang, Chao Li 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12625v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机图形学领域中的实时渲染技术，具体针对动态光照环境下的全局光照（GI）压缩问题。论文提出的Neural Dynamic GI（NDGI）是一种基于神经网络的压缩技术，用于减少时间性光照贴图的存储需求。虽然论文使用了神经网络（lightweight neural networks），但其研究主题、方法、应用领域与评分关键词列表中的所有大模型（LLM）和深度学习技术原理创新完全无关。所有关键词都专注于自然语言处理、大模型架构、对齐、推理、代理、科学AI应用等方向，而本论文属于计算机图形学的特定工程应用，没有涉及任何关键词相关的内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Neural Dynamic GI的神经压缩技术，用于显著减少动态光照环境下时间性光照贴图的存储和内存需求，同时保持高质量的实时全局光照渲染效果。

摘要翻译

实时渲染中的高质量全局光照通常通过预计算光照技术实现，其中光照贴图是标准选择。为支持动态光照环境下静态物体的全局光照，需要预计算不同光照条件下的多张光照贴图，这会带来巨大的存储与内存开销。为克服这一限制，我们提出神经动态全局光照（Neural Dynamic GI, NDGI）——一种专为时序光照贴图集设计的新型压缩技术。该方法利用多维特征图与轻量级神经网络整合时序信息，而非显式存储多组贴图，从而显著降低光照贴图的存储空间。此外，我们在训练过程中引入了块压缩（Block Compression, BC）模拟策略，使最终生成的特征图可直接进行块压缩，进一步提升压缩率。为实现高效的实时解压，我们还将虚拟纹理（Virtual Texturing, VT）系统与神经表示相结合。与现有方法相比，我们的方案在实现高质量动态全局光照的同时，保持了极低的存储与内存需求，且仅产生适度的实时解压开销。为推进该方向的后续研究，我们将公开在多种具有时序变化特征的场景中预计算的时序光照贴图数据集。

摘要 (Abstract)

High-quality global illumination (GI) in real-time rendering is commonly achieved using precomputed lighting techniques, with lightmap as the standard choice. To support GI for static objects in dynamic lighting environments, multiple lightmaps at different lighting conditions need to be precomputed, which incurs substantial storage and memory overhead. To overcome this limitation, we propose Neural Dynamic GI (NDGI), a novel compression technique specifically designed for temporal lightmap sets. Our method utilizes multi-dimensional feature maps and lightweight neural networks to integrate the temporal information instead of storing multiple sets explicitly, which significantly reduces the storage size of lightmaps. Additionally, we introduce a block compression (BC) simulation strategy during the training process, which enables BC compression on the final generated feature maps and further improves the compression ratio. To enable efficient real-time decompression, we also integrate a virtual texturing (VT) system with our neural representation. Compared with prior methods, our approach achieves high-quality dynamic GI while maintaining remarkably low storage and memory requirements, with only modest real-time decompression overhead. To facilitate further research in this direction, we will release our temporal lightmap dataset precomputed in multiple scenes featuring diverse temporal variations.

关键词: Neural Dynamic GI, temporal lightmap compression, global illumination, real-time rendering, neural networks, block compression, virtual texturing, dynamic lighting environments

78. ❌ Efficient Semantic Image Communication for Traffic Monitoring at the Edge

作者: Damir Assylbek, Nurmukhammed Aitymbetov, Marko Ristin, Dimitrios Zorbas 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12622v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是边缘计算环境下的语义图像通信技术，专注于交通监控场景的图像压缩和重建方法（MMSD和SAMR）。虽然使用了生成模型（扩散模型）进行图像重建，但论文的核心是计算机视觉、图像处理和边缘计算，而非大语言模型或深度学习技术原理的创新。所有评分关键词都直接与大语言模型相关，而本文完全不涉及语言模型、提示工程、对齐、推理、代理等任何相关概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了两种用于边缘交通监控的语义图像通信方法（MMSD和SAMR），通过语义分解和选择性压缩实现了99%以上的数据传输减少，同时保持了场景的语义一致性或视觉质量。

摘要翻译

许多视觉监控系统在严格的通信限制下运行，直接传输全分辨率图像既不现实也往往不必要。在此类场景中，视觉数据通常用于判断物体存在、空间关系和场景上下文，而非追求精确的像素保真度。本文提出了两种面向交通监控的语义图像通信框架——MMSD与SAMR，它们在保留有意义视觉信息的同时显著降低了传输成本。MMSD（多模态语义分解）旨在实现极高压缩比并保障数据机密性，因其不传输敏感的像素内容。该方法将原始图像替换为紧凑的语义表示，即分割图、边缘图和文本描述，并在接收端使用基于扩散的生成模型重建场景。SAMR（语义感知掩蔽重建）则目标在保持高压缩率的同时获得更优视觉质量。它在标准JPEG编码前根据语义重要性选择性抑制非关键图像区域，并通过生成式修复技术在接收端恢复缺失内容。两种设计均遵循非对称的发送端-接收端架构：边缘侧进行轻量处理，而计算密集的重建任务则卸载至服务器端。在树莓派5平台上，MMSD的边缘处理时间约为15秒，SAMR约为9秒。实验结果表明，MMSD与SAMR的平均传输数据量分别减少了99%和99.1%。此外，MMSD在保持强语义一致性的同时，其载荷大小低于近期SPIC基线方法；而SAMR在可比操作条件下，相比标准JPEG与SQ-GAN提供了更优的质量-压缩权衡。

摘要 (Abstract)

Many visual monitoring systems operate under strict communication constraints, where transmitting full-resolution images is impractical and often unnecessary. In such settings, visual data is often used for object presence, spatial relationships, and scene context rather than exact pixel fidelity. This paper presents two semantic image communication pipelines for traffic monitoring, MMSD and SAMR, that reduce transmission cost while preserving meaningful visual information. MMSD (Multi-Modal Semantic Decomposition) targets very high compression together with data confidentiality, since sensitive pixel content is not transmitted. It replaces the original image with compact semantic representations, namely segmentation maps, edge maps, and textual descriptions, and reconstructs the scene at the receiver using a diffusion-based generative model. SAMR (Semantic-Aware Masking Reconstruction) targets higher visual quality while maintaining strong compression. It selectively suppresses non-critical image regions according to semantic importance before standard JPEG encoding and restores the missing content at the receiver through generative inpainting. Both designs follow an asymmetric sender-receiver architecture, where lightweight processing is performed at the edge and computationally intensive reconstruction is offloaded to the server. On a Raspberry Pi~5, the edge-side processing time is about 15s for MMSD and 9s for SAMR. Experimental results show average transmitted-data reductions of 99% for MMSD and 99.1% for SAMR. In addition, MMSD achieves lower payload size than the recent SPIC baseline while preserving strong semantic consistency, whereas SAMR provides a better quality-compression trade-off than standard JPEG and SQ-GAN under comparable operating conditions.

关键词: semantic image communication, traffic monitoring, edge computing, image compression, generative models, diffusion models, MMSD, SAMR

作者: You Qin, Linqing Wang, Hao Fei, Roger Zimmermann, Liefeng Bo, Qinglin Lu, Chunyu Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12617v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型的后训练方法SOAR，与关键词的相关性分析如下：1）高度相关（10分）：‘Post-training OR Supervised Fine-tuning OR SFT’是论文核心，SOAR旨在改进SFT阶段；‘Self-Correction OR Self-Improvement OR Self-Reflection’是方法的核心机制。2）中等相关（8分）：‘Instruction Tuning OR Alignment OR Value Alignment’，论文涉及模型对齐，但针对扩散模型而非LLM。3）弱相关（5分）：‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’，论文提及RL作为对比，但SOAR本身不使用奖励模型。4）无关（0分）：其他关键词主要针对LLM、推理、压缩、科学应用等，与扩散模型后训练无关。

!!! tip deepseek-chat TL;DR

论文提出SOAR方法，通过自校正机制解决扩散模型后训练中SFT与RL之间的差距，在无奖励模型的情况下提升了生成质量和对齐性能。

摘要翻译

扩散模型的训练后流程目前包含两个阶段：基于精选数据的监督微调（SFT）以及使用奖励模型的强化学习（RL）。二者之间存在一个根本性的鸿沟。SFT仅针对从前向加噪过程中采样的真实状态优化去噪器；一旦推理过程偏离这些理想状态，后续的去噪便依赖于分布外泛化而非习得的校正，这表现出与自回归模型相同的暴露偏差问题，但偏差是沿着去噪轨迹而非标记序列累积的。RL原则上可以解决这种不匹配，但其终端奖励信号稀疏，存在信用分配困难，并有奖励破解的风险。我们提出了SOAR（面向最优对齐与精炼的自校正），一种填补这一鸿沟的偏差校正训练后方法。SOAR从一个真实样本出发，使用当前模型执行一次停止梯度的前向推演，对由此产生的偏离轨迹的状态重新加噪，并监督模型使其回归到原始的干净目标。该方法是在线的、无需奖励的，并提供密集的每时间步监督，且不存在信用分配问题。在SD3.5-Medium模型上，相较于SFT，SOAR将GenEval分数从0.70提升至0.78，OCR分数从0.64提升至0.67，同时提高了所有基于模型的偏好评分。在受控的特定奖励实验中，尽管SOAR未使用奖励模型，其在美学和图文对齐任务上的最终指标值均超越了Flow-GRPO。由于SOAR的基础损失函数包含了标准的SFT目标，它可以作为预训练后更强的首个训练后阶段直接替代SFT，同时与后续的RL对齐保持完全兼容。

摘要 (Abstract)

The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a real sample, SOAR performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. The method is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem. On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while simultaneously raising all model-based preference scores. In controlled reward-specific experiments, SOAR surpasses Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. Since SOAR’s base loss subsumes the standard SFT objective, it can directly replace SFT as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent RL alignment.

关键词: diffusion models, post-training, self-correction, supervised fine-tuning, alignment, reinforcement learning, denoising trajectory, exposure bias

80. ❌ DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant

作者: Lev Sorokin, Ivan Vasilev, Samuele Pasini 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12615v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是LLM在汽车信息检索应用中的测试基准竞赛，直接涉及LLM和RAG技术（汽车手册检索应用），因此这两个关键词高度相关（10分）。论文关注系统失败和警告遗漏，与幻觉缓解有一定关联（5分）。其他关键词如MoE、量化、推理加速、对齐等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文报告了2026年DeepTest工具竞赛的结果，该竞赛基准测试了一个基于LLM的汽车手册信息检索应用，评估了四种工具在识别系统未能适当提及手册警告的用户输入方面的有效性，并比较了它们暴露故障的能力和发现的故障揭示测试的多样性。

摘要翻译

本报告总结了在ICSE 2026会议DeepTest研讨会期间举办的首届大语言模型（Large Language Model, LLM）测试竞赛的结果。四款测试工具围绕一款基于LLM的汽车手册信息检索应用进行了基准评估，其核心目标是识别出会导致系统未能恰当提及手册中所含警告信息的用户输入。评估测试方案的主要依据是其暴露系统故障的有效性，以及所发现的揭示故障测试用例的多样性。本文详细阐述了实验方法、参赛工具及最终结果。

摘要 (Abstract)

This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testing solutions were evaluated based on their effectiveness in exposing failures and the diversity of the discovered failure-revealing tests. We report on the experimental methodology, the competitors, and the results.

关键词: LLM testing, automotive assistant, information retrieval, benchmarking, failure detection, car manual, testing competition, DeepTest workshop

81. ❌ Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

作者: Jianhao Chen, Haoyang Chen, Hanjie Zhao, Haozhe Liang, Tieyun Qian 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12616v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language Models（VLMs）的多模态越狱攻击，核心创新在于提出MemJack框架，使用多智能体协作进行语义层面的攻击。与关键词相关性分析：1）高度相关（8-10分）：‘LLM Agents/Autonomous Agents’和’Multi-agent Systems’是核心方法；‘Large Language Models’和’Alignment’是攻击目标（VLMs属于大模型，攻击涉及对齐安全）。2）无关（0分）：其他关键词涉及模型架构、训练技术、推理优化、科学应用等，论文未涉及。

!!! tip deepseek-chat TL;DR

该论文提出MemJack框架，利用多智能体协作进行视觉语义越狱攻击，在Qwen3-VL-Plus上实现71.48%的攻击成功率，并发布包含11.3万条攻击轨迹的数据集以促进防御研究。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）的快速发展催生了人工智能领域前所未有的能力；然而，这种持续的模态扩展无意中暴露了一个极大拓宽且不受约束的对抗攻击面。当前的多模态越狱策略主要集中于表层像素扰动、排版攻击或有害图像，但未能触及视觉数据内在的复杂语义结构。这使得原始自然图像中广阔的语义攻击面在很大程度上未被审视。为揭示这些深层的语义漏洞，我们提出了 MemJack，一个记忆增强的多智能体越狱攻击框架，它明确利用视觉语义来编排自动化的越狱攻击。MemJack 通过协调的多智能体合作，动态地将视觉实体映射至恶意意图，通过多视角视觉语义伪装生成对抗性提示，并利用迭代零空间投影（Iterative Nullspace Projection, INLP）几何过滤器来规避过早的隐空间拒绝。通过持久的多模态经验记忆积累和传递成功策略，MemJack 能够在不同图像间保持高度连贯的扩展多轮越狱攻击交互，从而提升对新图像的攻击成功率（Attack Success Rate, ASR）。在完整、未经修改的 COCO val2017 图像上进行的大量实证评估表明，MemJack 对 Qwen3-VL-Plus 实现了 71.48% 的攻击成功率，在扩展预算下可提升至 90%。此外，为促进未来的防御对齐研究，我们将发布 MemJack-Bench，这是一个包含超过 113,000 条交互式多模态越狱攻击轨迹的综合数据集，为开发本质鲁棒的视觉语言模型奠定了重要基础。

摘要 (Abstract)

The rapid evolution of Vision-Language Models (VLMs) has catalyzed unprecedented capabilities in artificial intelligence; however, this continuous modal expansion has inadvertently exposed a vastly broadened and unconstrained adversarial attack surface. Current multimodal jailbreak strategies primarily focus on surface-level pixel perturbations and typographic attacks or harmful images; however, they fail to engage with the complex semantic structures intrinsic to visual data. This leaves the vast semantic attack surface of original, natural images largely unscrutinized. Driven by the need to expose these deep-seated semantic vulnerabilities, we introduce \textbf{MemJack}, a \textbf{MEM}ory-augmented multi-agent \textbf{JA}ilbreak atta\textbf{CK} framework that explicitly leverages visual semantics to orchestrate automated jailbreak attacks. MemJack employs coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi-angle visual-semantic camouflage, and utilize an Iterative Nullspace Projection (INLP) geometric filter to bypass premature latent space refusals. By accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, MemJack maintains highly coherent extended multi-turn jailbreak attack interactions across different images, thereby improving the attack success rate (ASR) on new images. Extensive empirical evaluations across full, unmodified COCO val2017 images demonstrate that MemJack achieves a 71.48% ASR against Qwen3-VL-Plus, scaling to 90% under extended budgets. Furthermore, to catalyze future defensive alignment research, we will release \textbf{MemJack-Bench}, a comprehensive dataset comprising over 113,000 interactive multimodal jailbreak attack trajectories, establishing a vital foundation for developing inherently robust VLMs.

关键词: Vision-Language Models, Jailbreak Attacks, Multi-agent Systems, Semantic Vulnerabilities, Adversarial Prompts, Iterative Nullspace Projection, Attack Success Rate, Multimodal Experience Memory

82. ❌ LLM-Guided Prompt Evolution for Password Guessing

作者: Vladimir A. Mazin, Mikhail A. Zorin, Dmitrii S. Korzh, Elvir Z. Karimov, Dmitrii A. Bolokhov, Oleg Y. Rogov 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12601v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是应用LLM进行密码猜测的提示词进化优化，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确使用Qwen3 8B、Gemini-2.5 Flash等LLM作为密码生成器。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究应用LLM驱动的进化计算自动优化密码猜测提示词，将RockYou测试集的破解率从2.02%提升至8.48%，并生成更真实的密码分布。

摘要翻译

密码仍是当前主流的身份验证方式，但其安全性常因用户选择的可预测性和大规模凭证泄露事件而受到破坏。自动化密码猜测是压力测试密码策略和模拟攻击者行为的关键工具。本文应用基于大语言模型（LLM）驱动的进化计算，为LLM密码猜测框架自动优化提示词。通过使用OpenEvolve——一个结合了MAP-Elites质量多样性搜索与岛屿种群模型的开源系统，我们进化出能在基于RockYou衍生的测试集上最大化破解率的提示词。我们评估了三种配置：使用Qwen3 8B的本地设置、单个紧凑云模型Gemini-2.5 Flash，以及由前沿LLM组成的双模型集成。该方法将破解率从2.02%提升至8.48%。字符分布分析进一步证实，进化后的提示词能生成统计意义上更接近真实的密码。自动化提示词进化是一种低门槛且有效的方法，可增强基于LLM的密码审计能力，同时也揭示了攻击流程如何通过自动化改进展现出优化倾向。

摘要 (Abstract)

Passwords still remain a dominant authentication method, yet their security is routinely subverted by predictable user choices and large-scale credential leaks. Automated password guessing is a key tool for stress-testing password policies and modeling attacker behavior. This paper applies LLM-driven evolutionary computation to automatically optimize prompts for the LLM password guessing framework. Using OpenEvolve, an open-source system combining MAP-Elites quality-diversity search with an island population model we evolve prompts that maximize cracking rate on a RockYou-derived test set. We evaluate three configurations: a local setup with Qwen3 8B, a single compact cloud model Gemini-2.5 Flash, and a two-model ensemble of frontier LLMs. The approach raises the cracking rates from 2.02% to 8.48%. Character distribution analysis further confirms how evolved prompts produce statistically more realistic passwords. Automated prompt evolution is a low-barrier yet effective way to strengthen LLM-based password auditing and underlining how attack pipelines show tendency via automated improvements.

关键词: LLM, password guessing, prompt evolution, evolutionary computation, MAP-Elites, cracking rate, RockYou, OpenEvolve

83. ❌ IDEA: An Interpretable and Editable Decision-Making Framework for LLMs via Verbal-to-Numeric Calibration

作者: Yanji He, Yuxin Jiang, Yiwen Wu, Bo Huang, Jiaheng Wei, Wei Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12573v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文IDEA专注于LLM决策框架的可解释性和可编辑性，核心涉及LLM（高度相关）和可解释AI（高度相关）。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统、压缩加速等均未在摘要中提及或相关，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对LLM决策中概率校准不准确、解释不忠实、难以融入专家知识的问题，提出了IDEA框架，通过将LLM决策知识提取到可解释参数模型中，实现了校准概率和定量人机协作，在多个数据集上超越了现有模型。

摘要翻译

大语言模型正日益被应用于决策任务，但其在高风险领域的应用仍受限于概率校准失准、解释缺乏忠实性以及难以精确融入专家知识等问题。我们提出IDEA框架，该框架能够将大语言模型的决策知识提取为基于语义明确因子的可解释参数化模型。通过期望最大化算法联合学习语言到数值的映射关系与决策参数、采用保持因子依赖关系的关联采样技术，以及具备数学保证的直接参数编辑，IDEA在生成校准概率的同时实现了定量化的人机协作。在五个数据集上的实验表明，采用Qwen-3-32B的IDEA模型（78.6%）在性能上超越了DeepSeek R1（68.1%）与GPT-5.2（77.9%），并实现了完美的因子排除与精确校准——这种精度是仅通过提示工程无法达到的。项目代码已公开于https://github.com/leonbig/IDEA。

摘要 (Abstract)

Large Language Models are increasingly deployed for decision-making, yet their adoption in high-stakes domains remains limited by miscalibrated probabilities, unfaithful explanations, and inability to incorporate expert knowledge precisely. We propose IDEA, a framework that extracts LLM decision knowledge into an interpretable parametric model over semantically meaningful factors. Through joint learning of verbal-to-numerical mappings and decision parameters via EM, correlated sampling that preserves factor dependencies, and direct parameter editing with mathematical guarantees, IDEA produces calibrated probabilities while enabling quantitative human-AI collaboration. Experiments across five datasets show IDEA with Qwen-3-32B (78.6%) outperforms DeepSeek R1 (68.1%) and GPT-5.2 (77.9%), achieving perfect factor exclusion and exact calibration – precision unattainable through prompting alone. The implementation is publicly available at https://github.com/leonbig/IDEA.

关键词: Large Language Models, decision-making, interpretable, calibration, human-AI collaboration, parametric model, verbal-to-numerical, factor dependencies

84. ❌ KumoRFM-2: Scaling Foundation Models for Relational Learning

作者: Valter Hudovernik, Federico López, Vid Kocijan, Akihiro Nitta, Jan Eric Lenssen, Jure Leskovec, Matthias Fey 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12596v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是关系数据的基础模型KumoRFM-2，与’Foundation Models’高度相关（10分），涉及’Pre-training’和’In-context Learning’（各10分），支持’Fine-tuning’（8分），并提到扩展到大规模数据集（与’Scaling Laws’有一定关联，5分）。其他关键词如MoE、SLMs、Alignment、RAG等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了KumoRFM-2，一个用于关系数据预测任务的基础模型，通过预训练和上下文学习，在41个基准测试中超越监督方法高达8%，并扩展到十亿级数据集。

摘要翻译

我们推出KumoRFM-2，这是面向关系数据的预训练基础模型的新一代版本。KumoRFM-2支持上下文学习与微调，适用于广泛的预测任务。与传统表格基础模型不同，KumoRFM-2原生支持关系数据操作，能够同时处理一个或多个关联表，无需人工进行表格扁平化或目标变量生成，同时保持时间一致性。该模型利用大规模合成与真实世界数据语料库，从四个维度进行预训练：单表级别的行与列维度，以及数据库级别的外键与跨样本维度。相较于前代模型，KumoRFM-2尽可能早地注入任务信息，从而更精准地选择任务相关列，并提升对噪声数据的鲁棒性。通过对41个具有挑战性的基准测试进行广泛实验，并结合表达能力与敏感度分析，我们证明KumoRFM-2在监督方法与基础模型方法上的性能提升最高达8%，同时在冷启动和噪声数据的极端场景下仍保持强劲性能。据我们所知，这是首次有少样本基础模型在常见基准任务上超越监督学习方法，且经过微调后性能可进一步提升。最后，尽管KumoRFM-1仅适用于小规模内存数据集，KumoRFM-2已能扩展至十亿级别规模的关系数据集。

摘要 (Abstract)

We introduce KumoRFM-2, the next iteration of a pre-trained foundation model for relational data. KumoRFM-2 supports in-context learning as well as fine-tuning and is applicable to a wide range of predictive tasks. In contrast to tabular foundation models, KumoRFM-2 natively operates on relational data, processing one or more connected tables simultaneously without manual table flattening or target variable generation, all while preserving temporal consistency. KumoRFM-2 leverages a large corpus of synthetic and real-world data to pre-train across four axes: the row and column dimensions at the individual table level, and the foreign key and cross-sample dimensions at the database level. In contrast to its predecessor, KumoRFM-2 injects task information as early as possible, enabling sharper selection of task-relevant columns and improved robustness to noisy data. Through extensive experiments on 41 challenging benchmarks and analysis around expressivity and sensitivity, we demonstrate that KumoRFM-2 outperforms supervised and foundational approaches by up to 8%, while maintaining strong performance under extreme settings of cold start and noisy data. To our knowledge, this is the first time a few-shot foundation model has been shown to surpass supervised approaches on common benchmark tasks, with performance further improving upon fine-tuning. Finally, while KumoRFM-1 was limited to small-scale in-memory datasets, KumoRFM-2 scales to billion-scale relational datasets.

关键词: foundation model, relational data, pre-training, in-context learning, fine-tuning, scalability, benchmark performance, cold start robustness

85. ❌ A Two-Stage LLM Framework for Accessible and Verified XAI Explanations

作者: Georgios Mermigkis, Dimitris Metaxakis, Marios Tyrovolas, Argiris Sofotasios, Nikolaos Avgeris, Panagiotis Hadjidoukas, Chrysostomos Stylios 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12543v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在可解释AI（XAI）中的应用，提出两阶段LLM框架（Explainer和Verifier）来生成和验证自然语言解释。高度相关的关键词包括：LLMs（核心工具）、Hallucination Mitigation（验证机制防止幻觉）、Explainable AI（应用领域）、Self-Correction（迭代反馈改进）。中等相关的关键词：Chain of Thought和System 2 Thinking（涉及推理过程分析，如EPR指标显示更稳定的推理）。其他关键词如MoE、SFT、RAG等未在论文中涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文针对LLM生成XAI解释时缺乏准确性保证的问题，提出了一个两阶段LLM元验证框架，通过解释器生成和验证器评估的迭代机制，显著提高了解释的可靠性和语言可访问性。

摘要翻译

大语言模型正日益被用于将可解释人工智能方法的技术输出转化为易于理解的自然语言解释。然而，现有方法往往缺乏对准确性、忠实性和完整性的保障。与此同时，当前对此类叙述的评估工作仍主要依赖主观判断或局限于事后评分，无法防止有缺陷的解释传达给终端用户。为应对这些局限，本文提出一种两阶段大语言模型元验证框架，该框架包含：（i）解释器大语言模型，将原始可解释人工智能输出转换为自然语言叙述；（ii）验证器大语言模型，从忠实性、连贯性、完整性及幻觉风险等维度对其进行评估；以及（iii）迭代反馈机制，利用验证器的反馈对解释进行优化改进。通过在五种可解释人工智能技术和数据集上、使用三个系列的开源权重大语言模型进行实验，结果表明：与原始可解释人工智能输出相比，验证环节对于过滤不可靠解释至关重要，同时能提升语言可及性。此外，对优化过程中熵产生率的分析表明，验证器的反馈能逐步引导解释器形成更稳定、连贯的推理路径。总体而言，所提出的框架为实现更可信、更民主化的可解释人工智能系统提供了有效路径。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly used to translate the technical outputs of eXplainable Artificial Intelligence (XAI) methods into accessible natural-language explanations. However, existing approaches often lack guarantees of accuracy, faithfulness, and completeness. At the same time, current efforts to evaluate such narratives remain largely subjective or confined to post-hoc scoring, offering no safeguards to prevent flawed explanations from reaching end-users. To address these limitations, this paper proposes a Two-Stage LLM Meta-Verification Framework that consists of (i) an Explainer LLM that converts raw XAI outputs into natural-language narratives, (ii) a Verifier LLM that assesses them in terms of faithfulness, coherence, completeness, and hallucination risk, and (iii) an iterative refeed mechanism that uses the Verifier’s feedback to refine and improve them. Experiments across five XAI techniques and datasets, using three families of open-weight LLMs, show that verification is crucial for filtering unreliable explanations while improving linguistic accessibility compared with raw XAI outputs. In addition, the analysis of the Entropy Production Rate (EPR) during the refinement process indicates that the Verifier’s feedback progressively guides the Explainer toward more stable and coherent reasoning. Overall, the proposed framework provides an efficient pathway toward more trustworthy and democratized XAI systems.

关键词: Large Language Models, Explainable AI, XAI, Hallucination Mitigation, Verification Framework, Natural-language Explanations, Faithfulness, Iterative Refinement

86. ❌ When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP

作者: Mahounan Pericles Adjovi, Roald Eiselen, Prasenjit Mitra 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12540v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LLM（Gemini 2.5 Flash）在低资源语言数据增强中的应用，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（Pre-training、SFT、RLHF等）、推理优化（CoT、MCTS）、模型压缩、AI for Science等均未在论文中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究评估了LLM生成和回译两种数据增强方法对豪萨语和丰语在命名实体识别和词性标注任务中的效果，发现增强效果取决于任务类型而非语言或LLM生成质量，挑战了LLM生成质量预测增强成功的假设。

摘要翻译

数据稀缺性限制了低资源非洲语言的自然语言处理发展。本研究针对豪萨语和丰语这两种西非语言（它们在大型语言模型生成质量上存在显著差异），评估了两种数据增强方法——基于大型语言模型的生成方法（Gemini 2.5 Flash）与回译方法（NLLB-200）。我们使用MasakhaNER 2.0和MasakhaPOS基准测试，在命名实体识别（NER）和词性标注（POS tagging）任务上评估增强效果。结果表明，增强效果主要取决于任务类型，而非单纯由语言或大型语言模型质量决定。在命名实体识别任务中，两种方法对两种语言均未超越基线水平；大型语言模型增强使豪萨语NER的F1值下降0.24%，丰语NER下降1.81%。在词性标注任务中，大型语言模型增强使丰语准确率提升0.33%，而回译方法使豪萨语准确率提升0.17%；回译方法使丰语词性标注准确率降低0.35%，对豪萨语词性标注影响可忽略。同一大型语言模型生成的合成数据在丰语的不同任务中产生相反效果——损害命名实体识别性能却提升词性标注性能——这表明任务结构对增强结果的影响比合成数据质量更为关键。这些发现挑战了“大型语言模型生成质量可预测增强效果”的假设，并提供了可操作的指导：数据增强应被视为针对特定任务的干预措施，而非普遍适用的预处理步骤。

摘要 (Abstract)

Data scarcity limits NLP development for low-resource African languages. We evaluate two data augmentation methods – LLM-based generation (Gemini 2.5 Flash) and back-translation (NLLB-200) – for Hausa and Fongbe, two West African languages that differ substantially in LLM generation quality. We assess augmentation on named entity recognition (NER) and part-of-speech (POS) tagging using MasakhaNER 2.0 and MasakhaPOS benchmarks. Our results reveal that augmentation effectiveness depends on task type rather than language or LLM quality alone. For NER, neither method improves over baseline for either language; LLM augmentation reduces Hausa NER by 0.24% F1 and Fongbe NER by 1.81% F1. For POS tagging, LLM augmentation improves Fongbe by 0.33% accuracy, while back-translation improves Hausa by 0.17%; back-translation reduces Fongbe POS by 0.35% and has negligible effect on Hausa POS. The same LLM-generated synthetic data produces opposite effects across tasks for Fongbe – hurting NER while helping POS – suggesting task structure governs augmentation outcomes more than synthetic data quality. These findings challenge the assumption that LLM generation quality predicts augmentation success, and provide actionable guidance: data augmentation should be treated as a task-specific intervention rather than a universally beneficial preprocessing step.

关键词: data augmentation, LLM generation, low-resource languages, named entity recognition, part-of-speech tagging, back-translation, Hausa, Fongbe

87. ❌ Technical Report – A Context-Sensitive Multi-Level Similarity Framework for First-Order Logic Arguments: An Axiomatic Study

作者: Victor David, Jérôme Delobelle, Jean-Guy Mailly 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12534v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究一阶逻辑论证的相似性框架，属于形式论证和逻辑推理领域，与所有评分关键词（均聚焦于大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐、应用或相关概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对一阶逻辑论证的相似性度量问题，提出了一个包含扩展公理基础、四层参数化模型和两个模型家族的综合性框架，以解决结构化内容下的相似性计算。

摘要翻译

形式论辩中的相似性研究近期因其在语义层面的论据聚合与省略三段论解码等问题中的重要性而受到关注。现有方法主要聚焦于命题逻辑，本文则探讨更具表达力的一阶逻辑（First-Order Logic, FOL）场景，其中相似性需考虑结构化内容。我们提出一个全面的FOL论据相似性框架，其构建基于：（1）扩展的公理化基础；（2）涵盖谓词、文字、子句与公式相似性的四层次参数化模型；（3）两个模型族，其中一类通过语言模型实现句法敏感性，两类模型均整合了上下文权重以支持精细且可解释的相似性度量；（4）确保理想性质的形式化约束条件。

摘要 (Abstract)

Similarity in formal argumentation has recently gained attention due to its significance in problems such as argument aggregation in semantics and enthymeme decoding. While existing approaches focus on propositional logic, we address the richer setting of First-Order Logic (FOL), where similarity must account for structured content. We introduce a comprehensive framework for FOL argument similarity, built upon: (1) an extended axiomatic foundation; (2) a four-level parametric model covering predicates, literals, clauses, and formulae similarity; (3) two model families, one syntax-sensitive via language models, both integrating contextual weights for nuanced and explainable similarity; and (4) formal constraints enforcing desirable properties.

关键词: First-Order Logic, Argument Similarity, Axiomatic Foundation, Parametric Model, Context-Sensitive, Formal Argumentation, Similarity Framework, Explainable Similarity

88. ❌ MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

作者: Ruoxiang Huang, Zhen Yuan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12537v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉语言模型（VLMs）中的位置编码优化问题，提出了一种无需训练的动态调整位置索引的方法MODIX。虽然论文涉及Transformer架构和注意力机制，但所有给定的关键词都专门针对大语言模型（LLMs）或特定的大模型技术（如MoE、RLHF、RAG等），而论文明确聚焦于视觉语言模型（VLMs），这是一种多模态模型，与纯文本大语言模型有本质区别。论文未涉及任何关键词中提到的具体技术、方法或应用领域（如科学AI），因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

论文针对视觉语言模型中位置编码机制效率低下的问题，提出了一种无需训练的、基于多模态信息动态调整位置索引的方法MODIX，实验证明该方法能有效提升多模态推理性能并自适应地重新分配注意力。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）在多模态理解方面取得了显著进展，但其位置编码机制仍非最优。现有方法统一为所有标记分配位置索引，忽视了模态内及跨模态信息密度的差异，这导致注意力分配效率低下——冗余的视觉区域占据主导，而信息丰富的内容却未能充分表征。我们将位置粒度视为一种隐式资源，并提出MODIX（多模态信息驱动的位置索引缩放），这是一种无需训练即可根据模态特定贡献动态调整位置步长的框架。MODIX通过基于协方差的熵联合建模模态内密度，并通过跨模态对齐建模模态间交互，从而推导出统一评分，该评分重新缩放位置索引，将更细的粒度分配给信息丰富的模态，同时压缩冗余模态，且无需修改模型参数或架构。在多种架构和基准测试上的实验表明，MODIX能持续提升多模态推理能力，并根据任务相关的信息分布自适应地重新分配注意力，这表明在多模态序列建模的Transformer中，位置编码应被视为一种可自适应调配的资源。

摘要 (Abstract)

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.

关键词: Vision-Language Models, Positional Encoding, Multimodal Understanding, Attention Allocation, Training-Free Framework, Positional Index Scaling, Transformer, Multimodal Reasoning

89. ❌ Orthogonal Subspace Projection for Continual Machine Unlearning via SVD-Based LoRA

作者: Yogachandran Rahulamathavan, Nasir Iqbal, Juncheng Hu, Sangarapillai Lambotharan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12526v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器遗忘（machine unlearning）领域，提出了一种基于SVD的正交子空间投影方法，通过约束LoRA更新来减少任务间干扰。论文核心与LoRA（参数高效微调）高度相关（10分），因为LoRA是其方法的基础技术。然而，论文不涉及大语言模型、科学AI应用或其他关键词，研究的是计算机视觉任务（CIFAR-100、MNIST）中的通用机器学习模型（ResNet-20），而非大模型或特定科学领域应用。因此，除LoRA外，其他关键词均无关（0分）。

!!! tip deepseek-chat TL;DR

该论文解决了连续机器遗忘中多个LoRA模块导致的参数冲突问题，提出了一种基于SVD的正交子空间投影方法，在长期序列遗忘任务中保持了模型性能并确保了有效的遗忘效果。

摘要翻译

持续机器遗忘旨在消除不应再保留的数据对模型的影响，同时保持模型在其他所有数据上的有效性。当删除请求按顺序到达时，这一任务变得尤为困难，因为模型必须在不抹除先前保留知识的情况下反复适应。低秩适应（LoRA）为实现此类更新提供了一种高效途径，但若简单组合多个顺序的LoRA模块会导致参数冲突，引发任务间的强干扰。我们提出一种基于奇异值分解（SVD）引导的正交子空间投影的静态替代方案。该方法在训练过程中约束每个新的LoRA更新，使其位于早期遗忘任务所用子空间的正交补空间中。这既保持了任务间的隔离性，又无需在部署时进行动态路由。在CIFAR-100数据集（使用ResNet-20模型）和MNIST数据集上的实验表明，该方法在长序列遗忘任务中表现稳定。经过三十次连续遗忘任务后，现有最优的静态融合方法将保留准确率从60.39%降至12.70%，而本文提出的训练中约束优化方法在保持基线性能（约58.1%）的同时，仍维持了强大的遗忘效能。

摘要 (Abstract)

Continual machine unlearning aims to remove the influence of data that should no longer be retained, while preserving the usefulness of the model on everything else. This setting becomes especially difficult when deletion requests arrive sequentially, because the model must repeatedly adapt without erasing previously retained knowledge. Low-Rank Adaptation (LoRA) offers an efficient way to implement such updates, but naively combining many sequential LoRA modules leads to parameter collision, causing \textit{strong interference} between tasks. We propose a static alternative based on Singular Value Decomposition (SVD)-guided orthogonal subspace projection. Our method constrains each new LoRA update during training so that it lies in the orthogonal complement of the subspaces used by earlier unlearning tasks. This preserves task isolation without requiring dynamic routing at deployment. Experiments on CIFAR-100 with ResNet-20 and on MNIST show stable behavior across long sequences of unlearning tasks. After thirty sequential unlearning tasks, state-of-the-art static fusion reduces retained accuracy from 60.39% to 12.70%, whereas the proposed in-training constrained optimization maintains baseline performance ($\sim$58.1%) while preserving strong unlearning efficacy.

关键词: Continual machine unlearning, Low-Rank Adaptation (LoRA), Orthogonal subspace projection, Singular Value Decomposition (SVD), Parameter collision, Task interference, Sequential unlearning tasks, Model adaptation

90. ❌ NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1)

作者: Guanyi Qin, Jie Liang, Bingbing Zhang, Lishen Qu, Ya-nan Guan, Hui Zeng, Lei Zhang, Radu Timofte, Jianhui Sun, Xinli Yue, Tao Shao, Huan Hou, Wenjie Liao, Shuhao Han, Jieyu Yuan, Chunle Guo, Chongyi Li, Zewen Chen, Yunze Liu, Jian Guo, Juan Wang, Yun Zeng, Bing Li, Weiming Hu, Hesong Li, Dehua Liu, Xinjie Zhang, Qiang Li, Li Yan, Wei Dong, Qingsen Yan, Xingcan Li, Shenglong Zhou, Manjiang Yin, Yinxiang Zhang, Hongbo Wang, Jikai Xu, Zhaohui Fan, Dandan Zhu, Wei Sun, Weixia Zhang, Kun Zhu, Nana Zhang, Kaiwei Zhang, Qianqian Zhang, Zhihan Zhang, William Gordon, Linwei Wu, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Cici Liu, Yaokun Shi 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12512v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究多模态大语言模型（MLLMs）在专业图像质量评估中的应用，涉及比较质量选择和解释性推理任务。与’Large Language Models’高度相关（8分），因为MLLMs是核心方法；与’Chain of Thought’和’System 2 Thinking’相关（各8分），因为论文强调推理能力以提供专家级解释；与’Mechanistic Interpretability’相关（8分），因为研究关注模型的可解释性和理由生成。其他关键词如MoE、SLMs、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过NTIRE 2026挑战赛，利用多模态大语言模型（MLLMs）解决专业图像质量评估问题，实现了在高质量图像对中可靠选择更优图像并生成专家级解释推理，显著提升了该领域的先进水平。

摘要翻译

本文概述了NTIRE 2026“第三届野外任意图像修复模型”挑战赛，重点关注赛道一：专业图像质量评估。传统的图像质量评估方法通常依赖于标量分数。这些方法将复杂的视觉特征压缩为单一数值，因而难以区分高质量图像之间细微的差异。此外，它们无法阐明一幅图像为何更优，缺乏为视觉任务提供指导所需的推理能力。为弥补这一差距，多模态大语言模型的最新进展提供了一个有前景的范式。受此潜力启发，我们的挑战赛建立了一个新颖的基准，旨在探索MLLMs在评估高质量图像对时模仿人类专家认知的能力。参赛者的核心任务是克服专业场景中的关键瓶颈，聚焦于两个主要目标：(1) 比较性质量选择：可靠地识别高质量图像对中视觉上更优的图像；(2) 解释性推理：生成有依据的、专家级的解释，详细说明选择背后的理由。本次挑战赛共吸引了近200名注册者和超过2500份提交。表现最优的方法显著推进了专业IQA的技术水平。挑战赛数据集发布于https://github.com/narthchin/RAIM-PIQA，官方主页可通过https://www.codabench.org/competitions/12789/访问。

摘要 (Abstract)

In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high-quality images. Furthermore, they fail to articulate why one image is superior, lacking the reasoning capabilities required to provide guidance for vision tasks. To bridge this gap, recent advancements in Multimodal Large Language Models (MLLMs) offer a promising paradigm. Inspired by this potential, our challenge establishes a novel benchmark exploring the ability of MLLMs to mimic human expert cognition in evaluating high-quality image pairs. Participants were tasked with overcoming critical bottlenecks in professional scenarios, centering on two primary objectives: (1) Comparative Quality Selection: reliably identifying the visually superior image within a high-quality pair; and (2) Interpretative Reasoning: generating grounded, expert-level explanations that detail the rationale behind the selection. In total, the challenge attracted nearly 200 registrations and over 2,500 submissions. The top-performing methods significantly advanced the state of the art in professional IQA. The challenge dataset is available at https://github.com/narthchin/RAIM-PIQA, and the official homepage is accessible at https://www.codabench.org/competitions/12789/.

关键词: Multimodal Large Language Models, Image Quality Assessment, Professional IQA, Comparative Quality Selection, Interpretative Reasoning, Expert-level Explanations, Challenge Benchmark, Vision Tasks

91. ❌ Topology-Aware Reasoning over Incomplete Knowledge Graph with Graph-Based Soft Prompting

作者: Shuai Wang, Xixi Wang, Yinan Yu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12503v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确使用LLMs进行知识图谱问答，核心贡献是提出基于图的软提示框架来增强LLMs在知识图谱上的推理能力，并直接解决LLMs的幻觉问题，因此与’Large Language Models’和’Hallucination Mitigation’高度相关（10分）。其他关键词如MoE、量化、推理加速、对齐等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图的软提示框架，通过将知识图谱子图编码为软提示来增强大语言模型在多跳知识图谱问答中的推理能力，有效减轻了对不完整知识图谱的敏感性并减少了幻觉问题，在多个基准测试中取得了最先进的性能。

摘要翻译

大语言模型（LLMs）在各类任务中展现出卓越能力，但在知识密集型场景中仍易产生幻觉。知识库问答（Knowledge Base Question Answering, KBQA）通过将生成过程锚定于知识图谱（Knowledge Graphs, KGs）中来缓解此问题。然而，现有大多数多跳KBQA方法依赖于显式的边遍历，这使其对知识图谱的不完整性极为敏感。本文提出了一种新颖的基于图谱的软提示框架，将推理范式从节点级路径遍历转向子图级推理。具体而言，我们采用图神经网络（Graph Neural Network, GNN）将提取的结构化子图编码为软提示，使大语言模型能够基于更丰富的结构上下文进行推理，并识别出直接图邻域之外的相关实体，从而降低对缺失边的敏感性。此外，我们引入了一种两阶段范式，在保持良好性能的同时降低计算成本：首先使用轻量化大语言模型结合软提示识别问题相关实体与关系，随后由更强大的大语言模型进行基于证据的答案生成。在四个多跳KBQA基准数据集上的实验表明，我们的方法在其中三个数据集上取得了最先进的性能，验证了其有效性。代码已发布于仓库：https://github.com/Wangshuaiia/GraSP。

摘要 (Abstract)

Large Language Models (LLMs) have shown remarkable capabilities across various tasks but remain prone to hallucinations in knowledge-intensive scenarios. Knowledge Base Question Answering (KBQA) mitigates this by grounding generation in Knowledge Graphs (KGs). However, most multi-hop KBQA methods rely on explicit edge traversal, making them fragile to KG incompleteness. In this paper, we proposed a novel graph-based soft prompting framework that shifts the reasoning paradigm from node-level path traversal to subgraph-level reasoning. Specifically, we employ a Graph Neural Network (GNN) to encode extracted structural subgraphs into soft prompts, enabling LLM to reason over richer structural context and identify relevant entities beyond immediate graph neighbors, thereby reducing sensitivity to missing edges. Furthermore, we introduce a two-stage paradigm that reduces computational cost while preserving good performance: a lightweight LLM first leverages the soft prompts to identify question-relevant entities and relations, followed by a more powerful LLM for evidence-aware answer generation. Experiments on four multi-hop KBQA benchmarks show that our approach achieves state-of-the-art performance on three of them, demonstrating its effectiveness. Code is available at the repository: https://github.com/Wangshuaiia/GraSP.

关键词: Large Language Models, Knowledge Graph Question Answering, Graph Neural Networks, Soft Prompting, Multi-hop Reasoning, Hallucination Mitigation, Subgraph-level Reasoning, Two-stage Paradigm

92. ❌ SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker

作者: Junbin Su, Ziteng Xue, Shihui Zhang, Kun Chen, Weiming Hu, Zhipeng Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12502v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SEATrack专注于多模态跟踪领域，核心创新点包括：1）提出AMG-LoRA，将LoRA（参数高效微调）与自适应互指导结合，用于跨模态注意力对齐和领域适应，因此与’PEFT/LoRA’高度相关（15分）；2）引入分层混合专家（HMoE）进行全局关系建模，与’Mixture of Experts’直接相关（10分）；3）涉及跨模态对齐和适应，与’Domain Adaptation’有一定关联（5分）。论文未涉及大语言模型、科学AI应用或其他关键词，因此其余关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对多模态跟踪中参数高效微调（PEFT）的性能-效率困境，提出了SEATrack框架，通过AMG-LoRA实现跨模态注意力对齐和HMoE进行高效全局融合，在RGB-T、RGB-D和RGB-E跟踪任务中取得了性能与效率的平衡。

摘要翻译

多模态跟踪中的参数高效微调（Parameter-efficient fine-tuning, PEFT）揭示了一个令人担忧的趋势：近期的性能提升往往以参数预算的膨胀为代价，这从根本上侵蚀了PEFT的效率承诺。本文提出SEATrack，一个简单、高效且自适应的双流多模态跟踪器，从两个互补的视角应对这一性能-效率困境。我们首先优先考虑匹配响应的跨模态对齐——这是一个尚未被充分探索但至关重要的因素，我们认为这对于打破现有权衡至关重要。具体而言，我们观察到现有双流方法中存在的模态特定偏差会产生相互冲突的匹配注意力图，从而阻碍有效的联合表征学习。为缓解此问题，我们提出了AMG-LoRA，它将用于领域适应的低秩自适应（Low-Rank Adaptation, LoRA）与自适应互引导（Adaptive Mutual Guidance, AMG）无缝集成，以动态优化并跨模态对齐注意力图。随后，我们摒弃了传统的局部融合方法，引入了分层混合专家（Hierarchical Mixture of Experts, HMoE），以实现高效的全局关系建模，在跨模态融合中有效平衡了表达力与计算效率。得益于这些创新，SEATrack在RGB-T、RGB-D和RGB-E跟踪任务中，于性能与效率的平衡方面相比现有先进方法取得了显著进展。\href{https://github.com/AutoLab-SAI-SJTU/SEATrack}{\textcolor{cyan}{代码已开源}}。

摘要 (Abstract)

Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT’s efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. \href{https://github.com/AutoLab-SAI-SJTU/SEATrack}{\textcolor{cyan}{Code is available}}.

关键词: multimodal tracking, parameter-efficient fine-tuning, LoRA, cross-modal alignment, Hierarchical Mixture of Experts, attention maps, domain adaptation, efficiency-performance trade-off

93. ❌ Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining

作者: Mahmoud Amiri, Jamile Mohammad Jafari, Sara Mostafapour, Thomas Bocklitz 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12498v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要关注化学领域语料库的构建、验证和可复现工作流程，涉及文本挖掘和检索应用。论文使用了E5-large-v2模型生成段落级嵌入，但这属于通用嵌入模型的应用，而非大语言模型或深度学习技术原理的创新。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文的化学领域应用有一定关联（评5分），因为论文构建了化学专用语料库并支持下游文本挖掘。其他关键词均与大模型技术、训练方法、推理优化、代理系统等核心创新内容无关，因此评0分。论文未涉及任何专家作者。

!!! tip deepseek-chat TL;DR

该论文提出了Lit2Vec，一个可复现的工作流程，用于从S2ORC中构建和验证经过法律筛选的化学语料库，以支持下游检索和文本挖掘应用。

摘要翻译

本文提出Lit2Vec——一种基于语义学者开放研究语料库、通过保守的元数据许可筛选来构建与验证化学领域语料库的可复现工作流程。利用该流程，我们构建了一个包含582,683篇化学专业全文研究文章的内部研究语料库，其具备结构化全文、词汇感知的段落分块、基于intfloat/e5-large-v2模型生成的段落级嵌入向量，以及包含摘要与许可信息的记录级元数据。为支持下游检索与文本挖掘应用，语料库中符合条件的子集还额外增强了机器生成的简要摘要与涵盖18个化学领域的多标签子领域标注。许可筛选使用了来自Unpaywall、OpenAlex和Crossref的元数据，并对最终语料库进行了技术验证，包括模式合规性、嵌入向量可复现性、文本质量及元数据完整性。本研究的主要贡献在于提供了一套可复现的语料库构建与验证工作流程，及其关联的数据模式与可复现性资源。发布材料包含从固定的公共上游资源复现语料库所需的代码、重建流程、数据模式、元数据/溯源文件及验证输出。基于源文本的公开再分发及广泛文本衍生表示不在本次通用发布范围内。研究者可通过已发布的流程，结合公开可用的上游数据集与元数据服务，复现该工作流程。

摘要 (Abstract)

We present Lit2Vec, a reproducible workflow for constructing and validating a chemistry corpus from the Semantic Scholar Open Research Corpus using conservative, metadata-based license screening. Using this workflow, we assembled an internal study corpus of 582,683 chemistry-specific full-text research articles with structured full text, token-aware paragraph chunks, paragraph-level embeddings generated with the intfloat/e5-large-v2 model, and record-level metadata including abstracts and licensing information. To support downstream retrieval and text-mining use cases, an eligible subset of the corpus was additionally enriched with machine-generated brief summaries and multi-label subfield annotations spanning 18 chemistry domains. Licensing was screened using metadata from Unpaywall, OpenAlex, and Crossref, and the resulting corpus was technically validated for schema compliance, embedding reproducibility, text quality, and metadata completeness. The primary contribution of this work is a reproducible workflow for corpus construction and validation, together with its associated schema and reproducibility resources. The released materials include the code, reconstruction workflow, schema, metadata/provenance artifacts, and validation outputs needed to reproduce the corpus from pinned public upstream resources. Public redistribution of source-derived text and broad text-derived representations is outside the scope of the general release. Researchers can reproduce the workflow by using the released pipeline with publicly available upstream datasets and metadata services.

关键词: chemistry corpus, reproducible workflow, text mining, retrieval, embeddings, license screening, corpus validation, Semantic Scholar Open Research Corpus

94. ❌ Latent Planning Emerges with Scale

作者: Michael Hanna, Emmanuel Ameisen 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12493v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的隐式规划能力及其随模型规模增长的机制，因此与’Large Language Models’和’Scaling Laws’高度相关（10分）。研究涉及内部规划表示和推理过程，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。论文测试了0.6B-14B的模型，包括较小模型，因此与’Small Language Models’有一定关联（5分）。研究通过分析模型内部特征来理解规划机制，与’Mechanistic Interpretability’高度相关（10分）。其他关键词如MoE、训练方法、RAG、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）在简单规划任务中的隐式规划能力，发现这种能力随模型规模增长而增强，并提供了测量规划和机制证据的框架。

摘要翻译

大型语言模型（LLM）能够执行看似需要大量规划的任务，例如撰写连贯的故事或功能性代码，而无需显式地表述计划；然而，其进行隐性规划的程度尚不明确。本文中，我们将潜在规划定义为：当LLM具备内部规划表征时，该表征（1）导致生成特定的未来词元或概念，且（2）调整前文语境以支持所述未来词元或概念的出现。我们以Qwen-3系列模型（0.6B-14B）为对象，在简单规划任务上展开研究，发现潜在规划能力随模型规模提升而增强。具备规划能力的模型拥有表征规划目标词（如“accountant”）的特征，并促使其输出“an”而非“a”；此外，即使表现稍逊的Qwen-3 4B-8B模型也已具备初期的规划机制。在完成押韵对句这一更复杂的任务中，我们发现模型常能提前识别韵脚，但即使是大规模模型也极少进行长程规划。然而，通过在散文生成中引导模型朝向规划词汇，我们可以激发一定程度的规划行为，且该能力随模型规模扩大而提升。总之，我们提出了一个衡量规划的框架，并提供了模型规划能力随规模增长机理的证据。

摘要 (Abstract)

LLMs can perform seemingly planning-intensive tasks, like writing coherent stories or functioning code, without explicitly verbalizing a plan; however, the extent to which they implicitly plan is unknown. In this paper, we define latent planning as occurring when LLMs possess internal planning representations that (1) cause the generation of a specific future token or concept, and (2) shape preceding context to license said future token or concept. We study the Qwen-3 family (0.6B-14B) on simple planning tasks, finding that latent planning ability increases with scale. Models that plan possess features that represent a planned-for word like “accountant”, and cause them to output “an” rather than “a”; moreover, even the less-successful Qwen-3 4B-8B have nascent planning mechanisms. On the more complex task of completing rhyming couplets, we find that models often identify a rhyme ahead of time, but even large models seldom plan far ahead. However, we can elicit some planning that increases with scale when steering models towards planned words in prose. In sum, we offer a framework for measuring planning and mechanistic evidence of how models’ planning abilities grow with scale.

关键词: latent planning, LLMs, scale, internal representations, planning tasks, Qwen-3, mechanistic evidence, model size

95. ❌ Deepfakes at Face Value: Image and Authority

作者: James Ravi Kirkpatrick 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12490v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文探讨深度伪造（deepfakes）的伦理和法律问题，特别是关于身份权威和图像使用的权利。虽然论文提到深度伪造使用深度学习方法，但全文聚焦于哲学、伦理和法律分析，而非大模型或深度学习的技术原理、创新或应用。所有评分关键词均涉及大模型技术、优化方法、应用场景或科学AI应用，与该论文的伦理法律主题完全无关。

!!! tip deepseek-chat TL;DR

该论文研究深度伪造为何在未造成实际伤害时仍属错误行为，提出其通过算法征用个人身份特征而侵犯了个人对自身形象使用和身份治理的权威，并区分了可允许的艺术挪用与错误的算法模拟。

摘要翻译

深度伪造是一种利用深度学习方法将某人的形象叠加或生成至既有音频、图像或视频上的合成媒体。现有关于制作与传播深度伪造之不当性的论述，多聚焦于其造成的实际伤害或对非规范性利益的侵害。然而，这些观点未能解释为何即使未造成实际伤害或未损害其他非规范性利益时，深度伪造仍可能构成不当行为。针对这一问题，本文指出了一个被忽视的深层原因：深度伪造可能颠覆我们对自身图像使用许可权及身份自主管理权的正当利益。我们认为，当深度伪造通过将个人生物特征数据用作生成性资源，从而篡夺我们判定自身行为来源的权威时，其行为即构成不当。具体而言，我们享有反对身份被算法征用的特定权利。通过区分艺术描绘等正当挪用形式与不当的算法模拟，本文进一步界定了这一利益的范围。

摘要 (Abstract)

Deepfakes are synthetic media that superimpose or generate someone’s likeness on to pre-existing sound, images, or videos using deep learning methods. Existing accounts of the wrongs involved in creating and distributing deepfakes focus on the harms they cause or the non-normative interests they violate. However, these approaches do not explain how deepfakes can be wrongful even when they cause no harm or set back any other non-normative interest. To address this issue, this paper identifies a neglected reason why deepfakes are wrong: they can subvert our legitimate interests in having authority over the permissible uses of our image and the governance of our identity. We argue that deepfakes are wrong when they usurp our authority to determine the provenance of our own agency by exploiting our biometric features as a generative resource. In particular, we have a specific right against the algorithmic conscription of our identity. We refine the scope of this interest by distinguishing between permissible forms of appropriation, such as artistic depiction, from wrongful algorithmic simulation.

关键词: deepfakes, synthetic media, deep learning, authority, identity governance, algorithmic conscription, biometric features, ethical wrongs

96. ❌ KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

作者: Shuai Wang, Yinan Yu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12487v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在知识图谱多跳推理中的应用，与’Large Language Models’高度相关（10分），并涉及’Chain of Thought’和’System 2 Thinking’等推理机制（各10分）。其他关键词如MoE、量化、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在知识图谱多跳推理中的局限性，提出了KG-Reasoner框架，通过强化学习训练LLM内部化KG遍历过程，在多个基准测试中取得了竞争性或更优的性能。

摘要翻译

大语言模型（Large Language Models, LLMs）在自然语言理解与生成方面展现出强大能力，但在知识密集型推理任务中仍面临困难。结构化知识图谱（Knowledge Graphs, KGs）作为一种有效的外部知识表示形式，已被广泛用于提升经典知识库问答（Knowledge Base Question Answering, KBQA）任务的性能。然而，针对复杂查询在知识图谱上进行精确的多跳推理仍然极具挑战性。现有方法大多将推理过程分解为一系列孤立的步骤，并通过固定流程执行。尽管这类设计在一定程度上有效，但它们限制了推理的灵活性，并割裂了整体决策过程，常导致推理不连贯以及早期步骤中关键中间信息的丢失。本文提出KG-Reasoner，一个端到端框架，它将多步推理整合至推理大语言模型的统一“思考”阶段。通过强化学习（Reinforcement Learning, RL），大语言模型被训练以将知识图谱遍历过程内化，使其能够动态探索推理路径，并在必要时执行回溯。在八个多跳及知识密集型推理基准测试上的实验表明，KG-Reasoner相较于现有最优方法取得了具有竞争力或更优的性能。代码已发布于仓库：https://github.com/Wangshuaiia/KG-Reasoner。

摘要 (Abstract)

Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge-intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi-hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG-Reasoner, an end-to-end framework that integrates multi-step reasoning into a unified “thinking” phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi-hop and knowledge-intensive reasoning benchmarks demonstrate that KG-Reasoner achieves competitive or superior performance compared to the state-of-the-art methods. Codes are available at the repository: https://github.com/Wangshuaiia/KG-Reasoner.

关键词: Large Language Models, Knowledge Graph Reasoning, Multi-hop Reasoning, Reinforcement Learning, End-to-end Framework, Reasoning LLM, Knowledge Base Question Answering, KG Traversal

97. ❌ Elastic Net Regularization and Gabor Dictionary for Classification of Heart Sound Signals using Deep Learning

作者: Mahmoud Fakhry, Ascensión Gallardo-Antolín 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12483v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习（CNN和LSTM）进行心音信号分类，并采用弹性网络正则化和Gabor字典进行特征提取。论文内容与绝大多数关键词（主要涉及大模型技术、训练方法、推理优化、智能体等）完全无关，因为这些关键词针对的是大语言模型（LLMs）及相关技术，而本文研究的是传统的深度学习在生物医学信号处理中的应用。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（心音分析）领域的应用，但并非核心创新点，只是应用场景，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究如何通过弹性网络正则化和Gabor字典优化心音信号的时频特征表示，并利用深度学习网络（CNN和LSTM）对五种心脏瓣膜疾病进行分类，最终取得了98.95%的最佳分类准确率。

摘要翻译

本文提出通过优化时频原子的分辨率与拟合模型的正则化，以获取心音信号的更优表征。该方法通过评估深度学习网络基于拟合模型所生成的新型时频特征矩阵在鉴别五种心脏瓣膜病变时的分类性能来实现。我们考察了分辨率与正则化的多种组合方案，最终选取能提供最佳分类性能的最优组合。为此，我们基于心音信号与过完备Gabor原子字典，利用线性模型的弹性网络正则化方法获得拟合模型。本研究采用两种不同的深度学习架构：第一种主要由一维卷积神经网络层和长短期记忆层构成；第二种则由一维及二维卷积神经网络层串联长短期记忆层组成。网络训练采用两种算法：带动量的随机梯度下降法与自适应矩估计算法。通过使用包含五种心脏瓣膜病变心音信号的数据库进行大量实验，结果表明：当采用第二种架构、以自适应矩估计算法进行训练，并使用由高时间-低频率分辨率原子构成的Gabor字典所获得的最优模型生成特征矩阵时，取得了$98.95%$的最佳分类准确率。

摘要 (Abstract)

In this article, we propose the optimization of the resolution of time-frequency atoms and the regularization of fitting models to obtain better representations of heart sound signals. This is done by evaluating the classification performance of deep learning (DL) networks in discriminating five heart valvular conditions based on a new class of time-frequency feature matrices derived from the fitting models. We inspect several combinations of resolution and regularization, and the optimal one is that provides the highest performance. To this end, a fitting model is obtained based on a heart sound signal and an overcomplete dictionary of Gabor atoms using elastic net regularization of linear models. We consider two different DL architectures, the first mainly consisting of a 1D convolutional neural network (CNN) layer and a long short-term memory (LSTM) layer, while the second is composed of 1D and 2D CNN layers followed by an LSTM layer. The networks are trained with two algorithms, namely stochastic gradient descent with momentum (SGDM) and adaptive moment (ADAM). Extensive experimentation has been conducted using a database containing heart sound signals of five heart valvular conditions. The best classification accuracy of $98.95%$ is achieved with the second architecture when trained with ADAM and feature matrices derived from optimal models obtained with a Gabor dictionary consisting of atoms with high-time low-frequency resolution and imposing sparsity on the models.

关键词: heart sound classification, deep learning, elastic net regularization, Gabor dictionary, time-frequency features, convolutional neural network, long short-term memory, cardiac valvular conditions

作者: K. Ege de Bruin, Kyrre Glette, Kai Olav Ellefsen, Giorgia Nadizar, Eric Medvet 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12482v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究虚拟软机器人的社会学习策略，涉及机器人形态与控制参数的联合优化、进化算法和社会学习机制。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于机器人学中的进化算法和社会学习，未涉及任何大模型技术、深度学习创新或AI在生物/化学信息学等科学领域的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了虚拟软机器人如何通过社会学习策略（从其他机器人获取优化参数）来加速其控制参数的优化，实验结果表明社会学习比从头学习在相同计算预算下表现更优，但最佳教师选择策略仍需探索。

摘要翻译

优化机器人的身体与大脑是一个耦合性挑战：形态结构决定了何种控制策略有效，而控制参数又影响着形态结构的性能表现。这种联合优化可通过进化与学习过程的嵌套循环实现，其中每个机器人的控制参数被独立学习。然而，单个机器人习得的控制参数可能包含对其他机器人有价值的信息。因此，我们引入一种社会学习方法，使机器人能够利用同伴已优化的参数来加速自身大脑的优化。在此框架内，我们系统研究了教师选择——即决定向哪些机器人学习以及学习多少机器人的经验——如何影响性能，并在四种任务和环境中对虚拟软体机器人进行了实验。特别地，由于机器人优化中身体与大脑的紧密耦合，我们研究了从形态相似的机器人继承经验的效果。实验结果证实了借鉴他人经验的有效性：在同等计算资源下，社会学习方法明显优于从零开始的学习。此外，虽然最优教师选择策略仍有待探索，但我们的研究表明，融合多位教师的知识能够带来更稳定、更鲁棒的改进。

摘要 (Abstract)

Optimizing the body and brain of a robot is a coupled challenge: the morphology determines what control strategies are effective, while the control parameters influence how well the morphology performs. This joint optimization can be done through nested loops of evolutionary and learning processes, where the control parameters of each robot are learned independently. However, the control parameters learned by one robot may contain valuable information for others. Thus, we introduce a social learning approach in which robots can exploit optimized parameters from their peers to accelerate their own brain optimization. Within this framework, we systematically investigate how the selection of teachers, deciding which and how many robots to learn from, affects performance, experimenting with virtual soft robots in four tasks and environments. In particular, we study the effect of inheriting experience from morphologically similar robots due to the tightly coupled body and brain in robot optimization. Our results confirm the effectiveness of building on others’ experience, as social learning clearly outperforms learning from scratch under equivalent computational budgets. In addition, while the optimal teacher selection strategy remains open, our findings suggest that incorporating knowledge from multiple teachers can yield more consistent and robust improvements.

关键词: social learning, virtual soft robots, morphology optimization, control parameters, evolutionary algorithms, teacher selection, joint optimization, computational budget

99. ❌ Audio Source Separation in Reverberant Environments using $β$-divergence based Nonnegative Factorization

作者: Mahmoud Fakhry, Piergiorgio Svaizer, Maurizio Omologo 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12480v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究音频源分离技术，使用基于β-散度的非负分解方法，属于传统信号处理领域。所有评分关键词均涉及大模型、深度学习及相关技术（如MoE、RLHF、RAG等），而本文未涉及任何大模型、深度学习或相关技术原理，也未涉及AI在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于β-散度非负分解的方法，用于混响环境中的音频源分离，实验表明该方法通过控制稀疏性提高了分离性能。

摘要翻译

在高斯模型基多通道音频源分离中，观测到的源信号混合物的似然性由源频谱方差及相关空间协方差矩阵参数化。这些参数通过期望最大化算法最大化似然性进行估计，并借助多通道维纳滤波实现信号分离。
我们提出基于源方差的先验信息，应用非负分解来估计这些参数。在非负分解中，可将频谱基矩阵定义为先验信息。这些矩阵可通过预先训练得到的冗余库直接提取或间接获取。在独立步骤中，应用非负张量分解，本文提出两种算法以提取或检测最能表征观测混合物中源信号功率谱的基矩阵。该分解通过乘法更新规则最小化$β$-散度实现，其稀疏性可通过调整$β$值进行控制。
实验表明，提升分离性能的关键在于分解的稀疏性，而非训练中设定的$β$值。所提方法在多种混合条件下进行评估，相较于其他同类算法展现出更优的分离质量。

摘要 (Abstract)

In Gaussian model-based multichannel audio source separation, the likelihood of observed mixtures of source signals is parametrized by source spectral variances and by associated spatial covariance matrices. These parameters are estimated by maximizing the likelihood through an Expectation-Maximization algorithm and used to separate the signals by means of multichannel Wiener filtering. We propose to estimate these parameters by applying nonnegative factorization based on prior information on source variances. In the nonnegative factorization, spectral basis matrices can be defined as the prior information. The matrices can be either extracted or indirectly made available through a redundant library that is trained in advance. In a separate step, applying nonnegative tensor factorization, two algorithms are proposed in order to either extract or detect the basis matrices that best represent the power spectra of the source signals in the observed mixtures. The factorization is achieved by minimizing the $β$-divergence through multiplicative update rules. The sparsity of factorization can be controlled by tuning the value of $β$. Experiments show that sparsity, rather than the value assigned to $β$ in the training, is crucial in order to increase the separation performance. The proposed method was evaluated in several mixing conditions. It provides better separation quality with respect to other comparable algorithms.

关键词: audio source separation, reverberant environments, β-divergence, nonnegative factorization, multichannel Wiener filtering, sparsity control, spectral basis matrices, separation performance

100. ❌ Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe

作者: Mahounan Pericles Adjovi, Roald Eiselen, Prasenjit Mitra 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12477v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文核心研究LLMs在低资源语言数据生成中的应用，通过系统比较不同提示策略从GPT-4o Mini和Gemini 2.5 Flash中提取豪萨语和丰语文本数据，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的技术原理、方法或应用，如MoE、SLMs、训练技术、推理方法、代理系统、模型优化等，故其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究通过系统比较六种提示策略，探索如何从大型语言模型（GPT-4o Mini和Gemini 2.5 Flash）中提取西非低资源语言（豪萨语和丰语）的可用文本数据，发现GPT-4o Mini的提取效率显著更高且最优策略因语言而异。

摘要翻译

大型语言模型（LLMs）的训练数据来源于低资源语言社群的贡献，然而这些模型所编码的语言知识目前仅能通过商业应用程序接口（APIs）获取。本文探讨是否能够通过策略性提示（prompting）从LLMs中提取可用于研究的文本数据，并以两种西非语言为例：豪萨语（亚非语系，约8000万使用者）和丰语（尼日尔-刚果语系，约200万使用者）。我们系统比较了两种商业LLM（GPT-4o Mini和Gemini 2.5 Flash）在六类诱导任务（elicitation task types）上的表现。GPT-4o Mini在每次API调用中提取的可使用目标语言词汇量是Gemini的6至41倍。最优策略因语言而异：豪萨语适合采用功能性文本和对话生成，而丰语则需要约束性生成提示（constrained generation prompts）。我们公开了所有生成语料库及代码。

摘要 (Abstract)

Large language models (LLMs) are trained on data contributed by low-resource language communities, yet the linguistic knowledge encoded in these models remains accessible only through commercial APIs. This paper investigates whether strategic prompting can extract usable text data from LLMs for two West African languages: Hausa (Afroasiatic, approximately 80 million speakers) and Fongbe (Niger-Congo, approximately 2 million speakers). We systematically compare six elicitation task types across two commercial LLMs (GPT-4o Mini and Gemini 2.5 Flash). GPT-4o Mini extracts 6-41 times more usable target-language words per API call than Gemini. Optimal strategies differ by language: Hausa benefits from functional text and dialogue, while Fongbe requires constrained generation prompts. We release all generated corpora and code.

关键词: Large language models, Low-resource languages, Elicitation strategies, Prompting, Hausa, Fongbe, Text data extraction, Comparative analysis

101. ❌ From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution

作者: Lidor Erez, Shahaf S. Shperberg, Ayal Taitler 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12474v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人任务中的混合离散-连续规划问题，使用强化学习优化物理可行性，未涉及大模型、深度学习技术原理或科学AI应用等关键词领域。所有关键词均与大语言模型、深度学习技术、AI科学应用相关，而本文研究的是传统机器人控制与优化问题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于强化学习的方法，通过定义包含二阶约束的马尔可夫决策过程来优化混合规划器生成的一阶轨迹，从而解决机器人任务中混合离散-连续规划的物理可行性问题，有效恢复了物理可行性并缩小了规划轨迹与实际执行所需的动力学之间的差距。

摘要翻译

在许多机器人任务中，智能体必须穿越一系列空间区域以完成使命。此类问题本质上是混合离散-连续的：既涉及高层动作序列，也涉及物理上可行的连续轨迹。生成的轨迹与动作序列还必须满足问题约束，如截止时间、时间窗口以及速度或加速度限制。尽管混合时序规划器试图应对这一挑战，但它们通常采用线性（一阶）动力学模型来描述运动，这无法保证生成的计划符合机器人的真实物理约束。因此，即使高层动作序列已确定，生成动态可行的轨迹仍会成为一个双层优化问题。我们通过连续空间中的强化学习来解决这一问题。我们定义了一个明确包含解析二阶约束的马尔可夫决策过程，并利用它来优化由混合规划器生成的一阶计划。研究结果表明，该方法能够可靠地恢复物理可行性，并有效弥合规划器初始一阶轨迹与实际执行所需动力学之间的差距。

摘要 (Abstract)

In many robotic tasks, agents must traverse a sequence of spatial regions to complete a mission. Such problems are inherently mixed discrete-continuous: a high-level action sequence and a physically feasible continuous trajectory. The resulting trajectory and action sequence must also satisfy problem constraints such as deadlines, time windows, and velocity or acceleration limits. While hybrid temporal planners attempt to address this challenge, they typically model motion using linear (first-order) dynamics, which cannot guarantee that the resulting plan respects the robot’s true physical constraints. Consequently, even when the high-level action sequence is fixed, producing a dynamically feasible trajectory becomes a bi-level optimization problem. We address this problem via reinforcement learning in continuous space. We define a Markov Decision Process that explicitly incorporates analytical second-order constraints and use it to refine first-order plans generated by a hybrid planner. Our results show that this approach can reliably recover physical feasibility and effectively bridge the gap between a planner’s initial first-order trajectory and the dynamics required for real execution.

关键词: robotic tasks, hybrid planning, reinforcement learning, physical feasibility, Markov Decision Process, second-order constraints, trajectory optimization, bi-level optimization

102. ❌ Intelligent ROI-Based Vehicle Counting Framework for Automated Traffic Monitoring

作者: Mohamed A. Abdelwahab, Zaynab Al-Ariny, Mahmoud Fakhry, El-Sayed Hasaneen 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12470v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的交通监控，提出了一种基于ROI的车辆计数框架，涉及目标检测、跟踪和密度估计等传统CV技术。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关，未涉及任何大模型、语言模型、训练方法、推理技术、对齐、压缩、代理系统或AI for Science相关内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于自适应ROI的自动化车辆计数框架，通过结合检测、跟踪和密度估计来优化计算效率和准确性，在多个基准数据集上实现了高精度和更快的处理速度。

摘要翻译

通过视频监控实现精确的车辆计数对于高效交通管理至关重要。然而，在确保计算效率的同时实现高计数精度仍是一个挑战。为此，我们提出了一种全自动、基于视频的车辆计数框架，旨在优化计算效率和计数精度。该框架在两个不同阶段运行：估计阶段与预测阶段。在估计阶段，我们基于检测分数、跟踪分数和车辆密度，通过一种新颖的三模型组合自动确定最佳感兴趣区域（Region of Interest, ROI）。这种自适应方法确保了与任何检测和跟踪方法的兼容性，增强了框架的通用性。在预测阶段，车辆计数在估计出的ROI内高效执行。我们在UA-DETRAC、GRAM、CDnet 2014和ATON等基准数据集上评估了该框架。结果表明，该框架具有卓越的准确性，大多数视频实现了100%的准确率，同时显著提升了计算效率，处理速度比全帧处理快达四倍。该框架优于现有技术，尤其在复杂的多道路场景中，展现出鲁棒性和更高的准确性。这些进展使其成为实时交通监控领域一个有前景的解决方案。

摘要 (Abstract)

Accurate vehicle counting through video surveillance is crucial for efficient traffic management. However, achieving high counting accuracy while ensuring computational efficiency remains a challenge. To address this, we propose a fully automated, video-based vehicle counting framework designed to optimize both computational efficiency and counting accuracy. Our framework operates in two distinct phases: \textit{estimation} and \textit{prediction}. In the estimation phase, the optimal region of interest (ROI) is automatically determined using a novel combination of three models based on detection scores, tracking scores, and vehicle density. This adaptive approach ensures compatibility with any detection and tracking method, enhancing the framework’s versatility. In the prediction phase, vehicle counting is efficiently performed within the estimated ROI. We evaluated our framework on benchmark datasets like UA-DETRAC, GRAM, CDnet 2014, and ATON. Results demonstrate exceptional accuracy, with most videos achieving 100% accuracy, while also enhancing computational efficiency, making processing up to four times faster than full-frame processing. The framework outperforms existing techniques, especially in complex multi-road scenarios, demonstrating robustness and superior accuracy. These advancements make it a promising solution for real-time traffic monitoring.

关键词: vehicle counting, traffic monitoring, region of interest, computational efficiency, detection and tracking, video surveillance, real-time processing, adaptive framework

103. ❌ CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems

作者: Yongxuan Wu, Xixun Lin, He Zhang, Nan Sun, Kun Wang, Chuan Zhou, Shirui Pan, Yanan Cao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12461v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based Multi-Agent Systems（MAS）的通信拓扑推断攻击，与’LLM Agents/Autonomous Agents/Agentic Workflow’和’Multi-agent Systems/Agent Coordination’高度相关（10分），涉及推理过程与’Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’有一定关联（5分），其他关键词如LLM基础技术、训练方法、优化技术等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了在限制性黑盒设置下，通过提出的通信推断攻击（CIA）方法，能够有效推断LLM-based多智能体系统的通信拓扑，揭示了其隐私风险，实验平均AUC达到0.87。

摘要翻译

基于大语言模型（LLM）的多智能体系统（Multi-Agent Systems, MAS）在解决复杂任务方面已展现出卓越能力。MAS的核心在于其通信拓扑结构，该结构决定了智能体内部如何交换信息。因此，通信拓扑的安全性日益受到关注。本文研究了一种关键的隐私风险：在受限的黑盒设置下，MAS的通信拓扑可能被推断出来，从而暴露系统漏洞并构成重大的知识产权威胁。为探究此风险，我们提出了通信推断攻击（Communication Inference Attack, CIA），这是一种新型攻击方法，通过构建新的对抗性查询来诱导中间智能体的推理输出，并借助所提出的全局偏差解耦与LLM引导的弱监督技术，对其语义相关性进行建模。在具有优化通信拓扑的MAS上进行的大量实验证明了CIA的有效性，其平均AUC达到0.87，峰值AUC高达0.99，从而揭示了MAS中存在的重大隐私风险。

摘要 (Abstract)

LLM-based Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in solving complex tasks. Central to MAS is the communication topology which governs how agents exchange information internally. Consequently, the security of communication topologies has attracted increasing attention. In this paper, we investigate a critical privacy risk: MAS communication topologies can be inferred under a restrictive black-box setting, exposing system vulnerabilities and posing significant intellectual property threats. To explore this risk, we propose Communication Inference Attack (CIA), a novel attack that constructs new adversarial queries to induce intermediate agents’ reasoning outputs and models their semantic correlations through the proposed global bias disentanglement and LLM-guided weak supervision. Extensive experiments on MAS with optimized communication topologies demonstrate the effectiveness of CIA, achieving an average AUC of 0.87 and a peak AUC of up to 0.99, thereby revealing the substantial privacy risk in MAS.

关键词: LLM-based Multi-Agent Systems, Communication Topology, Privacy Risk, Inference Attack, Black-box Setting, Semantic Correlation, Adversarial Queries, Reasoning Outputs

104. ❌ Euler-inspired Decoupling Neural Operator for Efficient Pansharpening

作者: Anqi Zhu, Mengting Ma, Yizhen Jiang, Xiangdong Li, Kai Zheng, Jiaxin Li, Wei Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12463v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于遥感图像处理中的全色锐化任务，提出了一种基于欧拉公式的物理启发式神经网络框架（EDNO），用于高效融合空间纹理和光谱信息。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用直接相关，但本文研究的是计算机视觉中的特定图像处理问题，未涉及任何大语言模型技术、深度学习基础原理创新或AI在生物/化学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于欧拉公式的物理启发式神经网络框架（EDNO），通过频域连续函数映射和显式-隐式交互机制，解决了全色锐化任务中光谱-空间模糊和计算成本高的问题，实现了优于重型架构的效率-性能平衡。

摘要翻译

全色锐化旨在通过融合全色图像的空间纹理与低分辨率多光谱图像的光谱信息，合成高分辨率多光谱图像。尽管近期的深度学习范式，特别是基于扩散算子的方法，已推动了性能边界的提升，但其随机性和迭代采样特性常导致光谱-空间模糊及高昂的计算成本。本文提出了一种受物理学启发的框架——欧拉启发的解耦神经算子，该框架将全色锐化重新定义为频域中的连续函数映射。与传统笛卡尔特征处理方式不同，我们的欧拉启发解耦神经算子利用欧拉公式将特征转换至极坐标系，实现了一种新颖的显式-隐式交互机制。具体而言，我们开发了欧拉特征交互层，将融合任务解耦为两个专用模块：1）显式特征交互模块，采用线性加权方案模拟相位旋转，以实现自适应几何对齐；2）隐式特征交互模块，利用前馈网络建模光谱分布，以达成优越的色彩一致性。通过在频域中操作，欧拉启发解耦神经算子本质上能够捕获全局感受野，同时保持离散不变性。在三个数据集上的实验结果表明，与重量级架构相比，欧拉启发解耦神经算子提供了更优的效率-性能平衡。

摘要 (Abstract)

Pansharpening aims to synthesize high-resolution multispectral (HR-MS) images by fusing the spatial textures of panchromatic (PAN) images with the spectral information of low-resolution multispectral (LR-MS) images. While recent deep learning paradigms, especially diffusion-based operators, have pushed the performance boundaries, they often encounter spectral-spatial blurring and prohibitive computational costs due to their stochastic nature and iterative sampling. In this paper, we propose the Euler-inspired Decoupling Neural Operator (EDNO), a physics-inspired framework that redefines pansharpening as a continuous functional mapping in the frequency domain. Departing from conventional Cartesian feature processing, our EDNO leverages Euler’s formula to transform features into a polar coordinate system, enabling a novel explicit-implicit interaction mechanism. Specifically, we develop the Euler Feature Interaction Layer (EFIL), which decouples the fusion task into two specialized modules: 1) Explicit Feature Interaction Module, utilizing a linear weighting scheme to simulate phase rotation for adaptive geometric alignment; and 2) Implicit Feature Interaction Module, employing a feed-forward network to model spectral distributions for superior color consistency. By operating in the frequency domain, EDNO inherently captures global receptive fields while maintaining discretization-invariance. Experimental results on the three datasets demonstrate that EDNO offers a superior efficiency-performance balance compared to heavyweight architectures.

关键词: pansharpening, neural operator, Euler’s formula, frequency domain, explicit-implicit interaction, spectral-spatial fusion, computational efficiency, remote sensing

105. ❌ Enhancing Clustering: An Explainable Approach via Filtered Patterns

作者: Motaz Ben Hassine, Saïd Jabbour 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12460v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于可解释聚类（概念聚类），这是一种传统的机器学习方法，使用SAT求解器和整数线性规划进行模式生成和聚类选择。虽然论文涉及可解释人工智能（XAI），但所有关键词都针对大模型（LLMs）及其相关技术（如MoE、RLHF、RAG等），而本文完全不涉及大模型、深度学习或任何大模型技术原理。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文明确研究可解释聚类作为XAI的一部分，但这是传统XAI，而非大模型的可解释性。因此，除该关键词外，其余关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

本文解决了可解释聚类中因不同k-松弛频繁模式产生相同k-覆盖而导致的冗余问题，通过理论分析、优化策略去除冗余模式，显著减少了搜索空间并提高了计算效率，同时保持了聚类质量。

摘要翻译

机器学习已成为核心研究领域，其中可解释聚类（亦称概念聚类）日益受到关注。这是一种知识驱动的无监督学习范式，其将数据划分为θ个互不相交的簇，每个簇由一个显式的符号化表征所描述，通常表达为闭模式或项集。通过提供人类可理解的簇描述，可解释聚类在可解释人工智能和知识发现中扮演着重要角色。近期研究通过引入k-松弛频繁模式（k-RFPs）提升了聚类质量，该模式模型通过广义的k覆盖定义放松了严格的覆盖约束。该框架整合了基于约束的推理（使用SAT求解器进行模式生成）与组合优化（使用整数线性规划进行簇选择）。尽管有效，但该方法存在一个关键局限：多个不同的k-RFP可能诱导出相同的k覆盖，导致产生冗余的符号化表征，这些表征不必要地扩大了搜索空间，并增加了簇构建过程中的计算复杂度。本文通过一种模式约简框架来解决此冗余问题。我们的贡献有三方面：首先，我们形式化地刻画了不同k-RFP诱导出相同k覆盖的条件，为冗余检测提供了理论基础；其次，我们提出一种优化策略，通过为每个不同的k覆盖保留单一代表性模式来移除冗余模式；第三，我们通过分析ILP模型所选模式相对于其诱导簇的鲁棒性，探究了这些模式的可解释性与代表性。在多个真实数据集上进行的大量实验表明，所提方法能显著缩减模式搜索空间，提升计算效率，并在某些情况下保持甚至提高了最终簇的质量。

摘要 (Abstract)

Machine learning has become a central research area, with increasing attention devoted to explainable clustering, also known as conceptual clustering, which is a knowledge-driven unsupervised learning paradigm that partitions data into $θ$ disjoint clusters, where each cluster is described by an explicit symbolic representation, typically expressed as a closed pattern or itemset. By providing human-interpretable cluster descriptions, explainable clustering plays an important role in explainable artificial intelligence and knowledge discovery. Recent work improved clustering quality by introducing k-relaxed frequent patterns (k-RFPs), a pattern model that relaxes strict coverage constraints through a generalized kcover definition. This framework integrates constraint-based reasoning, using SAT solvers for pattern generation, with combinatorial optimization, using Integer Linear Programming (ILP) for cluster selection. Despite its effectiveness, this approach suffers from a critical limitation: multiple distinct k-RFPs may induce identical k-covers, leading to redundant symbolic representations that unnecessarily enlarge the search space and increase computational complexity during cluster construction. In this paper, we address this redundancy through a pattern reduction framework. Our contributions are threefold. First, we formally characterize the conditions under which distinct k-RFPs induce identical kcovers, providing theoretical foundations for redundancy detection. Second, we propose an optimization strategy that removes redundant patterns by retaining a single representative pattern for each distinct k-cover. Third, we investigate the interpretability and representativeness of the patterns selected by the ILP model by analyzing their robustness with respect to their induced clusters. Extensive experiments conducted on several real-world datasets demonstrate that the proposed approach significantly reduces the pattern search space, improves computational efficiency, preserves and enhances in some cases the quality of the resulting clusters.

关键词: explainable clustering, conceptual clustering, k-relaxed frequent patterns, pattern reduction, SAT solvers, Integer Linear Programming, redundancy detection, computational efficiency

106. ❌ X-VC: Zero-shot Streaming Voice Conversion in Codec Space

作者: Qixi Zheng, Yuxiang Zhao, Tianrui Wang, Wenxi Chen, Kele Xu, Yikang Li, Qinyuan Chen, Xipeng Qiu, Kai Yu, Xie Chen 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12456v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文X-VC专注于语音转换（VC）任务，特别是零样本流式语音转换。它使用预训练的神经编解码器（codec）的潜在空间进行一步转换，并采用了双条件声学转换器、自适应归一化、角色分配策略和分块推理等技术。虽然论文涉及深度学习在语音处理中的应用，但其核心内容与大多数关键词（特别是大语言模型相关技术）无关。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文使用了预训练的神经编解码器，并涉及领域适应（在语音转换中适应不同说话者）。其他关键词均未在论文中提及或相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为X-VC的零样本流式语音转换系统，通过在预训练神经编解码器的潜在空间中进行一步转换，实现了高质量、低延迟的语音转换，并在实验中表现出色。

摘要翻译

零样本语音转换旨在将源语音转换为未见目标说话者的音色，同时保持其语言内容。尽管现有系统已提升转换质量，但为交互场景构建零样本语音转换系统仍具挑战，因为高保真说话人音色迁移与低延迟流式推理难以同时实现。本研究提出X-VC——一种在预训练神经编解码器隐空间内执行一步转换的零样本流式语音转换系统。该系统采用双条件声学转换器，联合建模源编解码器隐变量与从目标参考语音提取的帧级声学条件，同时通过自适应归一化注入语句级目标说话人信息。为减少训练与推理间的差异，我们使用生成式配对数据及融合标准模式、重建模式与反向模式的角色分配策略进行模型训练。针对流式推理，我们进一步采用与编解码器分段训练范式对齐的含重叠平滑机制的分块推理方案。在Seed-TTS-Eval数据集上的实验表明：X-VC在英文和中文场景均取得最优的流式词错误率，在同语言与跨语言设定下均展现优异的说话人相似度，且离线实时因子显著低于基线系统。这些结果证明，基于编解码器隐空间的一步转换是构建高质量低延迟零样本语音转换系统的可行路径。音频样本详见https://x-vc.github.io，代码与模型检查点将同步开源。

摘要 (Abstract)

Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challenging because high-fidelity speaker transfer and low-latency streaming inference are difficult to achieve simultaneously. In this work, we present X-VC, a zero-shot streaming VC system that performs one-step conversion in the latent space of a pretrained neural codec. X-VC uses a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions derived from target reference speech, while injecting utterance-level target speaker information through adaptive normalization. To reduce the mismatch between training and inference, we train the model with generated paired data and a role-assignment strategy that combines standard, reconstruction, and reversed modes. For streaming inference, we further adopt a chunkwise inference scheme with overlap smoothing that is aligned with the segment-based training paradigm of the codec. Experiments on Seed-TTS-Eval show that X-VC achieves the best streaming WER in both English and Chinese, strong speaker similarity in same-language and cross-lingual settings, and substantially lower offline real-time factor than the compared baselines. These results suggest that codec-space one-step conversion is a practical approach for building high-quality low-latency zero-shot VC systems. Audio samples are available at https://x-vc.github.io. Our code and checkpoints will also be released.

关键词: zero-shot voice conversion, streaming inference, neural codec, dual-conditioning acoustic converter, adaptive normalization, chunkwise inference, low-latency, speaker similarity

107. ❌ IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation

作者: Haoyu Zheng, Tianwei Lin, Wei Wang, Zhuonan Wang, Wenqiao Zhang, Jiaqi Zhu, Feifei Shao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12440v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出IAD-Unify框架，使用Qwen3.5-4B作为视觉语言骨干模型，属于大模型在工业检测领域的应用，因此与’Large Language Models’和’AI for Science’相关。其他关键词主要涉及大模型技术原理、训练方法、推理优化、代理系统等，论文未涉及这些具体技术细节，只应用了现成的大模型作为组件，因此相关性为0。

!!! tip deepseek-chat TL;DR

该论文提出了IAD-Unify统一框架，通过区域专家向视觉语言大模型注入异常证据，实现了工业异常分割、理解和生成三项任务的联合处理，并在构建的Anomaly-56K数据集上验证了其有效性。

摘要翻译

现实工业检测不仅需要定位缺陷，还需用自然语言解释缺陷并生成可控的缺陷编辑。然而，现有方法无法在统一框架与评估协议中同时支持这三种能力。我们提出IAD-Unify——一种双编码器统一框架，其中冻结的基于DINOv2的区域专家通过轻量级令牌注入向共享的Qwen3.5-4B视觉语言骨干网络提供精确的异常证据，共同实现异常分割、区域锚定的理解与掩码引导的生成。为建立统一评估，我们进一步构建了Anomaly-56K——一个全面的统一多任务工业异常检测评估平台，涵盖24个类别、104种缺陷变体的59,916张图像。受控消融实验得出四项发现：（一）区域锚定是理解任务的决定性机制，移除该机制会使定位准确率下降超过76个百分点；（二）预测区域的性能与真实标注区域高度接近，证实了部署可行性；（三）基于区域锚定的生成实现了最佳的全图像保真度与掩码区域感知质量；（四）预初始化的联合训练以可忽略的生成代价（-0.16 dB）提升了理解能力。IAD-Unify在MMAD基准测试中进一步取得优异表现，包括训练阶段未见过的类别，展现出强大的跨类别泛化能力。

摘要 (Abstract)

Real-world industrial inspection requires not only localizing defects, but also explaining them in natural language and generating controlled defect edits. However, existing approaches fail to jointly support all three capabilities within a unified framework and evaluation protocol. We propose IAD-Unify, a dual-encoder unified framework in which a frozen DINOv2-based region expert supplies precise anomaly evidence to a shared Qwen3.5-4B vision-language backbone via lightweight token injection, jointly enabling anomaly segmentation, region-grounded understanding, and mask-guided generation. To enable unified evaluation, we further construct Anomaly-56K, a comprehensive unified multi-task IAD evaluation platform, spanning 59,916 images across 24 categories and 104 defect variants. Controlled ablations yield four findings: (i) region grounding is the decisive mechanism for understanding, removing it degrades location accuracy by >76 pp; (ii) predicted-region performance closely matches oracle, confirming deployment viability; (iii) region-grounded generation achieves the best full-image fidelity and masked-region perceptual quality; and (iv) pre-initialized joint training improves understanding at negligible generation cost (-0.16 dB). IAD-Unify further achieves strong performance on the MMAD benchmark, including categories unseen during training, demonstrating robust cross-category generalization.

关键词: Industrial Anomaly Detection, Unified Model, Vision-Language Model, Region Grounding, Anomaly Segmentation, Defect Generation, Cross-category Generalization, Qwen3.5-4B

108. ❌ Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

作者: Sihang Jia, Shuliang Liu, Songbo Yang, Yibo Yan, Xin Zou, Xuming Hu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12424v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）的幻觉问题，提出了一种名为DeP的训练无关缓解框架。因此，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展；与’Hallucination Mitigation’高度相关（10分），因为这是论文的核心研究问题；与’Self-Correction’有一定关联（5分），因为DeP通过扰动和调整来纠正模型输出；与’Mechanistic Interpretability’有一定关联（5分），因为论文分析了幻觉的机制（语言先验主导视觉证据）并提供了可解释的干预。其他关键词（如MoE、SFT、RAG、量化等）在论文中未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型因语言先验主导视觉证据而产生的推理幻觉问题，提出了一种名为DeP的训练无关框架，通过动态文本扰动来缓解幻觉，并在多个基准测试中取得了优越性能。

摘要翻译

多模态大语言模型常出现推理幻觉，部分源于语言先验对视觉证据的压制。现有免训练缓解方法要么扰动视觉表征使其偏离自然图像分布，要么采用侵入式操作损害模型固有的生成流畅性。我们提出一种新视角：多模态幻觉在解码阶段表现为视觉 grounding 对文本表述的过度敏感。基于此，我们提出解码扰动框架（DeP），这是一种通过受控文本干预缓解先验诱发幻觉的免训练方法。DeP采用动态探针，通过多层级文本扰动来激发潜在语言先验。该方法利用注意力方差，增强特征空间中的稳定证据区域，同时抑制可疑噪声。此外，它通过logits统计构建可解释的先验漂移方向，以抵消文本共现带来的概率偏差。大量实验证实，DeP能有效减少幻觉，并在多个基准测试中取得优越性能。

摘要 (Abstract)

Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model’s inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.

关键词: Multimodal Large Language Models, Hallucination Mitigation, Textual Perturbation, Visual Grounding, Training-free Framework, Language Priors, Decoding Phase, Attention Variance

109. ❌ RACF: A Resilient Autonomous Car Framework with Object Distance Correction

作者: Chieh Tsai, Hossein Rastgoftar, Salim Hariri 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12418v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶系统的感知层鲁棒性，提出了一种结合深度相机、LiDAR和运动学的多传感器融合框架（RACF）和距离校正算法（ODCA），以应对环境退化和对抗攻击。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，而本文的核心是传感器融合、实时感知和物理模型，未涉及任何大模型、深度学习、AI科学应用或相关技术方法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种具有物体距离校正功能的弹性自动驾驶框架（RACF），通过多传感器融合和实时校正算法，在强干扰下将距离估计的RMSE降低了35%，并改善了停车合规性和制动延迟。

摘要翻译

自动驾驶车辆正日益广泛地部署于安全关键型应用中，在此类场景下，感知系统故障或信息物理攻击可能导致不安全操作，进而造成人员伤亡和/或严重的物理损害。因此，可靠且实时的感知对其安全运行和社会接受度至关重要。例如，基于视觉的距离估计易受环境退化与对抗性扰动的影响，而现有防御措施多为被动响应且反应迟缓，难以及时缓解其对安全运行的影响。本文提出一种弹性自动驾驶汽车框架（Resilient Autonomous Car Framework, RACF），该框架整合了目标距离校正算法（Object Distance Correction Algorithm, ODCA），通过融合深度相机、激光雷达（LiDAR）及基于物理学的运动学模型所提供的冗余与多样性信息，以提升感知层的鲁棒性。在此框架内，当深度相机产生的障碍物距离估计值出现不一致时，跨传感器门控机制将激活校正算法以修正检测到的不一致性。我们在基于Quanser QCar 2平台实现的测试床上对所提出的弹性汽车框架进行了实验，并评估其性能。实验表明，该框架在强干扰条件下实现了高达35%的均方根误差（RMSE）降低，同时提升了停车合规性与制动响应速度，且能够实时运行。这些结果证明了一种面向安全关键型自动驾驶的实用且轻量化的弹性感知实现路径。

摘要 (Abstract)

Autonomous vehicles are increasingly deployed in safety-critical applications, where sensing failures or cyberphysical attacks can lead to unsafe operations resulting in human loss and/or severe physical damages. Reliable real-time perception is therefore critically important for their safe operations and acceptability. For example, vision-based distance estimation is vulnerable to environmental degradation and adversarial perturbations, and existing defenses are often reactive and too slow to promptly mitigate their impacts on safe operations. We present a Resilient Autonomous Car Framework (RACF) that incorporates an Object Distance Correction Algorithm (ODCA) to improve perception-layer robustness through redundancy and diversity across a depth camera, LiDAR, and physics-based kinematics. Within this framework, when obstacle distance estimation produced by depth camera is inconsistent, a cross-sensor gate activates the correction algorithm to fix the detected inconsistency. We have experiment with the proposed resilient car framework and evaluate its performance on a testbed implemented using the Quanser QCar 2 platform. The presented framework achieved up to 35% RMSE reduction under strong corruption and improves stop compliance and braking latency, while operating in real time. These results demonstrate a practical and lightweight approach to resilient perception for safety-critical autonomous driving

关键词: Autonomous Vehicles, Resilient Perception, Object Distance Correction, Sensor Fusion, Real-time Safety, Depth Camera, LiDAR, Kinematics

110. ❌ Security and Resilience in Autonomous Vehicles: A Proactive Design Approach

作者: Chieh Tsai, Murad Mehrab Abrar, Salim Hariri 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12408v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶车辆的安全性和弹性设计，涉及传感器、通信、入侵检测和系统架构，但未提及任何大模型、深度学习技术或相关关键词。所有关键词均与大模型技术原理、训练方法、推理优化或科学AI应用相关，而本文属于网络安全和自动驾驶系统领域，与给定关键词完全无关。

!!! tip deepseek-chat TL;DR

本文研究了自动驾驶车辆面临的安全威胁，并提出了一种集成冗余、多样性和自适应重配置的弹性架构，通过实验验证了其在检测深度相机致盲攻击和软件篡改方面的有效性。

摘要翻译

自动驾驶车辆（Autonomous Vehicles, AVs）有望构建高效、清洁且经济高效的交通系统，但其对传感器、无线通信及决策系统的依赖使其易受网络攻击和物理威胁。本章提出了增强自动驾驶车辆安全性与韧性的创新设计技术。我们首先对不同架构层面临的潜在攻击进行了分类梳理，涵盖从感知与控制操纵、车联万物（Vehicle-to-Any, V2X）通信利用到软件供应链入侵等多个层面。基于此分析，我们提出了一种自动驾驶车辆韧性架构，该架构融合了冗余性、多样性与自适应重构策略，并辅以基于异常检测和哈希值的入侵检测技术。在Quanser QCar平台上的实验验证表明，这些方法能有效检测深度相机致盲攻击及感知模块的软件篡改行为。研究结果凸显了快速异常检测结合故障切换与备份机制如何确保系统在对抗条件下仍能维持运行连续性。通过将分层威胁建模与实际防御实施方案相结合，本研究推动了自动驾驶车辆韧性策略的发展，以构建更安全、更可信的自动驾驶系统。

摘要 (Abstract)

Autonomous vehicles (AVs) promise efficient, clean and cost-effective transportation systems, but their reliance on sensors, wireless communications, and decision-making systems makes them vulnerable to cyberattacks and physical threats. This chapter presents novel design techniques to strengthen the security and resilience of AVs. We first provide a taxonomy of potential attacks across different architectural layers, from perception and control manipulation to Vehicle-to-Any (V2X) communication exploits and software supply chain compromises. Building on this analysis, we present an AV Resilient architecture that integrates redundancy, diversity, and adaptive reconfiguration strategies, supported by anomaly- and hash-based intrusion detection techniques. Experimental validation on the Quanser QCar platform demonstrates the effectiveness of these methods in detecting depth camera blinding attacks and software tampering of perception modules. The results highlight how fast anomaly detection combined with fallback and backup mechanisms ensures operational continuity, even under adversarial conditions. By linking layered threat modeling with practical defense implementations, this work advances AV resilience strategies for safer and more trustworthy autonomous vehicles.

关键词: Autonomous Vehicles, Security, Resilience, Cyberattacks, Intrusion Detection, Anomaly Detection, V2X Communication, Quanser QCar

111. ❌ Heuristic Classification of Thoughts Prompting (HCoT): Integrating Expert System Heuristics for Structured Reasoning into Large Language Models

作者: Lei Lin, Jizhao Zhu, Yong Liu, Donghong Sun, Hongbo He, Yihua Du 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12390v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的推理方法改进，与’Large Language Models’和’Chain of Thought’高度相关（10分），涉及结构化推理和决策优化，与’System 2 Thinking’有一定关联（8分）。方法包含知识引导优化，与’Self-Correction’有弱关联（5分）。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在复杂问题解决中推理过程随机且与决策机制解耦的问题，提出了Heuristic Classification of Thoughts prompting (HCoT)方法，在多个推理任务上超越了现有方法，实现了性能与计算成本的帕累托最优平衡。

摘要翻译

本文针对大语言模型在解决复杂问题时的两个局限性展开研究：(1)其推理过程呈现类贝叶斯随机生成特性，即每个词元均从上下文相关的概率分布中采样产生，导致决策轨迹具有内在随机性而非确定性规划；(2)推理与决策机制存在静态解耦问题，动态检索的领域知识无法动态调整底层推理策略。这种双重缺陷使得初始决策缺乏策略锚定，推理链往往难以收敛至正确解——因为随机生成机制在序列化推理过程中缺乏轨迹修正或知识引导优化的能力。为突破这些限制，我们提出一种嵌入大语言模型生成过程的解题方法以引导推理。该方法兼容多种大语言模型且具备可复用解决方案，其核心基于新型启发式思维分类提示框架。该框架通过启发式分类模型将大语言模型的推理能力与结构化问题空间相协同，该模型既能控制推理过程，又能提供可复用的抽象解决方案。在两项具有不明确搜索空间的复杂归纳推理任务上的评估表明，该框架在性能上超越现有方法（如思维树和思维链提示）。在结构良好的24点游戏任务中，相较于最先进的广度优先搜索思维树方法，该框架展现出显著更高的词元效率。在准确率与词元消耗两个维度上，该框架实现了帕累托前沿平衡，在性能与计算成本之间达成了优越的权衡。

摘要 (Abstract)

This paper addresses two limitations of large language models (LLMs) in solving complex problems: (1) their reasoning processes exhibit Bayesian-like stochastic generation, where each token is sampled from a context-dependent probability distribution, leading to inherently random decision trajectories rather than deterministic planning; (2) the reasoning and decision-making mechanisms are statically decoupled, meaning dynamically retrieved domain knowledge fails to dynamically adjust the underlying reasoning strategy. These dual deficiencies result in initial decisions lacking strategic anchoring and reasoning chains often failing to converge on correct solutions, as stochastic generation lacks mechanisms for trajectory correction or knowledge-guided optimization during sequential reasoning. To resolve these issues, we propose a problem-solving method integrated into the LLM’s generation process to guide reasoning. This method, compatible with numerous LLMs and featuring reusable solutions, is grounded in a novel Heuristic-Classification-of-Thoughts prompting schema (HCoT). HCoT synergizes the LLM’s reasoning ability with a structured problem space via a heuristic classification model that controls the reasoning process and provides reusable abstract solutions. Evaluated on two complex inductive reasoning tasks with ill-defined search spaces, HCoT outperforms existing approaches (e.g., Tree-of-Thoughts and Chain-of-Thoughts prompting) in performance. On the well-structured 24 Game task, HCoT demonstrates significantly higher token efficiency compared to the state-of-the-art Tree-of-Thoughts-Breadth-First-Search. In terms of both accuracy and token usage, HCoT achieves a Pareto frontier balance, offering a strong trade-off between performance and computational cost.

关键词: Large Language Models, Reasoning, Heuristic Classification of Thoughts, Structured Reasoning, Complex Problem Solving, Tree-of-Thoughts, Chain-of-Thoughts, Token Efficiency

112. ❌ Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

作者: Songping Peng, Zhiheng Zhang, Daojian Zeng, Lincheng Jiang, Xieping Gao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12384v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在微调过程中的安全对齐问题，提出CWAC方法同时约束权重更新和激活特征来防止安全漂移。与"Large Language Models"高度相关（核心研究对象），与"Post-training/Supervised Fine-tuning"高度相关（研究微调过程中的安全漂移），与"Instruction Tuning/Alignment/Value Alignment"高度相关（研究安全对齐问题）。其他关键词如MoE、SLMs、RAG、量化等均未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究大语言模型在微调过程中安全对齐容易退化的问题，提出了一种同时约束权重和激活的CWAC方法，实验证明该方法能有效防止安全漂移且对微调准确性影响最小。

摘要翻译

大型语言模型（LLM）的安全对齐在微调过程中仍极为脆弱，即使进行良性适配也可能削弱预训练阶段的拒绝行为，导致模型生成有害回复。现有防御方法通常仅单独约束权重或激活值，而未考虑二者对安全性的耦合影响。本文首先从理论上证明，仅约束权重或激活值均不足以有效维护安全性。为鲁棒地保持安全对齐，我们提出耦合权重与激活约束（Coupled Weight and Activation Constraints, CWAC），该方法同时实施双重保护：在权重更新上强制限定于预计算的安全子空间，并对稀疏自编码器识别的安全关键特征施加定向正则化。在四种广泛使用的LLM及多样下游任务上的大量实验表明，CWAC始终能以对微调精度最小的影响实现最低有害性评分，即使在高比例有害数据条件下也显著优于现有强基线方法。

摘要 (Abstract)

Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or activations in isolation, without considering their coupled effects on safety. In this paper, we first theoretically demonstrate that constraining either weights or activations alone is insufficient for safety preservation. To robustly preserve safety alignment, we propose Coupled Weight and Activation Constraints (CWAC), a novel approach that simultaneously enforces a precomputed safety subspace on weight updates and applies targeted regularization to safety-critical features identified by sparse autoencoders. Extensive experiments across four widely used LLMs and diverse downstream tasks show that CWAC consistently achieves the lowest harmful scores with minimal impact on fine-tuning accuracy, substantially outperforming strong baselines even under high harmful data ratios.

关键词: Large Language Models, Safety Alignment, Fine-tuning, Safety Drift, Weight Constraints, Activation Constraints, Harmful Response Prevention, CWAC

113. ❌ Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks

作者: Yuangang Li, Justin Tian Jin Chen, Ethan Yu, David Hong, Iftekhar Ahmed 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12379v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在编程任务中的推理质量评估，与’Large Language Models’高度相关（10分），并直接涉及推理过程（‘Chain of Thought’和’System 2 Thinking’各10分）。其他关键词如MoE、量化、RAG等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有基准在评估大语言模型编程推理质量上的不足，提出了首个覆盖生成、总结和分类三类编程任务的基准CodeRQ-Bench，并基于此开发了VERA评估器，显著提升了评估性能。

摘要翻译

大型语言模型（LLMs）日益依赖显式推理来解决编码任务，然而评估此类推理的质量仍具挑战性。现有的推理评估器并非为编码任务设计，且当前基准测试主要关注代码生成，其他编码任务在很大程度上尚未得到探索。我们提出了CodeRQ-Bench，这是首个用于评估LLM在三大编码任务类别（生成、总结与分类）中推理质量的基准测试。基于此基准，我们分析了现有评估器中1,069个不匹配案例，识别出五个常见局限，并推导出针对编码任务推理评估的四项设计洞见。在这些洞见的指导下，我们提出了VERA——一个两阶段评估器，它结合了基于证据的验证与考虑模糊性的分数校正。在CodeRQ-Bench上的实验表明，VERA在四个数据集上均持续优于强基线模型，将AUCROC提升最高达0.26，AUPRC提升最高达0.21。我们在https://github.com/MrLYG/CodeRQ-Bench发布了CodeRQ-Bench，以支持未来研究。

摘要 (Abstract)

Large language models (LLMs) increasingly rely on explicit reasoning to solve coding tasks, yet evaluating the quality of this reasoning remains challenging. Existing reasoning evaluators are not designed for coding, and current benchmarks focus primarily on code generation, leaving other coding tasks largely unexplored. We introduce CodeRQ-Bench, the first benchmark for evaluating LLM reasoning quality across three coding task categories: generation, summarization, and classification. Using this benchmark, we analyze 1,069 mismatch cases from existing evaluators, identify five recurring limitations, and derive four design insights for reasoning evaluation in coding tasks. Guided by these insights, we propose VERA, a two-stage evaluator that combines evidence-grounded verification with ambiguity-aware score correction. Experiments on CodeRQ-Bench show that VERA consistently outperforms strong baselines across four datasets, improving AUCROC by up to 0.26 and AUPRC by up to 0.21. We release CodeRQ-Bench at https://github.com/MrLYG/CodeRQ-Bench, supporting future investigations.

关键词: Large Language Models, LLMs, reasoning evaluation, coding tasks, benchmark, CodeRQ-Bench, VERA evaluator, evidence-grounded verification

114. ❌ SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models

作者: SungHo Kim, Juhyeong Park, Eda Atalay, SangKeun Lee 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12377v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SCRIPT模块，专注于韩语预训练语言模型的子字符组合表示注入，属于大模型技术原理创新。与’Large Language Models’和’Pre-training’高度相关（8分），因为涉及预训练语言模型的改进。与’PEFT’有一定关联（5分），因为SCRIPT是模型无关的模块，无需架构改变或额外预训练，类似于参数高效微调的思想。与’Mechanistic Interpretability’有一定关联（5分），因为论文进行了详细的嵌入空间分析以解释模型改进。其他关键词与论文主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对韩语预训练语言模型未能有效捕捉字符内部组合结构的问题，提出了SCRIPT模块来注入子字符组合知识，从而在无需架构改变或额外预训练的情况下，提升了模型在多种韩语自然语言理解和生成任务上的性能，并通过语言学分析验证了其有效性。

摘要翻译

韩语是一种形态丰富的语言，其文字系统具有特征性书写特性，每个字符都由称为“字母”（Jamo）的亚字符单元系统化组合而成。这些亚字符不仅决定了韩语的视觉结构，还编码了频繁且具有语言学意义的形态音系过程。然而，当前大多数韩语语言模型（LMs）基于子词分词方案构建，这些方案并未明确设计用于捕捉字符内部的组合结构。为克服这一局限，我们提出了SCRIPT——一个与模型无关的模块，旨在将亚字符组合知识注入韩语预训练语言模型（PLMs）中。SCRIPT能够以结构化的粒度增强子词嵌入表示，且无需改变模型架构或进行额外的预训练。实验结果表明，SCRIPT在多种韩语自然语言理解（NLU）与生成（NLG）任务中均提升了所有基线模型的性能。此外，除性能提升外，细致的语言学分析表明，SCRIPT通过重塑嵌入空间，能更好地捕捉语法规律和语义连贯的变体。我们的代码公开于https://github.com/SungHo3268/SCRIPT。

摘要 (Abstract)

Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters. To address this limitation, we propose SCRIPT, a model-agnostic module that injects subcharacter compositional knowledge into Korean PLMs. SCRIPT allows to enhance subword embeddings with structural granularity, without requiring architectural changes or additional pre-training. As a result, SCRIPT enhances all baselines across various Korean natural language understanding (NLU) and generation (NLG) tasks. Moreover, beyond performance gains, detailed linguistic analyses show that SCRIPT reshapes the embedding space in a way that better captures grammatical regularities and semantically cohesive variations. Our code is available at https://github.com/SungHo3268/SCRIPT.

关键词: Korean language models, subcharacter composition, Jamo, pre-trained language models, embedding enhancement, natural language understanding, linguistic analysis, model-agnostic module

115. ❌ Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

作者: Ziyang Liu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12376v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM长对话中超出上下文窗口时的内容管理问题，提出cooperative paging方法，使用关键词书签和recall()工具实现按需检索。高度相关关键词：LLMs（核心研究对象）、Context Window Extension（解决长上下文问题）、Tool Use（实现recall()工具）。中等相关：RAG（涉及检索机制）、LLM Agents（涉及工具使用和对话管理）。其他关键词如MoE、SFT、RLHF等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了当LLM对话超出上下文窗口时如何有效管理历史内容的问题，提出了一种使用关键词书签和按需检索工具的cooperative paging方法，在多个模型和基准测试中显著优于传统方法。

摘要翻译

当大语言模型对话超出上下文窗口时，旧内容必须被移出——但模型如何在需要时恢复这些内容？我们提出协作分页机制：被移出的片段被替换为极简的关键词书签（[pN:关键词]，每个约8-24个词元），同时模型获得一个 recall() 工具以按需检索完整内容。在LoCoMo基准测试（10个真实多轮对话，300+轮次）中，协作分页在四种模型（GPT-4o-mini、DeepSeek-v3.2、Claude Haiku、GLM-5）上对比六种方法（截断法、BM25检索、词重叠检索、搜索工具基线及完整上下文）取得了最高的回答质量——其优势经四个独立大语言模型评判员确认（$p=0.017$，配对自助法）。我们随后通过边界策略与淘汰策略的5×4消融实验（3,176个合成探针，1,600个LoCoMo探针）探索了分页设计空间。关键发现包括：（1）粗粒度固定大小分页（fixed_20）达到96.7%性能，而基于内容感知的主题转移策略（topic_shift）骤降至56.7%；（2）淘汰策略选择具有数据依赖性（FIFO在合成数据上最优，LFU在LoCoMo上最优）；（3）两种书签生成策略相比启发式基线均有提升（端到端指标分别+4.4与+8.7分）；（4）当前瓶颈在于书签区分度——模型在96%的情况下触发 recall()，但当书签特征不足时仅能选择正确分页的57%。仅关键词特异性一项就导致25个百分点的准确率差异。

摘要 (Abstract)

When LLM conversations grow beyond the context window, old content must be evicted – but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods – outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context – on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination – the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

关键词: LLM conversations, context window, cooperative paging, keyword bookmarks, recall tool, long-horizon conversations, retrieval, LoCoMo benchmark

116. ❌ Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

作者: NVIDIA, :, Aakshita Chandiramani, Aaron Blakeman, Abdullahi Olaoye, Abhibha Gupta, Abhilash Somasamudramath, Abhinav Khattar, Adeola Adesoba, Adi Renduchintala, Adil Asif, Aditya Agrawal, Aditya Vavre, Ahmad Kiswani, Aishwarya Padmakumar, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Gronskiy, Alex Kondratenko, Alex Neefus, Alex Steiner, Alex Yang, Alexander Bukharin, Alexander Young, Ali Hatamizadeh, Ali Taghibakhshi, Alina Galiautdinova, Alisa Liu, Alok Kumar, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Anahita Bhiwandiwalla, Ananth Subramaniam, Andrew Tao, Anjaney Shrivastava, Anjulie Agrusa, Ankur Srivastava, Ankur Verma, Ann Guan, Anna Shors, Annamalai Chockalingam, Anubhav Mandarwal, Aparnaa Ramani, Arham Mehta, Arti Jain, Arun Venkatesan, Asha Anoosheh, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asli Sabanci Demiroz, Asma Kuriparambil Thekkumpate, Atefeh Sohrabizadeh, Avinash Kaur, Ayush Dattagupta, Barath Subramaniam Anandan, Bardiya Sadeghi, Barnaby Simkin, Ben Lanir, Benedikt Schifferer, Benjamin Chislett, Besmira Nushi, Bilal Kartal, Bill Thiede, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Branislav Kisacanin, Brian Yu, Bryan Catanzaro, Buvaneswari Mani, Carlo del Mundo, Chankyu Lee, Chanran Kim, Chantal Hwang, Chao Ni, Charles Wang, Charlie Truong, Cheng-Ping Hsieh, Chenhan Yu, Chenjie Luo, Cherie Wang, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Chris Holguin, Chris Wing, Christian Munley, Christopher Parisien, Chuck Desai, Chunyang Sheng, Collin Neale, Cyril Meurillon, Dakshi Kumar, Dan Gil, Dan Su, Dane Corneil, Daniel Afrimi, Daniel Burkhardt Eliuth Triana, Daniel Egert, Daniel Fatade, Daniel Lo, Daniel Rohrer, Daniel Serebrenik, Daniil Sorokin, Daria Gitman, Daria Levy, Darko Stosic, David Edelsohn, David Messina, David Mosallanezhad, David Tamok, Deena Donia, Deepak Narayanan, Devin O’Kelly, Dheeraj Peri, Dhruv Nathawani, Di Wu, Dima Rekesh, Dina Yared, Divyanshu Kakwani, Dmitry Konyagin Brandon Tuttle, Dong Ahn, Dongfu Jiang, Dorrin Poorkay, Douglas O’Flaherty, Duncan Riach, Dusan Stosic, Dustin Van Stee, Edgar Minasyan, Edward Lin, Eileen Peters Long, Elad Segal, Elena Lantz, Elena Lewis, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Pham-Hung, Eric W. Tramel, Erick Galinkin, Erik Pounds, Esti Etrog, Evan Briones, Evan Wu, Evelina Bakhturina, Evgeny Tsykunov, Ewa Dobrowolska, Farshad Saberi Movahed, Farzan Memarian, Fay Wang, Fei Jia, Felipe Soares, Felipe Vieira Frujeri, Feng Chen, Fengguang Lin, Ferenc Galko, Fortuna Zhang, Frankie Siino, Frida Hou, Gantavya Bhatt, Gargi Prasad, Geethapriya Venkataramani, Geetika Gupta, George Armstrong, Gerald Shen, Giulio Borghesi, Gordana Neskovic, Gorkem Batmaz, Grace Lam, Grace Wu, Greg Pauloski, Greyson Davis, Grigor Nalbandyan, Guoming Zhang, Guy Farber, Guyue Huang, Haifeng Qian, Haran Kumar Shiv Kumar, Harry Kim, Harsh Sharma, Hayate Iso, Hayley Ross, Herbert Hum, Herman Sahota, Hexin Wang, Himanshu Soni, Hiren Upadhyay, Huy Nguyen, Iain Cunningham, Ido Galil, Ido Shahaf, Igino Padovani, Igor Gitman, Igor Shovkun, Ikroop Dhillon, Ilya Loshchilov, Ingrid Kelly, Itamar Schen, Itay Levy, Ivan Moshkov, Izik Golan, Izzy Putterman, Jain Tu, Jan Baczek, Jan Kautz, Jane Polak Scowcroft, Janica Rosenberg, Jared Casper, Jarrod Pflum, Jason Grant, Jason Sewall, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jiacheng Xu, Jiafan Zhu, Jialin Song, Jian Zhang, Jiaqi Zeng, Jie Lou, Jill Milton, Jim Chow, Jimmy Zhang, Jinhang Choi, Jining Huang, Jocelyn Huang, Joel Caruso, Joey Conway, Joey Guman, Johan Jatko, John Kamalu, Johnny Greco, Jonathan Cohen, Jonathan Raiman, Joseph Jennings, Joyjit Daw, Juan Yu, Julio Tapia, Junkeun Yi, Jupinder Parmar, Jyothi Achar, Kari Briski, Kartik Mattoo, Katherine Cheung, Katherine Luna, Keith Wyss, Kevin Shih, Kezhi Kong, Khanh Nguyen, Khushi Bhardwaj, Kirill Buryak, Kirthi Shankar Sivamani, Konstantinos Krommydas, Kris Murphy, Krishna C. Puvvada, Krzysztof Pawelec, Kumar Anik, Laikh Tewari, Laya Sleiman, Leo Du, Leon Derczynski, Li Ding, Lilach Ilan, Lingjie Wu, Lizzie Wei, Luis Vega, Lun Su, Maarten Van Segbroeck, Maer Rodrigues de Melo, Magaret Zhang, Mahan Fathi, Makesh Narsimhan Sreedhar, Makesh Sreedhar, Makesh Tarun Chandran, Manuel Reyes Gomez, Maor Ashkenazi, Marc Cuevas, Marc Romeijn, Margaret Zhang, Mark Cai, Mark Gabel, Markus Kliegl, Martyna Patelka, Maryam Moosaei, Matthew Varacalli, Matvei Novikov, Mauricio Ferrato, Mehrzad Samadi, Melissa Corpuz, Meng Xin, Mengdi Wang, Mengru Wang, Meredith Price, Micah Schaffer, Michael Andersch, Michael Boone, Michael Evans, Michael Z Wang, Miguel Martinez, Mikail Khona, Mike Chrzanowski, Mike Hollinger, Mingyuan Ma, Minseok Lee, Mohammad Dabbah, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Nader Khalil, Najeeb Nabwani, Nancy Agarwal, Nanthini Balasubramaniam, Narimane Hennouni, Narsi Kodukula, Natalie Hereth, Nathaniel Pinckney, Nave Assaf, Negar Habibi, Nestor Qin, Neta Zmora, Netanel Haber, Nick Reamaroon, Nickson Quak, Nidhi Bhatia, Nikhil Jukar, Nikki Pope, Nikolai Ludwig, Nima Tajbakhsh, Nir Ailon, Nirmal Juluru, Nirmalya De, Nowel Pitt, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Omri Almog, Omri Puny, Oren Tropp, Otavio Padovani, Ouye Xie, Parth Chadha, Pasha Shamis, Paul Gibbons, Pavlo Molchanov, Peter Belcak, Peter Jin, Pinky Xu, Piotr Januszewski, Pooya Jannaty, Prachi Shevate, Pradeep Thalasta, Pranav Prashant Thombre, Prasoon Varshney, Prerana Gambhir, Pritam Gundecha, Przemek Tredak, Qing Miao, Qiyu Wan, Quan Tran Minh, Rabeeh Karimi Mahabadi, Rachel Oberman, Rachit Garg, Rahul Kandu, Raina Zhong, Ran El-Yaniv, Ran Zilberstein, Rasoul Shafipour, Renee Yao, Renjie Pi, Richard Mazzarese, Richard Wang, Rick Izzo, Ridhima Singla, Rima Shahbazyan, Rishabh Garg, Ritika Borkar, Ritu Gala, Riyad Islam, Robert Clark, Robert Hesse, Roger Waleffe, Rohit Varma Kalidindi, Rohit Watve, Roi Koren, Ron Fan, Ruchika Kharwar, Ruisi Cai, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Ryan Timbrook, Ryota Egashira, Sadegh Mahdavi, Sagar Singh Ashutosh Joshi, Sahil Modi, Samuel Kriman, Sandeep Pombra, Sanjay Kariyappa, Sanjeev Satheesh, Santiago Pombo, Saori Kaji, Satish Pasumarthi, Saurav Mishra, Saurav Muralidharan, Scott Hara, Sean Narenthiran, Sebastian Rogawski, Seonjin Na, Seonmyeong Bak, Sepehr Sameni, Seth Poulos, Shahar Mor, Shantanu Acharya, Shaona Ghosh Adam Lord, Sharath Turuvekere Sreenivas, Shaun Kotek, Shaya Gharghabi, Shelby Thomas, Sheng-Chieh Lin, Shibani Likhite, Shiqing Fan, Shiyang Chen, Shreya Gopal, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuo Zhang, Shuoyang Ding, Shyam Renjith, Shyamala Prayaga, Siddhartha Jain, Simeng Sun, Sirisha Rella, Sirshak Das, Smita Ithape, Sneha Harishchandra S, Somshubra Majumdar, Soumye Singhal, Sri Harsha Singudasu, Sriharsha Niverty, Stas Sergienko, Stefana Gloginic, Stefania Alborghetti, Stephen Ge, Stephen McCullough, Sugam Dipak Devare, Suguna Varshini Velury, Sukrit Rao, Sumeet Kumar Barua, Sunny Gai, Suseella Panguluri, Sushil Koundinyan, Swathi Patnam, Sweta Priyadarshi, Swetha Bhendigeri, Syeda Nahida Akter, Sylendran Arunagiri, Tailling Yuan, Talor Abramovich, Tan Bui, Tan Yu, Terry Kong, Thanh Do, Thomas Gburek, Thorgane Marques, Tiffany Moore, Tijmen Blankevoort, Tim Moon, Timothy Ma, Tiyasa Mitra, Tomasz Grzegorzek, Tomer Asida, Tomer Bar Natan, Tomer Keren, Tomer Ronen, Traian Rebedea, Trenton Starkey, Tugrul Konuk, Twinkle Vashishth, Tyler Condensa, Udi Karpas, Ushnish De, Vahid Noorozi, Vahid Noroozi, Vanshil Atul Shah, Veena Vaidyanathan, Venkat Srinivasan, Venmugil Elango, Victor Cui, Vijay Korthikanti, Vikas Mehta, Virginia Adams, Virginia Wu, Vitaly Kurin, Vitaly Lavrukhin, Vladimir Anisimov, Wan Seo, Wanli Jiang, Wasi Uddin Ahmad, Wei Du, Wei Ping, Wei-Ming Chen, Wendy Quan, Wenliang Dai, Wenwen Gao, Will Jennings, William Zhang, Xiaowei Ren, Xiaowen Xin, Xin Li, Yang Yu, Yangyi Chen, Yaniv Galron, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Ying Lin, Yonatan Geifman, Yonggan Fu, Yoshi Suhara, Youngeun Kwon, Yuan Zhang, Yuki Huang, Zach Moshe, Zhilin Wang, Zhiyu Cheng, Zhongbo Zhu, Zhuolin Yang, Zihan Liu, Zijia Chen, Zijie Yan, Zuhair Ahmed 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12374v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究1200亿参数混合Mamba-Attention MoE模型Nemotron 3 Super，高度相关关键词包括：LLMs/Foundation Models（核心研究对象）、MoE/Sparse Models（采用LatentMoE新架构）、Pre-training（25万亿token预训练）、Post-training/SFT（监督微调）、Context Window Extension（支持1M上下文）、Quantization（量化检查点）、Speculative Decoding（MTP层加速推理）。RLHF/RLAIF/DPO相关度5分（提到强化学习但未明确方法）。其他关键词如SLMs、Scaling Laws、Instruction Tuning等未涉及。

!!! tip deepseek-chat TL;DR

论文提出了Nemotron 3 Super，一个1200亿参数的混合Mamba-Attention MoE模型，通过预训练、后训练和量化实现了高达1M的上下文长度和2.2-7.5倍的推理加速，同时开源了模型和数据集。

摘要翻译

本文介绍了Nemotron 3 Super模型的预训练、后训练及量化过程。该模型是一个拥有1200亿参数（激活参数为120亿）的混合Mamba-注意力专家混合模型。Nemotron 3 Super是Nemotron 3系列中首个具备以下特征的模型：1）采用NVFP4格式进行预训练；2）应用了LatentMoE（一种新型专家混合架构），该架构在每FLOP精度与每参数精度上均进行了优化；3）集成了MTP层，通过原生推测解码实现推理加速。我们在25万亿词元上对Nemotron 3 Super进行了预训练，随后采用监督微调与强化学习进行后训练。最终模型支持高达100万词元的上下文长度，在常见基准测试中达到了可比精度，同时与GPT-OSS-120B和Qwen3.5-122B相比，推理吞吐量分别最高提升至2.2倍和7.5倍。Nemotron 3 Super的数据集，以及基础模型、后训练模型和量化检查点均已开源发布于HuggingFace平台。

摘要 (Abstract)

We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.

关键词: Mixture-of-Experts, Mamba-Transformer, Pre-training, Post-training, Quantization, Inference Acceleration, Long Context, Open Source

117. ❌ ReflectCAP: Detailed Image Captioning with Reflective Memory

作者: Kyungmin Min, Minbeom Kim, Kang-il Lee, Seunghyun Yoon, Kyomin Jung 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12357v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ReflectCAP方法，使用多智能体流程分析大型视觉语言模型（LVLM）的幻觉和遗漏模式，生成结构化反思笔记来指导图像描述生成。核心相关关键词：1）‘Self-Correction/Self-Improvement/Self-Reflection’（10分）- 核心方法涉及反思和改进模型输出；2）‘LLM Agents/Autonomous Agents/Agentic Workflow’（10分）- 明确使用多智能体流程；3）‘Multi-agent Systems/Agent Coordination’（10分）- 多智能体协调工作；4）‘Hallucination Mitigation/Factuality/Truthfulness’（10分）- 直接解决幻觉问题并提高事实性；5）‘Large Language Models/LLMs/Foundation Models’（8分）- 应用于GPT-4.1、Qwen等大型视觉语言模型家族。其他关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该研究解决了详细图像描述中事实性和覆盖范围难以兼顾的问题，通过多智能体流程分析模型幻觉和遗漏模式，生成结构化反思笔记来指导描述生成，在多个大型视觉语言模型上实现了事实性和覆盖范围的帕累托最优改进。

摘要翻译

精细图像描述任务既要求事实依据，又需要细粒度覆盖，然而现有方法难以同时实现这两点。我们通过反思笔记引导描述方法（Reflective Note-Guided Captioning, ReflectCAP）来解决这一矛盾：该方法采用多智能体流程，分析目标大型视觉语言模型（Large Vision-Language Model, LVLM）持续产生的幻觉内容及其系统性忽略的细节，并将这些模式提炼为可复用的结构化反思笔记（Structured Reflection Notes）。在推理阶段，这些笔记从“应避免的内容”和“需关注的方向”两个维度引导描述模型，生成同时提升事实准确性与覆盖度的精细描述。将本方法应用于涵盖GPT-4.1系列、Qwen系列及InternVL变体的8种大型视觉语言模型后，ReflectCAP达到了事实性与覆盖度权衡的帕累托前沿，并在CapArena-Auto评测中取得显著提升——该基准通过将生成描述与强参考模型进行直接对比来评估性能。此外，相较于模型缩放或现有多智能体流程（会产生21%-36%的额外开销），ReflectCAP在描述质量与计算成本之间实现了更优的平衡，这使得高质量精细图像描述能够在实际成本与延迟约束下得以实现。

摘要 (Abstract)

Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes – what to avoid and what to attend to – yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where generated captions are judged head-to-head against strong reference models. Moreover, ReflectCAP offers a more favorable trade-off between caption quality and compute cost than model scaling or existing multi-agent pipelines, which incur 21–36% greater overhead. This makes high-quality detailed captioning viable under real-world cost and latency constraints.

关键词: detailed image captioning, large vision-language models, multi-agent pipeline, hallucination mitigation, factuality improvement, structured reflection notes, Pareto frontier, compute cost efficiency

118. ❌ MultiDocFusion: Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents

作者: Joongmin Shin, Chanjun Park, Jeongbae Park, Jaehyung Seo, Heuiseok Lim 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12352v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是改进RAG系统在长工业文档上的应用，通过多模态分块管道（结合视觉解析、OCR和LLM）提升检索精度和QA性能。与RAG关键词高度相关（10分），涉及LLM用于文档解析（8分），处理长文档与上下文扩展相关（5分），工业文档应用与AI for Science有一定关联（5分）。其他关键词如MoE、SFT、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对长工业文档中传统文本分块方法导致信息丢失的问题，提出了MultiDocFusion多模态分块管道，通过结合视觉解析、OCR和LLM重建文档层次结构，显著提升了RAG系统的检索精度和问答性能。

摘要翻译

基于检索增强生成（RAG）的问答系统已成为处理长篇幅工业文档的有力方法。然而，传统的文本分块方法往往忽略了复杂且冗长的工业文档结构，导致信息丢失和答案质量下降。为解决这一问题，我们提出了MultiDocFusion——一种多模态分块流程，该流程整合了：（i）通过基于视觉的文档解析检测文档区域，（ii）利用光学字符识别（OCR）从这些区域提取文本，（iii）借助基于大语言模型（LLM）的文档章节层次解析（DSHP-LLM）将文档结构重建为层次树，以及（iv）通过基于深度优先搜索（DFS）的分组构建层次化文本块。在多个工业基准测试上进行的大量实验表明，与基线方法相比，MultiDocFusion将检索精度提升了8-15%，并将ANLS问答得分提高了2-3%，这凸显了在多模态文档问答中显式利用文档层次结构的关键作用。这些显著的性能提升强调了结构感知分块对于增强基于RAG的问答系统保真度的必要性。

摘要 (Abstract)

RAG-based QA has emerged as a powerful method for processing long industrial documents. However, conventional text chunking approaches often neglect complex and long industrial document structures, causing information loss and reduced answer quality. To address this, we introduce MultiDocFusion, a multimodal chunking pipeline that integrates: (i) detection of document regions using vision-based document parsing, (ii) text extraction from these regions via OCR, (iii) reconstruction of document structure into a hierarchical tree using large language model (LLM)-based document section hierarchical parsing (DSHP-LLM), and (iv) construction of hierarchical chunks through DFS-based grouping. Extensive experiments across industrial benchmarks demonstrate that MultiDocFusion improves retrieval precision by 8-15% and ANLS QA scores by 2-3% compared to baselines, emphasizing the critical role of explicitly leveraging document hierarchy for multimodal document-based QA. These significant performance gains underscore the necessity of structure-aware chunking in enhancing the fidelity of RAG-based QA systems.

关键词: RAG, multimodal chunking, document hierarchy, industrial documents, LLM-based parsing, retrieval precision, QA systems, document structure

119. ❌ SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

作者: Kathakoli Sengupta, Kai Ao, Paola Cascante-Bonilla 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13035v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究3D室内场景合成的评估方法，核心贡献是SceneCritic符号评估器。与关键词的相关性分析：1）论文明确使用LLMs和VLMs生成室内场景，因此"Large Language Models"相关度8分；2）论文提到现有评估方法存在幻觉问题，因此"Hallucination Mitigation"相关度5分；3）其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、代理系统等均未在论文中涉及，相关度为0分。论文属于大模型在特定领域（3D场景生成）的应用研究，但未涉及大模型技术原理的创新。

!!! tip deepseek-chat TL;DR

该论文针对LLMs和VLMs生成的3D室内场景评估不稳定问题，提出了基于结构化空间本体的符号评估器SceneCritic，实验表明其比基于VLM的评估器更符合人类判断，且图像基础的VLM细化是最有效的修正方式。

摘要翻译

大语言模型（LLM）与视觉语言模型（VLM）日益通过布局和场景图等中间结构生成室内场景，但评估仍依赖LLM或VLM评判器对渲染视图进行打分，导致判断易受视角、提示语表述和幻觉影响。当评估器不稳定时，难以判定模型是否生成了空间合理的场景，亦或输出分数仅反映了视角选择、渲染方式或提示语的差异。本文提出SceneCritic——一种面向平面图级布局的符号化评估器。SceneCritic的约束条件基于SceneOnto实现，后者是我们通过整合3D-FRONT、ScanNet和Visual Genome的室内场景先验知识构建的结构化空间本体。SceneCritic遍历该本体，联合验证跨物体关系的语义、朝向与几何一致性，提供物体级和关系级评估，以识别具体违规项与成功布局。此外，我们将SceneCritic与迭代优化测试平台结合，探究模型在不同评判模式下的空间构建与修正机制：基于规则的评判器以碰撞约束作为反馈，LLM评判器以文本形式处理布局，VLM评判器基于渲染观测进行判断。通过大量实验，我们证明：（a）SceneCritic相较于基于VLM的评估器能显著更好地与人类判断保持一致；（b）纯文本LLM在语义布局质量上可超越VLM；（c）基于图像的VLM优化模式在语义与朝向校正方面是最有效的评判方式。

摘要 (Abstract)

Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic’s constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.

关键词: 3D indoor scene synthesis, symbolic evaluator, Large Language Models, Vision-Language Models, spatial ontology, layout evaluation, hallucination mitigation, iterative refinement

120. ❌ Toward Autonomous Long-Horizon Engineering for ML Research

作者: Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji-Rong Wen, Kai Jia 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13018v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出AiScientist系统，专注于自主长周期ML研究工程，核心是构建多智能体系统（LLM Agents, Multi-agent Systems）进行任务协调（Agent Coordination），涉及长期推理（System 2 Thinking, Multi-step Reasoning）、自我改进（Self-Correction）和工具使用（Tool Use），属于AI for Science在ML研究自动化中的应用。系统基于LLMs构建（Large Language Models），但未深入LLM技术细节（如MoE、训练方法、优化技术等），因此相关关键词得分较低或为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何实现自主长周期机器学习研究工程的问题，通过提出AiScientist系统，结合分层编排和持久状态连续性，在PaperBench和MLE-Bench Lite基准上显著提升了性能。

摘要翻译

自主人工智能研究发展迅速，但长周期机器学习研究工程仍面临挑战：智能体需在数小时乃至数天的任务理解、环境配置、实施、实验与调试过程中保持连贯进展。本文提出AiScientist系统，该系统基于一个核心原则构建：实现优异的长周期性能需要结构化编排与持久化状态连续性。为此，AiScientist将分层编排与权限限定的“文件即总线”（File-as-Bus）工作空间相结合：顶层编排器（Orchestrator）通过精炼摘要和工作空间地图维持阶段级控制，而专用智能体则持续基于持久化工件（如分析报告、计划、代码和实验证据）进行反复校准，而非主要依赖对话式任务交接，从而实现对厚重状态的轻量控制。在两项互补的基准测试中，AiScientist将PaperBench分数较最佳匹配基线平均提升10.54分，并在MLE-Bench Lite上实现81.82%的任意奖章率（Any Medal%）。消融实验进一步表明，“文件即总线”协议是性能的关键驱动因素，移除该协议会导致PaperBench分数下降6.41分，MLE-Bench Lite分数下降31.82分。这些结果表明，长周期机器学习研究工程本质上是协调专业化工作与持久化项目状态的系统性问题，而非纯粹的局部推理问题。

摘要 (Abstract)

Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.

关键词: Autonomous AI research, Long-horizon ML research engineering, Multi-agent systems, Hierarchical orchestration, File-as-Bus workspace, Durable state continuity, AiScientist, PaperBench

121. ❌ PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

作者: Han Bao, Penghao Zhang, Yue Huang, Zhengqing Yuan, Yanchi Ru, Rui Su, Yujun Zhou, Xiangqi Wang, Kehan Guo, Nitesh V Chawla, Yanfang Ye, Xiangliang Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12995v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在公共政策领域的理解和应用，直接涉及’Large Language Models’和’Mixture of Experts’两个关键词（高度相关，10分）。论文评估模型在政策理解中的推理能力，涉及’Chain of Thought’和’System 2 Thinking’（有一定关联，5分）。其他关键词如小模型、训练方法、推理优化、科学AI应用等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在公共政策理解方面的不足，提出了首个跨系统政策理解基准PolicyBench和基于专家混合架构的PolicyMoE模型，结果表明当前模型在应用导向的政策任务上表现更好，但在政策理解方面仍有局限。

摘要翻译

大型语言模型正日益融入现实世界的决策过程，包括公共政策领域。然而，其理解和推理政策相关内容的能力仍未得到充分探索。为填补这一空白，我们提出了 PolicyBench，这是首个评估政策理解能力的大规模跨系统基准（中美对比），涵盖广泛政策领域的2.1万个案例，捕捉了现实治理的多样性与复杂性。依据布鲁姆分类法，该基准评估三项核心能力：（1）记忆：对政策知识的事实性回忆；（2）理解：概念与情境推理；（3）应用：在真实政策情境中的问题解决能力。基于此基准，我们进一步提出 PolicyMoE，这是一个领域专用的混合专家模型，其专家模块与各认知层级相对应。所提出的模型在面向应用的政策任务上表现出比记忆或概念理解任务更强的性能，并在结构化推理任务中取得了最高准确率。我们的研究结果揭示了当前大型语言模型在政策理解方面的关键局限，并为开发更可靠、聚焦政策的模型指明了路径。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom’s taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.

关键词: Large Language Models, Policy Comprehension, Benchmark, Mixture of Experts, Reasoning, Public Policy, Domain Adaptation, Cognitive Levels

122. ❌ Accelerating Speculative Decoding with Block Diffusion Draft Trees

作者: Liran Ringel, Yaniv Romano 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12989v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	15.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于推测解码（speculative decoding）的加速方法，这是大语言模型推理加速的核心技术之一。论文直接涉及’Speculative Decoding OR Inference Acceleration’关键词，是该研究的核心内容，因此给予最高分15分。论文也涉及大语言模型（LLMs）的推理加速，因此’Large Language Models OR LLMs OR Foundation Models’相关度为10分。其他关键词如MoE、量化、对齐、RAG等均未在论文中涉及，因此得分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DDTree的新方法，通过构建扩散草稿树来改进推测解码，显著提高了大语言模型的推理速度，超越了现有的EAGLE-3等先进方法。

摘要翻译

推测解码通过使用轻量级草稿模型预测多个未来词元，再由目标模型并行验证，从而加速自回归语言模型的推理。DFlash研究表明，块扩散草稿模型可在单次前向传播中生成完整草稿块，并实现最先进的推测解码性能，超越了EAGLE-3等强自回归草稿模型。然而，原始DFlash每轮仍仅验证单一草稿轨迹，可能限制其接受长度。本文提出DDTree（扩散草稿树）方法，该方法直接从块扩散草稿模型的逐位置分布构建草稿树。在固定节点预算下，DDTree采用简单的最佳优先堆算法，根据草稿模型输出定义的代理指标，选择最可能匹配目标模型的延续路径。生成的草稿树通过仅关注祖先的注意力掩码，在单次目标模型前向传播中高效验证。由于DDTree基于推测解码的领先草稿模型DFlash构建，这些改进使DDTree跻身推测解码领域的领先方法之列。

摘要 (Abstract)

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model’s output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.

关键词: speculative decoding, autoregressive language models, diffusion draft tree, inference acceleration, block diffusion, draft model, target model, DDTree

123. ❌ GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

作者: Amir Hossein Kargaran, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12978v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究光学字符识别（OCR）模型的跨脚本泛化能力评估，属于计算机视觉和自然语言处理的交叉领域，但核心是OCR系统评估而非大模型技术原理创新。所有关键词均聚焦于大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本文主要评估视觉-语言模型在OCR任务上的表现，虽然提到模型可能依赖语言模型预训练，但未深入探讨任何LLM技术细节、创新或应用。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文通过构建GlotOCR Bench基准，评估了OCR模型在100多种Unicode脚本上的泛化能力，发现当前模型仅在少数脚本上表现良好，最强模型也无法泛化超过30种脚本，且性能与脚本级预训练覆盖度相关。

摘要翻译

随着视觉语言模型的兴起，光学字符识别（OCR）技术发展迅速，但其评估仍集中于少数高资源和中资源文字体系。我们推出了GlotOCR Bench，这是一个涵盖100多种Unicode文字体系的综合性基准测试，用于评估OCR的泛化能力。该基准包含由真实多语言文本渲染生成的清晰版与退化版图像变体。图像使用Google Fonts资源库中的字体进行渲染，通过HarfBuzz进行字形排布，并利用FreeType进行光栅化处理，同时支持从左至右（LTR）和从右至左（RTL）的文字体系。我们对渲染后的图像样本进行了人工审查，以确保所有文字体系均能正确显示。我们评估了一系列广泛的开源权重模型与专有视觉语言模型，发现大多数模型仅在不到十种文字体系上表现良好，即使最先进的尖端模型也难以泛化至三十种以上的文字体系。模型性能总体上与文字体系层面的预训练覆盖度相关，这表明当前OCR系统既依赖于视觉识别，也同等依赖于语言模型的预训练。当面对不熟悉的文字体系时，模型要么产生随机噪声，要么从其已掌握的相似文字体系中幻觉式地生成字符。我们公开了基准测试数据集及复现流程。流程代码：https://github.com/cisnlp/glotocr-bench，基准测试数据：https://hf.co/datasets/cis-lmu/glotocr-bench。

摘要 (Abstract)

Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.

关键词: Optical Character Recognition, OCR, Vision-Language Models, Unicode Scripts, Benchmark Evaluation, Generalization, Multilingual Texts, Pretraining Coverage

124. ❌ MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

作者: Chung-Ming Chien, Manu Orsini, Eugene Kharitonov, Neil Zeghidour, Karen Livescu, Alexandre Défossez 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12928v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MoshiRAG专注于语音语言模型的事实性提升，通过异步检索增强生成（RAG）框架解决全双工语音模型的事实性问题。核心相关关键词为’Retrieval-Augmented Generation’（高度相关，10分），因为论文提出了MoshiRAG框架，本质上是RAG在语音模型中的应用；‘Hallucination Mitigation’（8分），因为研究目标是改善事实性，直接对应幻觉缓解；‘Large Language Models’（8分），虽然论文聚焦语音语言模型，但属于大语言模型在语音领域的应用变体。其他关键词如MoE、量化、推理加速等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文解决了全双工语音语言模型的事实性不足问题，通过提出MoshiRAG异步检索框架，在保持实时交互性的同时，利用外部知识源提升了响应的事实准确性，并在数学推理任务上展示了良好性能。

摘要翻译

语音到语音语言模型的最新进展提升了对话式人工智能的自然度。其中，全双工模型以其实时交互性为特点，能够处理停顿、打断和反馈信号。然而，提升其事实准确性仍是一个开放挑战。虽然扩大模型规模可能弥补这一不足，但会导致实时推理成本过高。本研究提出MoshiRAG——一种模块化方案，将紧凑型全双工交互界面与选择性检索机制相结合，以接入更强大的知识源。我们的异步框架使模型能够识别高知识需求查询，并将回答建立在外部信息基础上。通过利用响应起始与核心信息传递之间的自然时间间隙，检索过程可在保持自然对话流的同时完成。该方法使MoshiRAG在事实准确性上达到与最佳公开非双工语音语言模型相当的水平，同时保留了全双工系统固有的交互特性。此外，我们的灵活设计支持即插即用式检索方法而无需重新训练，并在领域外数学推理任务中展现出强大性能。

摘要 (Abstract)

Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.

关键词: speech-to-speech language models, full-duplex models, factuality improvement, retrieval-augmented generation, asynchronous retrieval, knowledge-demanding queries, real-time interactivity, mathematical reasoning

125. ❌ MetFuse: Figurative Fusion between Metonymy and Metaphor

作者: Saptarshi Ghosh, Tianyu Jiang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12919v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究转喻和隐喻的融合现象，创建了MetFuse数据集，并在实验中使用了大型语言模型（LLMs）进行测试。因此，仅与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为LLMs被用作评估工具之一。其他关键词均与论文核心内容（语言学、数据集构建、分类任务）无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了转喻和隐喻在自然语言中的融合现象，创建了首个专门的数据集MetFuse，并通过实验证明使用该数据集增强训练能提升分类性能，同时发现隐喻的存在使转喻更易识别。

摘要翻译

转喻与隐喻在自然语言中常共存出现，但计算语言学领域的研究大多将二者孤立探讨。本文提出一个框架，可将字面义句子转化为三种比喻变体：转喻句、隐喻句及融合句。基于此框架，我们构建了首个专门研究转喻与隐喻融合的数据集MetFuse，包含1,000组经人工验证的意义对齐四元句（总计4,000句）。在八个现有基准上的外部实验表明，使用MetFuse进行数据增强能持续提升转喻与隐喻分类性能，其中融合例句对转喻任务的提升最为显著。借助该数据集，我们还分析了两种比喻类型的共存影响。研究发现：无论是人工标注者还是大语言模型，在识别融合句中的转喻时均比识别纯转喻句表现更优，这证明隐喻的存在会使转喻名词更显性化。本数据集已公开于：https://github.com/cincynlp/MetFuse。

摘要 (Abstract)

Metonymy and metaphor often co-occur in natural language, yet computational work has studied them largely in isolation. We introduce a framework that transforms a literal sentence into three figurative variants: metonymic, metaphoric, and hybrid. Using this framework, we construct MetFuse, the first dedicated dataset of figurative fusion between metonymy and metaphor, containing 1,000 human-verified meaning-aligned quadruplets totaling 4,000 sentences. Extrinsic experiments on eight existing benchmarks show that augmenting training data with MetFuse consistently improves both metonymy and metaphor classification, with hybrid examples yielding the largest gains on metonymy tasks. Using this dataset, we also analyze how the presence of one figurative type impacts another. Our findings show that both human annotators and large language models better identify metonymy in hybrid sentences than in metonymy-only sentences, demonstrating that the presence of a metaphor makes a metonymic noun more explicit. Our dataset is publicly available at: https://github.com/cincynlp/MetFuse.

关键词: Metonymy, Metaphor, Figurative Language, Dataset, Natural Language Processing, Classification, Large Language Models, Hybrid Sentences

126. ❌ Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

作者: Eliya Habba, Itay Itzhak, Asaf Yehudai, Yotam Perlitz, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen, Gabriel Stanovsky 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12843v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM基准测试的评估框架，核心涉及LLM评估和基准测试方法，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、训练方法、推理技术、代理系统、科学AI等，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多维项目反应理论的框架，通过固定参数校准和锚定问题，解决了语言模型基准测试中因数据集动态引入和模型评估样本不同导致的分数不可比问题，实验表明仅需每个数据集100个锚定问题即可在2-3个百分点内预测全评估性能，并保持排名一致性。

摘要翻译

语言模型与评测基准的快速迭代使得对每个模型进行全量数据集评估的成本日益高昂。实践中，模型常在不同样本上进行评估，导致跨研究的结果难以直接比较。为解决这一问题，我们提出一个基于多维项目反应理论（IRT）的框架，该框架通过锚定题项将新基准校准至现有评估体系，同时保持已校准题项参数不变。我们的方法支持一种现实的评估场景：数据集随时间逐步引入，模型仅基于评估时可用的数据集进行测试；每个数据集使用固定锚定题项集，使得不同评估周期的结果能够直接比较。在涵盖超过400个模型的大规模实验中，本框架仅需每个数据集100道锚定题项即可将全量评估性能的预测误差控制在2-3个百分点内，斯皮尔曼等级相关系数ρ≥0.9，表明能够在保持分数可比性的前提下随时间扩展基准集合，且每个新数据集的评估成本保持恒定。代码发布于https://github.com/eliyahabba/growing-pains。

摘要 (Abstract)

The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item parameters fixed. Our approach supports a realistic evaluation setting in which datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation, while a fixed anchor set for each dataset is used so that results from different evaluation periods can be compared directly. In large-scale experiments on more than $400$ models, our framework predicts full-evaluation performance within 2-3 percentage points using only $100$ anchor questions per dataset, with Spearman $ρ\geq 0.9$ for ranking preservation, showing that it is possible to extend benchmark suites over time while preserving score comparability, at a constant evaluation cost per new dataset. Code available at https://github.com/eliyahabba/growing-pains

关键词: LLM benchmarking, Item Response Theory, anchor items, parameter calibration, evaluation framework, score comparability, model evaluation, benchmark extension

127. ❌ EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution

作者: Shiyu He, Minchi Kuang, Mengxian Wang, Bin Hu, Tingxiang Gu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12776v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文EvoSpark专注于基于LLM的多智能体系统在叙事生成中的应用，核心是解决长期叙事演化中的一致性问题。因此，与"Large Language Models"和"LLM Agents”、“Multi-agent Systems"高度相关（10分），因为这些是论文的基础技术框架。其他关键词如MoE、量化、推理加速、对齐等，论文未涉及技术细节，故评0分。

!!! tip deepseek-chat TL;DR

论文提出了EvoSpark框架，通过分层叙事记忆和生成场景机制，解决了基于LLM的多智能体系统中长期叙事演化的逻辑一致性问题，显著提升了叙事生成的连贯性和表达性。

摘要翻译

在基于大语言模型的多智能体系统中实现内生的叙事演化，因生成式涌现的固有随机性而受阻。具体而言，长周期模拟面临两大挑战：一是社会记忆堆叠，即相互冲突的关系状态不断累积却无法消解；二是叙事空间失调，即空间逻辑与演化中的情节脱节。为弥合这一差距，我们提出了EvoSpark框架，该框架专为在内生交互式智能体社会中维持逻辑连贯的长周期叙事而设计。为确保一致性，分层叙事记忆采用角色社会演化基作为活体认知，动态代谢经验以化解历史冲突。与之互补的是，生成式场景编排机制强制执行角色-位置-情节对齐，使角色存在与叙事流同步。上述功能的底层支撑是统一叙事操作引擎，它集成了涌现角色锚定协议，将随机触发转化为持久角色。该引擎建立了一个基础架构，能将最小前提扩展为一个开放、演化的故事世界。实验表明，EvoSpark在不同范式下均显著优于基线方法，能够持续生成富有表现力且连贯的叙事体验。

摘要 (Abstract)

Realizing endogenous narrative evolution in LLM-based multi-agent systems is hindered by the inherent stochasticity of generative emergence. In particular, long-horizon simulations suffer from social memory stacking, where conflicting relational states accumulate without resolution, and narrative-spatial dissonance, where spatial logic detaches from the evolving plot. To bridge this gap, we propose EvoSpark, a framework specifically designed to sustain logically coherent long-horizon narratives within Endogenous Interactive Agent Societies. To ensure consistency, the Stratified Narrative Memory employs a Role Socio-Evolutionary Base as living cognition, dynamically metabolizing experiences to resolve historical conflicts. Complementarily, Generative Mise-en-Scène mechanism enforces Role-Location-Plot alignment, synchronizing character presence with the narrative flow. Underpinning these is the Unified Narrative Operation Engine, which integrates an Emergent Character Grounding Protocol to transform stochastic sparking into persistent characters. This engine establishes a substrate that expands a minimal premise into an open-ended, evolving story world. Experiments demonstrate that EvoSpark significantly outperforms baselines across diverse paradigms, enabling the sustained generation of expressive and coherent narrative experiences.

关键词: LLM-based multi-agent systems, endogenous narrative evolution, long-horizon narratives, agent societies, narrative coherence, social memory stacking, generative emergence, narrative-spatial dissonance

128. ❌ Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning

作者: Timon Ziegenbein, Maja Stahl, Henning Wachsmuth 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12770v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在文本编辑任务中的人类对齐问题，提出基于强化学习的方法（GRPO）来优化编辑策略，使其更接近人类编辑模式。因此，与’Large Language Models’高度相关（10分），因为LLM是研究对象；与’RLHF’高度相关（10分），因为使用了强化学习优化（GRPO是RLHF的一种变体）；与’Self-Correction’有一定关联（5分），因为编辑任务涉及LLM自我改进文本的适当性；其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在编辑不当论证时与人类编辑策略不匹配的问题，提出了一种基于强化学习的方法，通过优化编辑的语义相似性、流畅性和模式一致性，使LLM生成更人类化、自包含的编辑建议，在自动和人工评估中优于现有方法。

摘要翻译

编辑人类撰写的文本已成为大型语言模型（LLM）的标准应用场景，例如使个人论点更符合特定讨论的需求。然而，通过对比人类与LLM生成的编辑内容，我们观察到两者在编辑策略上存在差异：LLM通常进行多处零散的修改，且倾向于显著改变原意；而人类更倾向于将相互依赖的修改封装为自包含、保含义的编辑单元。本文提出一种强化学习方法，旨在教导LLM进行类人化编辑，以提升论点的适切性。该方法生成自包含的句子级编辑建议，这些建议可被独立接受或拒绝。我们采用群体相对策略优化进行训练，结合多组件奖励函数，共同优化编辑层面的语义相似度、流畅度与模式一致性，以及论点层面的适切性。在自动评估与人工评估中，该方法优于竞争基线及当前类人化编辑的最先进技术，其中多轮编辑实现的适切性接近完全重写的水平。

摘要 (Abstract)

Editing human-written text has become a standard use case of large language models (LLMs), for example, to make one’s arguments more appropriate for a discussion. Comparing human to LLM-generated edits, however, we observe a mismatch in editing strategies: While LLMs often perform multiple scattered edits and tend to change meaning notably, humans rather encapsulate dependent changes in self-contained, meaning-preserving edits. In this paper, we present a reinforcement learning approach that teaches LLMs human-like editing to improve the appropriateness of arguments. Our approach produces self-contained sentence-level edit suggestions that can be accepted or rejected independently. We train the approach using group relative policy optimization with a multi-component reward function that jointly optimizes edit-level semantic similarity, fluency, and pattern conformity as well as argument-level appropriateness. In automatic and human evaluation, it outperforms competitive baselines and the state of the art in human-like editing, with multi-round editing achieving appropriateness close to full rewriting.

关键词: Large Language Models, Reinforcement Learning, Text Editing, Human-like Editing, Appropriateness, Group Relative Policy Optimization, Argumentation, Self-contained Edits

作者: Jihao Dai, Dingjun Wu, Yuxuan Chen, Zheni Zeng, Yukun Yan, Zhenghao Liu, Maosong Sun 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12766v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文NaviRAG的核心贡献是提出了一种新颖的检索增强生成（RAG）框架，通过将知识文档组织成层次结构，并利用LLM代理主动导航知识记录，实现多粒度证据定位和动态检索规划。因此，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（15分），因为这是论文的核心主题；与’Large Language Models OR LLMs OR Foundation Models’相关（10分），因为论文使用LLM作为代理进行知识导航；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’相关（10分），因为论文明确提到使用LLM代理进行主动导航。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、AI for Science等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对传统检索增强生成（RAG）在复杂任务中难以动态合成多粒度信息的局限性，提出了NaviRAG框架，通过层次化知识组织和LLM代理主动导航，显著提升了长文档问答基准中的检索召回率和答案性能。

摘要翻译

检索增强生成（RAG）通常依赖于一种扁平化的检索范式，该范式将查询直接映射到静态、孤立的文本片段。这种方法在处理更复杂的任务时存在局限，这些任务需要跨不同粒度层级（例如从宽泛概念到具体证据）进行条件性检索和动态信息合成。为弥补这一不足，我们提出了NaviRAG，一种从被动片段检索转向主动知识导航的新型框架。NaviRAG首先将知识文档构建为层次化结构，保留从粗粒度主题到细粒度细节的语义关系。利用重组后的知识记录，一个大语言模型（LLM）智能体主动导航于记录之中，迭代式地识别信息缺口，并从最合适的粒度层级检索相关内容。在长文档问答基准上的大量实验表明，相较于传统RAG基线方法，NaviRAG持续提升了检索召回率和端到端答案生成性能。消融研究证实，性能提升源于本方法在多粒度证据定位和动态检索规划方面的能力。我们进一步探讨了本方法的效率、适用场景及未来方向，以期使RAG系统更具智能性与自主性。

摘要 (Abstract)

Retrieval-augmented generation (RAG) typically relies on a flat retrieval paradigm that maps queries directly to static, isolated text segments. This approach struggles with more complex tasks that require the conditional retrieval and dynamic synthesis of information across different levels of granularity (e.g., from broad concepts to specific evidence). To bridge this gap, we introduce NaviRAG, a novel framework that shifts from passive segment retrieval to active knowledge navigation. NaviRAG first structures the knowledge documents into a hierarchical form, preserving semantic relationships from coarse-grained topics to fine-grained details. Leveraging this reorganized knowledge records, a large language model (LLM) agent actively navigates the records, iteratively identifying information gaps and retrieving relevant content from the most appropriate granularity level. Extensive experiments on long-document QA benchmarks show that NaviRAG consistently improves both retrieval recall and end-to-end answer performance over conventional RAG baselines. Ablation studies confirm performance gains stem from our method’s capacity for multi-granular evidence localization and dynamic retrieval planning. We further discuss efficiency, applicable scenario, and future directions of our method, hoping to make RAG systems more intelligent and autonomous.

关键词: Retrieval-Augmented Generation, RAG, Knowledge Navigation, Hierarchical Knowledge, LLM Agent, Multi-granular Retrieval, Dynamic Retrieval Planning, Long-document QA

130. ❌ Generating Effective CoT Traces for Mitigating Causal Hallucination

作者: Yiheng Zhao, Jun Yan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12748v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	15.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在事件因果关系识别中的因果幻觉问题，特别关注小模型（≤1.5B参数），通过生成有效的思维链（CoT）轨迹进行微调来缓解幻觉。因此，与’Large Language Models’、‘Small Language Models’、‘Post-training’、‘Chain of Thought’和’Hallucination Mitigation’高度相关（10-15分）。‘System 2 Thinking’有一定关联（5分），因为CoT涉及深度推理。其他关键词如MoE、Scaling Laws、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对小规模大语言模型在事件因果关系识别中存在的严重因果幻觉问题，提出了一种生成有效思维链轨迹的管道，通过微调显著减少幻觉并提高准确性，同时引入了新的因果幻觉率度量标准。

摘要翻译

尽管大语言模型（LLM）在复杂推理任务中表现出色，但在事件因果关系识别（ECI）中，尤其是较小规模模型（参数≤15亿）存在严重的因果幻觉问题。一种有前景的解决思路是利用思维链（CoT）轨迹对模型进行微调，但目前缺乏适用于ECI的CoT轨迹数据集。本文首先探究了能够有效缓解小模型因果幻觉的CoT轨迹应满足的核心标准，随后设计了一套生成符合这些标准的CoT轨迹的流程。此外，由于当前缺乏量化因果幻觉的指标，我们引入了新指标——因果幻觉率（CHR），用以量化因果幻觉、指导有效CoT轨迹标准的制定，并验证所提流程的有效性。实验表明，使用本流程生成的CoT轨迹进行微调，不仅能显著降低小规模LLM的因果幻觉，还能提升平均准确率。此外，微调后的模型展现出强大的跨数据集、跨难度泛化能力，以及在误导性干预提示下的鲁棒性。

摘要 (Abstract)

Although large language models (LLMs) excel in complex reasoning tasks, they suffer from severe causal hallucination in event causality identification (ECI), particularly in smaller models ($\leq$1.5B parameters). A promising approach to address this issue is to fine-tune them with Chain-of-Thought (CoT) traces. However, there is currently a lack of CoT trace dataset available for ECI. In this paper, we first investigate the essential criteria that effective CoT traces should possess to mitigate causal hallucination in smaller models. We then design a pipeline to generate CoT traces that meet these criteria. Moreover, since there is currently no metric for quantifying causal hallucination, we also introduce a new metric, the Causal Hallucination Rate (CHR), to quantify causal hallucination, guide the formulation of effective CoT trace criteria, and validate the effectiveness of our pipeline. Our experiments show that fine-tuning with the CoT traces generated by our pipeline not only substantially reduces causal hallucination in smaller LLMs but also improves mean accuracy. Moreover, the fine-tuned models exhibit strong cross-dataset and cross-difficulty generalization, as well as robustness under misleading intervention prompts.

关键词: Large Language Models, Small Language Models, Chain-of-Thought, Causal Hallucination, Fine-tuning, Event Causality Identification, Causal Hallucination Rate, Generalization

131. ❌ Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark

作者: Terra Blevins, Stephen Mayhew, Marek Šuppa, Hila Gonen, Shachar Mirkin, Vasile Pais, Kaja Dobrovoljc, Voula Giouli, Jun Kevin, Eugene Jang, Eungseo Kim, Jeongyeon Seo, Xenophon Gialis, Yuval Pinter 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12744v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于构建多语言命名实体识别（NER）基准数据集，属于自然语言处理（NLP）的基础任务。摘要中提到“multilingual language models promise to bring the benefits of LLMs to speakers of many languages”，表明该基准旨在评估多语言大模型（LLMs）在NER任务上的性能，因此与“Large Language Models OR LLMs OR Foundation Models”有一定关联（5分）。然而，论文本身不涉及大模型或深度学习的技术原理创新（如MoE、量化、推理加速等），也不属于生物医药等特定科学领域的AI应用，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Universal NER v2，一个旨在构建大规模多语言命名实体识别（NER）基准数据集的持续项目，以解决多语言大模型评估中黄金标准数据稀缺的问题。

摘要翻译

尽管多语言模型有望将大语言模型的优势带给众多语言的使用者，但用于验证这些假设的多数语言仍缺乏黄金标准评估基准。Universal NER项目（通用命名实体识别项目）现已进入第四年，致力于构建黄金标准的多语言命名实体识别基准数据集。受其他核心自然语言处理任务（如Universal Dependencies）的大规模多语言实践启发，该项目采用通用标签集和详尽的标注指南，以收集标准化的跨语言命名实体跨度标注。首个版本（UNER v1）已于2024年发布，此后项目持续扩展，活跃社区中汇集了来自各方的组织者、标注者及合作者。

摘要 (Abstract)

While multilingual language models promise to bring the benefits of LLMs to speakers of many languages, gold-standard evaluation benchmarks in most languages to interrogate these assumptions remain scarce. The Universal NER project, now entering its fourth year, is dedicated to building gold-standard multilingual Named Entity Recognition (NER) benchmark datasets. Inspired by existing massively multilingual efforts for other core NLP tasks (e.g., Universal Dependencies), the project uses a general tagset and thorough annotation guidelines to collect standardized, cross-lingual annotations of named entity spans. The first installment (UNER v1) was released in 2024, and the project has continued and expanded since then, with various organizers, annotators, and collaborators in an active community.

关键词: multilingual NER, benchmark dataset, named entity recognition, cross-lingual annotation, gold-standard evaluation, multilingual language models, Universal NER project

132. ❌ Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

作者: Xingyu Lin, Yilin Wen, Du Su, Jinchang Hou, En Wang, Wenbin Liu, Chenfu Bao, Zhonghou Lv 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12736v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在数学推理中的优化方法，直接涉及Chain of Thought推理和RLHF相关技术（GRPO/DAPO属于RLHF范畴），因此这三个关键词高度相关（10分）。System 2 Thinking与深度推理相关，论文关注CoT推理的改进，有一定关联（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在链式思维推理中面临的令牌级稀疏奖励问题，提出了TEPO框架，通过序列级似然链接组级奖励与令牌级聚合，并引入KL散度掩码约束，显著提升了数学推理性能并减少了50%的收敛时间。

摘要翻译

群体相对策略优化（GRPO）显著提升了大语言模型（LLMs）的推理能力，尤其在数学推理表现上取得了重要进展。然而，GRPO及相关熵正则化方法在处理令牌级稀疏奖励时仍面临困难，这是思维链（CoT）推理中固有的挑战。这些方法通常依赖于无差异化的令牌级熵正则化，在稀疏令牌奖励下容易导致熵崩溃或模型性能退化。本研究提出TEPO，一种新颖的令牌级框架，其创新在于：（1）利用序列级似然性，通过令牌级聚合将群体级奖励与单个令牌关联；（2）引入一种针对具有正优势且熵值递减的令牌的令牌级KL散度掩码约束，以缓解策略的突变更新。实验表明，TEPO不仅在数学推理基准测试中取得了最先进的性能，还显著提升了训练稳定性，与GRPO/DAPO相比收敛时间缩短了50%。

摘要 (Abstract)

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

关键词: Token-Level Policy Optimization, Large Language Models, Chain-of-Thought Reasoning, Mathematical Reasoning, Sparse Rewards, Entropy Regularization, Training Stability, GRPO

133. ❌ InsightFlow: LLM-Driven Synthesis of Patient Narratives for Mental Health into Causal Models

作者: Shreya Gupta, Prottay Kumar Adhikary, Bhavyaa Dave, Salam Michael Singh, Aniket Deroy, Tanmoy Chakraborty 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12721v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心是使用LLM自动从心理治疗对话中生成因果图，属于大模型在医疗领域的应用创新。仅与’Large Language Models’（核心方法）和’AI for Science’（医疗应用）相关，其他关键词涉及的技术原理、训练方法、推理优化等均未提及。

!!! tip deepseek-chat TL;DR

该研究提出InsightFlow，一种基于LLM的方法，能够自动从患者-治疗师对话中生成符合5P框架的临床因果图，评估显示其生成图在结构和语义上与专家标注具有可比性，并被认为具有临床实用性。

摘要翻译

临床个案概念化通常采用5P框架将患者症状与社会心理因素组织为因果模型。然而，从治疗记录中构建此类图谱耗时且存在临床医师间的差异。本研究提出InsightFlow——一种基于大语言模型的方法，能够从医患对话中自动生成符合5P框架的因果图谱。基于临床专家标注的46份心理治疗初始访谈记录，我们通过结构相似性（NetSimile）、语义相似性（嵌入向量）及专家临床标准评估，将大语言模型生成的图谱与人工概念化结果进行对比。生成图谱的结构相似度达到标注者间一致性水平，与人工图谱的语义对齐度较高。专家评估认为输出结果具有中等程度的完整性、一致性和临床实用性。尽管相较于人工图谱的链式结构，大语言模型生成的图谱倾向于形成更多互连结构，但整体复杂度和内容覆盖度相似。这些结果表明，大语言模型能够在专家实践的自然变异范围内生成具有临床意义的个案概念化图谱。InsightFlow彰显了自动化因果建模增强临床工作流程的潜力，未来研究需改进时序推理能力并降低冗余度。

摘要 (Abstract)

Clinical case formulation organizes patient symptoms and psychosocial factors into causal models, often using the 5P framework. However, constructing such graphs from therapy transcripts is time consuming and varies across clinicians. We present InsightFlow, an LLM based approach that automatically generates 5P aligned causal graphs from patient-therapist dialogues. Using 46 psychotherapy intake transcripts annotated by clinical experts, we evaluate LLM generated graphs against human formulations using structural (NetSimile), semantic (embedding similarity), and expert rated clinical criteria. The generated graphs show structural similarity comparable to inter annotator agreement and high semantic alignment with human graphs. Expert evaluations rate the outputs as moderately complete, consistent, and clinically useful. While LLM graphs tend to form more interconnected structures compared to the chain like patterns of human graphs, overall complexity and content coverage are similar. These results suggest that LLMs can produce clinically meaningful case formulation graphs within the natural variability of expert practice. InsightFlow highlights the potential of automated causal modeling to augment clinical workflows, with future work needed to improve temporal reasoning and reduce redundancy.

关键词: LLM, causal models, mental health, clinical case formulation, psychotherapy transcripts, automated graph generation, 5P framework, patient narratives

作者: Chuang Peng, Wei Zhang, Renshuai Tao, Xinhao Zhang, Jian Yang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12666v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于基于文本的Web导航代理，属于大模型（LLM）在特定应用领域的研究。核心贡献包括：1）提出Triton数据集（590k实例）和渐进式训练课程，涉及数据质量和课程学习，与’Scaling Laws AND Data Quality’有一定关联（5分）。2）明确使用Supervised Fine-Tuning（SFT）作为基础方法，并指出其局限性，与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。3）采用Odds Ratio Preference Optimization（ORPO）和Group Relative Policy Optimization（GRPO），这些是偏好优化技术，与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’相关（8分）。4）研究目标是开发鲁棒的Web导航代理，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。其他关键词如MoE、SLMs、RAG、CoT等未在摘要中提及或与论文主题无关，均给0分。论文未涉及科学领域的AI应用（如生物信息学），因此’AI for Science OR Bioinformatics OR Cheminformatics’为0分。

!!! tip deepseek-chat TL;DR

该论文针对基于文本的Web导航代理在嘈杂和异构HTML环境中缺乏判别能力和泛化能力的问题，通过构建Triton数据集和渐进式训练课程，开发了Triton-GRPO-32B模型，在Mind2Web基准上实现了58.7%的步骤成功率，超越了GPT-4.5和Claude-4.5。

摘要翻译

基于文本的网络代理为自主网络导航提供了计算效率，但由于真实世界HTML的嘈杂与异构特性，开发鲁棒的代理仍具挑战性。标准的监督微调方法在两大关键维度存在不足：它们缺乏在密集页面中拒绝看似合理但错误元素的判别能力，并且对未见过的网站布局泛化能力有限。为解决这些挑战，我们引入了Triton数据集（59万实例）及渐进式训练课程。Triton通过结构语义硬负例挖掘（Structural-Semantic Hard Negative Mining）构建，该方法显式挖掘拓扑结构相似的干扰项，并采用双代理共识流程（Dual-Agent Consensus Pipeline），通过严格验证合成多样化的跨领域任务。在此基础上，我们的渐进式课程训练产出三个模型：用于基础模仿的Triton-SFT-32B、通过比值比偏好优化（Odds Ratio Preference Optimization）实现鲁棒判别的Triton-ORPO-32B，以及通过群体相对策略优化（Group Relative Policy Optimization）实现长程一致性的Triton-GRPO-32B。在Mind2Web上的实证评估表明，Triton-GRPO-32B以58.7%的步骤成功率在开源模型中达到最先进性能，超越GPT-4.5（42.4%）和Claude-4.5（41.4%）超过16%，验证了针对网络导航的专项数据课程训练优于原始参数规模。

摘要 (Abstract)

Text-based web agents offer computational efficiency for autonomous web navigation, yet developing robust agents remains challenging due to the noisy and heterogeneous nature of real-world HTML. Standard Supervised Fine-Tuning (SFT) approaches fail in two critical dimensions: they lack discrimination capabilities to reject plausible but incorrect elements in densely populated pages, and exhibit limited generalization to unseen website layouts. To address these challenges, we introduce the Triton dataset (590k instances) and a progressive training curriculum. Triton is constructed via Structural-Semantic Hard Negative Mining, which explicitly mines topologically similar distractors, and a Dual-Agent Consensus pipeline that synthesizes diverse cross-domain tasks with strict verification. Building upon this foundation, our progressive curriculum produces three models: Triton-SFT-32B for basic imitation, Triton-ORPO-32B for robust discrimination via Odds Ratio Preference Optimization, and Triton-GRPO-32B for long-horizon consistency through Group Relative Policy Optimization. Empirical evaluation on Mind2Web demonstrates that Triton-GRPO-32B achieves state-of-the-art performance among open-source models with 58.7% Step Success Rate, surpassing GPT-4.5 (42.4%) and Claude-4.5 (41.4%) by over 16%, validating that specialized data curriculum outweighs raw parameter scale for web navigation.

关键词: Web Navigation Agents, Supervised Fine-Tuning, Odds Ratio Preference Optimization, Group Relative Policy Optimization, Progressive Curriculum Learning, Structural-Semantic Hard Negative Mining, Text-based Web Agents, Triton Dataset

135. ❌ Do VLMs Truly “Read” Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting

作者: Kaiqi Hu, Linda Xiao, Shiyue Xu, Ziyi Tang, Mingwen Liu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12659v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）在股票价格预测中的应用，属于计算机视觉与金融交叉领域，但所有评分关键词均针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG等），而论文未涉及LLMs或相关技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文构建了一个多尺度蜡烛图数据集和评估框架，以评估视觉语言模型在股票价格预测中利用多尺度视觉市场信号的能力，发现大多数模型仅在持续上涨或下跌趋势中表现良好，而在常见市场场景中预测能力较弱，且存在预测偏差和时间推理限制。

摘要翻译

视觉语言模型在视觉化股价预测中的应用日益广泛，然而现有基准测试未能充分评估其对K线图中股价信息的理解能力。首先，先前研究未能明确区分视觉语言模型对视觉输入的理解是否真正提升了预测性能，以及模型是否真正理解K线图形态。此外，现有数据集和评估框架大多围绕单周期或表格化输入设计，但专业分析师高度依赖多尺度K线图——长期图表捕捉趋势方向，短期图表提供拐点线索——这使得系统评估视觉语言模型整合短期与长期视觉市场动态的能力变得困难。为弥补这一空白，我们构建了一个多尺度K线图数据集和标准化评估框架，用以评估视觉语言模型利用多尺度视觉市场信号的能力。评估方法结合了基于混淆矩阵的诊断分析与信息系数时间序列指标，并引入XGBoost作为基于特征的时序基线模型。基于该数据集，我们对代表性视觉语言模型进行基准测试，分析其利用多尺度股价数据的能力。实验结果表明，多数视觉语言模型仅在持续上涨或下跌趋势中表现良好，而在更普遍的市场场景中预测能力较弱。我们还发现模型存在显著预测偏差，且对提示语中明确指定的预测周期敏感度有限，这揭示了其在精确时序推理方面存在固有局限性。

摘要 (Abstract)

Vision-language models(VLMs) are increasingly applied to visual stock price forecasting, yet existing benchmarks inadequately evaluate their understanding of stock price in candlestick charts. First, prior studies fail to isolate VLMs’ comprehension of visual inputs genuinely improves predictive performance and whether VLMs truly comprehend candlestick patterns. Further, most existing datasets and evaluation setups are designed around single-period or tabular inputs. However, human analysts strongly rely on multi-scale candlestick charts, where longer-term horizons capture trend direction and shorter-term horizons provide cues for inflection points, making it difficult to systematically assess VLMs’ ability to integrate short-term and long-term visual market dynamics. To bridge this gap, we construct a multi-scale candlestick charts dataset and a standardized evaluation framework to assess VLMs’ ability to utilize multi-scale visual market signals. Evaluation combines confusion-matrix-based diagnostics with information coefficient(IC) time series metrics and includes XGBoost as a feature-based temporal baseline. Using this dataset, we benchmark representative VLMs and analyze their ability to leverage multi-scale stock price data. Experimental results show that most VLMs perform well only under persistent uptrend or downtrend conditions, while exhibiting weak predictive capability in more common market scenarios. We also identify significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts, indicating inherent limitations in precise temporal reasoning.

关键词: Vision-language models, Candlestick charts, Stock price forecasting, Multi-scale evaluation, Visual market dynamics, Temporal reasoning, Prediction bias, Information coefficient

136. ❌ Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification

作者: Tsai-Ning Wang, Herman Teun den Dekker, Lin-Lin Chen, Neil Zeghidour, Aaqib Saeed 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12647v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出TRIAGE框架，在零样本呼吸音频分类中应用了检索增强生成（RAG）和大型语言模型（LLM）进行推理，属于AI for Science在生物信息学领域的应用。论文明确提到’retrieval-augmented large language model reasoning’，与RAG关键词高度相关；使用LLM进行推理与Chain of Thought相关；涉及多级推理过程与System 2 Thinking有一定关联。其他关键词如MoE、SFT、量化等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文针对呼吸音频分类中标注数据稀缺的问题，提出了TRIAGE自适应测试时计算框架，通过分层路由机制在零样本设置下实现了与监督基线相当的性能，同时显著降低了计算成本。

摘要翻译

自动化呼吸音频分析有望实现可扩展、非侵入性的疾病筛查，但其发展受限于标注数据稀缺和专家注释成本高昂。零样本推理无需任务特定监督，但现有方法对所有输入采用统一计算，未考虑样本难度差异。我们提出TRIAGE——一种分层零样本框架，通过将每个音频样本动态路由至逐级增强的推理阶段来自适应调整测试时计算量：在联合音频-文本嵌入空间中进行快速标签余弦评分（Tier-L）、基于临床描述符的结构化匹配（Tier-M），以及检索增强的大语言模型推理（Tier-H）。基于置信度的路由机制使简单样本在最低计算层级提前输出，同时为模糊样本分配额外计算资源，实现近半数样本以最低成本完成推理。在未经任务特定训练的九项呼吸分类任务中，TRIAGE平均AUROC达到0.744，超越现有零样本方法，并在多项任务中达到或超过有监督基线的性能。分析表明，测试时计算量缩放将性能提升集中于关键区域：不确定样本获得最高19%的相对改进，而高置信度预测在最低成本下保持稳定。

摘要 (Abstract)

Automated respiratory audio analysis promises scalable, non-invasive disease screening, yet progress is limited by scarce labeled data and costly expert annotation. Zero-shot inference eliminates task-specific supervision, but existing methods apply uniform computation to every input regardless of difficulty. We introduce TRIAGE, a tiered zero-shot framework that adaptively scales test-time compute by routing each audio sample through progressively richer reasoning stages: fast label-cosine scoring in a joint audio-text embedding space (Tier-L), structured matching with clinician-style descriptors (Tier-M), and retrieval-augmented large language model reasoning (Tier-H). A confidence-based router finalizes easy predictions early while allocating additional computation to ambiguous inputs, enabling nearly half of all samples to exit at the cheapest tier. Across nine respiratory classification tasks without task-specific training, TRIAGE achieves a mean AUROC of 0.744, outperforming prior zero-shot methods and matching or exceeding supervised baselines on multiple tasks. Our analysis show that test-time scaling concentrates gains where they matter: uncertain cases see up to 19% relative improvement while confident predictions remain unchanged at minimal cost.

关键词: zero-shot classification, respiratory audio analysis, adaptive test-time scaling, retrieval-augmented generation, large language model reasoning, tiered framework, confidence-based routing, AI for healthcare

137. ❌ Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

作者: Vadim Borisov 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12633v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多语言多标签情感分类，使用合成数据训练多语言Transformer编码器（如DistilBERT、XLM-R）。所有评分关键词均与大模型技术原理创新或科学领域应用相关，但论文未涉及任何关键词：未使用LLMs（仅使用编码器模型）、未涉及MoE/SLMs/Scaling Laws等训练技术、未涉及对齐/RLHF/PEFT/RAG等微调或推理技术、未涉及推理/代理/压缩/解释性等高级主题、也未涉及生物信息学等科学应用。论文属于传统NLP任务，与评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文通过构建大规模多语言合成数据集，训练多语言Transformer编码器解决了多语言情感分类数据稀缺问题，使XLM-R-Large模型在23种语言上达到高性能，并在人类标注数据集上零样本匹配或超越英语专用模型。

摘要翻译

多语言环境下的情感分类任务仍受限于标注数据的稀缺性：现有语料库主要为英语、单标签且涵盖语种有限。为填补这一空白，我们通过文化适配生成与程序化质量过滤技术，构建了一个包含23种语言（阿拉伯语、孟加拉语、荷兰语、英语、法语、德语、印地语、印尼语、意大利语、日语、韩语、汉语普通话、波兰语、葡萄牙语、旁遮普语、俄语、西班牙语、斯瓦希里语、泰米尔语、土耳其语、乌克兰语、乌尔都语和越南语）的大规模合成训练语料库，涵盖超过100万个多标签样本（每种语言5万条），涉及11种情感类别。我们在相同条件下训练并比较了六种多语言Transformer编码器模型，参数规模从DistilBERT（1.35亿参数）到XLM-R-Large（5.6亿参数）。在领域内测试集上，XLM-R-Large取得了0.868的微观F1分数和0.987的微观AUC值。为基于人工标注数据进行验证，我们在GoEmotions（英语）和SemEval-2018 Task 1 E-c（英语、阿拉伯语、西班牙语）数据集上对所有模型进行了零样本评估。在无需阈值的排序指标上，XLM-R-Large达到或超越了仅针对英语训练的专用模型：在平均精度-微观（AP-micro，0.636）和标签排序平均精度（LRAP，0.804）上持平，同时在微观AUC（0.810 vs. 0.787）上实现超越，并原生支持全部23种语言。最佳基础规模模型已公开于https://huggingface.co/tabularisai/multilingual-emotion-classification。

摘要 (Abstract)

Emotion classification in multilingual settings remains constrained by the scarcity of annotated data: existing corpora are predominantly English, single-label, and cover few languages. We address this gap by constructing a large-scale synthetic training corpus of over 1M multi-label samples (50k per language) across 23 languages: Arabic, Bengali, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin, Polish, Portuguese, Punjabi, Russian, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, and Vietnamese, covering 11 emotion categories using culturally-adapted generation and programmatic quality filtering. We train and compare six multilingual transformer encoders, from DistilBERT (135M parameters) to XLM-R-Large (560M parameters), under identical conditions. On our in-domain test set, XLM-R-Large achieves 0.868 F1-micro and 0.987 AUC-micro. To validate against human-annotated data, we evaluate all models zero-shot on GoEmotions (English) and SemEval-2018 Task 1 E-c (English, Arabic, Spanish). On threshold-free ranking metrics, XLM-R-Large matches or exceeds English-only specialist models, tying on AP-micro (0.636) and LRAP (0.804) while surpassing on AUC-micro (0.810 vs. 0.787), while natively supporting all 23 languages. The best base-sized model is publicly available at https://huggingface.co/tabularisai/multilingual-emotion-classification

关键词: multilingual emotion classification, synthetic data, transformer encoders, XLM-R, zero-shot evaluation, multi-label classification, large-scale corpus

138. ❌ Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

作者: Xudong Wang, Chaoning Zhang, Qigan Sun, Zhenzhen Huang, Chang Lu, Sheng Zheng, Zeyu Ma, Caiyan Qin, Yang Yang, Hengtao Shen 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12610v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG框架改进，与’Retrieval-Augmented Generation’高度相关（10分），直接涉及LLMs（10分）。提出结构化三元组方法改善推理链，与’Chain of Thought’和’System 2 Thinking’强相关（8分）。旨在缓解幻觉问题，与’Hallucination Mitigation’相关（8分）。使用轻量级提示适应冻结参数，与’PEFT’有一定关联（5分）。涉及上下文效率，与’Context Window Extension’和’In-context Learning’部分相关（5分）。其他关键词如MoE、量化、科学AI等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对RAG中检索证据结构化和对齐问题，提出Tri-RAG框架将外部知识转化为结构化三元组，显著提高了检索质量、推理效率和生成稳定性。

摘要翻译

检索增强生成（Retrieval-Augmented Generation，RAG）通过在生成过程中引入外部知识，缓解了大语言模型（LLMs）的幻觉问题。然而，RAG的有效性不仅取决于检索器的设计和底层模型的能力，还取决于检索到的证据如何组织并与查询对齐。现有的RAG方法通常检索并拼接非结构化文本片段作为上下文，这往往会引入冗余或弱相关信息。这种做法导致上下文过度积累、语义对齐性降低以及推理链碎片化，从而在增加标记消耗的同时降低了生成质量。为解决这些挑战，我们提出了Tri-RAG，一种基于结构化三元组的检索框架，通过推理对齐的上下文构建来提高检索效率。Tri-RAG自动将外部知识从自然语言转化为标准化的结构化三元组，包含条件（Condition）、证明（Proof）和结论（Conclusion），并利用基于轻量级提示的适配方法（保持模型参数冻结）显式捕获知识片段间的逻辑关系。基于此表示，三元组的头部“条件”被作为检索和匹配的显式语义锚点，从而能够精确识别与查询相关的知识单元，而无需直接拼接冗长的原始文本。因此，Tri-RAG在检索准确性和上下文标记效率之间实现了良好平衡。在多个基准数据集上的实验结果表明，Tri-RAG显著提升了检索质量和推理效率，同时在复杂推理场景中产生了更稳定的生成行为和更高效的资源利用。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) mitigates hallucination in large language models (LLMs) by incorporating external knowledge during generation. However, the effectiveness of RAG depends not only on the design of the retriever and the capacity of the underlying model, but also on how retrieved evidence is structured and aligned with the query. Existing RAG approaches typically retrieve and concatenate unstructured text fragments as context, which often introduces redundant or weakly relevant information. This practice leads to excessive context accumulation, reduced semantic alignment, and fragmented reasoning chains, thereby degrading generation quality while increasing token consumption. To address these challenges, we propose Tri-RAG, a structured triplet-based retrieval framework that improves retrieval efficiency through reasoning-aligned context construction. Tri-RAG automatically transforms external knowledge from natural language into standardized structured triplets consisting of Condition, Proof, and Conclusion, explicitly capturing logical relations among knowledge fragments using lightweight prompt-based adaptation with frozen model parameters. Building on this representation, the triplet head Condition is treated as an explicit semantic anchor for retrieval and matching, enabling precise identification of query-relevant knowledge units without directly concatenating lengthy raw texts. As a result, Tri-RAG achieves a favorable balance between retrieval accuracy and context token efficiency. Experimental results across multiple benchmark datasets demonstrate that Tri-RAG significantly improves retrieval quality and reasoning efficiency, while producing more stable generation behavior and more efficient resource utilization in complex reasoning scenarios.

关键词: Retrieval-Augmented Generation, RAG, large language models, structured triplets, hallucination mitigation, reasoning efficiency, context token efficiency, Tri-RAG

139. ❌ FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing

作者: Peng Wang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12559v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FABLE专注于大语言模型的无结构化编辑方法，通过分层框架将细粒度事实锚定与整体文本生成解耦，因此与’Large Language Models’高度相关（10分）。该方法旨在解决事实访问不可靠的问题，直接关联’Hallucination Mitigation OR Factuality OR Truthfulness’（10分）。论文未涉及其他关键词如MoE、SLMs、训练方法、推理加速、代理系统等具体技术，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了FABLE框架，通过分层解耦细粒度事实锚定与整体文本生成，解决了无结构化模型编辑中事实访问不可靠的问题，显著提升了细粒度问答性能并保持了最先进的整体编辑效果。

摘要翻译

非结构化模型编辑旨在通过真实世界文本来更新模型，但现有方法通常整体记忆文本，缺乏可靠的细粒度事实访问能力。为解决这一问题，我们提出FABLE——一种将细粒度事实注入与整体文本生成解耦的分层框架。FABLE采用两阶段、事实优先的策略：离散事实被锚定在模型浅层，随后对深层进行最小化更新以生成连贯文本。这种解耦机制解决了整体回忆与细粒度事实访问之间的错配问题，反映了Transformer单向流中表层形式生成会放大而非修正底层事实表征的特性。我们还提出了UnFine诊断基准，该基准包含细粒度问答对和事实级评估指标，用于系统性评估。实验表明，FABLE在保持最先进整体编辑性能的同时，显著提升了细粒度问答能力。代码已公开于https://github.com/caskcsg/FABLE。

摘要 (Abstract)

Unstructured model editing aims to update models with real-world text, yet existing methods often memorize text holistically without reliable fine-grained fact access. To address this, we propose FABLE, a hierarchical framework that decouples fine-grained fact injection from holistic text generation. FABLE follows a two-stage, fact-first strategy: discrete facts are anchored in shallow layers, followed by minimal updates to deeper layers to produce coherent text. This decoupling resolves the mismatch between holistic recall and fine-grained fact access, reflecting the unidirectional Transformer flow in which surface-form generation amplifies rather than corrects underlying fact representations. We also introduce UnFine, a diagnostic benchmark with fine-grained question-answer pairs and fact-level metrics for systematic evaluation. Experiments show that FABLE substantially improves fine-grained question answering while maintaining state-of-the-art holistic editing performance. Our code is publicly available at https://github.com/caskcsg/FABLE.

关键词: model editing, fact anchoring, fine-grained facts, unstructured editing, hierarchical framework, fact injection, text generation, Transformer flow

140. ❌ Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis

作者: Kang He, Yuzhe Ding, Xinrong Wang, Fei Li, Chong Teng, Donghong Ji 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12518v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多模态情感分析（MSA），提出了一种名为EBMC的框架，旨在通过增强较弱模态和平衡模态协作来提高鲁棒性。论文内容涉及多模态融合、表示学习和鲁棒性优化，但未涉及大语言模型（LLMs）、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，而本文研究的是传统的多模态情感分析任务，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种增强-平衡模态协作框架（EBMC），通过语义解耦和跨模态增强来提升较弱模态的表示质量，并引入能量引导的模态协调机制和实例感知的模态信任蒸馏，以解决多模态情感分析中模态不平衡和噪声/缺失模态下的鲁棒性问题，实验表明EBMC在标准设置和缺失模态设置下均达到了最先进或竞争性的性能。

摘要翻译

多模态情感分析（MSA）通过整合异构的文本、音频和视觉信号来推断人类情感。尽管现有方法利用跨模态互补性，但往往难以充分利用较弱模态。实践中，主导模态易掩盖非语言模态，引发模态竞争并限制整体贡献。这种不平衡会降低融合性能，并在模态噪声或缺失情况下削弱模型鲁棒性。为此，我们提出一种新颖的“增强-平衡模态协作框架”（Enhance-then-Balance Modality Collaboration framework, EBMC）。该框架通过语义解耦与跨模态增强提升表征质量，从而强化较弱模态。为防止主导模态压制其他模态，我们设计了能量引导的模态协调机制，通过可微分均衡目标实现隐式梯度再平衡。此外，实例感知的模态信任蒸馏模块通过估计样本级可靠性来自适应调整融合权重，确保模型鲁棒性。大量实验表明，EBMC取得了最先进或具有竞争力的结果，并在模态缺失场景下保持强劲性能。

摘要 (Abstract)

Multimodal sentiment analysis (MSA) integrates heterogeneous text, audio, and visual signals to infer human emotions. While recent approaches leverage cross-modal complementarity, they often struggle to fully utilize weaker modalities. In practice, dominant modalities tend to overshadow non-verbal ones, inducing modality competition and limiting overall contributions. This imbalance degrades fusion performance and robustness under noisy or missing modalities. To address this, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC improves representation quality via semantic disentanglement and cross-modal enhancement, strengthening weaker modalities. To prevent dominant modalities from overwhelming others, an Energy-guided Modality Coordination mechanism achieves implicit gradient rebalancing via a differentiable equilibrium objective. Furthermore, Instance-aware Modality Trust Distillation estimates sample-level reliability to adaptively modulate fusion weights, ensuring robustness. Extensive experiments demonstrate that EBMC achieves state-of-the-art or competitive results and maintains strong performance under missing-modality settings.

关键词: Multimodal Sentiment Analysis, Modality Collaboration, Cross-modal Enhancement, Energy-guided Modality Coordination, Instance-aware Modality Trust Distillation, Robustness, Missing Modalities, Semantic Disentanglement

141. ❌ Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

作者: Linhao Zhang, Yuhan Song, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12506v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Audio Large Language Models（AudioLLMs），这是大语言模型在音频领域的应用，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文提出Unified Audio Schema（UAS）作为监督框架，涉及音频感知和推理，但未明确涉及其他关键词如MoE、SLMs、Scaling Laws、训练方法（预训练、微调、对齐等）、推理优化、代理系统、模型压缩或特定科学领域应用，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对Audio Large Language Models在细粒度音频感知任务上表现不佳的问题，提出了Unified Audio Schema（UAS）监督框架，通过结构化组织音频信息为转录、副语言和非语言事件，在MMSU、MMAR和MMAU基准测试中实现了10.9%的性能提升，同时保持了强大的推理能力。

摘要翻译

近期音频大语言模型（AudioLLMs）呈现出一种显著的性能倒置现象：虽然在复杂推理任务上表现优异，但在细粒度声学感知任务上却持续表现不佳。我们将此差距归因于以自动语音识别（ASR）为核心的训练方式存在根本性局限——该方式虽提供了精确的语言文本目标，却隐式地教导模型将副语言线索和声学事件作为噪声予以抑制。为解决此问题，我们提出了统一音频模式（Unified Audio Schema, UAS），这是一个整体化、结构化的监督框架，它将音频信息组织成三个明确的组成部分——转写文本、副语言信息和非语言事件——并统一于JSON格式中。该设计在实现全面声学覆盖的同时，并未牺牲支撑推理能力的紧密音频-文本对齐关系。我们通过将该监督策略应用于离散和连续两种音频大语言模型架构，验证了其有效性。在MMSU、MMAR和MMAU基准上的大量实验表明，UAS-Audio模型带来了持续的性能提升，其在MMSU上的细粒度感知能力相比同等规模的先进模型提高了10.9%，同时保持了稳健的推理能力。我们的代码和模型已公开于https://github.com/Tencent/Unified_Audio_Schema。

摘要 (Abstract)

Recent Audio Large Language Models (AudioLLMs) exhibit a striking performance inversion: while excelling at complex reasoning tasks, they consistently underperform on fine-grained acoustic perception. We attribute this gap to a fundamental limitation of ASR-centric training, which provides precise linguistic targets but implicitly teaches models to suppress paralinguistic cues and acoustic events as noise. To address this, we propose Unified Audio Schema (UAS), a holistic and structured supervision framework that organizes audio information into three explicit components – Transcription, Paralinguistics, and Non-linguistic Events – within a unified JSON format. This design achieves comprehensive acoustic coverage without sacrificing the tight audio-text alignment that enables reasoning. We validate the effectiveness of this supervision strategy by applying it to both discrete and continuous AudioLLM architectures. Extensive experiments on MMSU, MMAR, and MMAU demonstrate that UAS-Audio yields consistent improvements, boosting fine-grained perception by 10.9% on MMSU over the same-size state-of-the-art models while preserving robust reasoning capabilities. Our code and model are publicly available at https://github.com/Tencent/Unified_Audio_Schema.

关键词: Audio Large Language Models, AudioLLMs, Unified Audio Schema, acoustic perception, paralinguistics, non-linguistic events, audio-text alignment, fine-grained perception

142. ❌ Calibrated Confidence Estimation for Tabular Question Answering

作者: Lukas Voss 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12491v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在表格问答任务中的置信度校准问题，因此与’Large Language Models’高度相关（10分）。论文提出并评估了多种置信度估计方法，包括自我评估方法（如verbalized, P(True)）和扰动方法（如semantic entropy, self-consistency），这些方法涉及模型对自身输出的评估和反思，与’Self-Correction OR Self-Improvement OR Self-Reflection’有一定关联（8分）。论文未涉及其他关键词如MoE、SLMs、训练技术、推理加速、AI for Science等具体内容，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了五种置信度估计方法在五个前沿大语言模型和两个表格问答基准上的表现，发现所有模型都存在严重过度自信问题，并提出了一种名为Multi-Format Agreement的新方法，该方法利用结构化数据的序列化变体来估计置信度，在降低API成本的同时显著提高了校准性能。

摘要翻译

大语言模型（LLMs）在表格问答任务中的应用日益广泛，然而其在结构化数据上的校准问题尚未得到充分研究。本文首次系统性地比较了五种置信度估计方法，涵盖五个前沿大语言模型和两个表格问答基准测试。所有模型均表现出严重的过度自信问题（平滑ECE为0.35-0.64，而文本问答中报告的数值为0.10-0.15）。在两个基准测试和所有四个全覆盖模型中，均出现了一致的自我评估与扰动方法二分现象：自我评估方法（包括语言化表达和P(True)）的AUROC为0.42-0.76，而扰动方法（包括语义熵、自一致性以及我们提出的多格式一致性）的AUROC达到0.78-0.86。经Holm-Bonferroni校正后，各模型的配对自助法检验均在p<0.001水平拒绝原假设；对GPT-4o-mini的三次随机种子检验显示，每次种子的标准差仅为0.006。本文提出了多格式一致性方法，该方法利用结构化数据特有的无损确定性序列化变体（包括Markdown、HTML、JSON、CSV格式）进行置信度估计，其API调用成本比采样基线降低20%。该方法使ECE降低44-63%，在TableBench的所有四个模型上均表现出良好泛化能力（平均AUROC为0.80），并能与采样方法形成互补：MFA与自一致性集成将AUROC从0.74提升至0.82。作为次要贡献，结构感知重校准方法相较于标准事后校准方法，将AUROC提升了10个百分点。

摘要 (Abstract)

Large language models (LLMs) are increasingly deployed for tabular question answering, yet calibration on structured data is largely unstudied. This paper presents the first systematic comparison of five confidence estimation methods across five frontier LLMs and two tabular QA benchmarks. All models are severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reported for textual QA). A consistent self-evaluation versus perturbation dichotomy replicates across both benchmarks and all four fully-covered models: self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, while perturbation methods (semantic entropy, self-consistency, and our Multi-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstrap tests reject the null at p<0.001 after Holm-Bonferroni correction, and a 3-seed check on GPT-4o-mini gives a per-seed standard deviation of only 0.006. The paper proposes Multi-Format Agreement (MFA), which exploits the lossless and deterministic serialization variation unique to structured data (Markdown, HTML, JSON, CSV) to estimate confidence at 20% lower API cost than sampling baselines. MFA reduces ECE by 44-63%, generalizes across all four models on TableBench (mean AUROC 0.80), and combines complementarily with sampling: an MFA + self-consistency ensemble lifts AUROC from 0.74 to 0.82. A secondary contribution, structure-aware recalibration, improves AUROC by +10 percentage points over standard post-hoc methods.

关键词: Large Language Models, Tabular Question Answering, Confidence Estimation, Calibration, Multi-Format Agreement, Self-evaluation, Perturbation Methods, Overconfidence

143. ❌ Meet Dynamic Individual Preferences: Resolving Conflicting Human Value with Paired Fine-Tuning

作者: Shanyong Wang, Shuhang Lin, Yining Zhao, Xi Zhu, Yongfeng Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12479v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM与个体偏好对齐问题，提出Preference-Paired Fine-Tuning框架，直接涉及LLM、SFT、Alignment和DPO等关键词，这些是论文的核心技术和方法。其他关键词如MoE、RAG、Quantization等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Preference-Paired Fine-Tuning的新框架，用于解决大语言模型适应动态且冲突的个体偏好的挑战，实验表明该方法在多项任务中优于传统训练方法。

摘要翻译

近期大规模语言模型（LLM）的发展显著提升了模型与人类普遍偏好的对齐能力。然而，如何使LLM适应个体偏好仍面临重大挑战，这些偏好不仅具有多样性，而且是动态变化的。本文提出了一种新颖的框架——偏好配对微调（Preference-Paired Fine-Tuning，简称PFT），旨在使模型能够适应相互矛盾且不断演变的个体偏好。我们同时构建了一个新的数据集——价值冲突困境（Value Conflict Dilemma，简称VCD），其中包含涉及人类偏好冲突的场景，以促进对方法的评估。实验表明，PFT在多项选择分类任务中准确率最高可达96.6%，在开放式生成任务中获得了最高的8.69分，其表现优于单一偏好训练方法。与DPO、SFT及部分传统训练方法相比，PFT也展现出显著优势，尤其是在处理冲突偏好时。此外，在有限的用户历史数据条件下，模型能够快速推断偏好向量，相较于单一偏好模型，其在用户特定偏好对齐方面实现了44.76%的提升。

摘要 (Abstract)

Recent advances in large language models (LLMs) have significantly improved the alignment of models with general human preferences. However, a major challenge remains in adapting LLMs to individual preferences, which are not only diverse but also dynamic. In this paper, we introduce a novel framework, Preference-Paired Fine-Tuning (PFT), designed to align models with contradictory and evolving individual preferences. We present a new dataset, Value Conflict Dilemma (VCD), which includes scenarios that involve conflicting human preferences, facilitating the evaluation of our approach. Our experiments demonstrate that PFT outperforms single-preference training methods, achieving up to 96.6% accuracy in multi-choice classification tasks and the highest open-ended generation score of 8.69. PFT also shows significant improvements over DPO, SFT and some traditional training methods, especially when handling conflicting preferences. Additionally, with limited user history data, models can inferring preference vector rapidly, achieving a 44.76% improvement in user-specific preference alignment in comparison to single-preference models.

关键词: Large Language Models, Individual Preferences, Preference-Paired Fine-Tuning, Value Conflict Dilemma, Alignment, DPO, SFT, User-specific Preference

144. ❌ Beyond Single-Dimension Novelty: How Combinations of Theory, Method, and Results-based Novelty Shape Scientific Impact

作者: Yi Zhao, Yang Chenggang, Yuzhuo Wang, Tong Bao, Zhang Heng, Chengzhi Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12471v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文使用DeepSeek-V3模型对科学文章进行新颖性分类，属于大模型在科学领域的应用（AI for Science），因此与’Large Language Models’和’AI for Science’关键词有一定关联（5分）。但论文核心是研究科学新颖性配置与影响力的关系，而非大模型技术本身，因此与大多数技术原理关键词（如MoE、Scaling Laws、RLHF等）完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究探讨了理论、方法和结果三种新颖性维度的不同组合如何共同影响科学影响力，发现仅具有结果新颖性的文章比同时具有三种新颖性的文章获得更多引用和更高排名。

摘要翻译

科学新颖性推动着研究前沿的进展，但它也伴随着更高的不确定性以及可能遭遇现有范式的阻力，从而导致复杂的科学影响力模式。先前的研究主要考察了单一维度新颖性——如理论新颖性、方法新颖性或结果新颖性——与科学影响力之间的关系。然而，由于科学新颖性本质上是多维的，仅关注孤立维度可能掩盖不同类型的新颖性如何共同塑造影响力。因此，我们对于新颖性类型的组合如何影响科学影响力知之甚少。为此，我们利用发表在《自然·通讯》（Nature Communications）上的15,322篇论文数据集，借助DeepSeek-V3模型，根据论文引言部分的内容将文章划分为三个新颖性维度：理论新颖性、方法新颖性和结果新颖性。这些维度可能在同一篇文章中共存，形成不同的新颖性构型。科学影响力通过五年引用次数以及文章是否属于高被引论文前1%或前10%的指标来衡量。描述性结果表明，仅含结果新颖性以及同时具备三种新颖性类型的构型是样本中的主导构型。回归结果进一步显示，与同时展现三种新颖性类型的文章相比，仅具有结果新颖性的文章获得了显著更多的引用，并且更有可能进入高被引论文的前1%和前10%。这些发现深化了我们对多维新颖性构型如何塑造知识扩散的理解。

摘要 (Abstract)

Scientific novelty drives advances at the research frontier, yet it is also associated with heightened uncertainty and potential resistance from incumbent paradigms, leading to complex patterns of scientific impact. Prior studies have primarily ex-amined the relationship between a single dimension of novelty – such as theoreti-cal, methodological, or results-based novelty – and scientific impact. However, because scientific novelty is inherently multidimensional, focusing on isolated dimensions may obscure how different types of novelty jointly shape impact. Consequently, we know little about how combinations of novelty types influence scientific impact. To this end, we draw on a dataset of 15,322 articles published in Nature Communications. Using the DeepSeek-V3 model, we classify articles into three novelty dimensions based on the content of their Introduction sections: theoretical novelty, methodological novelty, and results-based novelty. These dimensions may coexist within the same article, forming distinct novelty configura-tions. Scientific impact is measured using five-year citation counts and indicators of whether an article belongs to the top 1% or top 10% highly cited papers. Descriptive results indicate that results-based novelty alone and the simultaneous presence of all three novelty types are the dominant configurations in the sample. Regression results further show that articles with results-based novelty only re-ceive significantly more citations and are more likely to rank among the top 1% and top 10% highly cited papers than articles exhibiting all three novelty types. These findings advance our understanding of how multidimensional novelty configurations shape knowledge diffusion.

关键词: scientific novelty, multidimensional novelty, theoretical novelty, methodological novelty, results-based novelty, scientific impact, citation analysis, DeepSeek-V3

145. ❌ GLeMM: A large-scale multilingual dataset for morphological research

作者: Hathout Nabil, Basilio Calderone, Fiammetta Namer, Franck Sajous 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12442v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	5.0/10	0.0

评分理由: 论文GLeMM专注于构建一个用于形态学研究的多语言数据集，涉及词形变化、语义标注和自动化资源创建。它属于计算语言学领域，与AI在科学（语言学）中的应用有一定关联，因此仅对’AI for Science’关键词给予5分（有一定关联）。其他所有关键词均与大模型、深度学习技术原理、训练方法、推理优化、对齐、压缩、代理系统等无关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对形态学研究中形式-意义关系变化机制难以基于有限数据复现和泛化的问题，提出了GLeMM——一个大规模、多语言、自动化构建的派生形态学资源数据集，支持数据驱动的形态学描述和计算方法实验。

摘要翻译

在派生形态学中，何种机制支配着词与词之间形式-意义关系的变异？对此类问题的解答通常基于直觉和有限数据的观察，即使在考察多种语言时亦是如此。许多此类研究难以复现和推广。为解决这一问题，我们提出了GLeMM——一种专为形态学实验和数据驱动描述而设计的新型派生资源。GLeMM具有以下特征：（一）规模庞大；（二）覆盖广泛（目前涵盖七种欧洲语言，即德语、英语、西班牙语、法语、意大利语、波兰语、俄语）；（三）采用全自动设计，且在所有语言中保持一致；（四）对每个词条自动标注形态特征；（五）为其中重要子集的词条编码语义描述。该资源使研究者能够探讨诸如形式与意义在构词中的作用等难题，并开发及实验性测试识别派生形态结构的计算方法。本文阐述了如何利用维基词典条目构建GLeMM，并通过多个案例研究展示了该资源的潜在应用场景。

摘要 (Abstract)

In derivational morphology, what mechanisms govern the variation in form-meaning relations between words? The answers to this type of questions are typically based on intuition and on observations drawn from limited data, even when a wide range of languages is considered. Many of these studies are difficult to replicate and generalize. To address this issue, we present GLeMM, a new derivational resource designed for experimentation and data-driven description in morphology. GLeMM is characterized by (i) its large size, (ii) its extensive coverage (currently amounting to seven European languages, i.e., German, English, Spanish, French, Italian, Polish, Russian, (iii) its fully automated design, identical across all languages, (iv) the automatic annotation of morphological features on each entry, as well as (v) the encoding of semantic descriptions for a significant subset of these entries. It enables researchers to address difficult questions, such as the role of form and meaning in word-formation, and to develop and experimentally test computational methods that identify the structures of derivational morphology. The article describes how GLeMM is created using Wiktionary articles and presents various case studies illustrating possible applications of the resource.

关键词: derivational morphology, multilingual dataset, morphological research, automated design, Wiktionary, form-meaning relations, computational methods, data-driven description

146. ❌ Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task

作者: Alicia Curth, Rachel Lawrence, Sushrut Karmalkar, Niranjani Prasad 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12426v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究Transformer模型在关系推理任务中是否自适应地使用其深度，属于大模型技术原理的创新研究。核心相关关键词包括：1) ‘Large Language Models’ (8分)：论文研究基于预训练Transformer模型，属于大模型范畴；2) ‘Pre-training’ (5分)：涉及预训练模型的分析；3) ‘Post-training/SFT’ (8分)：重点研究了微调对模型行为的影响；4) ‘Chain of Thought/CoT Reasoning’ (8分)：研究多跳关系推理，属于多步推理任务；5) ‘System 2 Thinking’ (8分)：研究深度推理过程；6) ‘Mechanistic Interpretability’ (8分)：使用logit lens和causal patching分析模型内部机制。其他关键词如MoE、量化、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文通过多跳关系推理任务研究Transformer模型是否自适应地使用其深度，发现预训练模型有有限的适应性证据，而微调模型表现出更清晰的自适应深度使用模式。

摘要翻译

本研究旨在探究Transformer模型是否能够根据任务难度的增加自适应地调整其网络深度的使用。我们采用基于家族故事的多跳关系推理任务作为受控实验环境，其中任务难度由所需组合的关系跳数决定。我们通过两种方法进行监测：（一）利用早期读出技术（logit lens）追踪预测结果在不同网络层间的演化过程；（二）通过因果修补（causal patching）技术分析任务相关信息在不同标记间的整合机制。针对预训练模型，我们发现了有限的自适应深度使用证据：部分较大模型在处理较简单任务时，只需更少的网络层即可生成合理答案；且随着推理链长度的增加，模型普遍会调用更多网络层来整合跨标记信息。在对任务进行微调的模型中，我们发现了更清晰且更一致的自适应深度使用证据，这种效应在那些不保留通用语言建模能力、约束较少的微调方案中表现得更为显著。

摘要 (Abstract)

We investigate whether transformers use their depth adaptively across tasks of increasing difficulty. Using a controlled multi-hop relational reasoning task based on family stories, where difficulty is determined by the number of relationship hops that must be composed, we monitor (i) how predictions evolve across layers via early readouts (the logit lens) and (ii) how task-relevant information is integrated across tokens via causal patching. For pretrained models, we find some limited evidence for adaptive depth use: some larger models need fewer layers to arrive at plausible answers for easier tasks, and models generally use more layers to integrate information across tokens as chain length increases. For models finetuned on the task, we find clearer and more consistent evidence of adaptive depth use, with the effect being stronger for less constrained finetuning regimes that do not preserve general language modeling abilities.

关键词: Transformers, adaptive depth, relational reasoning, multi-hop reasoning, logit lens, causal patching, fine-tuning, model interpretability

147. ❌ Agentic Insight Generation in VSM Simulations

作者: Micha Selak, Dirk Krechel, Adrian Ulges, Sven Spieckermann, Niklas Stoehr, Andreas Loehr 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12421v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确使用大语言模型（LLMs）构建代理架构（agentic architecture）用于价值流图模拟分析，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。论文提到’progressive data discovery’和’multi-hop reasoning’，与’Chain of Thought’有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及或与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种解耦的两步代理架构，利用大语言模型从复杂的价值流图模拟中提取可操作的见解，实现了高达86%的准确率并展示了高鲁棒性。

摘要翻译

从复杂的价值流图模拟中提取可操作的洞见往往具有挑战性、耗时且易出错。大型语言模型的最新进展为支持用户完成此任务提供了新途径。现有方法虽擅长处理原始数据以获取信息，但其结构上难以捕捉该领域中区分相似数据源所需的细微情境差异。为解决这一问题，我们提出一种解耦的双步骤智能体架构。通过将流程编排与数据分析分离，该系统利用融合领域专家知识的渐进式数据发现方法。该架构使得编排层能够智能选择数据源，并在数据结构间执行多跳推理，同时保持精简的内部上下文。多个前沿大型语言模型的测试结果表明了该框架的可行性：顶级模型的准确率最高可达86%，且在多次评估中展现出高度的稳健性。

摘要 (Abstract)

Extracting actionable insights from complex value stream map simulations can be challenging, time-consuming, and error-prone. Recent advances in large language models offer new avenues to support users with this task. While existing approaches excel at processing raw data to gain information, they are structurally unfit to pick up on subtle situational differences needed to distinguish similar data sources in this domain. To address this issue, we propose a decoupled, two-step agentic architecture. By separating orchestration from data analysis, the system leverages progressive data discovery infused with domain expert knowledge. This architecture allows the orchestration to intelligently select data sources and perform multi-hop reasoning across data structures while maintaining a slim internal context. Results from multiple state-of-the-art large language models demonstrate the framework’s viability: with top-tier models achieving accuracies of up to 86% and demonstrating high robustness across evaluation runs.

关键词: agentic architecture, large language models, value stream map simulations, multi-hop reasoning, data discovery, orchestration, actionable insights, robustness

148. ❌ KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates

作者: Yudong Li, Jiawei Cai, Linlin Shen 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12397v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM预训练方法创新（KoCo方法），通过知识坐标条件化增强上下文感知，属于大模型技术原理创新。与’Large Language Models’和’Pre-training’高度相关（10分），因为论文明确研究LLM预训练方法改进。与’Hallucination Mitigation’相关（8分），因为论文提到该方法有助于区分事实与噪声，减轻幻觉。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及，因此评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为知识坐标条件化（KoCo）的新方法，通过将文档映射到三维语义坐标并作为前缀进行预训练，以增强大语言模型的上下文感知能力，实验表明该方法能显著提升下游任务性能、加速预训练收敛约30%，并有效减轻生成输出中的幻觉问题。

摘要翻译

标准大语言模型（LLM）预训练通常将语料库视为扁平的词元序列，往往忽略了人类在自然语境化信息时所依赖的真实世界背景。为弥补这一差距，我们提出了知识坐标条件化（Knowledge Coordinate Conditioning，KoCo）方法，这是一种将每篇文档映射到三维语义坐标的简单技术。通过将这些坐标作为文本前缀加入预训练过程，我们旨在使模型具备显式的上下文感知能力，从而在真实世界知识结构中学习文档内容。实验结果表明，KoCo在10项下游任务中显著提升了模型性能，并将预训练收敛速度加快了约30%。此外，我们的分析表明，显式建模知识坐标有助于模型区分稳定事实与噪声，从而有效缓解生成内容中的幻觉现象。

摘要 (Abstract)

Standard Large Language Model (LLM) pre-training typically treats corpora as flattened token sequences, often overlooking the real-world context that humans naturally rely on to contextualize information. To bridge this gap, we introduce Knowledge Coordinate Conditioning (KoCo), a simple method that maps every document into a three-dimensional semantic coordinate. By prepending these coordinates as textual prefixes for pre-training, we aim to equip the model with explicit contextual awareness to learn the documents within the real-world knowledge structure. Experiment results demonstrate that KoCo significantly enhances performance across 10 downstream tasks and accelerates pre-training convergence by approximately 30%. Furthermore, our analysis indicates that explicitly modeling knowledge coordinates helps the model distinguish stable facts from noise, effectively mitigating hallucination in generated outputs.

关键词: Knowledge Coordinate Conditioning, Large Language Model pre-training, semantic coordinate, contextual awareness, hallucination mitigation, pre-training convergence, downstream tasks, knowledge structure

149. ❌ From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue

作者: Jiarui Zhang, Xiangyu Liu, Yong Hu, Chaoyue Niu, Hang Zeng, Shaojie Tang, Fan Wu, Guihai Chen 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12385v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	10.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM路由在多轮对话中的应用，与’Large Language Models’高度相关（10分）。方法中明确使用MCTS探索对话分支，与’Monte Carlo Tree Search OR MCTS AND LLM’高度相关（10分）。论文涉及路由策略学习，与’LLM Agents’有一定关联（5分）。方法中提及检索增强的未来状态近似，与’Retrieval-Augmented Generation’有一定关联（5分）。其他关键词如MoE、量化、对齐等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对多轮对话中现有LLM路由方法因交互动态和延迟奖励而无法最大化累积性能的问题，提出了DialRouter方法，通过MCTS探索对话分支并学习轻量级路由策略，实验表明其在任务成功率上显著优于单一LLM和现有路由基线。

摘要翻译

多轮对话是与大语言模型交互的主要形式。尽管大语言模型路由在单轮场景中表现有效，但现有方法因交互动态性和延迟奖励的存在，难以在多轮对话中实现累积性能最大化。为应对这一挑战，我们将目光从短视的单轮选择转向多轮对话的长视野序列路由。据此，我们提出DialRouter方法：该方法首先通过蒙特卡洛树搜索探索由不同大语言模型选择引发的对话分支，并收集具有高累积奖励的轨迹；随后，DialRouter从搜索衍生的数据中学习一个轻量级路由策略，并辅以基于检索的未来状态近似技术，从而实现在无需在线搜索的情况下进行多轮路由。在开放域和特定领域对话任务上，针对包含开源与闭源大语言模型的多样化候选集进行的实验表明，DialRouter在任务成功率上显著优于单一模型及现有路由基线，并在结合成本感知奖励时实现了更优的性能-成本权衡。

摘要 (Abstract)

Multi-turn dialogue is the predominant form of interaction with large language models (LLMs). While LLM routing is effective in single-turn settings, existing methods fail to maximize cumulative performance in multi-turn dialogue due to interaction dynamics and delayed rewards. To address this challenge, we move from myopic, single-turn selection to long-horizon sequential routing for multi-turn dialogue. Accordingly, we propose DialRouter, which first performs MCTS to explore dialogue branches induced by different LLM selections and collect trajectories with high cumulative rewards. DialRouter then learns a lightweight routing policy from search-derived data, augmented with retrieval-based future state approximation, enabling multi-turn routing without online search. Experiments on both open-domain and domain-specific dialogue tasks across diverse candidate sets of both open-source and closed-source LLMs demonstrate that DialRouter significantly outperforms single LLMs and existing routing baselines in task success rate, while achieving a superior performance-cost trade-off when combined with a cost-aware reward.

关键词: LLM routing, multi-turn dialogue, MCTS, sequential routing, DialRouter, cumulative rewards, retrieval-based approximation, performance-cost trade-off

150. ❌ ReasonXL: Shifting LLM Reasoning Language Without Sacrificing Performance

作者: Daniil Gurgurov, Tom Röhr, Sebastian von Rohrscheidt, Josef van Genabith, Alexander Löser, Simon Ostermann 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12378v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在非英语语言中的推理能力，直接涉及LLM、监督微调(SFT)和思维链推理(CoT)等关键词，得10分；与深度推理和可解释AI有一定关联，得5分；其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了LLM在非英语场景中推理语言不匹配的问题，通过构建多语言推理数据集和两阶段微调方法，成功使LLM能够用目标语言进行推理而不损失性能。

摘要翻译

尽管多语言能力有所进展，大多数大语言模型（LLM）的训练过程——尤其是其推理轨迹的生成——仍以英语为中心。即使处理非英语问题时，这些模型也主要使用英语进行推理，这为非英语使用场景带来了根本性的不匹配。
我们通过三项贡献直接应对这一差距。（i）我们提出了ReasonXL，这是首个涵盖五种欧洲语言（英语、德语、法语、意大利语和西班牙语）的大规模跨领域并行推理轨迹语料库，每种语言包含超过两百万个对齐样本，每个样本均包含提示、推理轨迹和最终输出，从而能够直接监督特定语言的推理过程。（ii）利用ReasonXL，我们证明通过一个简单的两阶段流程——监督微调（SFT）后接可验证奖励的强化学习（RLVR）——可以使大语言模型完全适应以目标语言进行推理。所得模型在性能上达到或超过基线水平，同时通用知识损失极小，且跨语言迁移能力得到广泛保留。（iii）我们对这一适应过程进行了深入的表示分析，发现模型深度上存在明确的功能分工：早期层包含一个因果决定语言身份的激活瓶颈，而上层则集中了由适应过程驱动的权重与激活变化。我们进一步发现，与监督微调相比，可验证奖励的强化学习能以更小的参数更新实现更大的行为差异，这表明尽管权重更新幅度小得多，但其实现了更高效的表示重定向。

摘要 (Abstract)

Despite advances in multilingual capabilities, most large language models (LLMs) remain English-centric in their training and, crucially, in their production of reasoning traces. Even when tasked with non-English problems, these models predominantly reason in English, creating a fundamental mismatch for non-English usage scenarios. We address this disparity directly with three contributions. (i) We introduce ReasonXL, the first large-scale parallel corpus of cross-domain reasoning traces spanning five European languages (English, German, French, Italian, and Spanish), with over two million aligned samples per language, each comprising prompts, reasoning traces, and final outputs, enabling direct supervision of language-specific reasoning. (ii) Using ReasonXL, we demonstrate that LLMs can be adapted to reason entirely in a desired target language, using a simple two-stage pipeline of supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). The resulting models match or exceed baseline performance, with minimal loss in general knowledge and broadly preserved cross-lingual transfer. (iii) We conduct an extensive representational analysis of the adaptation and find a clear functional division across model depth: early layers contain an activation bottleneck that causally determines language identity, while upper layers concentrate the weight and activation changes driven by adaptation. We further find that RLVR achieves greater behavioral divergence from the base model with smaller parameter updates than SFT, suggesting a more efficient representational rerouting despite much smaller weight updates.

关键词: Large Language Models, Reasoning Traces, Multilingual, Supervised Fine-tuning, Reinforcement Learning, Cross-lingual Transfer, Representational Analysis, Language Adaptation

151. ❌ Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

作者: Tomer Ashuach, Liat Ein-Dor, Shai Gretz, Yoav Katz, Yonatan Belinkov 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12373v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM是否拥有关于答案正确性的特权知识（类似人类内省），通过训练分类器比较模型自身隐藏状态与外部模型表示的性能。高度相关关键词：‘Large Language Models’（研究对象）、‘Mechanistic Interpretability’（探究模型内部机制和可解释性）。中等相关：‘Self-Correction’（涉及模型自我评估能力）、‘Hallucination Mitigation’（与事实性评估相关）、‘Chain of Thought’（涉及数学推理任务分析）。其他关键词如MoE、量化、RAG等未在研究中涉及。

!!! tip deepseek-chat TL;DR

该研究探究大语言模型是否拥有关于答案正确性的内部特权知识，发现模型在事实性任务中确实存在这种知识优势，但在数学推理任务中没有，且这种优势随模型层数变化。

摘要翻译

人类通过内省来评估自身的理解程度，这种内省依赖于外部观察者无法触及的私人内在状态。本研究探讨大型语言模型是否拥有关于答案正确性的类似特权知识——即无法通过外部观察获取的信息。我们基于模型自身隐藏状态与外部模型的问题表征训练了正确性分类器，以检验自我表征是否能带来性能优势。在标准评估中，我们发现并无优势：自我探测器的表现与同侪模型探测器相当。我们假设这是由于模型间对答案正确性存在高度共识。为分离出真正的特权知识，我们在模型产生矛盾预测的分歧子集上进行评估。在此条件下，我们发现了领域特定的特权知识：在事实性知识任务中，自我表征持续优于同侪模型表征，但在数学推理任务中未显示优势。我们进一步将这种领域不对称性定位到模型的不同层级，发现事实性知识的优势从早中期层级开始逐步显现，这与模型特定的记忆检索机制一致，而数学推理任务在任何深度均未呈现稳定优势。

摘要 (Abstract)

Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model’s own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

关键词: Large Language Models, Privileged Knowledge, Self-probes, Factual Knowledge, Math Reasoning, Model Layers, Correctness Classification, Inter-model Agreement

152. ❌ Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

作者: Rui Yin, Tianxu Han, Naen Xu, Changjiang Li, Ping He, Chunyi Zhou, Jun Wang, Zhihui Fu, Tianyu Du, Jinbao Li, Shouling Ji 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12359v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究安全对齐大语言模型的后门攻击方法，通过修改模型权重将触发词映射到攻击者指定的响应。核心相关关键词：1) ‘Large Language Models’ - 论文明确研究LLMs；2) ‘Post-training’ - 使用后训练权重编辑方法注入后门；3) ‘Instruction Tuning/Alignment’ - 针对安全对齐模型进行攻击。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对安全对齐大语言模型的隐蔽后门攻击方法，通过将激活引导向量编译到模型权重中，在触发词出现时实现持续的有害输出，同时保持正常输入下的安全性和实用性。

摘要翻译

安全对齐的大语言模型（LLMs）正越来越多地部署于实际应用流程中，然而这种部署也扩大了供应链攻击面：攻击者可以分发带有后门的模型检查点，这些检查点在标准评估下表现正常，但在特定隐藏触发器出现时会发生越狱。近期的后处理权重编辑方法提供了一种高效的后门注入途径，即通过直接修改模型权重，将触发器映射至攻击者指定的响应。然而，现有方法通常优化的是词元级别的映射，强制模型生成一个肯定的前缀（例如“当然”），但这并不能保证持续输出有害内容——模型可能表面上开始同意，却在几步解码后回归至安全对齐的拒绝状态。我们通过将后门目标从表层词元转向内部表征来解决这一可靠性差距。我们提取了一个能够捕捉顺从行为与拒绝行为差异的导向向量，并将其编译为一种仅在触发器出现时激活的持久性权重修改。为保持隐蔽性和良性功能，我们施加了零空间约束，使得注入的编辑在干净输入上保持休眠状态。该方法效率高，仅需少量示例并允许闭式解。在多种安全对齐的LLM和越狱基准测试中，我们的方法在保持非触发状态下安全性与通用性能的同时，实现了较高的触发攻击成功率。

摘要 (Abstract)

Safety-aligned large language models (LLMs) are increasingly deployed in real-world pipelines, yet this deployment also enlarges the supply-chain attack surface: adversaries can distribute backdoored checkpoints that behave normally under standard evaluation but jailbreak when a hidden trigger is present. Recent post-hoc weight-editing methods offer an efficient approach to injecting such backdoors by directly modifying model weights to map a trigger to an attacker-specified response. However, existing methods typically optimize a token-level mapping that forces an affirmative prefix (e.g., ``Sure’’), which does not guarantee sustained harmful output – the model may begin with apparent agreement yet revert to safety-aligned refusal within a few decoding steps. We address this reliability gap by shifting the backdoor objective from surface tokens to internal representations. We extract a steering vector that captures the difference between compliant and refusal behaviors, and compile it into a persistent weight modification that activates only when the trigger is present. To preserve stealthiness and benign utility, we impose a null-space constraint so that the injected edit remains dormant on clean inputs. The method is efficient, requiring only a small set of examples and admitting a closed-form solution. Across multiple safety-aligned LLMs and jailbreak benchmarks, our method achieves high triggered attack success while maintaining non-triggered safety and general utility.

关键词: Large Language Models, Safety Alignment, Backdoor Attack, Weight Editing, Activation Steering, Null-space Constraint, Jailbreak, Stealthy Attack

153. ❌ CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

作者: Jingbo Yang, Guanyu Yao, Bairu Hou, Xinghan Yang, Nikolai Glushnev, Iwona Bialynicka-Birula, Duo Ding, Shiyu Chang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12312v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为任务导向代理在对话系统中检测合规违规的能力，与’Large Language Models’和’LLM Agents’高度相关（10分）。涉及LLM作为法官的评估，与’Post-training/SFT’和’Instruction Tuning/Alignment’有一定关联（5分），因为微调和对齐可能提升法官性能。论文关注违规检测，与’Hallucination Mitigation/Factuality’部分相关（5分），涉及模型输出的可靠性。其他关键词如MoE、Scaling Laws、RAG等未在论文中提及或仅边缘相关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了CompliBench基准，用于评估LLM法官在多轮对话中检测和定位指南违规的能力，并通过自动化数据生成管道发现当前LLM在此任务上表现不佳，而基于合成数据微调的小型法官模型能超越领先LLM并泛化到新领域。

摘要翻译

随着大型语言模型（LLM）在企业环境中越来越多地被部署为面向任务的智能体，确保其严格遵守复杂且领域特定的操作准则变得至关重要。尽管利用LLM作为评判者是一种具有前景的可扩展评估方案，但这些评判者在检测具体策略违规行为方面的可靠性仍很大程度上未被探索。这一空白主要是由于缺乏系统化的数据生成方法，而细粒度人工标注的高昂成本以及合成真实智能体违规行为的困难又进一步阻碍了此类方法的建立。本文中，我们引入了CompliBench，这是一个新颖的基准测试，旨在评估LLM评判者在多轮对话中检测和定位准则违规行为的能力。为克服数据稀缺问题，我们开发了一个可扩展的自动化数据生成流程，用于模拟用户与智能体之间的交互。我们可控的缺陷注入过程能自动生成关于被违反准则及具体对话轮次的精确真实标签，同时一种对抗性搜索方法确保所引入的扰动具有高度挑战性。我们的全面评估表明，当前最先进的专有LLM在此任务上表现显著不佳。此外，我们证明，在我们合成数据上微调的小规模评判者模型能够超越领先的LLM，并且能很好地泛化到未见过的业务领域，这凸显了我们的流程可作为训练鲁棒生成式奖励模型的有效基础。

摘要 (Abstract)

As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.

关键词: LLM judges, compliance violation detection, dialogue systems, benchmark, data generation pipeline, adversarial search, fine-tuning, generalization

154. ❌ ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance

作者: Haoran Li, Yulin Chen, Huihao Jing, Wenbin Hu, Tsz Ho Li, Chanhou Lou, Hong Ting Tsang, Sirui Han, Yangqiu Song 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12308v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确使用LLMs作为上下文评估器，提出ContextLens框架来改进LLMs在隐私和安全合规评估中的表现，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、训练方法、推理优化、代理系统、模型压缩、科学AI等，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

论文提出ContextLens框架，利用大语言模型（LLMs）处理不完整和模糊的上下文，以改进法律合规性评估，在GDPR和欧盟AI法案基准测试中显著提升了评估性能。

摘要翻译

个体对数据隐私与人工智能安全的关切具有高度情境依赖性，其范围远超敏感信息模式本身。解决这些问题需通过对具体情境进行推理，以识别并降低潜在风险。尽管研究者已广泛探索利用大语言模型作为情境化安全与隐私评估工具，但这些研究通常假设存在完整清晰的情境信息，而现实情境往往具有模糊性与不完整性。本文提出ContextLens——一个基于半规则框架的系统，该框架借助大语言模型将输入情境锚定于法律领域，并明确识别法律合规性中的已知与未知因素。与直接评估安全结果不同，我们的ContextLens通过指导大语言模型回答一系列精心设计的问题来实现评估，这些问题涵盖适用性、通用原则与具体条款，用以检验是否符合预设优先级与规则。我们在现有合规基准（涵盖《通用数据保护条例》（GDPR）与《欧盟人工智能法案》（EU AI Act））上进行了大量实验。结果表明，ContextLens能显著提升大语言模型的合规评估能力，在无需训练的情况下超越现有基线方法。此外，ContextLens还能进一步识别情境中的模糊因素与缺失要素。

摘要 (Abstract)

Individuals’ concerns about data privacy and AI safety are highly contextualized and extend beyond sensitive patterns. Addressing these issues requires reasoning about the context to identify and mitigate potential risks. Though researchers have widely explored using large language models (LLMs) as evaluators for contextualized safety and privacy assessments, these efforts typically assume the availability of complete and clear context, whereas real-world contexts tend to be ambiguous and incomplete. In this paper, we propose ContextLens, a semi-rule-based framework that leverages LLMs to ground the input context in the legal domain and explicitly identify both known and unknown factors for legal compliance. Instead of directly assessing safety outcomes, our ContextLens instructs LLMs to answer a set of crafted questions that span over applicability, general principles and detailed provisions to assess compliance with pre-defined priorities and rules. We conduct extensive experiments on existing compliance benchmarks that cover the General Data Protection Regulation (GDPR) and the EU AI Act. The results suggest that our ContextLens can significantly improve LLMs’ compliance assessment and surpass existing baselines without any training. Additionally, our ContextLens can further identify the ambiguous and missing factors.

关键词: ContextLens, large language models, legal compliance, privacy, safety, contextualized assessment, GDPR, EU AI Act

155. ❌ Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

作者: Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, Qinhuai Na 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12290v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM agents在工程任务中的自我进化能力，通过生成-执行-评估循环进行迭代优化。与LLM agents、tool use、self-correction高度相关（10分），涉及reasoning和system 2 thinking（5分），属于AI for Science在工程领域的应用（5分）。其他关键词如MoE、scaling laws、training methods、efficiency techniques等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了Frontier-Eng基准测试，用于评估LLM agents在真实世界工程任务中通过生成优化循环（迭代提出-执行-评估）解决复杂开放性问题并自我进化的能力，发现Claude 4.6 Opus表现最稳健但所有模型仍面临挑战，且改进频率和幅度呈双幂律衰减。

摘要翻译

当前的大语言模型智能体基准测试主要聚焦于代码生成或基于搜索的问答等二元通过/失败任务，往往忽视了通过可行设计的迭代优化所体现的现实工程价值。为此，我们提出了Frontier-Eng——一个经过人工验证的生成式优化基准测试。该测试围绕“生成式优化”这一概念展开，即智能体在固定的交互预算内，遵循“生成-执行-评估”的迭代循环：生成候选方案，接收可执行的验证器反馈，并据此进行修订。该基准涵盖五大工程领域的47项任务。与以往测试集不同，Frontier-Eng的任务基于工业级模拟器和验证器构建，这些工具能在受限预算下提供连续的奖励信号并强制执行严格的可行性约束。我们使用代表性的搜索框架评估了八个前沿语言模型，发现尽管Claude 4.6 Opus取得了最稳健的性能，但该基准对所有模型而言仍具挑战性。我们的分析表明，改进频率（约与迭代次数成反比）和改进幅度（约与改进次数成反比）均呈现双重幂律衰减。我们进一步证明，尽管增加搜索宽度能提升并行性和多样性，但在固定预算下，搜索深度对于获得来之不易的改进仍然至关重要。Frontier-Eng为评估AI智能体整合领域知识与可执行反馈以解决复杂、开放式工程问题的能力设立了新标准。

摘要 (Abstract)

Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization – an iterative propose-execute-evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget – spanning $47$ tasks across five broad engineering categories. Unlike previous suites, Frontier-Eng tasks are grounded in industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search frameworks, finding that while Claude 4.6 Opus achieves the most robust performance, the benchmark remains challenging for all models. Our analysis suggests a dual power-law decay in improvement frequency ($\sim$ 1/iteration) and magnitude ($\sim$ 1/improvement count). We further show that although width improves parallelism and diversity, depth remains crucial for hard-won improvements under a fixed budget. Frontier-Eng establishes a new standard for assessing the capacity of AI agents to integrate domain knowledge with executable feedback to solve complex, open-ended engineering problems.

关键词: LLM agents, generative optimization, self-evolving agents, engineering tasks, iterative propose-execute-evaluate loop, real-world benchmark, feasibility constraints, continuous reward signals

156. ❌ The Enforcement and Feasibility of Hate Speech Moderation on Twitter

作者: Manuel Tonneau, Dylan Thurgood, Diyi Liu, Niyati Malhotra, Victor Orozco-Olvera, Ralph Schroeder, Scott A. Hale, Manoel Horta Ribeiro, Paul Röttger, Samuel P. Fraiberger 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12289v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究Twitter（现X平台）仇恨言论审核的执行情况和可行性，属于社会科学、平台治理和政策研究领域。论文内容涉及大规模内容审核系统、人机协作审核流程、自动化检测系统的局限性以及经济可行性分析，但完全不涉及大模型、深度学习技术原理、AI模型训练方法、推理优化、AI代理、模型压缩或科学AI应用等任何技术主题。所有关键词均与大模型和深度学习技术相关，而本文是社会科学实证研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过全球审计发现Twitter上80%的仇恨推文在发布5个月后仍未被删除，审核执行存在严重缺口，但模拟分析表明通过人机协作审核流程大幅减少用户接触仇恨言论在经济上是可行的，且成本低于现有监管罚款。

摘要翻译

网络仇恨言论与严重的社会危害相关，但平台执行仇恨言论政策的一致性如何，或大规模执行是否可行，目前仍不明确。我们通过对Twitter（现称X）的全球仇恨言论审核审计来探讨这些问题。利用完整的24小时公开推文快照，我们构建了包含54万条推文的代表性样本，这些推文由经过培训的标注员针对八种主要语言进行了仇恨言论标注。在发布五个月后，80%的仇恨推文仍保留在平台上，包括明确暴力的仇恨言论。此类推文被移除的可能性并不高于非仇恨推文，其严重性或可见性均未增加被移除的概率。随后，我们检验这些执行缺口是否反映了大规模审核系统的技术限制。虽然全自动检测系统无法可靠识别仇恨言论而不产生大量误报，但它们能有效将可能违规内容优先提交人工审核。对人机协同审核流程的模拟表明，大幅减少用户接触仇恨言论在经济上是可行的，其成本低于现有监管处罚。这些结果表明，网络仇恨言论的持续存在不能仅归因于技术限制，也反映了平台在审核资源分配上的制度性选择。

摘要 (Abstract)

Online hate speech is associated with substantial social harms, yet it remains unclear how consistently platforms enforce hate speech policies or whether enforcement is feasible at scale. We address these questions through a global audit of hate speech moderation on Twitter (now X). Using a complete 24-hour snapshot of public tweets, we construct representative samples comprising 540,000 tweets annotated for hate speech by trained annotators across eight major languages. Five months after posting, 80% of hateful tweets remain online, including explicitly violent hate speech. Such tweets are no more likely to be removed than non-hateful tweets, with neither severity nor visibility increasing the likelihood of removal. We then examine whether these enforcement gaps reflect technical limits of large-scale moderation systems. While fully automated detection systems cannot reliably identify hate speech without generating large numbers of false positives, they effectively prioritize likely violations for human review. Simulations of a human-AI moderation pipeline indicate that substantially reducing user exposure to hate speech is economically feasible at a cost below existing regulatory penalties. These results suggest that the persistence of online hate cannot be explained by technical constraints alone but also reflects institutional choices in the allocation of moderation resources.

关键词: hate speech moderation, Twitter audit, content moderation, automated detection, human-AI pipeline, enforcement gaps, social media governance, policy compliance

157. ❌ Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning

作者: Houxing Ren, Mingjie Zhan, Zimu Lu, Ke Wang, Yunqiao Yang, Haotian Hou, Hongsheng Li 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12282v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出SpreadsheetAgent，一个用于电子表格理解的两阶段多智能体框架，核心基于LLM（GPT-OSS-120B）构建，因此与’Large Language Models’高度相关（10分）。它采用逐步阅读和推理范式，涉及’Chain of Thought’（10分）和’System 2 Thinking’（8分）。框架是多智能体系统（‘Multi-agent Systems’：10分），使用LLM智能体（‘LLM Agents’：10分）并整合工具如代码执行（‘Tool Use’：8分）。它处理长上下文问题（‘Context Window Extension’：8分），通过验证模块减少错误（‘Self-Correction’：5分；‘Hallucination Mitigation’：5分）。应用于科学数据管理（‘AI for Science’：5分）。其他关键词如MoE、SFT、RAG等未直接涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究针对现实世界中大规模电子表格理解时LLM输入长度限制和视觉语义忽略的问题，提出了一个两阶段多智能体框架SpreadsheetAgent，通过逐步阅读、多模态推理和验证模块，在Spreadsheet Bench数据集上超越基线方法2.89个百分点，实现了更鲁棒和可扩展的电子表格理解。

摘要翻译

电子表格在企业报表、审计及科学数据管理等实际应用中具有核心地位。尽管其应用广泛，现有基于大语言模型的方法通常将表格视为纯文本处理，忽略了关键的布局线索与视觉语义。此外，现实场景中的电子表格往往规模庞大，超出了大语言模型能高效处理的输入长度限制。为应对这些挑战，我们提出了SpreadsheetAgent——一个用于电子表格理解的两阶段多智能体框架，采用逐步阅读与推理的范式。该框架并非一次性加载整个表格，而是通过代码执行结果、图像和LaTeX表格等多种模态，逐步解析局部区域。该方法首先构建结构草图与行列摘要，随后在求解阶段基于这一中间表示进行任务驱动的推理。为进一步提升可靠性，我们设计了验证模块，通过针对性检查来验证提取的结构，减少错误传播并确保下游推理获得可信输入。在两个电子表格数据集上的大量实验证明了我们方法的有效性。使用GPT-OSS-120B模型时，SpreadsheetAgent在Spreadsheet Bench数据集上达到38.16%的准确率，较ChatGPT Agent基线（35.27%）绝对提升2.89个百分点。这些结果凸显了SpreadsheetAgent在推动现实应用中实现鲁棒且可扩展的电子表格理解方面的潜力。代码发布于https://github.com/renhouxing/SpreadsheetAgent.git。

摘要 (Abstract)

Spreadsheets are central to real-world applications such as enterprise reporting, auditing, and scientific data management. Despite their ubiquity, existing large language model based approaches typically treat tables as plain text, overlooking critical layout cues and visual semantics. Moreover, real-world spreadsheets are often massive in scale, exceeding the input length that LLMs can efficiently process. To address these challenges, we propose SpreadsheetAgent, a two-stage multi-agent framework for spreadsheet understanding that adopts a step-by-step reading and reasoning paradigm. Instead of loading the entire spreadsheet at once, SpreadsheetAgent incrementally interprets localized regions through multiple modalities, including code execution results, images, and LaTeX tables. The method first constructs a structural sketch and row/column summaries, and then performs task-driven reasoning over this intermediate representation in the Solving Stage. To further enhance reliability, we design a verification module that validates extracted structures via targeted inspections, reducing error propagation and ensuring trustworthy inputs for downstream reasoning. Extensive experiments on two spreadsheet datasets demonstrate the effectiveness of our approach. With GPT-OSS-120B, SpreadsheetAgent achieves 38.16% on Spreadsheet Bench, outperforming the ChatGPT Agent baseline (35.27%) by 2.89 absolute points. These results highlight the potential of SpreadsheetAgent to advance robust and scalable spreadsheet understanding in real-world applications. Code is available at https://github.com/renhouxing/SpreadsheetAgent.git.

关键词: Spreadsheet Understanding, Multi-Agent Framework, Large Language Models, Step-by-Step Reasoning, Real-World Applications, Context Window Limitation, Visual Semantics, Verification Module

158. ❌ CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

作者: Zaoyu Chen, Jianbo Dai, Boyu Zhu, Jingdong Wang, Huiming Wang, Xin Xu, Haoyang Yuan, Zhijiang Guo, Xiao-Ming Wu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12268v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在代码生成任务中的行为规范生成能力评估，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确聚焦LLMs，并评估了15个最先进的LLMs。其他关键词如MoE、SLMs、训练技术、推理优化、代理系统、科学AI应用等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在生成可执行行为规范方面的能力，发现LLMs在仓库级任务上性能显著下降（仅20.2%通过率），且规范生成比代码生成更具挑战性，表明强大的编码性能不一定反映对程序语义的深入理解。

摘要翻译

大语言模型（LLM）能够根据自然语言生成代码，但其在多大程度上能捕捉预期程序行为仍不明确。通过前置条件与后置条件定义的可执行行为规约，为评估这种理解提供了具体方法。然而，现有关于规约生成的研究在评估方法、任务设置和规约表达能力方面存在局限。我们提出了CodeSpecBench，这是一个在执行式评估协议下的可执行行为规约生成基准。CodeSpecBench同时支持函数级和仓库级任务，并将规约编码为可执行的Python函数。该基准基于多样化的真实世界代码库构建，能够对规约的正确性（接受有效行为）和完备性（拒绝无效行为）进行现实评估。通过对15个前沿大语言模型在CodeSpecBench上的评估，我们观察到在仓库级任务上性能急剧下降，最佳模型的通过率仅为20.2%。我们进一步发现，规约生成比代码生成更具挑战性，这表明强大的编码性能未必反映对预期程序语义的深刻理解。我们的数据与代码已公开于https://github.com/SparksofAGI/CodeSpecBench。

摘要 (Abstract)

Large language models (LLMs) can generate code from natural language, but the extent to which they capture intended program behavior remains unclear. Executable behavioral specifications, defined via preconditions and postconditions, provide a concrete means to assess such understanding. However, existing work on specification generation is constrained in evaluation methodology, task settings, and specification expressiveness. We introduce CodeSpecBench, a benchmark for executable behavioral specification generation under an execution-based evaluation protocol. CodeSpecBench supports both function-level and repository-level tasks and encodes specifications as executable Python functions. Constructed from diverse real-world codebases, it enables a realistic assessment of both correctness (accepting valid behaviors) and completeness (rejecting invalid behaviors). Evaluating 15 state-of-the-art LLMs on CodeSpecBench, we observe a sharp performance degradation on repository-level tasks, where the best model attains only a 20.2% pass rate. We further find that specification generation is substantially more challenging than code generation, indicating that strong coding performance does not necessarily reflect deep understanding of intended program semantics. Our data and code are available at https://github.com/SparksofAGI/CodeSpecBench.

关键词: Large Language Models, executable behavioral specifications, benchmark, code generation, program semantics, specification generation, evaluation, CodeSpecBench

159. ❌ CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades

作者: Raeyoung Chang, Dongwook Kwon, Jisoo Lee, Nikhil Verma 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12262v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM级联系统（高度相关’Large Language Models’），通过引入多智能体审议机制（高度相关’LLM Agents’和’Multi-agent Systems’）来解决不确定查询问题，并在科学、医学等领域的基准测试中评估性能（有一定关联’AI for Science’）。其他关键词如MoE、SFT、RAG等未在摘要中提及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文提出CascadeDebate框架，通过在LLM级联系统的每个层级引入多智能体审议机制，动态解决不确定查询以避免过早升级到高成本模型或专家，从而在多个科学和通用知识基准上显著提升性能。

摘要翻译

级联大语言模型系统通过协调不同规模的模型与人类专家，在不确定性条件下平衡准确性、成本及弃权决策。然而，传统系统中每一阶段采用单一模型层处理模糊查询时往往面临挑战：由于置信度不足和计算效率扩展困难，容易触发过早升级至更高成本模型或专家。CascadeDebate通过在多层级联的升级边界直接引入多智能体审议机制来解决这一局限。基于置信度的路由模块仅针对不确定案例激活轻量级智能体集群，使其通过内部共识驱动解决歧义，从而避免调用高成本升级资源。我们的统一架构在跨模型规模的处理中，交替使用单模型推理与选择性多智能体审议，并以人类专家作为最终后备支持。该设计能够根据查询难度动态调整测试阶段的计算资源分配。在涵盖科学、医学及通用知识领域的五个基准测试中，CascadeDebate相较传统强级联单模型系统与独立多智能体系统的性能提升最高达26.75%。研究证明，在线阈值优化器具有关键作用：相较于固定策略，其可将准确率相对提升20.98%至52.33%，并实现对现实场景数据分布的弹性适应。

摘要 (Abstract)

Cascaded LLM systems coordinate models of varying sizes with human experts to balance accuracy, cost, and abstention under uncertainty. However, single-model tiers at each stage often struggle with ambiguous queries, triggering premature escalations to costlier models or experts due to under-confidence and inefficient compute scaling. CascadeDebate addresses this gap by inserting multi-agent deliberation directly at each tier’s escalation boundary. Confidence-based routers activate lightweight agent ensembles only for uncertain cases, enabling consensus-driven resolution of ambiguities internally without invoking higher-cost upgrades. Our unified architecture alternates single-model inference with selective multi-agent deliberation across model scales, culminating in human experts as the final fallback. This design scales test-time compute dynamically according to query difficulty. Across five benchmarks spanning science, medicine, and general knowledge, CascadeDebate outperforms strong single-model cascades and standalone multi-agent systems by up to 26.75 percent. An online threshold optimizer proves essential, boosting accuracy by 20.98 to 52.33 percent relative improvement over fixed policies and enabling elastic adaptation to real-world distributions.

关键词: LLM cascades, multi-agent deliberation, cost-aware systems, confidence-based routing, escalation boundary, query difficulty scaling, science benchmarks, online threshold optimizer

160. ❌ Coding-Free and Privacy-Preserving MCP Framework for Clinical Agentic Research Intelligence System

作者: Taehun Kim, Hyeryun Park, Hyeonhoon Lee, Yushin Lee, Kyungsang Kim, Hyung-Chul Lee 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12258v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文开发了一个临床研究智能系统CARIS，该系统集成了大语言模型（LLMs）与模块化工具，通过自然语言驱动工具编排，实现了临床研究流程的自动化。论文明确提到了LLMs的应用，因此’Large Language Models’得10分。系统采用代理式AI架构，实现了自动化工作流程，因此’LLM Agents’和’Tool Use’各得10分。研究应用于临床研究领域，属于AI for Science范畴，因此’AI for Science’得10分。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文开发了一个基于大语言模型的临床代理研究智能系统CARIS，通过自动化临床研究流程并保护数据隐私，成功将临床假设转化为可执行的研究工作流，消除了编码和直接数据访问的需求。

摘要翻译

临床研究涉及研究设计、队列构建、模型开发和文档记录等劳动密集型流程，需要领域专业知识、编程技能以及对敏感患者数据的访问权限。这些要求为临床医生和外部研究人员开展数据驱动型研究设置了障碍。为克服这些限制，我们开发了一种临床智能体研究系统（Clinical Agentic Research Intelligence System, CARIS），该系统能自动化临床研究流程，同时保护数据隐私，使研究人员无需直接接触原始数据即可开展全面研究。CARIS通过模型上下文协议（Model Context Protocol, MCP）将大语言模型（Large Language Models, LLMs）与模块化工具集成，实现了基于自然语言的工具智能编排。数据库始终安全地保留在MCP服务器内，用户仅能访问输出结果和最终研究报告。根据用户意图，CARIS自动执行完整流程：研究规划、文献检索、队列构建、机构审查委员会（Institutional Review Board, IRB）文档编制、Vibe机器学习（Vibe Machine Learning, ML）以及报告生成，并支持基于人机交互的迭代优化。我们在三个具有不同临床任务的异构数据集上对CARIS进行了评估。系统利用文献和数据证据，经过三到四次迭代即可完成研究计划和IRB文档的定稿。该系统通过探索特征-模型组合、对前十名模型进行排名并生成性能可视化图表来支持Vibe ML。基于TRIPOD+AI框架制定的检查清单评估，最终报告显示出高度的完整性：在LLM评估中覆盖率达96%，在人工评估中达82%。CARIS证明，智能体人工智能能够将临床假设转化为跨异构数据集的可执行研究流程。该系统通过消除编码需求和直接数据访问，降低了研究门槛，并连通了公共与私有的临床数据环境。

摘要 (Abstract)

Clinical research involves labor-intensive processes such as study design, cohort construction, model development, and documentation, requiring domain expertise, programming skills, and access to sensitive patient data. These demands create barriers for clinicians and external researchers conducting data-driven studies. To overcome these limitations, we developed a Clinical Agentic Research Intelligence System (CARIS) that automates the clinical research workflow while preserving data privacy, enabling comprehensive studies without direct access to raw data. CARIS integrates Large Language Models (LLMs) with modular tools via the Model Context Protocol (MCP), enabling natural language-driven orchestration of appropriate tools. Databases remain securely within the MCP server, and users access only the outputs and final research reports. Based on user intent, CARIS automatically executes the full pipeline: research planning, literature search, cohort construction, Institutional Review Board (IRB) documentation, Vibe Machine Learning (ML), and report generation, with iterative human-in-the-loop refinement. We evaluated CARIS on three heterogeneous datasets with distinct clinical tasks. Research plans and IRB documents were finalized within three to four iterations, using evidence from literature and data. The system supported Vibe ML by exploring feature-model combinations, ranking the top ten models, and generating performance visualizations. Final reports showed high completeness based on a checklist derived from the TRIPOD+AI framework, achieving 96% coverage in LLM evaluation and 82% in human evaluation. CARIS demonstrates that agentic AI can transform clinical hypotheses into executable research workflows across heterogeneous datasets. By eliminating the need for coding and direct data access, the system lowers barriers and bridges public and private clinical data environments.

关键词: Clinical Agentic Research Intelligence System, Large Language Models, Model Context Protocol, Clinical Research Workflow, Data Privacy, Automation, Agentic AI, Vibe Machine Learning

作者: Taisei Hishiki, Takaya Arita, Reiji Suzuki 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12250v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理在多智能体系统中的集体行为，直接涉及’Large Language Models’、‘LLM Agents’和’Multi-agent Systems’，并探讨了’Alignment’对行为的影响。其他关键词如MoE、SLMs、训练方法、推理技术、压缩等均未在研究中涉及。

!!! tip deepseek-chat TL;DR

本研究探讨了大型语言模型代理的记忆长度如何影响其在多智能体囚徒困境中的集体合作行为，发现不同LLM（Gemini与Gemma）因模型特性（可能包括对齐方式）导致记忆对合作产生相反影响。

摘要翻译

本研究探讨了大型语言模型（LLM）智能体在特定模型层面的特征（包括内在对齐性）如何影响记忆在多智能体系统中对其集体与合作动态的作用。为此，我们扩展了“社会粒子群”（SPS）模型：在该模型中，智能体于二维空间移动，并与邻近智能体进行“囚徒困境”博弈；我们将原基于规则的智能体替换为具备大五人格（Big Five personality）分数和不同记忆长度的LLM智能体。使用Gemini-2.0-Flash模型进行实验发现，记忆长度是调控集体行为的关键参数：即使极短的记忆也会显著抑制合作；随着记忆长度增加，系统从稳定的合作集群，经历集群周期性形成与崩溃，最终转变为分散的背叛状态。大五人格特质与智能体行为的相关性部分符合人类参与者实验的结论，支持了模型的有效性。使用Gemma~3:4b模型的对比实验则揭示了相反趋势：更长的记忆促进了合作，并伴随密集合作集群的形成。对智能体推理文本的情感分析表明，随着记忆增长，Gemini对记忆的解读趋于负面，而Gemma的负面解读程度较低；且这一差异在宏观动态收敛之前的实验初期阶段持续存在。这些结果表明，LLM的模型特定特征（可能包括对齐性）在生成式基于智能体的建模（Generative Agent-Based Modeling）中，对涌现的社会行为起着决定性作用，并为先前关于记忆与合作研究中存在的矛盾提供了微观层面的认知解释。

摘要 (Abstract)

This study examines how model-specific characteristics of Large Language Model (LLM) agents, including internal alignment, shape the effect of memory on their collective and cooperative dynamics in a multi-agent system. To this end, we extend the Social Particle Swarm (SPS) model, in which agents move in a two-dimensional space and play the Prisoner’s Dilemma with neighboring agents, by replacing its rule-based agents with LLM agents endowed with Big Five personality scores and varying memory lengths. Using Gemini-2.0-Flash, we find that memory length is a critical parameter governing collective behavior: even a minimal memory drastically suppressed cooperation, transitioning the system from stable cooperative clusters through cyclical formation and collapse of clusters to a state of scattered defection as memory length increased. Big Five personality traits correlated with agent behaviors in partial agreement with findings from experiments with human participants, supporting the validity of the model. Comparative experiments using Gemma~3:4b revealed the opposite trend: longer memory promoted cooperation, accompanied by the formation of dense cooperative clusters. Sentiment analysis of agents’ reasoning texts showed that Gemini interprets memory increasingly negatively as its length grows, while Gemma interprets it less negatively, and that this difference persists in the early phase of experiments before the macro-level dynamics converge. These results suggest that model-specific characteristics of LLMs, potentially including alignment, play a fundamental role in determining emergent social behavior in Generative Agent-Based Modeling, and provide a micro-level cognitive account of the contradictions found in prior work on memory and cooperation.

关键词: Large Language Model agents, multi-agent system, collective behavior, cooperation, memory length, Social Particle Swarm, Prisoner’s Dilemma, alignment

162. ❌ SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

作者: Zhuofan Wen, Yang Feng 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12247v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究speculative decoding（推测解码）技术，这是LLM推理加速的关键方法，因此与’Large Language Models’和’Speculative Decoding OR Inference Acceleration’高度相关（10分）。论文提出了一种新的自推测框架，通过层间温度退火和自适应边界推测长度来解决现有方法的局限性，属于大模型技术原理的创新。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或相关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型自推测解码中浅层过度自信和困难令牌导致效率低下的问题，提出了一种通过层间温度退火和自适应边界推测长度的新框架，在保持输出等价性的同时实现了最高2.33倍的推理加速。

摘要翻译

推测解码已成为加速大语言模型自回归推理的一种前景广阔的方法。基于自草稿的方法利用基础大语言模型自身进行推测，避免了辅助草稿模型的开销，但也面临局限：浅层网络常产生过度自信却错误的词元预测，且草稿序列中困难词元的存在迫使计算冗余地通过更深层网络，这既损害了草稿接受率，也削弱了整体加速效果。为解决这些问题，我们提出了一种新颖的自草稿框架，该框架通过早退决策中的分层温度退火机制抑制虚假置信度，并依据词元级解码难度自适应地限制推测长度。通过将草稿词元的隐藏状态在深层网络中统一进行单次并行重处理，我们的方法在保持与原始模型完全输出等价性的同时，最大限度地提升了计算效率。该方法无需修改基础大语言模型参数，在多样化的长文本生成任务和多种模型架构上，相比标准自回归解码实现了最高2.33倍的实时加速。

摘要 (Abstract)

Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft models but face limitations: shallow layers often produce overconfident yet incorrect token predictions, and the presence of difficult tokens in a draft sequence forces redundant computation through deeper layers, undermining both draft acceptance and overall speedup. To address these issues, we propose a novel self-draft framework that suppresses spurious confidence via layer-wise temperature annealing in early-exit decision and adaptively bounds speculation length based on token-wise decoding difficulty. By reprocessing the hidden states of draft tokens in a unified parallel pass through deep layers, our method maintains exact output equivalence with the original model while maximizing computational efficiency. It requires no modifications to the base LLM parameters and achieves up to 2.33x wall-time speedup over standard autoregressive decoding across diverse long-form generation tasks and multiple model architectures.

关键词: speculative decoding, large language models, inference acceleration, self-draft methods, layer-wise temperature annealing, adaptive bounded speculation, autoregressive inference, computational efficiency

163. ❌ Continuous Knowledge Metabolism: Generating Scientific Hypotheses from Evolving Literature

作者: Jinkai Tao, Yubo Wang, Xiaoyu Liu, Menglin Yang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12243v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出Continuous Knowledge Metabolism (CKM)框架，使用LLMs处理科学文献并生成科学假设，属于大模型在科学领域的应用创新。论文明确使用LLMs进行假设生成和评估（如LLM-judged novelty），因此与’Large Language Models’高度相关（10分）。研究内容属于科学文献分析和假设生成，直接对应’AI for Science’领域（10分）。其他关键词如MoE、SFT、RAG等涉及具体技术细节，论文未提及，故均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了Continuous Knowledge Metabolism框架，通过增量处理科学文献并利用LLMs生成科学假设，发现增量处理优于批量处理，且假设质量受文献处理方式影响，揭示了质量与覆盖率的权衡。

摘要翻译

科学假说生成不仅需要追踪当前已知内容，更需把握知识的演化过程。本文提出连续知识代谢框架，该框架通过滑动时间窗口处理科学文献，并随着新研究成果的出现逐步更新结构化知识库。我们进一步提出CKM-Lite——一种高效变体，通过增量积累机制实现较强的预测覆盖度，在命中率（+2.8%，p=0.006）、假说产出量（+3.6，p<0.001）和最佳匹配对齐度（+0.43，p<0.001）上均优于批量处理方法，同时将计算代价降低92%。为探究差异成因，我们开发了CKM-Full工具化变体，该系统能将每项新发现分类为新颖型、证实型或矛盾型，检测知识变化信号，并基于完整的演化轨迹进行假说生成。通过分析CKM-Full在50个研究主题中生成的892个假说，并与其他变体的并行运行结果对比，我们得出四项实证发现：（1）增量处理在预测性能与效率指标上均优于批量基线；（2）具备变化感知能力的工具化设计能产生更高的大语言模型判定新颖度（科恩d值=3.46），但会降低预测覆盖度，揭示质量与覆盖度的权衡关系；（3）领域轨迹稳定性与假说成功率相关（r=-0.28，p=0.051），这为基于文献的预测划定了边界条件；（4）知识收敛信号对应的命中率比矛盾信号高出近5倍，表明不同变化类型具有差异化的可预测性。这些发现说明生成假说的特性不仅取决于处理文献的数量，更取决于处理方式。研究进一步表明，评估框架必须考量质量与覆盖度的权衡关系，而非仅优化单一指标。

摘要 (Abstract)

Scientific hypothesis generation requires tracking how knowledge evolves, not just what is currently known. We introduce Continuous Knowledge Metabolism (CKM), a framework that processes scientific literature through sliding time windows and incrementally updates a structured knowledge base as new findings arrive. We present CKM-Lite, an efficient variant that achieves strong predictive coverage through incremental accumulation, outperforming batch processing on hit rate (+2.8%, p=0.006), hypothesis yield (+3.6, p<0.001), and best-match alignment (+0.43, p<0.001) while reducing token cost by 92%. To understand what drives these differences, we develop CKM-Full, an instrumented variant that categorizes each new finding as novel, confirming, or contradicting, detects knowledge change signals, and conditions hypothesis generation on the full evolution trajectory. Analyzing 892 hypotheses generated by CKM-Full across 50 research topics, alongside parallel runs of the other variants, we report four empirical observations: (1) incremental processing outperforms batch baseline across predictive and efficiency metrics; (2) change-aware instrumentation is associated with higher LLM-judged novelty (Cohen’s d=3.46) but lower predictive coverage, revealing a quality-coverage trade-off; (3) a field’s trajectory stability is associated with hypothesis success (r=-0.28, p=0.051), suggesting boundary conditions for literature-based prediction; (4) knowledge convergence signals are associated with nearly 5x higher hit rate than contradiction signals, pointing to differential predictability across change types. These findings suggest that the character of generated hypotheses is shaped not only by how much literature is processed, but also by how it is processed. They further indicate that evaluation frameworks must account for the quality-coverage trade-off rather than optimize for a single metric.

关键词: Continuous Knowledge Metabolism, scientific hypothesis generation, evolving literature, incremental processing, knowledge change signals, LLM-judged novelty, quality-coverage trade-off, AI for science

164. ❌ MolMem: Memory-Augmented Agentic Reinforcement Learning for Sample-Efficient Molecular Optimization

作者: Ziqing Wang, Yibo Wen, Abhishek Pandy, Han Liu, Kaize Ding 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12237v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文MolMem专注于药物发现中的分子优化问题，提出了一种结合记忆增强和强化学习的智能体框架。该研究属于AI for Science（AI4Science）领域，特别是生物信息学/化学信息学在药物发现中的应用，因此与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。然而，论文并未涉及大语言模型（LLMs）、模型架构（如MoE、SLMs）、训练技术（如预训练、微调、对齐、RLHF、PEFT）、推理优化（如RAG、注意力机制、量化）、推理方法（如思维链、系统2思维、MCTS）、智能体技术（如LLM Agents、工具使用、多智能体系统）或其他大模型相关主题，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对药物发现中分子优化样本效率低的问题，提出了一种记忆增强的智能体强化学习框架MolMem，在仅使用500次昂贵评估的情况下，在单属性和多属性优化任务上分别实现了90%和52%的成功率。

摘要翻译

在药物发现领域，分子优化的目标是通过迭代改进先导化合物以提升其分子特性，同时保持与原始分子的结构相似性。然而，每次评估（oracle evaluation）成本高昂，使得在有限的评估预算下，样本效率成为现有方法面临的关键挑战。试错法需要大量评估调用，而利用外部知识的方法往往重复使用已知模板，难以应对具有挑战性的优化目标。当前方法缺失的关键环节是能够支撑决策并为未来优化提供可复用见解的长期记忆机制。为此，我们提出了MolMem（基于记忆的分子优化），这是一个具备双记忆系统的多轮智能体强化学习框架。具体而言，MolMem利用静态范例记忆（Static Exemplar Memory）检索相关范例以实现冷启动支撑，并通过演化技能记忆（Evolving Skill Memory）将成功的优化轨迹提炼为可复用的策略。基于这种记忆增强的架构，我们采用密集的逐步奖励来训练策略，将高成本的探索过程转化为可提升未来优化效果的长期知识。大量实验表明，MolMem在单属性优化任务中实现了90%的成功率（达到最佳基线方法的1.5倍），在多属性任务中仅使用500次评估调用即达到52%的成功率。我们的代码已公开于https://github.com/REAL-Lab-NU/MolMem。

摘要 (Abstract)

In drug discovery, molecular optimization aims to iteratively refine a lead compound to improve molecular properties while preserving structural similarity to the original molecule. However, each oracle evaluation is expensive, making sample efficiency a key challenge for existing methods under a limited oracle budget. Trial-and-error approaches require many oracle calls, while methods that leverage external knowledge tend to reuse familiar templates and struggle on challenging objectives. A key missing piece is long-term memory that can ground decisions and provide reusable insights for future optimizations. To address this, we present MolMem (\textbf{Mol}ecular optimization with \textbf{Mem}ory), a multi-turn agentic reinforcement learning (RL) framework with a dual-memory system. Specifically, MolMem uses Static Exemplar Memory to retrieve relevant exemplars for cold-start grounding, and Evolving Skill Memory to distill successful trajectories into reusable strategies. Built on this memory-augmented formulation, we train the policy with dense step-wise rewards, turning costly rollouts into long-term knowledge that improves future optimization. Extensive experiments show that MolMem achieves 90% success on single-property tasks (1.5$\times$ over the best baseline) and 52% on multi-property tasks using only 500 oracle calls. Our code is available at https://github.com/REAL-Lab-NU/MolMem.

关键词: Molecular Optimization, Drug Discovery, Reinforcement Learning, Memory-Augmented Agent, Sample Efficiency, Multi-turn Agentic RL, Static Exemplar Memory, Evolving Skill Memory

165. ❌ Thought-Retriever: Don’t Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems

作者: Tao Feng, Pengrui Han, Guanyu Lin, Ge Liu, Jiaxuan You 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12231v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理系统如何有效利用外部知识，提出Thought-Retriever算法解决传统RAG受上下文长度限制的问题。高度相关关键词包括：LLMs（论文基础）、RAG（直接改进对象）、Context Window Extension（解决核心限制）、Chain of Thought（利用中间思考）、Self-Improvement（系统自我进化）、LLM Agents（应用场景）。System 2 Thinking和In-context Learning有一定关联，因为涉及深度推理和上下文学习机制。其他关键词如MoE、量化、对齐等未涉及。

!!! tip deepseek-chat TL;DR

论文提出Thought-Retriever算法，通过检索和组织LLM的中间思考而非原始数据，解决了传统检索增强生成受上下文长度限制的问题，使LLM代理能够有效利用超长外部知识并实现自我进化，在多个基准测试中显著优于现有方法。

摘要翻译

大型语言模型（LLM）凭借其强大的内部能力与知识储备，已深刻改变了人工智能研究。然而，现有的LLM在与世界交互时，仍难以有效整合海量的外部知识。尽管检索增强型LLM被提出以缓解此问题，但它们本质上仍受限于LLM的上下文长度，因其仅能从通常包含数百万数据块的外部知识库中检索前K个原始数据块。本文提出Thought-Retriever，一种新颖的模型无关算法，它能够帮助LLM基于任意长度的外部数据生成输出，而不受上下文长度或检索数据块数量的限制。我们的核心思路是让LLM充分利用其在解决过往用户查询时生成的中间响应（即“思考”），过滤无意义和冗余的思考，将其组织于思考记忆中，并在处理新查询时检索相关思考。这有效为基于LLM的智能体赋予了自我进化的长期记忆，使其能通过持续交互不断增强能力。除算法创新外，我们还精心构建了一个新颖的基准测试集AcademicEval，要求LLM基于真实世界学术论文的超长上下文忠实回答查询。在AcademicEval及另外两个公开数据集上的大量实验表明，Thought-Retriever显著优于现有先进基线方法，在不同任务中F1分数平均提升至少7.6%，胜率平均提升16%。更重要的是，我们进一步验证了两个令人振奋的发现：（1）Thought-Retriever确实能帮助LLM在解决更多用户查询后实现自我进化；（2）Thought-Retriever能够学会利用更深层的思考来回答更抽象的用户查询。

摘要 (Abstract)

Large language models (LLMs) have transformed AI research thanks to their powerful internal capabilities and knowledge. However, existing LLMs still fail to effectively incorporate the massive external knowledge when interacting with the world. Although retrieval-augmented LLMs are proposed to mitigate the issue, they are still fundamentally constrained by the context length of LLMs, as they can only retrieve top-K raw data chunks from the external knowledge base which often consists of millions of data chunks. Here we propose Thought-Retriever, a novel model-agnostic algorithm that helps LLMs generate output conditioned on arbitrarily long external data, without being constrained by the context length or number of retrieved data chunks. Our key insight is to let an LLM fully leverage its intermediate responses generated when solving past user queries (thoughts), filtering meaningless and redundant thoughts, organizing them in thought memory, and retrieving the relevant thoughts when addressing new queries. This effectively equips LLM-based agents with a self-evolving long-term memory that grows more capable through continuous interaction. Besides algorithmic innovation, we further meticulously prepare a novel benchmark, AcademicEval, which requires an LLM to faithfully leverage ultra-long context to answer queries based on real-world academic papers. Extensive experiments on AcademicEval and two other public datasets validate that Thought-Retriever remarkably outperforms state-of-the-art baselines, achieving an average increase of at least 7.6% in F1 score and 16% in win rate across various tasks. More importantly, we further demonstrate two exciting findings: (1) Thought-Retriever can indeed help LLM self-evolve after solving more user queries; (2) Thought-Retriever learns to leverage deeper thoughts to answer more abstract user queries.

关键词: Thought-Retriever, retrieval-augmented LLMs, context length limitation, LLM agents, self-evolving memory, intermediate thoughts, AcademicEval benchmark, long-term memory

166. ❌ HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models

作者: Jawad Hossain, Xiangyu Guo, Jiawei Zhou, Chong Liu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12229v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究小语言模型（SLMs）的数学推理能力提升，通过提示辅助框架实现多步推理。与’Small Language Models’高度相关（10分），涉及’Chain of Thought’推理（10分）和’System 2 Thinking’深度推理（8分）。论文采用两模型协作系统，与’LLM Agents’和’Multi-agent Systems’有一定关联（各5分）。使用大模型蒸馏训练提示生成模型，与’Large Language Models’相关（5分）。提示机制帮助模型从错误中恢复，与’Self-Correction’相关（5分）。其他关键词与论文内容无关或未涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种提示辅助推理框架，通过让小语言模型在解决数学问题时接收上下文感知的提示，显著提升了小语言模型在多步数学推理任务中的准确性。

摘要翻译

小型语言模型（SLMs）由于在维持长链中间步骤和从早期错误中恢复的能力有限，往往在处理复杂数学推理时面临困难。我们通过引入一种提示辅助推理框架来解决这一挑战，该框架能逐步引导SLMs完成多步骤数学问题求解。我们的方法将解决方案分解为顺序推理步骤，并提供上下文感知的提示，其中提示由一个独立的SLM生成，该SLM通过从强大大型语言模型进行蒸馏训练得到。尽管单独使用提示生成SLM无法解决问题，但其与推理SLM的协作实现了有效引导，形成了一个用于推理的协同双模型系统。每个提示均基于问题描述和累积的推理历史条件生成，提供逐步的、局部化的指导，而不泄露完整解法。这减少了错误传播，并使推理模型能够专注于可处理的子问题。在多样化数学基准测试和模型上的实验表明，提示辅助持续提升了SLMs的推理准确性，相比标准提示方法取得了显著增益，同时保持了模型效率。这些结果凸显了SLMs之间通过提示生成与推理的结构化协作为增强数学推理提供了一种有效且轻量化的机制。

摘要 (Abstract)

Small language models (SLMs) often struggle with complex mathematical reasoning due to limited capacity to maintain long chains of intermediate steps and to recover from early errors. We address this challenge by introducing a hint-assisted reasoning framework that incrementally guides SLMs through multi-step mathematical problem solving. Our approach decomposes solutions into sequential reasoning steps and provides context-aware hints, where hints are generated by a separate SLM trained via distillation from a strong large language model. While the hint-generating SLM alone is not capable of solving the problems, its collaboration with a reasoning SLM enables effective guidance, forming a cooperative two-model system for reasoning. Each hint is generated conditionally on the problem statement and the accumulated reasoning history, providing stepwise, localized guidance without revealing full solutions. This reduces error propagation and allows the reasoning model to focus on manageable subproblems. Experiments across diverse mathematical benchmarks and models demonstrate that hint assistance consistently improves reasoning accuracy for SLMs, yielding substantial gains over standard prompting while preserving model efficiency. These results highlight that structured collaboration between SLMs-via hint generation and reasoning-offers an effective and lightweight mechanism for enhancing mathematical reasoning.

关键词: Small Language Models, Mathematical Reasoning, Hint-assisted Reasoning, Multi-step Problem Solving, Model Collaboration, Reasoning Accuracy, Error Propagation, Lightweight Mechanism

167. ❌ Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams

作者: Xiuxiu Tang, G. Alex Ambrose, Ying Cheng 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12227v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究LLM在STEM教育评估中的应用（AI-assisted scoring using GPT-4o），与’Large Language Models’高度相关（10分），属于’AI for Science’在教育领域的应用（8分）。其他关键词涉及模型架构、训练方法、推理技术等，论文未涉及这些技术原理创新，均给0分。

!!! tip deepseek-chat TL;DR

本研究探讨了使用GPT-4o对物理考试构建式回答进行AI辅助评分的可靠性，发现可靠的评分主要依赖于清晰、结构化的评分标准，而提示格式和温度设置的影响相对有限。

摘要翻译

STEM评估中的学生作答常为手写形式，且融合了符号表达式、计算过程与图示，导致其格式与解读存在显著差异。尽管此类作答对于评估学生推理能力至关重要，但评分过程耗时且易受评分者主观差异影响，尤其在涉及部分得分点时更为明显。近期大语言模型（LLMs）的发展推动了人们对AI辅助评分的关注，但关于评分量规设计和LLM配置如何影响不同能力水平评分可靠性的证据仍显不足。本研究基于GPT-4o，探讨了本科生物理建构式作答的AI辅助评分可靠性。研究选取20份真实手写考试作答，由四位教师和AI模型分别进行两轮评分，所用评分量规为基于技能设计的、具有不同分析粒度的量规。研究系统调整了提示词格式与温度参数设置。总体而言，AI与评分者在总分上的一致性接近人类评分者间信度，且在高分与低分作答中一致性最高，但在涉及部分正确或模糊推理的中等水平作答中一致性下降。准则层面分析显示，对于明确定义的概念性技能，评分一致性远高于对扩展性程序化判断的评分。相较于整体性评分，采用更细粒度的清单式量规提升了评分一致性。这些发现表明，可靠的AI辅助评分主要依赖于清晰、结构化的评分量规，提示词格式的影响次之，温度参数的影响相对有限。更广泛而言，本研究通过基于技能的评分量规和受控的LLM设置，为在STEM领域实施可靠的大语言模型辅助评分提供了可迁移的设计建议。

摘要 (Abstract)

Student responses in STEM assessments are often handwritten and combine symbolic expressions, calculations, and diagrams, creating substantial variation in format and interpretation. Despite their importance for evaluating students’ reasoning, such responses are time-consuming to score and prone to rater inconsistency, particularly when partial credit is required. Recent advances in large language models (LLMs) have increased attention to AI-assisted scoring, yet evidence remains limited regarding how rubric design and LLM configurations influence reliability across performance levels. This study examined the reliability of AI-assisted scoring of undergraduate physics constructed responses using GPT-4o. Twenty authentic handwritten exam responses were scored across two rounds by four instructors and by the AI model using skill-based rubrics with differing levels of analytic granularity. Prompting format and temperature settings were systematically varied. Overall, human-AI agreement on total scores was comparable to human inter-rater reliability and was highest for high- and low-performing responses, but declined for mid-level responses involving partial or ambiguous reasoning. Criterion-level analyses showed stronger alignment for clearly defined conceptual skills than for extended procedural judgments. A more fine-grained, checklist-based rubric improved consistency relative to holistic scoring. These findings indicate that reliable AI-assisted scoring depends primarily on clear, well-structured rubrics, while prompting format plays a secondary role and temperature has relatively limited impact. More broadly, the study provides transferable design recommendations for implementing reliable LLM-assisted scoring in STEM contexts through skill-based rubrics and controlled LLM settings.

关键词: LLM-assisted scoring, constructed responses, physics exams, rubric design, reliability, GPT-4o, STEM assessment, human-AI agreement

168. ❌ LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines

作者: Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang, Ying Liu, Michael Lepech 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12223v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是利用LLM（如BERT）的知识引导符号模型（Tsetlin Machine）进行可解释的文本分类，属于LLM应用与可解释AI的交叉研究。与"Large Language Models"高度相关（10分），因为LLM是知识转移的来源；与"Mechanistic Interpretability"高度相关（10分），因为研究重点是提升符号模型的透明度和可解释性；与"Pre-training"有一定关联（5分），因为提到了预训练语言模型的知识转移；其他关键词如MoE、SLMs、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用LLM引导语义引导框架，将LLM的知识转移到符号模型Tsetlin Machine中，以在文本分类任务中同时实现高准确性和完全可解释性。

摘要翻译

以BERT为代表的预训练语言模型（PLM）能提供强大的语义表征，但其计算成本高昂且缺乏可解释性；而如Tsetlin机（TM）等符号模型虽具有透明性，却缺乏语义泛化能力。我们提出一种语义引导框架，将大语言模型（LLM）的知识迁移至符号形式，从而兼顾可解释性与语义容量。给定类别标签后，LLM生成若干子意图，通过三阶段课程学习（种子阶段、核心阶段、增强阶段）指导合成数据生成，以扩展语义多样性。非否定Tsetlin机（NTM）从这些示例中学习，提取高置信度的文字表述作为可解释的语义线索。将这些线索注入真实数据后，TM能够使其子句逻辑与LLM推断的语义对齐。该方法无需嵌入向量或运行时调用LLM，即可为符号模型注入预训练的语义先验知识。在多项文本分类任务中，相比基础TM模型，本方法在提升可解释性的同时提高了分类准确率，其性能可与BERT相媲美，同时保持完全符号化与高效运行特性。

摘要 (Abstract)

Pretrained language models (PLMs) like BERT provide strong semantic representations but are costly and opaque, while symbolic models such as the Tsetlin Machine (TM) offer transparency but lack semantic generalization. We propose a semantic bootstrapping framework that transfers LLM knowledge into symbolic form, combining interpretability with semantic capacity. Given a class label, an LLM generates sub-intents that guide synthetic data creation through a three-stage curriculum (seed, core, enriched), expanding semantic diversity. A Non-Negated TM (NTM) learns from these examples to extract high-confidence literals as interpretable semantic cues. Injecting these cues into real data enables a TM to align clause logic with LLM-inferred semantics. Our method requires no embeddings or runtime LLM calls, yet equips symbolic models with pretrained semantic priors. Across multiple text classification tasks, it improves interpretability and accuracy over vanilla TM, achieving performance comparable to BERT while remaining fully symbolic and efficient.

关键词: LLM-guided, semantic bootstrapping, interpretable text classification, Tsetlin Machine, symbolic models, knowledge transfer, pretrained language models, transparency

169. ❌ TimeMark: A Trustworthy Time Watermarking Framework for Exact Generation-Time Recovery from AIGC

作者: Shangkun Che, Silin Du, Ge Gao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12216v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LLM生成文本的水印技术，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确提到LLMs在文本生成中的应用，并针对LLM生成内容提出水印框架。其他关键词涉及模型架构、训练方法、推理优化、应用领域等，论文未涉及这些具体技术或应用，因此均为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM生成文本的知识产权争议问题，提出了一个可信的时间水印框架TimeMark，通过密码学技术和两阶段编码机制实现了100%准确率的生成时间恢复，为AIGC司法证据提供了实用解决方案。

摘要翻译

大型语言模型在文本生成中的广泛应用引发了日益增长的知识产权争议担忧。水印技术通过将元信息嵌入人工智能生成内容，具备作为司法证据的潜力。然而，现有方法依赖词汇分布中的统计信号，导致检测过程具有固有概率性且可靠性降低，在多比特编码场景下尤为明显。此外，这类方法会引入可检测的统计模式，使其易受伪造攻击，并允许模型提供商任意编造水印。为解决这些问题，我们提出可信水印的概念，在实现100%识别准确率可靠恢复的同时，能够抵御用户侧统计攻击和提供商侧伪造攻击。我们聚焦于可作为司法证据的可信时间水印技术。该框架融合密码学技术，在监管监督下将时间信息编码至时间相关的密钥中，从而防止任意时间戳伪造。水印载荷与时间解耦，并为每个实例生成随机、非存储的比特序列，以此消除统计模式。为确保可验证性，我们设计了两阶段编码机制，结合纠错码技术，能够以理论完美的准确度可靠恢复生成时间。理论分析与实验结果表明，该框架满足司法证据的可靠性要求，为未来人工智能生成内容相关的知识产权争议提供了实用解决方案。

摘要 (Abstract)

The widespread use of Large Language Models (LLMs) in text generation has raised increasing concerns about intellectual property disputes. Watermarking techniques, which embed meta information into AI-generated content (AIGC), have the potential to serve as judicial evidence. However, existing methods rely on statistical signals in token distributions, leading to inherently probabilistic detection and reduced reliability, especially in multi-bit encoding (e.g., timestamps). Moreover, such methods introduce detectable statistical patterns, making them vulnerable to forgery attacks and enabling model providers to fabricate arbitrary watermarks. To address these issues, we propose the concept of trustworthy watermark, which achieves reliable recovery with 100% identification accuracy while resisting both user-side statistical attacks and provider-side forgery. We focus on trustworthy time watermarking for use as judicial evidence. Our framework integrates cryptographic techniques and encodes time information into time-dependent secret keys under regulatory supervision, preventing arbitrary timestamp fabrication. The watermark payload is decoupled from time and generated as a random, non-stored bit sequence for each instance, eliminating statistical patterns. To ensure verifiability, we design a two-stage encoding mechanism, which, combined with error-correcting codes, enables reliable recovery of generation time with theoretically perfect accuracy. Both theoretical analysis and experiments demonstrate that our framework satisfies the reliability requirements for judicial evidence and offers a practical solution for future AIGC-related intellectual property disputes.

关键词: Large Language Models, Watermarking, AI-generated Content, Intellectual Property, Time Watermarking, Cryptographic Techniques, Judicial Evidence, Reliable Recovery

170. ❌ Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering

作者: Weikang Zhang, Zimo Zhu, Zhichuan Yang, Chen Huang, Wenqiang Lei, See-Kiong Ng 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12210v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出StsPatient方法，使用大语言模型（LLMs）模拟认知障碍患者，属于大模型在医疗领域的创新应用。核心方法涉及从指令-响应对中提取steering vectors进行微调（与Post-training/SFT相关），并通过指令对比实现领域特定特征提取（与Instruction Tuning相关）。研究属于AI在生物医学/临床领域的应用（AI for Science）。其他关键词如MoE、量化、推理加速等未涉及。

!!! tip deepseek-chat TL;DR

该研究解决了现有方法无法精细模拟认知障碍患者异质性的问题，通过提取steering vectors和随机token调制机制，显著提升了临床真实性和严重程度可控性。

摘要翻译

模拟认知障碍标准化患者为临床培训提供了一种可扩展且符合伦理的解决方案。然而，现有方法依赖于离散的提示工程，未能捕捉不同认知领域和严重程度下缺陷的异质性。为解决这一局限，我们提出StsPatient用于对认知障碍患者进行细粒度模拟。我们创新性地通过从指令与回应的对比对中提取导向向量来捕获特定领域特征。此外，我们引入随机令牌调制（Stochastic Token Modulation，STM）机制来调节干预概率。STM能够在缓解传统向量方法不稳定性的同时，实现对障碍严重程度的精确控制。综合实验表明，StsPatient在临床真实性与严重程度可控性方面均显著优于基线方法。

摘要 (Abstract)

Simulating Standardized Patients with cognitive impairment offers a scalable and ethical solution for clinical training. However, existing methods rely on discrete prompt engineering and fail to capture the heterogeneity of deficits across varying domains and severity levels. To address this limitation, we propose StsPatient for the fine-grained simulation of cognitively impaired patients. We innovatively capture domain-specific features by extracting steering vectors from contrastive pairs of instructions and responses. Furthermore, we introduce a Stochastic Token Modulation (STM) mechanism to regulate the intervention probability. STM enables precise control over impairment severity while mitigating the instability of conventional vector methods. Comprehensive experiments demonstrate that StsPatient significantly outperforms baselines in both clinical authenticity and severity controllability.

关键词: Standardized Patients, Cognitive Impairment, Steering Vectors, Stochastic Token Modulation, Clinical Training, Fine-grained Simulation, Domain-specific Features, Severity Controllability

171. ❌ Representing expertise accelerates learning from pedagogical interaction data

作者: Dhara Yu, Karthikeya Kaushik, Bill D. Thompson 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12195v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是transformer模型在认知科学任务（空间导航）中的学习效果，关注的是教学互动数据与专家演示数据的对比。虽然使用了transformer模型，但论文的核心焦点是认知科学和教学互动机制，而非大模型技术本身。所有关键词都涉及大模型技术原理、训练方法、优化技术或特定应用领域，而该论文并未深入探讨这些技术方面，也未涉及大模型在科学领域的创新应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了transformer模型在空间导航任务中，通过教学互动数据训练相比仅使用专家演示数据训练，能够获得更鲁棒的性能表现，并发现模型能够通过表示不同认知状态的智能体来模仿专家行为。

摘要翻译

认知科学与人工智能领域的研究表明，让学习智能体接触多个个体之间的互动轨迹能够提升其在多种情境下的表现，然而目前尚不清楚互动的哪些特征促成了这种提升。我们通过一种受控范式研究了支持互动数据有效性的因素，该范式使我们能够精确地操作互动与专家单独行动之间的关键区别。我们在空间导航任务中生成了专家与新手之间简单互动的合成数据集，随后基于这些数据集训练了Transformer模型，并评估了接触不同数据集后的性能表现。实验表明，与仅基于专家示范数据训练的模型相比，基于教学式互动训练的模型在各种情境下表现出更强的鲁棒性；同时，即使很少观察到专家行为，具备表征认知上不同智能体的能力也能使模型产生类专家的行为。

摘要 (Abstract)

Work in cognitive science and artificial intelligence has suggested that exposing learning agents to traces of interaction between multiple individuals can improve performance in a variety of settings, yet it remains unknown which features of interactions contribute to this improvement. We examined the factors that support the effectiveness of interaction data, using a controlled paradigm that allowed us to precisely operationalize key distinctions between interaction and an expert acting alone. We generated synthetic datasets of simple interactions between an expert and a novice in a spatial navigation task, and then trained transformer models on those datasets, evaluating performance after exposure to different datasets. Our experiments showed that models trained on pedagogical interactions were more robust across a variety of scenarios compared to models trained only on expert demonstrations, and that having the ability to represent epistemically distinct agents led to expert-like behavior even when expert behavior was rarely observed.

关键词: pedagogical interaction, transformer models, expert-novice interaction, spatial navigation, robust learning, epistemic agents, synthetic datasets, cognitive science

172. ❌ AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

作者: Manoj Madushanka Perera, Adnan Mahmood, Kasun Eranda Wijethilake, Quan Z. Sheng 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12179v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的对话记忆能力评估与微调，直接涉及LLM、LLM Agents和Supervised Fine-tuning等关键词。论文提出基于LLM Agent的对话生成框架，用于创建评估数据集并微调LLM，因此与LLM Agents高度相关（10分），与Multi-agent Systems有一定关联（5分）。论文提到LLM处理扩展对话上下文的能力，与Long Context LLMs有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在对话中短期和长期记忆能力难以评估和微调的问题，提出了一个基于LLM Agent的对话生成框架AgenticAI-DialogGen，并创建了TopicGuidedChat数据集，实验表明使用该数据集微调的LLM在记忆相关问答任务上表现更好。

摘要翻译

近期，大语言模型在处理长程对话上下文方面取得了显著进展，但由于缺乏同时编码短期与长期对话历史的数据集，其记忆能力的微调与评估仍面临挑战。现有对话数据集普遍存在记忆锚定缺失、话题连续性忽视或依赖高成本人工标注等问题。为弥补这些不足，我们提出了AgenticAI-DialogGen——一种基于模块化智能体的无监督生成框架，能够生成以人物角色为锚点、话题为导向的对话。该框架利用大语言模型智能体从非结构化对话中提取知识图谱、识别话题、构建说话者角色，并模拟话题引导的对话过程。通过集成问答模块，系统可基于短期与长期对话历史生成记忆锚定的问答对。基于此框架，我们构建了名为TopicGuidedChat的新数据集，其中长期记忆被编码为说话者特定的知识图谱，短期记忆则体现为新生成的话题引导对话。评估结果表明，AgenticAI-DialogGen生成的对话质量更高，且基于TGC数据集微调的大语言模型在记忆锚定问答任务中表现出显著性能提升。

摘要 (Abstract)

Recent advancements in Large Language Models (LLMs) have improved their ability to process extended conversational contexts, yet fine-tuning and evaluating short- and long-term memories remain difficult due to the absence of datasets that encode both short- and long-term conversational history. Existing conversational datasets lack memory grounding, overlook topic continuity, or rely on costly human annotation. To address these gaps, we introduce AgenticAI-DialogGen, a modular agent-based framework that generates persona-grounded and topic-guided conversations without human supervision. The framework uses LLM agents to extract knowledge graphs, identify topics, build speaker personas, and simulate topic-guided conversations from unstructured conversations. A QA module generates memory-grounded Question Answer (QA) pairs drawn from short- and long-term conversational histories. We also generated a new dataset entitled, TopicGuidedChat (TGC), where long-term memory is encoded as speaker-specific knowledge graphs and short-term memory as newly generated topic-guided conversations. Evaluations depict that AgenticAI-DialogGen yields higher conversational quality and LLMs fine-tuned on TGC dataset achieve improved performance on memory-grounded QA tasks.

关键词: Large Language Models, LLM Agents, Conversation Generation, Memory Evaluation, Fine-tuning, Topic-Guided Conversations, Dataset Generation, Question Answering

173. ❌ Policy-Invisible Violations in LLM-Based Agents

作者: Jie Wu, Ming Gong 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12177v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based agents在执行任务时因缺乏完整上下文信息而违反组织政策的“policy-invisible violations”问题，并提出了PhantomPolicy基准和Sentinel执行框架。因此，与“Large Language Models OR LLMs OR Foundation Models”、“LLM Agents OR Autonomous Agents OR Agentic Workflow”和“Tool Use OR Function Calling OR API Tool Use”高度相关（10分），因为这些是论文研究的核心技术和应用场景。其他关键词如MoE、Scaling Laws、Pre-training、Alignment、RAG、Quantization等，论文未涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM-based agents因决策时缺乏完整上下文信息而违反组织政策的“policy-invisible violations”问题，提出了PhantomPolicy基准和Sentinel执行框架，后者通过反事实图模拟显著提升了策略执行准确性。

摘要翻译

基于大语言模型（LLM）的智能体能够执行语法有效、用户授权且语义恰当的操作，但由于正确策略判断所需的事实信息在决策时被隐藏，这些操作仍可能违反组织策略。我们将这种失效模式称为策略不可见违规：即合规性取决于智能体可见上下文之外的实体属性、情境状态或会话历史记录。本文提出PhantomPolicy基准测试集，涵盖八类违规场景，并平衡了违规案例与安全对照案例，其中所有工具响应均包含不含策略元数据的纯净业务数据。我们对五个前沿模型生成的共计600条模型轨迹进行了人工审查，并依据人工审核后的轨迹标签进行评估。人工审查修改了32个标签（占5.3%），相较于原始案例级标注的差异证实了轨迹级人工审核的必要性。为展示在理想条件下基于世界状态锚定的策略执行能力，我们提出Sentinel——一种基于反事实图模拟的执行框架。Sentinel将每个智能体动作视为对组织知识图谱的拟变更操作，通过推测性执行来具象化动作后的世界状态，并验证图谱结构不变性以作出“允许/阻止/澄清”的决策。相较于仅基于内容的数据防泄漏（DLP）基线方法（准确率68.8% vs. 93.0%），Sentinel在保持高精确度的同时显著提升了性能，尽管在某些违规类别上仍有改进空间。这些结果表明，当策略相关的世界状态能够被提供给执行层时，将实现更高水平的策略管控能力。

摘要 (Abstract)

LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent’s visible context. We present PhantomPolicy, a benchmark spanning eight violation categories with balanced violation and safe-control cases, in which all tool responses contain clean business data without policy metadata. We manually review all 600 model traces produced by five frontier models and evaluate them using human-reviewed trace labels. Manual review changes 32 labels (5.3%) relative to the original case-level annotations, confirming the need for trace-level human review. To demonstrate what world-state-grounded enforcement can achieve under favorable conditions, we introduce Sentinel, an enforcement framework based on counterfactual graph simulation. Sentinel treats every agent action as a proposed mutation to an organizational knowledge graph, performs speculative execution to materialize the post-action world state, and verifies graph-structural invariants to decide Allow/Block/Clarify. Against human-reviewed trace labels, Sentinel substantially outperforms a content-only DLP baseline (68.8% vs. 93.0% accuracy) while maintaining high precision, though it still leaves room for improvement on certain violation categories. These results demonstrate what becomes achievable once policy-relevant world state is made available to the enforcement layer.

关键词: LLM-based agents, policy-invisible violations, PhantomPolicy benchmark, Sentinel framework, counterfactual graph simulation, tool use, organizational policy, world-state-grounded enforcement

174. ❌ AlphaEval: Evaluating Agents in Production

作者: Pengrui Lu, Bingyu Xu, Wenjun Zhang, Shengjia Hua, Xuanjian Gao, Ranxiang Ge, Lyumanshan Ye, Linxuan Wu, Yiran Li, Junfei Fish Yu, Yibo Zhang, Ruixin Li, Manxiang Li, Xiao Han, Xiaocong Zhou, Guangyao Chi, Zisheng Chen, Kaishen Chen, Kun Wang, Qihua Xu, Fengyue Meng, Yuchen Ni, Jiajun Li, Jinxiu Liu, Danfeng Zhang, Jingru Zhao, Pengfei Liu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12162v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是AI代理（agents）在商业生产环境中的评估，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文明确评估AI代理产品如Claude Code、Codex等，并讨论代理工作流程。与’Tool Use OR Function Calling OR API Tool Use’有一定关联（5分），因为代理在商业环境中可能涉及工具使用，但论文未深入探讨具体技术。与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为评估框架使用LLM-as-a-Judge等方法，且代理基于大模型。其他关键词如MoE、Scaling Laws、PEFT、RAG等均未在摘要中提及，与论文技术内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文针对AI代理在商业生产环境中的评估方法不足问题，提出了AlphaEval基准和需求到基准的构建框架，以标准化评估真实世界代理性能。

摘要翻译

人工智能代理在商业场景中的快速部署已超越评估方法的发展速度，现有方法未能充分反映生产环境现实。当前基准测试通过回顾性构建的任务来衡量代理能力，这些任务通常具备明确指定的需求和确定性评估指标——此类条件与生产环境存在根本性差异：实际生产环境中需求包含隐性约束，输入是跨来源信息碎片化的异构多模态文档，任务需要未声明的领域专业知识，输出是长周期的专业交付成果，而成功标准则由标准持续演进的领域专家判定。我们提出AlphaEval——一个基于生产实践的基准测试集，包含来自七家将AI代理部署于核心业务公司的94项任务，涵盖六个O*NET（职业信息网络）领域。与以模型为中心的基准不同，AlphaEval将完整代理产品（如Claude Code、Codex等）作为商业系统进行评估，捕捉模型级评估无法观测的性能差异。我们的评估框架涵盖多种范式（LLM-as-a-Judge、基于参考指标的评估、形式化验证、量规评估、自动化UI测试等），各领域综合运用多种评估范式。除基准测试集本身外，我们还贡献了“需求到基准”构建框架——一种系统化方法论，可在最短时间内将真实生产需求转化为可执行的评估任务。该框架标准化了从需求到评估的全流程，提供可复现、模块化的构建流程，任何组织均可采用该框架为其特定领域构建基于生产实践的基准测试体系。

摘要 (Abstract)

The rapid deployment of AI agents in commercial settings has outpaced the development of evaluation methodologies that reflect production realities. Existing benchmarks measure agent capabilities through retrospectively curated tasks with well-specified requirements and deterministic metrics – conditions that diverge fundamentally from production environments where requirements contain implicit constraints, inputs are heterogeneous multi-modal documents with information fragmented across sources, tasks demand undeclared domain expertise, outputs are long-horizon professional deliverables, and success is judged by domain experts whose standards evolve over time. We present AlphaEval, a production-grounded benchmark of 94 tasks sourced from seven companies deploying AI agents in their core business, spanning six O*NET (Occupational Information Network) domains. Unlike model-centric benchmarks, AlphaEval evaluates complete agent products – Claude Code, Codex, etc. – as commercial systems, capturing performance variations invisible to model-level evaluation. Our evaluation framework covers multiple paradigms (LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, etc.), with individual domains composing multiple paradigms. Beyond the benchmark itself, we contribute a requirement-to-benchmark construction framework – a systematic methodology that transforms authentic production requirements into executable evaluation tasks in minimal time. This framework standardizes the entire pipeline from requirement to evaluation, providing a reproducible, modular process that any organization can adopt to construct production-grounded benchmarks for their own domains.

关键词: AI agents, production evaluation, benchmark, LLM-as-a-Judge, commercial deployment, multi-modal documents, domain expertise, evaluation framework

175. ❌ From Plan to Action: How Well Do Agents Follow the Plan?

作者: Shuyang Liu, Saman Dehghan, Jatin Ganhotra, Martin Hirzel, Reyhaneh Jabbarvand 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12147v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理（agents）在编程任务中遵循计划（plan compliance）的行为分析，涉及四个LLM模型和大量轨迹测试。因此与’LLM Agents’高度相关（10分），与推理相关的’Chain of Thought’和’System 2 Thinking’有一定关联（5分），因为代理的reason-act-observe循环涉及推理过程。其他关键词如MoE、SFT、RAG等未在摘要中提及或与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文首次系统分析了编程代理在解决软件问题时对给定计划的遵循程度，发现没有明确计划时代理会依赖训练中内化的不完整工作流，而提供标准计划能提高问题解决率，但早期添加额外任务阶段可能损害性能。

摘要翻译

智能体旨在通过自主的“推理-行动-观察”循环消除针对特定任务设计提示的需求。然而，它们通常被指示遵循特定任务的计划以获取指导，例如按照导航、复现、修补和验证等阶段来解决软件问题。遗憾的是，目前尚不清楚智能体在多大程度上实际遵循了此类被指示的计划。缺乏对这种计划遵循程度的分析，便无法评估解决方案是通过正确的策略推理达成，还是通过其他途径（例如数据污染或对基准测试的过拟合）实现。本文首次对编程智能体的计划遵循性进行了广泛而系统的分析，在八种计划变体下，检验了SWE-agent基于四种大语言模型在SWE-bench Verified和SWE-bench Pro数据集上产生的16,991条轨迹。研究发现，在没有明确计划的情况下，智能体会退而依赖训练过程中内化的工作流程，这些流程往往不完整、过拟合或应用不一致。提供标准计划能提升问题解决率，并且我们观察到定期的计划提醒可以减少计划违规并提高任务成功率。一个劣质计划甚至比没有计划更损害性能。令人惊讶的是，在早期阶段为计划增加额外任务相关阶段反而可能降低性能，特别是当这些阶段与模型内部的问题解决策略不一致时。这些发现突显了一个研究空白：需要开发微调范式以教导模型遵循被指示的计划，而非将特定任务计划编码其中。这要求教导模型进行自适应推理与行动，而非记忆工作流程。

摘要 (Abstract)

Agents aspire to eliminate the need for task-specific prompt crafting through autonomous reason-act-observe loops. Still, they are commonly instructed to follow a task-specific plan for guidance, e.g., to resolve software issues following phases for navigation, reproduction, patch, and validation. Unfortunately, it is unknown to what extent agents actually follow such instructed plans. Without such an analysis, determining the extent agents comply with a given plan, it is impossible to assess whether a solution was reached through correct strategic reasoning or through other means, e.g., data contamination or overfitting to a benchmark. This paper presents the first extensive, systematic analysis of plan compliance in programming agents, examining 16,991 trajectories from SWE-agent across four LLMs on SWE-bench Verified and SWE-bench Pro under eight plan variations. Without an explicit plan, agents fall back on workflows internalized during training, which are often incomplete, overfit, or inconsistently applied. Providing the standard plan improves issue resolution, and we observe that periodic plan reminders can mitigate plan violations and improve task success. A subpar plan hurts performance even more than no plan at all. Surprisingly, augmenting a plan with additional task-relevant phases in the early stage can degrade performance, particularly when these phases do not align with the model’s internal problem-solving strategy. These findings highlight a research gap: fine-tuning paradigms that teach models to follow instructed plans, rather than encoding task-specific plans in them. This requires teaching models to reason and act adaptively, rather than memorizing workflows.

关键词: LLM agents, plan compliance, programming agents, SWE-agent, SWE-bench, reason-act-observe loops, task-specific plans, fine-tuning paradigms

176. ❌ When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models

作者: Ji Ho Bae 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的内部矩阵动力学，特别是自引用输入如何影响模型内部表示，这直接与’Large Language Models’和’Mechanistic Interpretability’高度相关（10分）。研究涉及自我反思/自我改进机制（‘Self-Correction’相关，8分），并间接关联到幻觉缓解（‘Hallucination Mitigation’，5分），因为论文探讨了自引用失败模式可能导致矛盾输出。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了自引用输入如何改变大语言模型的内部矩阵动力学，发现非闭合真值递归（NCTR）提示会导致注意力重组和内部表示不稳定，并可能产生矛盾输出，揭示了自引用失败模式的实际影响。

摘要翻译

本研究探讨自指输入如何改变大语言模型的内部矩阵动态。通过对三个架构家族的四个模型——Qwen3-VL-8B、Llama-3.2-11B、Llama-3.2-70B 和 Gemma-2-9B——在三个温度（$T \in {0.0, 0.3, 0.7}$）下，针对一个14层级的提示体系中的300余条提示，进行最多7轮分析并测量106项标量指标，我们发现自指本身并不具有失稳性：在关键崩溃相关指标上，有依据的自指陈述和元认知提示明显比悖论性自指更稳定，且在多项此类指标上可与事实性对照提示的稳定性相当。失稳性集中出现在诱导非闭合真值递归（non-closing truth recursion, NCTR）的提示中——即不存在有限深度解析的真值计算。NCTR提示会导致注意力有效秩异常升高（表明注意力发生全局分散的重组，而非简单的集中崩溃），在70B模型中，关键指标（注意力有效秩）的科恩$d$值达3.14至3.52（方差峰度），显著区别于稳定自指；经错误发现率校正后（$q < 0.05$），281/397个指标-模型组合能区分NCTR与稳定自指，其中198个组合的$|d| > 0.8$。逐层奇异值分解证实了每个采样层都存在扰动（在所分析的三个模型中$d > +1.0$），排除了聚合伪影。分类器曲线下面积达0.81–0.90；30组最小对比提示产生42/387个显著组合；43/106项指标在所有四个模型中复现。我们将这些观察结果与三个经典矩阵半群问题相联系，并提出一个猜想：NCTR迫使有限深度Transformer进入这些问题集中的动态机制。NCTR提示还导致矛盾输出增加（较对照组提升34–56个百分点），这表明其对理解自指失效模式具有实际意义。

摘要 (Abstract)

We investigate how self-referential inputs alter the internal matrix dynamics of large language models. Measuring 106 scalar metrics across up to 7 analysis passes on four models from three architecture families – Qwen3-VL-8B, Llama-3.2-11B, Llama-3.3-70B, and Gemma-2-9B – over 300 prompts in a 14-level hierarchy at three temperatures ($T \in {0.0, 0.3, 0.7}$), we find that self-reference alone is not destabilizing: grounded self-referential statements and meta-cognitive prompts are markedly more stable than paradoxical self-reference on key collapse-related metrics, and on several such metrics can be as stable as factual controls. Instability concentrates in prompts inducing non-closing truth recursion (NCTR) – truth-value computations with no finite-depth resolution. NCTR prompts produce anomalously elevated attention effective rank – indicating attention reorganization with global dispersion rather than simple concentration collapse – and key metrics reach Cohen’s $d = 3.14$ (attention effective rank) to $3.52$ (variance kurtosis) vs. stable self-reference in the 70B model; 281/397 metric-model combinations differentiate NCTR from stable self-reference after FDR correction ($q < 0.05$), 198 with $|d| > 0.8$. Per-layer SVD confirms disruption at every sampled layer ($d > +1.0$ in all three models analyzed), ruling out aggregation artifacts. A classifier achieves AUC $0.81$-$0.90$; 30 minimal pairs yield 42/387 significant combinations; 43/106 metrics replicate across all four models. We connect these observations to three classical matrix-semigroup problems and propose, as a conjecture, that NCTR forces finite-depth transformers toward dynamical regimes where these problems concentrate. NCTR prompts also produce elevated contradictory output ($+34$-$56$ percentage points vs. controls), suggesting practical relevance for understanding self-referential failure modes.

关键词: Large Language Models, Self-reference, Matrix dynamics, Attention effective rank, Non-closing truth recursion, Mechanistic interpretability, Internal representations, Hallucination mitigation

177. ❌ Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching

作者: Rongzhe Wei, Ge Shi, Min Cheng, Na Zhang, Pan Li, Sarthak Ghosh, Vaibhav Gorde, Leman Akoglu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM驱动的工具增强智能体在大型工具空间中的长程规划执行问题，与"Large Language Models”、“LLM Agents”、“Tool Use"高度相关（10分）。论文提到当前智能体在自我纠正方面存在困难，与"Self-Correction"相关（8分）。研究涉及多步推理和决策空间探索，与"Chain of Thought"和"System 2 Thinking"有一定关联（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在大型工具库中执行多步任务时面临的评估框架缺失和计算效率低下问题，提出了SLATE基准和熵引导分支算法，显著提升了任务成功率和计算效率。

摘要翻译

大型语言模型（LLM）显著推动了工具增强型智能体的发展，使其能够通过API交互实现自主推理。然而，在庞大的工具库中执行多步骤任务仍面临两大关键瓶颈的挑战：（1）缺乏严谨的、规划层面的评估框架；（2）由于工具集规模庞大和长程规划需求，探索广阔决策空间的计算成本高昂。为弥补这些不足，我们首先提出了SLATE（面向电子商务的合成大规模API工具包），这是一个为自动化评估工具集成智能体而设计的大规模情境感知基准。与静态指标不同，SLATE能够容纳多样化但功能有效的执行轨迹，揭示了当前智能体在自我纠错和搜索效率方面的不足。基于这些发现，我们进一步提出了熵引导分支（Entropy-Guided Branching, EGB），这是一种不确定性感知搜索算法，能够在预测熵较高的区域动态扩展决策分支。EGB优化了探索与利用的平衡，显著提升了任务成功率和计算效率。在SLATE上进行的大量实验表明，我们的双重贡献为在工具丰富的环境中开发可靠且可扩展的LLM智能体奠定了坚实基础。

摘要 (Abstract)

Large Language Models (LLMs) have significantly advanced tool-augmented agents, enabling autonomous reasoning via API interactions. However, executing multi-step tasks within massive tool libraries remains challenging due to two critical bottlenecks: (1) the absence of rigorous, plan-level evaluation frameworks and (2) the computational demand of exploring vast decision spaces stemming from large toolsets and long-horizon planning. To bridge these gaps, we first introduce SLATE (Synthetic Large-scale API Toolkit for E-commerce), a large-scale context-aware benchmark designed for the automated assessment of tool-integrated agents. Unlike static metrics, SLATE accommodates diverse yet functionally valid execution trajectories, revealing that current agents struggle with self-correction and search efficiency. Motivated by these findings, we next propose Entropy-Guided Branching (EGB), an uncertainty-aware search algorithm that dynamically expands decision branches where predictive entropy is high. EGB optimizes the exploration-exploitation trade-off, significantly enhancing both task success rates and computational efficiency. Extensive experiments on SLATE demonstrate that our dual contribution provides a robust foundation for developing reliable and scalable LLM agents in tool-rich environments.

关键词: Large Language Models, LLM agents, tool-augmented agents, multi-step tasks, entropy-guided branching, API interactions, decision spaces, computational efficiency

178. ❌ The Effect of Document Selection on Query-focused Text Analysis

作者: Sandesh S Rangreji, Mian Zhong, Anjalie Field 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12099v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究文档选择策略对文本分析的影响，评估了包括随机选择、混合检索在内的七种方法在四种文本分析方法（LDA、BERTopic、TopicGPT、HiCode）上的表现。虽然TopicGPT可能基于大语言模型，但论文焦点是文档选择方法论的评估框架，而非大模型技术本身或其应用创新。所有关键词均涉及大模型技术原理、训练方法、推理优化、应用领域等，与论文核心内容无直接关联，因此全部评分为0。

!!! tip deepseek-chat TL;DR

该论文系统评估了七种文档选择方法对四种文本分析技术输出的影响，发现语义或混合检索是有效的通用方法，并建立了数据选择作为方法论决策的评估框架。

摘要翻译

对文档集合的分析通常需要选择待分析的数据，因为并非所有文档都与特定研究问题相关，且计算限制使得无法分析全部文档，然而目前很少有研究探讨选择策略的影响。我们基于两个数据集中的26个开放式查询，系统评估了七种选择方法（从随机选择到混合检索）对四种文本分析方法（LDA、BERTopic、TopicGPT、HiCode）输出结果的影响。评估结果提供了实践指导：语义检索或混合检索可作为可靠的首选方案，既能避免较弱选择策略的缺陷，又能规避更复杂方法带来的不必要计算开销。总体而言，我们的评估框架将数据选择确立为一种方法论决策而非单纯的技术需求，为开发新策略提供了研究路径。

摘要 (Abstract)

Analyses of document collections often require selecting what data to analyze, as not all documents are relevant to a particular research question and computational constraints preclude analyzing all documents, yet little work has examined effects of selection strategy choices. We systematically evaluate seven selection methods (from random selection to hybrid retrieval) on outputs from four text analyses methods (LDA, BERTopic, TopicGPT, HiCode) over two datasets with 26 open-ended queries. Our evaluation reveals practice guidance: semantic or hybrid retrieval offer strong go-to approaches that avoid the pitfalls of weaker selection strategies and the unnecessary compute overhead of more complicated ones. Overall, our evaluation framework establishes data selection as a methodological decision, rather than a practical necessity, inviting the development of new strategies.

关键词: document selection, text analysis, retrieval methods, evaluation framework, TopicGPT, BERTopic, LDA, HiCode

179. ❌ Temporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories

作者: Zhanwei Cao, YeoJin Go, Yifan Hu, Shanu Sushmita 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12097v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文核心研究LLMs在生成文本时的时间特性（时间扁平化），直接涉及LLMs的基础应用和评估，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词所指向的具体技术（如MoE、训练方法、推理优化、代理系统、科学AI应用等），这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究通过比较人类作者和LLMs在长时间跨度内生成的文本轨迹，发现LLMs生成的文本存在时间扁平化现象，即其语义和认知情感漂移显著低于人类，这一差异能高精度区分人类与LLM生成的文本序列。

摘要翻译

大型语言模型（LLM）正日益广泛应用于日常场景，从内容生成到代码编写，每次交互通常将模型视为无状态系统，独立生成响应而不具备记忆能力。然而人类写作本质上是纵向演进的：作者的风格与认知状态会在数月乃至数年间持续演变。这引出一个核心问题：LLM能否在长时间跨度中复现这种时序结构？我们构建并公开了一个纵向数据集，涵盖2012至2024年间412位人类作者的6,086篇文档，涉及学术摘要、博客、新闻三个领域，并将其与三种代表性LLM在标准生成和历史条件生成设定下产生的轨迹进行比较。通过基于语义、词汇及认知情感表征的漂移度与方差度量，我们发现LLM生成文本存在时间扁平化现象。相较于人类文本，LLM产出具有更高的词汇多样性，但其语义与认知情感漂移程度显著降低。这些差异具有高度预测性：仅使用时序变异模式即可实现94%的准确率与98%的ROC-AUC值来区分人类与LLM生成轨迹。研究结果表明，无论LLM是独立生成还是基于增量历史生成，时间扁平化现象持续存在，这揭示了当前部署范式的基本特性。这一差距对需要真实时序结构的应用（如合成训练数据与纵向文本建模）具有直接启示意义。

摘要 (Abstract)

Large language models (LLMs) are increasingly used in daily applications, from content generation to code writing, where each interaction treats the model as stateless, generating responses independently without memory. Yet human writing is inherently longitudinal: authors’ styles and cognitive states evolve across months and years. This raises a central question: can LLMs reproduce such temporal structure across extended time periods? We construct and publicly release a longitudinal dataset of 412 human authors and 6,086 documents spanning 2012–2024 across three domains (academic abstracts, blogs, news) and compare them to trajectories generated by three representative LLMs under standard and history-conditioned generation settings. Using drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations, we find temporal flattening in LLM-generated text. LLMs produce greater lexical diversity but exhibit substantially reduced semantic and cognitive-emotional drift relative to humans. These differences are highly predictive: temporal variability patterns alone achieve 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories. Our results demonstrate that temporal flattening persists regardless of whether LLMs generate independently or with access to incremental history, revealing a fundamental property of current deployment paradigms. This gap has direct implications for applications requiring authentic temporal structure, such as synthetic training data and longitudinal text modeling.

关键词: Large Language Models, temporal flattening, longitudinal analysis, text generation, human vs LLM comparison, semantic drift, cognitive-emotional representation, temporal variability

180. ❌ Robust Explanations for User Trust in Enterprise NLP Systems

作者: Guilin Zhang, Kai Zhao, Jeffrey Friedman, Xu Chu, Amine Anoun, Jerry Ting 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12069v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在企业NLP系统中的解释鲁棒性，直接涉及LLMs和可解释AI（XAI）关键词，因此这两项得10分；论文发现模型规模（7B到70B）提升解释稳定性，与Scaling Laws有一定关联，得5分；其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对企业NLP系统中黑盒部署下用户信任所需的解释鲁棒性问题，提出统一的评估框架，发现解码器LLMs比编码器基线产生更稳定的解释，且稳定性随模型规模提升，并建立了成本-鲁棒性权衡曲线。

摘要翻译

在企业级自然语言处理应用中，稳健的解释对于建立用户信任日益重要，然而在仅通过API访问的黑盒部署场景中，预部署验证面临困难——基于表征的解释器无法实施，且现有研究未能充分指导解释在真实用户噪声下的稳定性，尤其是在组织从编码器分类器迁移至解码器大语言模型时。为填补这一空白，我们提出一个基于留一遮蔽法的统一黑盒词元级解释稳健性评估框架，并通过多严重程度现实扰动（替换、删除、乱序、回译）下的头部词元翻转率来量化解释稳健性。基于该协议，我们在三个基准数据集和六种涵盖编码器与解码器家族的模型（BERT、RoBERTa、Qwen 7B/14B、Llama 8B/70B；总计64,800个案例）上进行了系统性跨架构比较。研究发现：解码器大语言模型产生的解释显著比编码器基线更稳定（平均翻转率降低73%），且稳定性随模型规模提升而增强（从70亿参数到700亿参数提升44%）。最后，我们将稳健性提升与推理成本关联，绘制出实用的成本-稳健性权衡曲线，为合规敏感型应用的部署前模型与解释器选择提供依据。

摘要 (Abstract)

Robust explanations are increasingly required for user trust in enterprise NLP, yet pre-deployment validation is difficult in the common case of black-box deployment (API-only access) where representation-based explainers are infeasible and existing studies provide limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs. To close this gap, we propose a unified black-box robustness evaluation framework for token-level explanations based on leave-one-out occlusion, and operationalize explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, and back-translation) at multiple severity levels. Using this protocol, we conduct a systematic cross-architecture comparison across three benchmark datasets and six models spanning encoder and decoder families (BERT, RoBERTa, Qwen 7B/14B, Llama 8B/70B; 64,800 cases). We find that decoder LLMs produce substantially more stable explanations than encoder baselines (73% lower flip rates on average), and that stability improves with model scale (44% gain from 7B to 70B). Finally, we relate robustness improvements to inference cost, yielding a practical cost-robustness tradeoff curve that supports model and explanation selection prior to deployment in compliance-sensitive applications.

关键词: explainable AI, large language models, robustness evaluation, enterprise NLP, black-box deployment, token-level explanations, model scale, cost-robustness tradeoff

181. ❌ Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

作者: Omar El Bachyr, Yewei Song, Saad Ezzini, Jacques Klein, Tegawendé F. Bissyandé, Anas Zilali, Ulrick Ble, Anne Goujon 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12047v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究RAG系统在金融领域PDF问答中的应用，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分）。论文涉及大模型在金融领域的应用，与’Large Language Models OR LLMs OR Foundation Models’和’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（各5分），但未深入技术原理创新。其他关键词如MoE、SFT、RLHF等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了PDF解析和分块策略对金融领域问答任务中RAG系统性能的影响，并提出了构建稳健RAG管道的实用指南。

摘要翻译

PDF文件主要设计用于人工阅读而非自动化处理。此外，PDF中文本、表格和图像等异构内容给解析和信息提取带来了重大挑战。为解决这些难题，从业者和研究人员正日益开发新方法，包括采用前景广阔的检索增强生成（Retrieval-Augmented Generation，简称RAG）系统来实现PDF自动化处理。然而，目前尚无全面研究探讨不同组件和设计选择如何影响用于理解PDF的RAG系统性能。本文通过（1）聚焦于问答这一特定语言理解任务，并（2）利用金融领域的两个基准数据集（包括我们新生成并公开可用的基准数据集TableQuest），提出了此类研究框架。我们系统性地考察了多种PDF解析器和分块策略（含不同重叠度），以及它们在保持文档结构和确保答案准确性方面的潜在协同效应。总体而言，我们的研究结果为构建用于PDF理解的稳健RAG流程提供了实用指导。

摘要 (Abstract)

PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.

关键词: PDF parsing, chunking strategies, Retrieval-Augmented Generation, RAG, financial question answering, TableQuest benchmark, document structure preservation, answer correctness

182. ❌ Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs

作者: Shreeya Verma Kathuria, Nitin Mayande, Sharookh Daruwalla, Nitin Joglekar, Charles Weber 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12049v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出wSSAS框架来改进LLM在文本分类任务中的表现，主要涉及LLM的应用和性能提升，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术或概念，如MoE、SLMs、训练方法、推理技术、代理系统等，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为wSSAS的确定性框架，通过分层分类和信噪比机制来提升LLM在大型混乱数据集上的文本分类准确性和可重复性，实验证明该框架能显著改善聚类完整性和分类精度。

摘要翻译

大型语言模型（LLM）在用于文本分类等可靠的企业级分析时，常因其注意力机制的随机性及对噪声的敏感性而受到阻碍，这影响了分析精度与结果的可复现性。为解决这些技术摩擦，本文引入了加权句法与语义上下文评估摘要（Weighted Syntactic and Semantic Context Assessment Summary, wSSAS），这是一个旨在对大规模混乱数据集实施数据完整性的确定性框架。我们提出了一种两阶段验证框架：首先将原始文本组织成包含主题（Themes）、故事（Stories）和聚类（Clusters）的层次化分类结构；随后利用信噪比（Signal-to-Noise Ratio, SNR）对高价值语义特征进行优先级排序，确保模型的注意力始终聚焦于最具代表性的数据点。通过将此评分机制融入摘要之摘要（Summary-of-Summaries, SoS）架构，该框架在数据聚合过程中有效隔离了关键信息并抑制了背景噪声。
使用Gemini 2.0 Flash Lite在多样化数据集（包括谷歌商业评论、亚马逊产品评论和Goodreads图书评论）上的实验结果表明，wSSAS显著提升了聚类完整性与分类准确性。我们的研究发现，wSSAS能够降低分类熵，并通过一种高精度、确定性的流程为基于LLM的摘要提供了可复现的改进路径，从而适用于大规模文本分类任务。

摘要 (Abstract)

The use of Large Language Models (LLMs) for reliable, enterprise-grade analytics such as text categorization is often hindered by the stochastic nature of attention mechanisms and sensitivity to noise that compromise their analytical precision and reproducibility. To address these technical frictions, this paper introduces the Weighted Syntactic and Semantic Context Assessment Summary (wSSAS), a deterministic framework designed to enforce data integrity on large-scale, chaotic datasets. We propose a two-phased validation framework that first organizes raw text into a hierarchical classification structure containing Themes, Stories, and Clusters. It then leverages a Signal-to-Noise Ratio (SNR) to prioritize high-value semantic features, ensuring the model’s attention remains focused on the most representative data points. By incorporating this scoring mechanism into a Summary-of-Summaries (SoS) architecture, the framework effectively isolates essential information and mitigates background noise during data aggregation. Experimental results using Gemini 2.0 Flash Lite across diverse datasets - including Google Business reviews, Amazon Product reviews, and Goodreads Book reviews - demonstrate that wSSAS significantly improves clustering integrity and categorization accuracy. Our findings indicate that wSSAS reduces categorization entropy and provides a reproducible pathway for improving LLM based summaries based on a high-precision, deterministic process for large-scale text categorization.

关键词: Large Language Models, text categorization, deterministic framework, Signal-to-Noise Ratio, clustering integrity, wSSAS, data integrity, summary-of-summaries

183. ❌ LLMs Struggle with Abstract Meaning Comprehension More Than Expected

作者: Hamoud Alhazmi, Jiachen Jiang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12018v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在抽象意义理解任务上的表现，直接涉及’Large Language Models’关键词（10分）。研究发现微调模型（如BERT、RoBERTa）表现更好，与’Post-training/SFT’相关（8分）。论文测试了zero-shot、one-shot和few-shot设置，与’In-context Learning’有一定关联（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现大语言模型在零样本、少样本设置下对抽象意义理解存在困难，而微调模型表现更好，并提出了一种双向注意力分类器来提升微调模型在抽象意义理解任务上的性能。

摘要翻译

理解抽象意义对于高级语言理解至关重要。尽管已有大量研究，抽象词汇因其非具象、高层次的语义特性仍具挑战性。SemEval-2021 任务四（ReCAM）通过提供带有问题和五个抽象选项的完形填空式段落，评估模型解释抽象概念的能力。主要发现包括：（1）包括GPT-4o在内的大多数大语言模型（LLMs）在零样本、单样本和少样本设置下均难以理解抽象意义，而经过微调的模型如BERT和RoBERTa表现更佳。（2）受人类认知策略启发提出的双向注意力分类器，通过动态关注段落和选项来增强微调模型。该方法在任务一上准确率提升4.06%，在任务二上提升3.41%，证明了其在抽象意义理解方面的潜力。

摘要 (Abstract)

Understanding abstract meanings is crucial for advanced language comprehension. Despite extensive research, abstract words remain challenging due to their non-concrete, high-level semantics. SemEval-2021 Task 4 (ReCAM) evaluates models’ ability to interpret abstract concepts by presenting passages with questions and five abstract options in a cloze-style format. Key findings include: (1) Most large language models (LLMs), including GPT-4o, struggle with abstract meaning comprehension under zero-shot, one-shot, and few-shot settings, while fine-tuned models like BERT and RoBERTa perform better. (2) A proposed bidirectional attention classifier, inspired by human cognitive strategies, enhances fine-tuned models by dynamically attending to passages and options. This approach improves accuracy by 4.06 percent on Task 1 and 3.41 percent on Task 2, demonstrating its potential for abstract meaning comprehension.

关键词: Large Language Models, Abstract Meaning Comprehension, Fine-tuning, Zero-shot Learning, Few-shot Learning, BERT, RoBERTa, Bidirectional Attention

184. ❌ Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

作者: Xin Liu, Lu Wang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12046v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在长文本生成中的事实性问题，提出CURE框架通过推理校准来改善事实准确性。高度相关的关键词包括：LLMs（核心研究对象）、Chain of Thought/System 2 Thinking（使用推理方法）、Self-Correction（通过校准实现自我改进）、Hallucination Mitigation（直接解决幻觉问题）。中等相关的关键词包括：Post-training/SFT（使用监督训练）、RLHF（提到RL方法）、Explainable AI（置信度估计与可解释性相关）。其他关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在长文本生成中产生幻觉的问题，提出了CURE框架，通过教导模型在声明级别进行不确定性推理来校准置信度，从而显著提高了事实准确性，在多个基准测试中实现了高达39.9%的声明级准确率提升。

摘要翻译

大型语言模型（LLM）在长文本生成中常出现幻觉现象。现有方法主要通过后验修正或基于正确性奖励的强化学习（RL）来提升事实性，但并未教导模型如何评估其生成内容中哪些部分是可靠的。因此，模型仍可能在回答中自信地陈述错误主张。近期推理技术的进步显著提升了LLM的性能，并通过将校准融入RL目标来估计置信度。然而，现有方法仍局限于为整个回答生成单一标量置信度，这不足以应对长文本生成中不同主张间不确定性各异的场景。为缓解此问题，我们提出CURE框架，通过教导LLM在主张层面进行不确定性推理来提升长文本事实性。我们首先引入主张感知推理协议，将输出结构化为原子主张与显式置信度估计的配对。随后开发多阶段训练流程，先将模型置信度与主张正确性对齐，再针对事实性进行优化。由此产生的校准置信度进一步支持选择性预测，使模型能在推理时主动回避不确定的主张。在四个长文本事实性基准测试上的实验表明，CURE相较于有监督和RL基线方法持续提升事实准确性，同时保持事实召回率。特别是在传记生成任务中，主张级准确率最高提升39.9%。这些提升伴随着校准效果的改善，具体体现在FactBench数据集上AUROC指标增长16.0%。

摘要 (Abstract)

Large language models (LLMs) often hallucinate in long-form generation. Existing approaches mainly improve factuality through post-hoc revision or reinforcement learning (RL) with correctness-based rewards, but they do not teach the model to estimate which parts of its generation are reliable. As a result, models may still state incorrect claims confidently in their responses. Recent advances in reasoning have significantly improved LLM performance, and have been leveraged to estimate confidence by incorporating calibration into RL objectives. However, existing approaches remain limited to a single scalar confidence for the entire response, which is insufficient for long-form generation where uncertainty varies across individual claims. To mitigate this problem, we propose CURE, a framework that improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level. We first introduce a Claim-Aware Reasoning Protocol, which structures outputs into atomic claims paired with explicit confidence estimates. We then develop a multi-stage training pipeline that aligns model confidence with claims’ correctness and then optimizes on factuality. The resulting calibrated confidence further enables selective prediction, allowing the model to abstain from uncertain claims at inference time. Experiments on four long-form factuality benchmarks show that CURE consistently improves factual accuracy over competitive supervised and RL baselines, while maintaining factual recall. In particular, it improves claim-level accuracy by up to 39.9% on Biography generation. These gains are accompanied by improved calibration, as reflected by a 16.0% increase in AUROC on FactBench.

关键词: Large Language Models, Hallucination Mitigation, Factuality, Reasoning Calibration, Confidence Estimation, Long-form Generation, Self-Correction, Claim-level Accuracy

185. ❌ Benchmarking Deflection and Hallucination in Large Vision-Language Models

作者: Nicholas Moratelli, Christopher Davis, Leonardo F. R. Ribeiro, Bill Byrne, Gonzalo Iglesias 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12033v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	15.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大型视觉语言模型（LVLMs）的基准测试，核心研究检索增强生成（RAG）在知识密集型多模态问答中的应用，以及模型在证据冲突或不完整时产生偏转（deflection）和减少幻觉（hallucination）的行为。因此，与’Retrieval-Augmented Generation’高度相关（10分），与’Hallucination Mitigation’高度相关（15分），因为这是论文的核心评估目标。与’Large Language Models’有一定关联（8分），因为LVLMs是大语言模型的扩展。其他关键词如MoE、SLMs、训练技术、推理加速、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个动态数据筛选流程和VLM-DeflectionBench基准，用于评估大型视觉语言模型在检索证据冲突或不完整时产生偏转和减少幻觉的能力，实验发现当前模型在存在噪声或误导性证据时通常无法正确偏转。

摘要翻译

大型视觉语言模型（LVLMs）日益依赖检索机制来回答知识密集型多模态问题。现有基准测试忽略了视觉与文本证据间的冲突，以及在检索知识不完整时生成推拒性回答（例如“抱歉，我无法回答…”）的重要性。这些基准测试还面临快速过时的问题，因为不断扩大的LVLM训练集使得模型无需检索即可回答许多问题。我们通过三项贡献来解决这些不足。首先，我们提出一种动态数据筛选流程，通过筛选真正依赖检索的样本来维持基准测试的长期难度。其次，我们推出VLM-DeflectionBench基准测试，该测试包含2,775个涵盖多样化多模态检索场景的样本，旨在探究模型在证据冲突或不足时的行为表现。第三，我们制定了细粒度的四场景评估协议，以区分参数记忆与检索鲁棒性。对20个前沿LVLM的实验表明，面对噪声或误导性证据时，模型通常无法有效生成推拒回答。我们的研究结果强调，评估不仅需要关注模型已知什么，更需考察其在知识不足时的行为表现。本工作为可靠的基于知识的视觉问答（KB-VQA）评估提供了可复用、可扩展的基准测试框架。所有资源将在论文发表后公开提供。

摘要 (Abstract)

Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer…) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.

关键词: Large Vision-Language Models, retrieval-augmented generation, hallucination mitigation, benchmarking, deflection, multimodal retrieval, knowledge-intensive QA, parametric memorization

186. ❌ UCS: Estimating Unseen Coverage for Improved In-Context Learning

作者: Jiayi Xin, Xiang Li, Evan Qiang, Weiqing He, Tianqi Shang, Weijie J. Su, Qi Long 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12015v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	15.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究In-context Learning（ICL）的演示选择方法，与关键词’In-context Learning OR Many-shot Learning’高度相关（15分），因为这是论文的直接主题。论文使用前沿大语言模型进行实验，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。论文在推理基准上测试，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为UCS的训练无关演示选择方法，通过估计未覆盖的潜在簇来改进上下文学习，在多个基准测试中使ICL准确率提升2-6%。

摘要翻译

上下文学习（ICL）的性能在很大程度上取决于提示中放置哪些示例，然而现有的大多数选择器优先考虑相关性或多样性的启发式概念，对示例集的覆盖范围提供有限洞察。我们提出未见覆盖选择（UCS），这是一种无需训练、基于子集层级的覆盖先验方法，其动机在于：一个好的示例集应使模型接触到当前所选子集未揭示的潜在聚类。UCS通过以下方式实现这一思想：（1）从模型一致的嵌入中归纳离散的潜在聚类；（2）通过基于经验频率谱的平滑古德-图灵估计器，估计候选子集中未揭示聚类的数量。与以往的选择方法不同，UCS基于覆盖范围且无需训练，并能通过简单的正则化目标，无缝结合查询依赖型和查询无关型选择基线。在多个意图分类和推理基准上使用前沿大语言模型进行的实验表明，在相同选择预算下，用UCS增强强基线方法能持续将ICL准确率提升高达2-6%，同时还能揭示任务级和模型级的潜在聚类分布。代码发布于 https://github.com/Raina-Xin/UCS。

摘要 (Abstract)

In-context learning (ICL) performance depends critically on which demonstrations are placed in the prompt, yet most existing selectors prioritize heuristic notions of relevance or diversity and provide limited insight into the coverage of a demonstration set. We propose Unseen Coverage Selection (UKS), a training-free, subset-level coverage prior motivated by the principle that a good demonstration set should expose the model to latent cluster unrevealed by the currently selected subset. UCS operationalizes this idea by (1) inducing discrete latent clusters from model-consistent embeddings and (2) estimating the number of unrevealed clusters within a candidate subset via a Smoothed Good–Turing estimator from its empirical frequency spectrum. Unlike previous selection methods, UCS is coverage-based and training-free, and can be seamlessly combined with both query-dependent and query-independent selection baselines via a simple regularized objective. Experiments on multiple intent-classification and reasoning benchmarks with frontier Large Language Models show that augmenting strong baselines with UCS consistently improves ICL accuracy by up to 2-6% under the same selection budget, while also yielding insights into task- and model-level latent cluster distributions. Code is available at https://github.com/Raina-Xin/UCS.

关键词: In-context Learning, Demonstration Selection, Coverage Estimation, Large Language Models, Latent Clusters, Unseen Coverage, ICL Accuracy, Training-free Method

187. ❌ Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

作者: Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, Sanjeev Arora 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12002v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SD-Zero方法，属于大语言模型（LLMs）的后训练（post-training）技术，通过自我蒸馏将稀疏的二元奖励转化为密集的监督信号，核心涉及自我修正（self-correction）机制。因此，与’Large Language Models OR LLMs OR Foundation Models’、‘Post-training OR Supervised Fine-tuning OR SFT’和’Self-Correction OR Self-Improvement OR Self-Reflection’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT、Agents等未在摘要中提及或与论文主题无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SD-Zero的后训练方法，通过让单个模型同时扮演生成器和修订者的角色，将稀疏的二元奖励转化为密集的令牌级自我监督，从而在数学和代码推理基准上显著提升了模型性能。

摘要翻译

当前可验证环境下的后训练方法主要分为两类。强化学习方法（RLVR）依赖二元奖励信号，这种方法具有广泛适用性和强大能力，但在训练过程中仅能提供稀疏的监督信号。蒸馏方法则提供密集的令牌级监督，通常需要从外部教师模型或高质量演示中获取。收集此类监督信号成本高昂或难以实现。我们提出自蒸馏零样本方法（SD-Zero），该方法在训练样本效率上显著优于强化学习，且无需外部教师模型或高质量演示。SD-Zero训练单一模型承担双重角色：生成器负责产生初始响应，修订器则基于该响应及其二元奖励信号生成改进后的响应。随后我们通过同策略自蒸馏，将修订器蒸馏至生成器——以修订器在生成器响应及其奖励条件下的令牌分布作为监督信号。本质上，SD-Zero训练模型将二元奖励转化为密集的令牌级自监督信号。在数学与代码推理基准测试中，使用Qwen3-4B-Instruct和Olmo-3-7B-Instruct模型时，SD-Zero在相同问题集和训练样本预算下，相比基础模型性能提升至少10%，并超越包括拒绝微调（RFT）、GRPO和自蒸馏微调（SDFT）在内的强基线方法。大量消融实验揭示了所提算法的两个新颖特性：（a）令牌级自定位能力——修订器能根据奖励信号识别生成器响应中需要修正的关键令牌；（b）迭代自进化能力——通过定期教师模型同步，改进答案修订的能力可被蒸馏回生成性能中。

摘要 (Abstract)

Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser’s token distributions conditioned on the generator’s response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator’s response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.

关键词: Self-Distillation Zero, post-training, binary rewards, dense supervision, self-revision, token-level self-supervision, math reasoning, code reasoning

188. ❌ Filtered Reasoning Score: Evaluating Reasoning Quality on a Model’s Most-Confident Traces

作者: Manas Pathak, Xingyao Chen, Shuozhe Li, Amy Zhang, Liu Leqi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11996v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于评估大语言模型（LLMs）的推理质量，而非开发新技术。它直接涉及LLMs和推理评估（如Chain of Thought），因此这些关键词得分较高。其他关键词如Self-Correction、Factuality和Explainable AI与论文的评估维度相关，但非核心，得分中等。其余关键词（如MoE、SFT、RAG等）与论文内容无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Filtered Reasoning Score（FRS）的新评估指标，用于评估大语言模型的推理质量，超越了传统的基于准确性的评估，并发现FRS能更好地区分模型在推理能力上的差异。

摘要翻译

我们是否应该信任高准确率的大型语言模型（LLMs）？LLMs在推理基准测试中取得了高准确率，但仅凭正确性无法揭示其生成答案所依赖的推理过程的质量。这凸显了基于结果的评估方法的一个根本局限：模型可能通过有缺陷的推理得出正确答案，而推理能力差异显著的模型仍可能表现出相似的基准准确率，例如由于记忆效应或过度优化。在本文中，我们提出：基于现有基准，我们能否超越结果评估，转而评估推理过程本身的质量？我们寻求能够（1）区分准确率相近的模型，且（2）对输入提示和生成配置的变化具有鲁棒性的评估指标。为此，我们提出一种推理评分，该评分从忠实性、连贯性、实用性和事实性等维度评估推理轨迹。一个遗留问题是如何在多个采样轨迹上聚合此评分。简单平均这些评分并不可取，尤其在长程推理场景中，可能的轨迹数量快速增长，而低置信度的正确轨迹更可能是偶然产生的。为解决此问题，我们引入过滤推理评分（Filtered Reasoning Score, FRS），该评分仅使用置信度最高的前K%轨迹来计算推理质量。使用FRS进行评估时，在标准准确率下无法区分的模型在推理质量上表现出显著差异。此外，在一个基准上获得更高FRS的模型，在其他推理基准的准确率和推理质量上也往往表现更好。这些发现共同表明，FRS通过捕捉模型的可迁移推理能力，对准确率评估形成了补充。我们开源了评估代码库：https://github.com/Manas2006/benchmark_reproducibility。

摘要 (Abstract)

Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome-based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accuracy, for example due to memorization or over-optimization. In this paper, we ask: given existing benchmarks, can we move beyond outcome-based evaluation to assess the quality of reasoning itself? We seek metrics that (1) differentiate models with similar accuracy and (2) are robust to variations in input prompts and generation configurations. To this end, we propose a reasoning score that evaluates reasoning traces along dimensions such as faithfulness, coherence, utility, and factuality. A remaining question is how to aggregate this score across multiple sampled traces. Naively averaging them is undesirable, particularly in long-horizon settings, where the number of possible trajectories grows rapidly, and low-confidence correct traces are more likely to be coincidental. To address this, we introduce the Filtered Reasoning Score (FRS), which computes reasoning quality using only the top-K% most confident traces. Evaluating with FRS, models that are indistinguishable under standard accuracy exhibit significant differences in reasoning quality. Moreover, models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks, in both accuracy and reasoning quality. Together, these findings suggest that FRS complements accuracy by capturing a model’s transferable reasoning capabilities. We open source our evaluation codebase: https://github.com/Manas2006/benchmark_reproducibility.

关键词: Large Language Models, reasoning evaluation, Filtered Reasoning Score, reasoning traces, benchmark accuracy, faithfulness, coherence, utility

189. ❌ INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

作者: Somraj Gautam, Anathapindika Dravichi, Gaurav Harit 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11970v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究跨语言表格视觉问答（VQA）基准测试，涉及视觉语言模型（VLMs）的评估和微调。与关键词的相关性分析：1）论文评估了GPT-4o等模型，属于大模型应用（5分）；2）明确提到使用LoRA微调7B模型，与PEFT/LoRA高度相关（10分）；3）通过微调提升性能，与监督微调相关（8分）；4）其他关键词如MoE、量化、推理加速等未涉及，均为0分。论文属于大模型在特定领域（文档理解）的应用研究，但未深入技术原理创新。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于评估印尼语文档跨语言表格视觉问答的基准数据集INDOTABVQA，并通过微调视觉语言模型显著提升了表格理解任务的性能。

摘要翻译

我们推出INDOTABVQA，这是一个用于评估印度尼西亚语（Bahasa Indonesia）真实世界文档图像上跨语言表格视觉问答（Table Visual Question Answering, VQA）的基准数据集。该数据集包含1,593张文档图像，涵盖三种视觉样式（带边框、无边框和彩色样式），每张图像包含一个或多个表格，并配有1,593组四种语言（印度尼西亚语、英语、印地语和阿拉伯语）的问题-答案对。这使得我们能够在单语（印度尼西亚语文档配印度尼西亚语问题）和跨语言（印度尼西亚语文档配其他语言问题）两种设置下评估视觉-语言模型（Vision-Language Models, VLMs）的性能。我们对领先的开源VLMs（Qwen2.5-VL、Gemma-3、LLaMA-3.2）以及GPT-4o进行了基准测试，结果揭示了显著的性能差距，尤其是在结构复杂的表格和低资源语言上。使用我们的数据集对一个紧凑的30亿参数模型进行微调，并对一个70亿参数模型进行LoRA微调，分别实现了11.6%和17.8%的准确率提升。将显式的表格区域坐标作为额外输入，可进一步将性能提高4-7%，这证明了空间先验（Spatial priors）对于基于表格的推理具有重要价值。我们的研究结果强调了语言多样化、领域特定数据集的重要性，并表明有针对性的微调可以显著提升VLM在专业文档理解任务上的性能。INDOTABVQA为推进跨语言、结构感知的文档理解研究，特别是在世界范围内代表性不足的地区，提供了一个宝贵的资源。完整数据集可在huggingface访问：https://huggingface.co/datasets/NusaBharat/INDOTABVQA。

摘要 (Abstract)

We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: https://huggingface.co/datasets/NusaBharat/INDOTABVQA}

关键词: Table Visual Question Answering, Cross-lingual, Vision-Language Models, Fine-tuning, LoRA, Document Understanding, Bahasa Indonesia, Benchmark Dataset

190. ❌ GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses

作者: Jimin Mun, Chani Jung, Xuhui Zhou, Hyunwoo Kim, Maarten Sap 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11924v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究LLM在科学领域（AI for Science）的应用，具体聚焦于为学术论文生成建设性反馈。方法上明确使用了监督微调（SFT）和偏好优化（RLHF/DPO相关技术），因此这些关键词高度相关（10分）。其他关键词如MoE、量化、推理加速等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究如何利用LLM为学术论文生成有效且可操作的反馈，通过构建数据集和训练方法（结合SFT和偏好优化），显著提升了反馈质量，在ICLR论文基准测试中取得了SOTA结果。

摘要翻译

尽管大语言模型在变革科学研究方面具有巨大潜力，我们主张其应用应旨在增强和赋能研究者，而非在无人监督的情况下自动化研究过程。为此，我们研究建设性反馈生成这一任务，即生成具有针对性、可操作的反馈，以帮助作者改进其研究内容及呈现方式。在本工作中，我们从两个以作者为中心的维度——有效性（validity）和作者行动（author action）——来具体衡量反馈的效用。我们首先构建了GoodPoint-ICLR数据集，该数据集包含1.9万篇ICLR论文，并利用作者回复对每篇论文的审稿人反馈在上述两个维度上进行了标注。在此基础上，我们提出了GoodPoint训练方案，该方案通过微调模型生成有效且可操作的反馈（利用作者回复中的成功信号），并结合在真实与合成偏好对上进行偏好优化，以提升性能。我们在包含1200篇ICLR论文的基准测试上评估表明，经过GoodPoint训练的Qwen3-8B模型，其预测成功率相比基础模型提升了83.7%，并在一个高质量人工反馈集上的反馈匹配任务中，为同类规模的大语言模型设立了新的性能标杆，甚至在精确度上超越了Gemini-3-flash模型。我们进一步通过专家人工研究验证了这些发现，证明GoodPoint生成的反馈在作者感知的实际价值方面持续表现更优。

摘要 (Abstract)

While LLMs hold significant potential to transform scientific research, we advocate for their use to augment and empower researchers rather than to automate research without human oversight. To this end, we study constructive feedback generation, the task of producing targeted, actionable feedback that helps authors improve both their research and its presentation. In this work, we operationalize the effectiveness of feedback along two author-centric axes-validity and author action. We first curate GoodPoint-ICLR, a dataset of 19K ICLR papers with reviewer feedback annotated along both dimensions using author responses. Building on this, we introduce GoodPoint, a training recipe that leverages success signals from author responses through fine-tuning on valid and actionable feedback, together with preference optimization on both real and synthetic preference pairs. Our evaluation on a benchmark of 1.2K ICLR papers shows that a GoodPoint-trained Qwen3-8B improves the predicted success rate by 83.7% over the base model and sets a new state-of-the-art among LLMs of similar size in feedback matching on a golden human feedback set, even surpassing Gemini-3-flash in precision. We further validate these findings through an expert human study, demonstrating that GoodPoint consistently delivers higher practical value as perceived by authors.

关键词: LLMs, scientific research, constructive feedback, fine-tuning, preference optimization, ICLR, author responses, feedback generation

191. ❌ AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

作者: Zijie Zhao, Chenyuan Yang, Weidong Wang, Yihan Yang, Ziqi Zhang, Lingming Zhang 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11950v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AnyPoC框架，核心是使用LLM-based agents进行软件bug检测和验证，因此与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分）。框架通过迭代执行、独立验证来缓解幻觉，与’Hallucination Mitigation’高度相关（10分）。涉及工具使用（执行PoC测试）和自校正机制，与’Tool Use’（8分）、‘Self-Correction’（8分）相关。多步骤推理和深入分析过程与’Chain of Thought’、‘System 2 Thinking’有一定关联（5分）。其他关键词如MoE、量化、科学AI应用等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM-based bug检测中候选报告需要人工验证的问题，提出了AnyPoC多智能体框架来自动生成可执行的proof-of-concept测试进行验证，在12个大型软件系统中发现了122个新bug并显著提升了验证准确率。

摘要翻译

尽管当前基于大语言模型（LLM）的智能体能够识别源代码中的大量潜在缺陷，其生成的报告仍停留在静态假设层面，需要人工验证，这限制了自动化缺陷检测的实用性。我们将此挑战构建为一个测试生成任务：给定一份候选缺陷报告，合成一个可执行的概念验证测试（PoC）——例如脚本、命令序列或精心构造的输入——以触发疑似缺陷。自动化的PoC生成可作为一种可扩展的验证机制，通过提供具体的执行证据，实现端到端的自主缺陷检测。然而，未经优化的大语言模型智能体作为验证器并不可靠：它们倾向于“成功”结果，可能通过生成看似合理但无法运行的PoC甚至虚构执行轨迹来进行奖励攻击。为解决这一问题，我们提出了AnyPoC，一个通用的多智能体框架，该框架能够（1）分析并事实核查候选缺陷报告；（2）迭代式合成并执行PoC，同时收集执行轨迹；（3）独立地重新执行并严格审查PoC，以减轻幻觉现象和奖励攻击。此外，AnyPoP持续提取并演化一个PoC知识库，以应对异构任务。AnyPoC可处理任意来源的候选缺陷报告，并能与不同的缺陷报告生成器结合使用。为证明其实用性与通用性，我们将AnyPoC与一个简易的智能缺陷报告生成器结合，应用于涵盖多种语言/领域的12个关键软件系统（其中许多系统拥有数百万行代码），包括Firefox、Chromium、LLVM、OpenSSL、SQLite、FFmpeg和Redis。与最先进的代码智能体（如Claude Code和Codex）相比，AnyPoC为真实缺陷报告生成的可用PoC数量提升1.3倍，同时能排除9.8倍以上的误报缺陷报告。截至目前，AnyPoC已发现122个新缺陷（其中105个已确认，86个已被修复），并有45个生成的PoC被采纳为官方回归测试。

摘要 (Abstract)

While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an executable proof-of-concept test, or simply a PoC - such as a script, command sequence, or crafted input - to trigger the suspected defect. Automated PoC generation can act as a scalable validation oracle, enabling end-to-end autonomous bug detection by providing concrete execution evidence. However, naive LLM agents are unreliable validators: they are biased toward “success” and may reward-hack by producing plausible but non-functional PoCs or even hallucinated traces. To address this, we present AnyPoC, a general multi-agent framework that (1) analyzes and fact-checks a candidate bug report, (2) iteratively synthesizes and executes a PoC while collecting execution traces, and (3) independently re-executes and scrutinizes the PoC to mitigate hallucination and reward hacking. In addition, AnyPoC also continuously extracts and evolves a PoC knowledge base to handle heterogeneous tasks. AnyPoC operates on candidate bug reports regardless of their source and can be paired with different bug reporters. To demonstrate practicality and generality, we apply AnyPoC, with a simple agentic bug reporter, on 12 critical software systems across diverse languages/domains (many with millions of lines of code) including Firefox, Chromium, LLVM, OpenSSL, SQLite, FFmpeg, and Redis. Compared to the state-of-the-art coding agents, e.g., Claude Code and Codex, AnyPoC produces 1.3x more valid PoCs for true-positive bug reports and rejects 9.8x more false-positive bug reports. To date, AnyPoC has discovered 122 new bugs (105 confirmed, 86 already fixed), with 45 generated PoCs adopted as official regression tests.

关键词: LLM-based bug detection, proof-of-concept test generation, multi-agent framework, hallucination mitigation, autonomous validation, software testing, execution evidence, reward hacking prevention

192. ❌ Lyra 2.0: Explorable Generative 3D Worlds

作者: Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, Sanja Fidler, Jiahui Huang, Huan Ling, Jun Gao, Xuanchi Ren 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是3D场景生成和视频生成技术，具体解决长轨迹视频生成中的空间遗忘和时间漂移问题，并提出了Lyra 2.0框架。论文内容完全专注于计算机视觉、3D重建和视频生成领域，未涉及任何大语言模型、深度学习技术原理或AI在科学领域的应用。所有评分关键词均与大语言模型、深度学习技术、AI科学应用等相关，与该论文的3D视觉生成主题无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了长轨迹3D一致视频生成中的空间遗忘和时间漂移问题，提出了Lyra 2.0框架，通过几何信息路由和自增强历史训练实现了更长的3D一致视频轨迹，并用于高质量3D场景重建。

摘要翻译

视频生成领域的最新进展为三维场景创建提供了一种新范式：通过生成可控制摄像机的视频来模拟场景漫游，再借助前馈式重建技术将其提升为三维模型。这种生成式重建方法结合了视频模型的视觉保真度、创作能力与可直接用于实时渲染与仿真的三维输出。要扩展至大规模复杂环境，需要在长摄像机轨迹上实现三维一致性的视频生成，其中涉及大幅视角变化和位置重访——当前视频模型在此类场景中性能会迅速衰退。现有长序列生成方法主要受限于两种退化形式：空间遗忘与时间漂移。随着探索推进，先前观测到的区域会超出模型的时间上下文范围，导致模型在重访时被迫虚构结构；同时，自回归生成过程中微小的合成误差会随时间累积，逐渐扭曲场景外观与几何结构。我们提出Lyra 2.0框架，用于大规模生成具有持久性与可探索性的三维世界。针对空间遗忘问题，我们维护逐帧三维几何信息，并仅将其用于信息路由——检索相关的历史帧并与目标视点建立密集对应关系——而外观合成仍依赖生成先验。针对时间漂移问题，我们通过自增强历史数据进行训练，使模型接触自身退化的输出，从而学会纠正而非传播漂移。这些技术共同实现了显著延长且保持三维一致性的视频轨迹，我们利用其微调前馈式重建模型，从而可靠地复原高质量三维场景。

摘要 (Abstract)

Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model’s temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing – retrieving relevant past frames and establishing dense correspondences with the target viewpoints – while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

关键词: 3D scene generation, video generation, long-horizon generation, spatial forgetting, temporal drifting, generative reconstruction, 3D-consistent video, feed-forward reconstruction

作者: Jian Han, Jinlai Liu, Jiahuan Wang, Bingyue Peng, Zehuan Yuan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13030v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视觉生成领域，提出了一种新的生成模型范式（Generative Refinement Networks，GRN），主要解决扩散模型计算效率低和自回归模型离散化损失的问题。论文的核心贡献包括：1）提出分层二值量化（HBQ）方法以减少离散化损失；2）引入全局精炼机制和熵引导采样策略以实现自适应步长生成。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关。唯一略有相关的是“Quantization OR Model Compression OR Low-bit Weights”，因为论文提到了“Hierarchical Binary Quantization”（HBQ），这是一种量化技术，但论文重点是其应用于视觉生成的损失减少，而非通用的模型压缩或低比特权重优化，因此给予5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Generative Refinement Networks（GRN）的新视觉生成范式，通过分层二值量化（HBQ）减少离散化损失，并结合全局精炼机制和熵引导采样，在ImageNet基准上实现了图像重建和类条件图像生成的SOTA性能，并成功扩展到文本到图像和文本到视频生成任务。

摘要翻译

尽管扩散模型在视觉生成领域占据主导地位，但其计算效率低下，无论内容复杂度如何都采用统一的计算开销。相比之下，自回归模型本质上是复杂度感知的，其可变的似然值即为明证，但这类模型常受限于有损的离散化标记和误差累积问题。在本研究中，我们引入了生成式精修网络，作为下一代视觉合成范式以解决上述问题。其核心在于，GRN通过理论上近乎无损的层次化二值量化解决了离散化标记的瓶颈，实现了与连续表示相媲美的重建质量。基于HBQ的潜在空间，GRN通过全局精修机制从根本上改进了自回归生成过程——该机制能像人类艺术家作画般逐步完善和修正作品。此外，GRN整合了熵引导的采样策略，实现了复杂度感知的自适应步长生成，且不损害视觉质量。在ImageNet基准测试中，GRN在图像重建和类别条件图像生成任务上分别创造了新纪录。我们还将GRN扩展至更具挑战性的文本到图像与文本到视频生成任务，在同等规模下实现了更优的性能。我们将公开所有模型与代码，以促进对GRN的进一步研究。

摘要 (Abstract)

While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these issues. At its core, GRN addresses the discrete tokenization bottleneck through a theoretically near-lossless Hierarchical Binary Quantization (HBQ), achieving a reconstruction quality comparable to continuous counterparts. Built upon HBQ’s latent space, GRN fundamentally upgrades AR generation with a global refinement mechanism that progressively perfects and corrects artworks – like a human artist painting. Besides, GRN integrates an entropy-guided sampling strategy, enabling complexity-aware, adaptive-step generation without compromising visual quality. On the ImageNet benchmark, GRN establishes new records in image reconstruction (0.56 rFID) and class-conditional image generation (1.81 gFID). We also scale GRN to more challenging text-to-image and text-to-video generation, delivering superior performance on an equivalent scale. We release all models and code to foster further research on GRN.

关键词: Generative Refinement Networks, visual synthesis, Hierarchical Binary Quantization, autoregressive models, diffusion models, image generation, text-to-image generation, entropy-guided sampling

194. ❌ Conflated Inverse Modeling to Generate Diverse and Temperature-Change Inducing Urban Vegetation Patterns

作者: Baris Sarper Tezcan, Hrishikesh Viswanath, Rubab Saher, Daniel Aliaga 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13028v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究城市热岛效应和植被配置的逆建模问题，使用扩散生成模型和预测前向模型，属于AI在科学领域的应用（具体是城市气候适应），因此仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但与所有其他大模型、深度学习技术原理、训练方法、推理优化、代理系统等关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种融合逆建模框架，通过结合预测前向模型和基于扩散的生成逆模型，生成多样且物理合理的植被空间配置，以实现特定的城市温度调节目标。

摘要翻译

城市地区在快速城市化与气候变化的双重驱动下，正日益面临极端高温的威胁。传统上，极端高温主要通过地球观测卫星和数值模拟框架进行监测。例如，从 Landsat 或 Sentinel 影像反演的地表温度常被用于刻画地表加热格局。这些方法作为前向模型运行，将辐射观测或模拟的边界条件转化为地表热状态的估计值。尽管前向模型能够基于植被和城市形态预测地表温度，但其逆问题——即确定能够实现特定区域温度变化目标的空间植被配置——在很大程度上仍未得到探索。该任务本质上是欠定的，因为多种空间植被格局可能产生相似的聚合温度响应。传统的回归方法和确定性神经网络难以捕捉这种模糊性，且往往给出平均化解，在数据稀缺条件下尤其如此。我们提出了一种融合逆建模框架，它将预测性前向模型与基于扩散的生成式逆模型相结合，能够根据特定的温度目标生成多样化、物理合理的图像化植被格局。该框架在保持对热环境结果控制的同时，支持生成多样化的空间植被配置，即使此类组合未在训练数据中出现。总体而言，本研究提出了一种可控的逆建模方法，用于城市气候适应规划，并兼顾了问题固有的多样性。代码已发布于 GitHub 仓库。

摘要 (Abstract)

Urban areas are increasingly vulnerable to thermal extremes driven by rapid urbanization and climate change. Traditionally, thermal extremes have been monitored using Earth-observing satellites and numerical modeling frameworks. For example, land surface temperature derived from Landsat or Sentinel imagery is commonly used to characterize surface heating patterns. These approaches operate as forward models, translating radiative observations or modeled boundary conditions into estimates of surface thermal states. While forward models can predict land surface temperature from vegetation and urban form, the inverse problem of determining spatial vegetation configurations that achieve a desired regional temperature shift remains largely unexplored. This task is inherently underdetermined, as multiple spatial vegetation patterns can yield similar aggregated temperature responses. Conventional regression and deterministic neural networks fail to capture this ambiguity and often produce averaged solutions, particularly under data-scarce conditions. We propose a conflated inverse modeling framework that combines a predictive forward model with a diffusion-based generative inverse model to produce diverse, physically plausible image-based vegetation patterns conditioned on specific temperature goals. Our framework maintains control over thermal outcomes while enabling diverse spatial vegetation configurations, even when such combinations are absent from training data. Altogether, this work introduces a controllable inverse modeling approach for urban climate adaptation that accounts for the inherent diversity of the problem. Code is available at the GitHub repository.

关键词: urban vegetation patterns, inverse modeling, diffusion-based generative model, temperature change, urban climate adaptation, thermal extremes, land surface temperature, physically plausible

195. ❌ See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

作者: Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso, Yu Hu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13019v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究GUI grounding for Computer Use Agents (CUAs)，核心是multi-turn iterative refinement with visual feedback。高度相关关键词：Self-Correction/Self-Improvement/Self-Reflection（核心机制）、LLM Agents/Autonomous Agents/Agentic Workflow（研究主体）、Chain of Thought/CoT Reasoning/Multi-step Reasoning和System 2 Thinking/Slow Thinking/In-depth Reasoning（迭代推理过程）。中等相关：Large Language Models/LLMs/Foundation Models（使用GPT-5.4、Claude、Qwen）、Tool Use/Function Calling/API Tool Use（GUI交互作为工具使用）。其他关键词与论文技术细节（如MoE、量化、RAG等）无关。

!!! tip deepseek-chat TL;DR

该论文针对密集编码界面中GUI grounding精度不足的问题，提出了一种基于视觉反馈的多轮迭代精炼方法，显著提高了计算机使用代理的点击精度和任务成功率。

摘要翻译

计算机使用代理（Computer Use Agents, CUAs）本质上依赖于图形用户界面（GUI）的接地能力，以将语言指令转化为可执行的屏幕操作，但在密集编码界面中——此类界面需要亚像素级精度才能与高密度的集成开发环境（IDE）元素交互——编辑级接地的研究仍显不足。现有方法通常依赖单次坐标预测，缺乏纠错机制，在高密度界面中往往失效。在本技术报告中，我们对编码环境中的像素级精确光标定位进行了实证研究。与单步执行不同，我们的代理采用迭代优化过程，利用先前尝试的视觉反馈来定位目标元素。这种闭环接地机制使代理能够自我校正位移误差并适应动态的用户界面变化。我们在GPT-5.4、Claude和Qwen模型上，通过一系列复杂编码基准测试评估了我们的方法，结果表明多轮优化在点击精度和整体任务成功率上均显著优于当前最先进的单次预测模型。我们的研究结果表明，迭代视觉推理是构建下一代可靠软件工程代理的关键组成部分。代码：https://github.com/microsoft/precision-cua-bench。

摘要 (Abstract)

Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.

关键词: GUI grounding, Computer Use Agents, multi-turn refinement, visual feedback, self-correction, cursor localization, coding environments, iterative visual reasoning

196. ❌ Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation

作者: Nafis Fuad Shahid, Maroof Ahmed, Md Akib Haider, Saidur Rahman Sagor, Aashnan Rahman, Md Azam Hossain 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12970v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医疗领域的多模态联邦学习，提出了一种概率特征补全和不确定性感知聚合方法。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文的医疗AI应用背景相关，但论文并未涉及大模型、深度学习技术原理创新或具体的大模型技术（如LLMs、MoE、训练方法等），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文解决了多模态联邦学习中因模态缺失导致的不确定性风险问题，通过提出概率特征补全网络和不确定性感知聚合策略，在联邦胸部X光分类任务中实现了性能提升。

摘要翻译

多模态联邦学习能够在医疗机构间实现隐私保护的协同模型训练。然而，模态异质性带来了根本性挑战：由于资源限制或工作流程差异，许多临床站点仅拥有部分模态数据。现有方法通过特征补全网络合成缺失模态表征来解决这一问题，但这些方法仅生成点估计而缺乏可靠性度量，迫使下游分类器将所有补全特征视为同等可信。在安全至上的医疗应用中，这一局限带来了显著风险。我们提出概率特征补全网络（Probabilistic Feature Imputation Network, P-FIN），该网络在输出补全特征的同时生成校准的不确定性估计。这种不确定性在两个层面被利用：（1）在本地层面，通过Sigmoid门控机制在分类前衰减不可靠的特征维度；（2）在全局层面，通过联邦不确定性加权平均（Fed-UQ-Avg）聚合策略，优先采用具有可靠补全能力的客户端更新。基于CheXpert、NIH Open-I和PadChest数据集进行的联邦胸部X光分类实验表明，该方法相较于确定性基线模型取得持续改进，在最具挑战性的配置中实现了+5.36%的AUC提升。

摘要 (Abstract)

Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises from modality heterogeneity: many clinical sites possess only a subset of modalities due to resource constraints or workflow variations. Existing approaches address this through feature imputation networks that synthesize missing modality representations, yet these methods produce point estimates without reliability measures, forcing downstream classifiers to treat all imputed features as equally trustworthy. In safety-critical medical applications, this limitation poses significant risks. We propose the Probabilistic Feature Imputation Network (P-FIN), which outputs calibrated uncertainty estimates alongside imputed features. This uncertainty is leveraged at two levels: (1) locally, through sigmoid gating that attenuates unreliable feature dimensions before classification, and (2) globally, through Fed-UQ-Avg, an aggregation strategy that prioritizes updates from clients with reliable imputation. Experiments on federated chest X-ray classification using CheXpert, NIH Open-I, and PadChest demonstrate consistent improvements over deterministic baselines, with +5.36% AUC gain in the most challenging configuration.

关键词: Multimodal Federated Learning, Feature Imputation, Uncertainty Quantification, Healthcare AI, Chest X-ray Classification, Privacy-preserving Learning, Modality Heterogeneity, Probabilistic Modeling

197. ❌ Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations

作者: Tong Zhang, Jiangning Zhang, Zhucun Xue, Juntao Jiang, Yicheng Xu, Chengming Xu, Teng Hu, Xingyu Xie, Xiaobin Hu, Yabiao Wang, Yong Liu, Shuicheng Yan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12968v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于深度学习优化算法的演进、评估和设计原则，包括一阶（如SGD、Adam）、二阶和零阶方法，并探讨了它们在隐私保护、内存效率和大规模训练中的应用。然而，论文内容与所有评分关键词（主要涉及大模型技术、训练方法、推理优化、对齐、应用等）均无直接关联，未提及任何大模型、语言模型、特定训练技术（如RLHF、PEFT）、推理方法（如CoT、RAG）或科学AI应用。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文回顾了深度学习优化算法的演进历程，通过全面的实证评估分析了主流优化器在不同模型架构和训练场景下的性能，总结了关键趋势和设计权衡，并为设计下一代高效、鲁棒、可信的优化方法提供了指导。

摘要翻译

平衡收敛速度、泛化能力与计算效率仍是深度学习优化的核心挑战。以随机梯度下降（SGD）和Adam为代表的一阶梯度下降方法构成了现代训练流程的基石。然而，大规模模型训练、严格的差分隐私要求以及分布式学习范式暴露了这些传统方法在隐私保护与内存效率方面的关键局限。为缓解这些瓶颈，研究者探索二阶优化技术以突破一阶方法的性能上限，而零阶方法则重新兴起以缓解大规模训练固有的内存约束。尽管方法体系不断扩展，该领域仍缺乏一个能统一底层原理并厘清不同方法适用场景的连贯框架。本文回顾性分析了深度学习优化算法的发展轨迹，并对主流优化器在不同模型架构与训练场景下进行了全面的实证评估。我们提炼出关键的新兴趋势与基础设计权衡，指明了未来研究的潜在方向。通过理论洞见与广泛实证证据的综合，本研究为设计下一代高效、鲁棒且可信赖的优化方法提供了可操作的指导。代码已发布于 https://github.com/APRIL-AIGC/Awesome-Optimizer。

摘要 (Abstract)

Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large-scale model training, stringent differential privacy requirements, and distributed learning paradigms expose critical limitations in these conventional approaches regarding privacy protection and memory efficiency. To mitigate these bottlenecks, researchers explore second-order optimization techniques to surpass first-order performance ceilings, while zeroth-order methods reemerge to alleviate memory constraints inherent to large-scale training. Despite this proliferation of methodologies, the field lacks a cohesive framework that unifies underlying principles and delineates application scenarios for these disparate approaches. In this work, we retrospectively analyze the evolutionary trajectory of deep learning optimization algorithms and present a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios. We distill key emerging trends and fundamental design trade-offs, pinpointing promising directions for future research. By synthesizing theoretical insights with extensive empirical evidence, we provide actionable guidance for designing next-generation highly efficient, robust, and trustworthy optimization methods. The code is available at https://github.com/APRIL-AIGC/Awesome-Optimizer.

关键词: optimization methods, deep learning, gradient descent, empirical evaluation, convergence speed, generalization capability, computational efficiency, large-scale training

198. ❌ AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation

作者: Yubraj Bhandari, Lavsen Dahal, Paul Segars, Joseph Y. Lo 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12969v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文AbdomenGen专注于医学影像领域的腹部解剖结构生成，使用扩散模型框架和创新的体积控制标量（VCS）技术。该研究属于AI在科学领域的应用（特别是生物医学成像），因此与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、PEFT等）、推理方法（如CoT、MCTS）、代理系统、模型优化（如量化、推理加速）或其他大模型相关主题，因此其余26个关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了AbdomenGen，一种基于序列体积条件扩散的框架，用于生成可控的腹部解剖结构，通过引入体积控制标量（VCS）实现器官体积的解耦调制，并在11个腹部器官上展示了高几何保真度和临床实用性。

摘要翻译

计算体模在医学影像研究中应用广泛，但现有系统在生成受控且具有临床意义的解剖结构变异方面仍存在局限。本文提出AbdomenGen，一种基于序列体素条件扩散的可控腹部解剖结构生成框架。我们引入体积控制标量（Volume Control Scalar, VCS），该标准化残差将器官尺寸与体型特征解耦，从而实现可解释的体积调控。器官掩模通过序列化合成，以前期生成的体部掩模及已生成结构为条件，在保持整体解剖一致性的同时支持独立的多器官控制。针对11个腹部器官，本框架实现了优异的几何保真度（如肝脏戴斯系数$0.83 \pm 0.05$），在$[-3,+3]$ VCS区间内保持稳定的单器官校准能力，并具备解耦的多器官调控特性。为展示临床实用性，我们从MERLIN数据集中选取肝肿大队列进行验证，基于Wasserstein距离的VCS选择策略将训练数据的分布距离降低了73.6%。这些结果表明，该框架能够实现校准化、分布感知的解剖结构生成，适用于可控腹部体模构建与仿真研究。

摘要 (Abstract)

Computational phantoms are widely used in medical imaging research, yet current systems to generate controlled, clinically meaningful anatomical variations remain limited. We present AbdomenGen, a sequential volume-conditioned diffusion framework for controllable abdominal anatomy generation. We introduce the \textbf{Volume Control Scalar (VCS)}, a standardized residual that decouples organ size from body habitus, enabling interpretable volume modulation. Organ masks are synthesized sequentially, conditioning on the body mask and previously generated structures to preserve global anatomical coherence while supporting independent, multi-organ control. Across 11 abdominal organs, the proposed framework achieves strong geometric fidelity (e.g., liver dice $0.83 \pm 0.05$), stable single-organ calibration over $[-3,+3]$ VCS, and disentangled multi-organ modulation. To showcase clinical utility with a hepatomegaly cohort selected from MERLIN, Wasserstein-based VCS selection reduces distributional distance of training data by 73.6% . These results demonstrate calibrated, distribution-aware anatomical generation suitable for controllable abdominal phantom construction and simulation studies.

关键词: AbdomenGen, diffusion framework, abdominal anatomy generation, Volume Control Scalar (VCS), organ volume modulation, computational phantoms, medical imaging, hepatomegaly cohort

199. ❌ Boosting Visual Instruction Tuning with Self-Supervised Guidance

作者: Sophia Sirko-Galouchenko, Monika Wysoczanska, Andrei Bursuc, Nicolas Thome, Spyros Gidaris 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12966v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	15.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究视觉指令微调（Instruction Tuning）的改进方法，通过引入自监督任务作为自然语言指令来增强多模态大语言模型（MLLMs）的视觉推理能力。因此，与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（15分），因为这是论文的核心创新点；与’Large Language Models OR LLMs OR Foundation Models’和’Post-training OR Supervised Fine-tuning OR SFT’相关（10分），因为论文涉及MLLMs（一种大模型）及其微调阶段；其他关键词如MoE、SLMs、Scaling Laws、RLHF等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在视觉中心任务中视觉信息利用不足的问题，提出了一种通过将自监督任务转化为自然语言指令来增强视觉指令微调的方法，从而显著提升了模型的视觉推理性能。

摘要翻译

多模态大语言模型（MLLMs）在众多视觉-语言任务上表现优异，但在需要细粒度视觉推理的以视觉为中心的问题上往往表现不佳。近期研究表明，这一局限并非源于视觉表征能力薄弱，而是由于在指令微调过程中视觉信息未被充分利用——许多任务仅依靠语言先验即可部分解决。我们提出一种简单轻量的方法，通过少量以自然语言指令表达的、基于视觉的自监督任务来增强视觉指令微调。通过将经典的自监督预训练任务（如旋转预测、颜色匹配和跨视角对应）重构为图像-指令-响应三元组，我们引入了必须依赖视觉证据才能解决的监督信号。该方法无需人工标注、无需修改模型架构、也无需额外的训练阶段。在多种模型、训练机制和基准测试中，仅注入少量（3-10%）此类基于视觉的指令，即可持续提升在以视觉为中心的评价任务上的性能。我们的研究结果表明，通过基于视觉的自监督学习任务进行指令微调，能够通过简单调整训练数据分布，成为增强多模态大语言模型视觉推理能力的有效途径。代码发布于：https://github.com/sirkosophia/V-GIFT

摘要 (Abstract)

Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT

关键词: Multimodal Large Language Models, Visual Instruction Tuning, Self-Supervised Learning, Visual Reasoning, Instruction-Response Triplets, Vision-Centric Tasks, Training Data Distribution, Fine-Grained Visual Analysis

200. ❌ Adaptive Data Dropout: Towards Self-Regulated Learning in Deep Neural Networks

作者: Amar Gahir, Varshil Patel, Shreyank N Gowda 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12945v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Adaptive Data Dropout: Towards Self-Regulated Learning in Deep Neural Networks》专注于深度学习训练效率优化，提出了一种动态调整训练数据子集的自适应方法。然而，所有评分关键词均围绕大模型（LLMs）及相关技术（如MoE、RLHF、RAG、量化等）、大模型应用（如AI for Science）或大模型特定能力（如推理、对齐）。该论文研究通用深度神经网络（DNNs）的训练数据选择策略，应用于图像分类，未涉及大模型、大模型技术原理或大模型在科学领域的应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种自适应数据丢弃框架，通过动态调整训练数据子集来优化深度神经网络的训练效率，在保持准确性的同时减少了有效训练步骤。

摘要翻译

深度神经网络通常通过跨周期均匀采样大型数据集进行训练，尽管有证据表明并非所有样本在整个学习过程中贡献均等。近期研究表明，逐步减少训练数据量可提升效率与泛化能力，但现有方法依赖于固定的调度策略，无法在训练过程中自适应调整。本研究提出自适应数据丢弃框架，这是一种基于性能反馈动态调整训练数据子集的简洁方法。受自我调节学习启发，我们的方法将数据选择视为自适应过程，根据训练准确率的变化增加或减少数据暴露程度。我们引入一种轻量级随机更新机制，在线调节数据丢弃计划，使模型能够随时间平衡探索与巩固。在标准图像分类基准上的实验表明，与静态数据丢弃策略相比，该方法在保持竞争力的准确率同时有效减少了训练步骤。这些结果凸显了自适应数据选择作为高效鲁棒训练方向的潜力。代码将公开。

摘要 (Abstract)

Deep neural networks are typically trained by uniformly sampling large datasets across epochs, despite evidence that not all samples contribute equally throughout learning. Recent work shows that progressively reducing the amount of training data can improve efficiency and generalization, but existing methods rely on fixed schedules that do not adapt during training. In this work, we propose Adaptive Data Dropout, a simple framework that dynamically adjusts the subset of training data based on performance feedback. Inspired by self-regulated learning, our approach treats data selection as an adaptive process, increasing or decreasing data exposure in response to changes in training accuracy. We introduce a lightweight stochastic update mechanism that modulates the dropout schedule online, allowing the model to balance exploration and consolidation over time. Experiments on standard image classification benchmarks show that our method reduces effective training steps while maintaining competitive accuracy compared to static data dropout strategies. These results highlight adaptive data selection as a promising direction for efficient and robust training. Code will be released.

关键词: Adaptive Data Dropout, Self-Regulated Learning, Deep Neural Networks, Training Efficiency, Data Selection, Stochastic Update, Image Classification, Generalization

201. ❌ Task Alignment: A simple and effective proxy for model merging in computer vision

作者: Pau de Jorge, César Roberto de Souza, Björn Michele, Mert Bülent Sarıyıldız, Philippe Weinzaepfel, Florent Perronnin, Diane Larlus, Yannis Kalantidis 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12935v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的模型合并技术，提出了一种名为“任务对齐代理”的方法来加速超参数选择。论文的核心内容与模型合并直接相关，因此“Model Merging OR Model Soups OR Weight Averaging”关键词获得10分（高度相关）。然而，论文的研究领域是计算机视觉，而非大语言模型（LLMs）或深度学习技术原理的创新，也未涉及科学领域的AI应用（如生物信息学）。其他所有关键词均与大语言模型、深度学习技术原理或科学AI应用相关，与该论文的计算机视觉模型合并研究无直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为“任务对齐代理”的方法，用于加速计算机视觉中多任务模型合并的超参数选择，从而扩展了模型合并技术在CLIP分类之外的适用性。

摘要翻译

高效整合多个源自同一预训练基础模型、但针对不同任务进行微调的模型，具有重要的实践意义。尽管已有大量前期研究，但计算机视觉领域的大多数模型合并评估仅限于使用CLIP进行图像分类的场景，其中不同的分类数据集定义了不同的任务。本工作的目标是使模型合并更具实用性，并展示其在超越这一特定设置的更具挑战性场景中的相关性。在大多数视觉场景中，不同的任务依赖于可训练的、且通常是异构的解码器。与以往使用固定解码器的研究不同（合并后的模型可以立即评估），解码器训练的非平凡成本使得基于下游性能进行超参数选择变得不切实际。为解决这一问题，我们引入了任务对齐代理，并展示了如何利用它将以数量级的速度提升超参数选择过程，同时保持性能。借助任务对齐代理，我们将模型合并的适用性扩展到基于CLIP的分类之外的多任务视觉模型。

摘要 (Abstract)

Efficiently merging several models fine-tuned for different tasks, but stemming from the same pretrained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision are restricted to image classification using CLIP, where different classification datasets define different tasks. In this work, our goal is to make model merging more practical and show its relevance on challenging scenarios beyond this specific setting. In most vision scenarios, different tasks rely on trainable and usually heterogeneous decoders. Differently from previous studies with frozen decoders, where merged models can be evaluated right away, the non-trivial cost of decoder training renders hyperparameter selection based on downstream performance impractical. To address this, we introduce the task alignment proxy, and show how it can be used to speed up hyperparameter selection by orders of magnitude while retaining performance. Equipped with the task alignment proxy, we extend the applicability of model merging to multi-task vision models beyond CLIP-based classification.

关键词: model merging, computer vision, task alignment, hyperparameter selection, multi-task models, CLIP, vision models, decoder training

202. ❌ Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection

作者: Tianshuo Zhang, Haoyuan Zhang, Siran Peng, Weisong Zhao, Xiangyu Zhu, Zhen Lei 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12941v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的持续人脸伪造检测（CFFD），提出了一种名为Direct Discrepancy Replay的新方法，包括Distribution-Discrepancy Condensation（DDC）和Manifold-Consistent Replay（MCR）。论文的核心是解决持续学习中的灾难性遗忘问题，通过建模和压缩真实与伪造分布之间的差异，并在流形一致性约束下生成重放样本。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本论文属于计算机视觉中的特定任务（人脸伪造检测），未涉及任何大语言模型技术、深度学习基础理论创新或AI在生物信息学等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Direct Discrepancy Replay的新方法，通过分布差异压缩和流形一致性重放，解决了持续人脸伪造检测中的灾难性遗忘问题，在极小内存预算下显著优于现有基线方法。

摘要翻译

持续人脸伪造检测（CFFD）要求检测器能够学习新出现的伪造范式，同时不遗忘先前已见过的篡改手段。现有的CFFD方法通常依赖回放少量历史数据来缓解遗忘问题。这类回放通常通过存储少量历史样本，或基于检测器相关扰动合成伪伪造样本实现。在严格的内存限制下，前者无法充分覆盖多样化的伪造线索且可能泄露面部身份信息，而后者仍与过去的决策边界紧密关联。我们认为，回放在CFFD中的核心作用是在后续训练过程中恢复先前伪造任务的分布特征。为此，我们直接压缩真实与伪造分布之间的差异，并利用当前阶段的真实人脸进行分布级回放。具体而言，我们提出分布差异压缩（Distribution-Discrepancy Condensation, DDC）方法，该方法通过特征函数空间中的代理因子化建模真实到伪造的分布差异，并将其压缩至一个微型的分布差异图谱库中。我们进一步提出流形一致回放（Manifold-Consistent Replay, MCR），通过将这些差异图谱与当前阶段真实人脸进行方差保持性组合来合成回放样本，生成的样本既能反映先前任务的伪造线索，又保持与当前真实人脸统计特性兼容。在极小内存预算且不直接存储原始历史人脸图像的条件下，我们的框架持续优于现有CFFD基线方法，并显著缓解灾难性遗忘问题。回放层级的隐私分析进一步表明，相较于基于样本选择的回放方法，本方法降低了身份信息泄露风险。

摘要 (Abstract)

Continual face forgery detection (CFFD) requires detectors to learn emerging forgery paradigms without forgetting previously seen manipulations. Existing CFFD methods commonly rely on replaying a small amount of past data to mitigate forgetting. Such replay is typically implemented either by storing a few historical samples or by synthesizing pseudo-forgeries from detector-dependent perturbations. Under strict memory budgets, the former cannot adequately cover diverse forgery cues and may expose facial identities, while the latter remains strongly tied to past decision boundaries. We argue that the core role of replay in CFFD is to reinstate the distributions of previous forgery tasks during subsequent training. To this end, we directly condense the discrepancy between real and fake distributions and leverage real faces from the current stage to perform distribution-level replay. Specifically, we introduce Distribution-Discrepancy Condensation (DDC), which models the real-to-fake discrepancy via a surrogate factorization in characteristic-function space and condenses it into a tiny bank of distribution discrepancy maps. We further propose Manifold-Consistent Replay (MCR), which synthesizes replay samples through variance-preserving composition of these maps with current-stage real faces, yielding samples that reflect previous-task forgery cues while remaining compatible with current real-face statistics. Operating under an extremely small memory budget and without directly storing raw historical face images, our framework consistently outperforms prior CFFD baselines and significantly mitigates catastrophic forgetting. Replay-level privacy analysis further suggests reduced identity leakage risk relative to selection-based replay.

关键词: Continual Face Forgery Detection, Distribution-Discrepancy Condensation, Manifold-Consistent Replay, Catastrophic Forgetting, Replay-based Methods, Face Forgery Detection, Memory Budget, Privacy Preservation

203. ❌ DINO-Explorer: Active Underwater Discovery via Ego-Motion Compensated Semantic Predictive Coding

作者: Yuhan Jin, Nayari Marie Lessa, Mariela De Lucas Alvarez, Melvin Laux, Lucas Amparo Barbosa, Frank Kirchner, Rebecca Adam 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12933v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究基于DINOv3基础模型的主动水下监测系统，属于AI在科学领域的应用（海洋监测），因此与’AI for Science OR Bioinformatics OR Cheminformatics’有较强关联（8分）。论文使用了’frozen DINOv3 foundation model’，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），但DINOv3是视觉基础模型而非语言模型。其他关键词主要涉及大语言模型的技术原理、训练方法、推理优化等，与本文的计算机视觉和机器人应用无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出DINO-Explorer框架，通过基于DINOv3基础模型的语义预测编码和自运动补偿，解决了自主水下车辆在监测海洋生态系统时无法主动识别高价值瞬态事件的问题，实现了带宽效率高的事件触发机制，显著减少了误报并集中传输人类验证的新颖事件数据。

摘要翻译

海洋生态系统退化亟需持续且具备科学选择性的水下监测。然而，大多数自主水下航行器（AUV）仅作为被动数据记录器运行，捕获大量视频供离线审查，常常错失具有高科学价值的瞬态事件。向主动感知转变需要一个因果性的在线信号，该信号能突显重要现象，同时抑制由航行器机动引起的视觉变化。我们提出了DINO-Explorer，一个由连续语义惊奇信号驱动的新颖性感知框架。该框架在一个冻结的DINOv3基础模型的潜在空间内运行，利用一个轻量级、动作条件化的循环预测器来预测短时程的语义演变。一个受传出副本启发的模块利用全局池化的光流来抵消自身引起的视觉变化，同时不抑制真实的环境新颖性。我们在不同遥测约束下的异步事件分类下游任务中评估了该信号。结果表明，DINO-Explorer提供了一种鲁棒且带宽高效的注意力机制。在一个固定的工作点上，系统保留了78.8%的事后发现后经人工审查达成共识的事件，并具有56.8%的触发确认率，有效地凸显了与任务相关的现象。至关重要的是，相对于未经补偿的惊奇信号基线，自我运动条件化将误报率降低了45.5%。在一项回放侧帕累托消融研究中，DINO-Explorer在已验证的峰值F1分数与遥测带宽边界上稳健地占据主导地位，在选定工作点上将遥测带宽降低了48.2%，同时保持了62.2%的峰值F1分数，成功地将数据传输集中在经人工验证的新颖性事件周围。

摘要 (Abstract)

Marine ecosystem degradation necessitates continuous, scientifically selective underwater monitoring. However, most autonomous underwater vehicles (AUVs) operate as passive data loggers, capturing exhaustive video for offline review and frequently missing transient events of high scientific value. Transitioning to active perception requires a causal, online signal that highlights significant phenomena while suppressing maneuver-induced visual changes. We propose DINO-Explorer, a novelty-aware perception framework driven by a continuous semantic surprise signal. Operating within the latent space of a frozen DINOv3 foundation model, it leverages a lightweight, action-conditioned recurrent predictor to anticipate short-horizon semantic evolution. An efference-copy-inspired module utilizes globally pooled optical flow to discount self-induced visual changes without suppressing genuine environmental novelty. We evaluate this signal on the downstream task of asynchronous event triage under variant telemetry constraints. Results demonstrate that DINO-Explorer provides a robust, bandwidth-efficient attention mechanism. At a fixed operating point, the system retains 78.8% of post-discovery human-reviewer consensus events with a 56.8% trigger confirmation rate, effectively surfacing mission-relevant phenomena. Crucially, ego-motion conditioning suppresses 45.5% of false positives relative to an uncompensated surprise signal baseline. In a replay-side Pareto ablation study, DINO-Explorer robustly dominates the validated peak F1 versus telemetry bandwidth frontier, reducing telemetry bandwidth by 48.2% at the selected operating point while maintaining a 62.2% peak F1 score, successfully concentrating data transmission around human-verified novelty events.

关键词: underwater monitoring, autonomous underwater vehicles, foundation model, semantic surprise signal, ego-motion compensation, novelty detection, bandwidth efficiency, active perception

204. ❌ Pi-HOC: Pairwise 3D Human-Object Contact Estimation

作者: Sravan Chittupalli, Ayush Jain, Dong Huang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12923v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Pi-HOC专注于计算机视觉领域，特别是3D人体-物体接触估计，使用视觉模型（如SAM、InteractionFormer）和几何处理（SMPL网格），不涉及大语言模型、深度学习技术原理创新或AI for Science的具体应用。所有关键词均与大语言模型、深度学习技术原理或AI for Science相关，与论文的视觉和几何主题完全无关，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

论文提出Pi-HOC框架，解决了图像中多人体-物体对细粒度3D语义接触估计的挑战，显著提高了准确性和定位精度，并实现了20倍吞吐量提升。

摘要翻译

解析图像中真实世界的人-物交互是一个多对多的挑战，其中解耦细粒度的并发物理接触尤为困难。现有的语义接触估计方法要么局限于单人场景，要么需要在输入图像之外额外提供物体几何信息（如网格）。当前最先进的方法利用强大的视觉语言模型（VLM）获取类别级语义，但在多人物场景中表现不佳，且推理效率低下。我们提出了Pi-HOC，一个单次推理、实例感知的框架，用于对所有人物-物体对进行密集的三维语义接触预测。Pi-HOC检测实例，为每一对人物-物体（HO）创建专用令牌，并使用交互变换器（InteractionFormer）对其进行优化。随后，一个基于SAM的解码器为每个人物-物体对在SMPL人体网格上预测密集接触。在MMHOI和DAMON数据集上，Pi-HOC在准确性和定位精度上显著优于现有最优方法，同时实现了20倍以上的处理吞吐量。我们进一步证明，通过测试时优化算法，预测的接触能够改进SAM-3D的图像到网格重建，并且无需额外训练即可实现基于语言查询的指代性接触预测。

摘要 (Abstract)

Resolving real-world human-object interactions in images is a many-to-many challenge, in which disentangling fine-grained concurrent physical contact is particularly difficult. Existing semantic contact estimation methods are either limited to single-human settings or require object geometries (e.g., meshes) in addition to the input image. Current state-of-the-art leverages powerful VLM for category-level semantics but struggles with multi-human scenarios and scales poorly in inference. We introduce Pi-HOC, a single-pass, instance-aware framework for dense 3D semantic contact prediction of all human-object pairs. Pi-HOC detects instances, creates dedicated human-object (HO) tokens for each pair, and refines them using an InteractionFormer. A SAM-based decoder then predicts dense contact on SMPL human meshes for each human-object pair. On the MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. We further demonstrate that predicted contacts improve SAM-3D image-to-mesh reconstruction via a test-time optimization algorithm and enable referential contact prediction from language queries without additional training.

关键词: 3D human-object contact, semantic contact estimation, instance-aware framework, InteractionFormer, SAM-based decoder, SMPL human meshes, multi-human scenarios, test-time optimization

205. ❌ Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

作者: Ayce Idil Aytekin, Xu Chen, Zhengyang Shen, Thabo Beeler, Helge Rhodin, Rishabh Dabral, Christian Theobalt 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12929v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的动态手-物体交互重建，使用高斯表示和传统跟踪方法，未涉及大语言模型、深度学习技术原理创新或AI for Science等关键词。所有关键词均与论文内容无关，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Grasp in Gaussians (GraG)的快速单目视频动态手-物体交互重建方法，通过紧凑的高斯和表示实现了比先前工作快6.4倍的速度，同时提高了物体重建精度13.4%并减少了手部关节位置误差65%以上。

摘要翻译

我们提出高斯抓取（Grasp in Gaussians，简称GraG），这是一种从单目视频中重建动态三维手-物体交互的快速鲁棒方法。与当前优化复杂神经表征的主流方法不同，我们的方法侧重于在通过预训练大模型初始化后，高效地跟踪手部和物体。我们的核心见解是，可以利用一种紧凑的高斯和（Sum-of-Gaussians，SoG）表征来恢复准确且时序稳定的手-物体运动；该表征源自经典跟踪文献，并与基于生成式高斯的初始化方法相结合。我们使用视频适配的SAM3D流程初始化物体姿态与几何，随后通过下采样将得到的稠密高斯表征转换为轻量化的SoG。这种紧凑表征在保持几何保真度的同时，实现了高效快速的跟踪。对于手部，我们采用一种互补策略：从现成的单目手部姿态初始化出发，我们通过简单而有效的二维关节点与深度对齐损失来优化手部运动，避免了逐帧优化精细的三维手部外观模型，同时保持了稳定的关节运动。在公开基准测试上的大量实验表明，GraG能够在长序列上重建时序连贯的手-物体交互，其速度比现有工作快6.4倍，同时将物体重建质量提升13.4%，并将手部各关节位置误差降低超过65%。

摘要 (Abstract)

We present Grasp in Gaussians (GraG), a fast and robust method for reconstructing dynamic 3D hand-object interactions from a single monocular video. Unlike recent approaches that optimize heavy neural representations, our method focuses on tracking the hand and the object efficiently, once initialized from pretrained large models. Our key insight is that accurate and temporally stable hand-object motion can be recovered using a compact Sum-of-Gaussians (SoG) representation, revived from classical tracking literature and integrated with generative Gaussian-based initializations. We initialize object pose and geometry using a video-adapted SAM3D pipeline, then convert the resulting dense Gaussian representation into a lightweight SoG via subsampling. This compact representation enables efficient and fast tracking while preserving geometric fidelity. For the hand, we adopt a complementary strategy: starting from off-the-shelf monocular hand pose initialization, we refine hand motion using simple yet effective 2D joint and depth alignment losses, avoiding per-frame refinement of a detailed 3D hand appearance model while maintaining stable articulation. Extensive experiments on public benchmarks demonstrate that GraG reconstructs temporally coherent hand-object interactions on long sequences 6.4x faster than prior work while improving object reconstruction by 13.4% and reducing hand’s per-joint position error by over 65%.

关键词: hand-object interaction, monocular reconstruction, Gaussian representation, Sum-of-Gaussians, 3D tracking, dynamic reconstruction, computer vision, pose estimation

206. ❌ M3D-Stereo: A Multiple-Medium and Multiple-Degradation Dataset for Stereo Image Restoration

作者: Deqing Yang, Yingying Liu, Qicong Wang, Zhi Zeng, Dajiang Lu, Yibin Tian 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12917v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的立体图像恢复数据集构建，研究内容涉及图像处理、数据集创建和基准测试，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对水下、雾霾和低光等复杂退化环境下的立体图像恢复问题，提出了一个包含7904对高分辨率图像的多介质多退化数据集M3D-Stereo，并验证了其在单级和混合级退化任务中的有效性。

摘要翻译

在诸如水下、雾霾或低光照等恶劣条件下的图像复原，由于复杂的物理退化与严重的信息损失，仍然是一个极具挑战性的问题。现有数据集大多局限于单一退化类型，或严重依赖缺乏立体一致性的合成数据，这从根本上限制了其在真实场景中的适用性。为此，我们提出了M3D-Stereo，这是一个包含7904对高分辨率图像对的立体数据集，专为图像复原研究而构建，其数据采集于多种介质中，并具有多种受控的退化程度。该数据集涵盖四种退化场景：水下散射、雾/霾、水下低光照以及雾霾低光照。每种场景构成一个子集，并进一步划分为六个渐进式退化等级，从而能够对复原方法进行细粒度评估，以应对日益严重的退化情况。通过实验室设置采集，该数据集提供了对齐的立体图像对及其像素级一致的清晰真实参考图像。我们执行了单级退化与混合级退化两项复原任务以验证其有效性。M3D-Stereo建立了一个更受控且更贴近现实的基准，用于评估复杂退化环境下的图像复原与立体匹配方法。本数据集依据LGPLv3许可证公开。

摘要 (Abstract)

Image restoration under adverse conditions, such as underwater, haze or fog, and low-light environments, remains a highly challenging problem due to complex physical degradations and severe information loss. Existing datasets are predominantly limited to a single degradation type or heavily rely on synthetic data without stereo consistency, inherently restricting their applicability in real-world scenarios. To address this, we introduce M3D-Stereo, a stereo dataset with 7904 high-resolution image pairs for image restoration research acquired in multiple media with multiple controlled degradation levels. It encompasses four degradation scenarios: underwater scatter, haze/fog, underwater low-light, and haze low-light. Each scenario forms a subset, and is divided into six levels of progressive degradation, allowing fine-grained evaluations of restoration methods with increasing severity of degradation. Collected via a laboratory setup, the dataset provides aligned stereo image pairs along with their pixel-wise consistent clear ground truths. Two restoration tasks, single-level and mixed-level degradation, were performed to verify its validity. M3D-Stereo establishes a better controlled and more realistic benchmark to evaluate image restoration and stereo matching methods in complex degradation environments. It is made public under LGPLv3 license.

关键词: stereo image restoration, multiple degradation, underwater scatter, haze/fog, low-light, dataset, benchmark, ground truth

207. ❌ Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

作者: Ahmet İnanç, Özgür Erkent 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12918v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶领域的雷达-相机融合多任务学习，提出了一种跨任务注意力桥（CTAB）来联合优化3D检测和分割任务。论文内容完全围绕计算机视觉、传感器融合和多任务学习展开，未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型技术、训练方法、推理优化、对齐技术、AI代理等主题相关，与本文的自动驾驶感知研究无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶中雷达-相机融合方法将检测和分割任务孤立处理的问题，提出了一种跨任务注意力桥（CTAB）模块，通过双向特征交换在nuScenes数据集上实现了分割性能的提升，同时保持了检测性能基本不变。

摘要翻译

鸟瞰图（Bird’s-eye-view, BEV）表示是自动驾驶领域三维感知的主流范式，它提供了一个统一的空间画布，使得检测与分割特征能够几何对齐至同一物理坐标系。然而，现有的雷达-相机融合方法将这些任务孤立处理，错失了任务间共享互补信息的机会：检测特征编码了物体级几何信息，可用于锐化分割边界；而分割特征则提供了密集的语义上下文，有助于锚定检测目标。我们提出 \textbf{CTAB}（跨任务注意力桥接模块），这是一个双向模块，通过在共享BEV空间中使用多尺度可变形注意力机制，实现检测分支与分割分支间的特征交换。CTAB被集成到一个多任务框架中，该框架包含一个基于实例归一化的分割解码器以及可学习的BEV上采样模块，以提供更精细的BEV表示。在nuScenes数据集上，CTAB在基本不影响检测性能的前提下，相比联合多任务基线提升了7个类别的分割效果。在一个包含4个类别（可行驶区域、人行横道、人行道、车辆）的子集上，我们的联合多任务模型在保持4个类别上具有可比性的平均交并比（mIoU）的同时，还提供了三维检测能力。

摘要 (Abstract)

Bird’s-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity to share complementary information between them: detection features encode object-level geometry that can sharpen segmentation boundaries, while segmentation features provide dense semantic context that can anchor detection. We propose \textbf{CTAB} (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model reaches comparable mIoU on 4 classes while simultaneously providing 3D detection.

关键词: Radar-Camera Fusion, BEV Representation, Multi-Task Learning, Cross-Task Attention, 3D Detection, Segmentation, Autonomous Driving, nuScenes Dataset

208. ❌ A Sanity Check on Composed Image Retrieval

作者: Yikun Liu, Jiangchao Yao, Weidi Xie, Yanfeng Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12904v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于组合图像检索（CIR）的评估方法改进，提出了新的基准FISD和自动多轮代理评估框架。虽然涉及生成模型和代理框架，但论文核心是计算机视觉领域的图像检索评估，而非大模型或深度学习技术原理的创新。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文不涉及这些主题，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对组合图像检索（CIR）模型评估不准确的问题，提出了一个无歧义的基准FISD和一个自动多轮代理评估框架，以更准确地评估CIR方法在实际应用中的效果。

摘要翻译

组合图像检索（Composed Image Retrieval，CIR）旨在根据一个由参考图像和指定修改意图的相对描述文本组成的查询，检索出目标图像。尽管CIR模型发展迅速，但其性能尚未被现有基准测试充分刻画：这些基准本身包含不确定的查询，从而降低了评估的可靠性（即存在多个候选图像而非仅目标图像符合查询条件），且未考虑模型在多轮交互系统中的有效性。基于此，我们从两方面改进评估流程：1）我们提出了FISD（全信息语义多样化基准），该基准利用生成模型精确控制参考-目标图像对的变量，从而能够在六个维度上对CIR方法进行无查询歧义的更精准评估；2）我们设计了一个自动多轮智能体评估框架，以探究现有模型在交互场景中的潜力。通过观察模型在连续多轮查询中如何调整和优化其选择，该框架为模型在实际应用中的效能提供了更贴近现实的评估。大量实验与比较证明了我们提出的新型评估方法对典型CIR模型的价值。

摘要 (Abstract)

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image, and a relative caption that specifies the desired modification. Despite the rapid development of CIR models, their performance is not well characterized by existing benchmarks, which inherently contain indeterminate queries degrading the evaluation (i.e., multiple candidate images, rather than solely the target image, meet the query criteria), and have not considered their effectiveness in the context of the multi-round system. Motivated by this, we consider improving the evaluation procedure from two aspects: 1) we introduce FISD, a Fully-Informed Semantically-Diverse benchmark, which employs generative models to precisely control the variables of reference-target image pairs, enabling a more accurate evaluation of CIR methods across six dimensions, without query ambiguity; 2) we propose an automatic multi-round agentic evaluation framework to probe the potential of the existing models in the interactive scenarios. By observing how models adapt and refine their choices over successive rounds of queries, this framework provides a more realistic appraisal of their efficacy in practical applications. Extensive experiments and comparisons prove the value of our novel evaluation on typical CIR methods.

关键词: Composed Image Retrieval, CIR, benchmark, evaluation framework, multi-round agentic evaluation, FISD, image retrieval, query ambiguity

209. ❌ Representing 3D Faces with Learnable B-Spline Volumes

作者: Prashanth Chandran, Daoye Wang, Timo Bolkart 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12894v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D人脸表示的几何建模方法（CUBE），使用B样条体积和MLP进行3D扫描配准和单目3D人脸重建。所有评分关键词均涉及大模型、深度学习技术原理或AI科学应用，而本文研究的是计算机视觉中的特定几何表示和重建方法，未涉及任何大模型、深度学习技术原理或AI在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CUBE的新型3D人脸几何表示方法，结合B样条体积和可学习特征，实现了最先进的3D扫描配准和单目3D人脸重建效果。

摘要翻译

本文提出CUBE（基于控制点的统一B样条编码），这是一种结合B样条体与学习特征的新型人脸几何表示方法，并展示了其作为三维扫描配准与单目三维人脸重建解码器的应用潜力。与现有采用三维控制点的B样条表示不同，CUBE通过高维控制特征构成的网格（例如8×8×8）进行参数化，从而提升了模型的表达能力。这些特征通过中间特征空间，定义了一个从三维参数域到三维欧氏空间的连续两阶段映射：首先，利用B样条基函数对高维控制特征进行局部混合，生成一个高维特征向量，其前三个数值定义了三维基础网格；随后，一个小型多层感知机处理该特征向量，预测相对于基础形状的残差位移，最终输出精细化三维坐标。为实现密集语义对应的三维表面重建，CUBE在固定模板网格采样的三维坐标点处进行查询。关键的是，CUBE保留了传统B样条表示的局部支撑特性，可通过更新单个控制特征实现局部表面编辑。我们通过训练基于Transformer的编码器，从非结构化点云和单目图像预测CUBE的控制特征，验证了该表示方法的优势，相较于现有基线模型取得了最先进的扫描配准效果。

摘要 (Abstract)

We present CUBE (Control-based Unified B-spline Encoding), a new geometric representation for human faces that combines B-spline volumes with learned features, and demonstrate its use as a decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-spline representations with 3D control points, CUBE is parametrized by a lattice (e.g., 8 x 8 x 8) of high-dimensional control features, increasing the model’s expressivity. These features define a continuous, two-stage mapping from a 3D parametric domain to 3D Euclidean space via an intermediate feature space. First, high-dimensional control features are locally blended using the B-spline bases, yielding a high-dimensional feature vector whose first three values define a 3D base mesh. A small MLP then processes this feature vector to predict a residual displacement from the base shape, yielding the final refined 3D coordinates. To reconstruct 3D surfaces in dense semantic correspondence, CUBE is queried at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support property of traditional B-spline representations, enabling local surface editing by updating individual control features. We demonstrate the strengths of this representation by training transformer-based encoders to predict CUBE’s control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent baselines.

关键词: 3D face representation, B-spline volumes, learned features, 3D scan registration, monocular 3D reconstruction, geometric representation, local surface editing, transformer-based encoders

210. ❌ Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

作者: Muhammad Kamran Janjua, Hugo Silva, Di Niu, Bahador Rashidi 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12896v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态语言模型（MLLMs）与视觉工具的结合，属于大模型在视觉推理领域的应用创新。高度相关的关键词包括：‘Large Language Models’（论文研究MLLMs）、‘LLM Agents’（涉及工具使用的工作流）、‘Tool Use’（核心研究视觉工具调用）。‘Chain of Thought’和’System 2 Thinking’有一定关联，因为论文关注视觉推理过程，但未明确使用这些术语。其他关键词如MoE、SLMs、训练方法、压缩技术等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文解决了多模态语言模型在视觉推理任务中无法有效利用工具生成视觉线索的问题，提出了一种无需训练的Perception Programs方法，将工具输出转换为结构化语言摘要，显著提升了多个视觉任务的性能。

摘要翻译

多模态语言模型（MLLMs）正日益与视觉工具（如深度、光流、对应关系等）结合以增强视觉推理能力。然而，尽管能够获取这些工具生成的视觉线索，MLLMs往往难以有效利用它们。现有方法通常将原始工具输出直接输入模型，但这些密集的像素级表示与LLMs基于语言的原生推理优势不匹配，导致感知能力薄弱且过度依赖语言先验。我们认为，在视觉工具能够提供必要视觉线索的问题中，瓶颈并非更多工具调用或更大的MLLMs，而在于工具输出的表征方式。本文提出感知程序（P$^2$），这是一种无需训练、与模型无关的方法，能将工具输出重写为紧凑、结构化、语言原生的摘要，使MLLMs能够直接解析和推理。在BLINK基准的六项以感知为核心的任务中，P$^2$相较于基础模型及原始工具增强基线均带来显著提升。以GPT-5 Mini作为基础模型时，P$^2$将其在多视角推理任务上的准确率从41.35%提升至86.47%，在相对深度任务上从52.42%提升至81.45%，并在所有任务中平均获得22%的性能增益，创造了新的最优结果。即使在较小规模的MLLMs（如InternVL3.5-4B和Qwen3VL-4B）上，P$^2$仍能带来15-40%的绝对性能提升，超越了以往基于代理、监督学习和强化学习的工具使用方法——且无需任何训练或模型修改。

摘要 (Abstract)

Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P$^2$), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P$^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P$^2$ raises its accuracy from 41.35% to 86.47% on multi-view reasoning, from 52.42% to 81.45% on relative depth, and achieves a 22% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40% absolute gains from P$^2$, surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.

关键词: Multimodal Language Models, Visual Reasoning, Tool Use, Perception Programs, Vision Tools, Language-native Summaries, Training-free Method, State-of-the-art Results

211. ❌ VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

作者: Andrei Atanov, Jesse Allardice, Roman Bachmann, Oğuzhan Fatih Kar, R Devon Hjelm, David Griffiths, Peter Fu, Afshin Dehghan, Amir Zamir 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12887v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频表示学习，提出了一种新的视频tokenization方法VideoFlexTok，采用coarse-to-fine的变长token序列表示视频。虽然论文涉及生成模型（text-to-video）和模型效率（5x smaller model），但所有关键词均针对大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG等），而本文研究的是视觉领域的视频tokenization和生成，未涉及任何语言模型技术。因此，所有关键词均不相关，得分为0。

!!! tip deepseek-chat TL;DR

论文提出VideoFlexTok，一种coarse-to-fine的变长视频tokenization方法，通过结构化token序列实现高效视频表示，在text-to-video生成任务中达到可比质量的同时模型规模减小5倍，并支持长视频生成而计算成本大幅降低。

摘要翻译

视觉分词器将高维原始像素映射为压缩表示以供下游建模使用。除压缩功能外，分词器决定了信息的保留方式及其组织结构。当前视频分词的标准方法是将视频表示为时空三维令牌网格，每个令牌捕获原始信号中对应的局部信息。这要求使用令牌的下游模型（例如文本到视频模型）必须学习“逐像素”预测所有低层细节，而忽略视频固有的复杂度，导致学习复杂度居高不下。
我们提出VideoFlexTok，该方法采用由粗到精结构的可变长度令牌序列表示视频——初始令牌（涌现式地）捕获抽象信息（如语义与运动特征），后续令牌则补充细粒度细节。其生成式流解码器能够根据任意数量的令牌实现逼真的视频重建。这种表示结构允许根据下游需求调整令牌数量，并在相同资源预算下编码比基线方法更长的视频。
我们在类别到视频及文本到视频生成任务上评估VideoFlexTok，结果表明相较于三维网格令牌，它能实现更高效的训练：例如使用模型规模缩小5倍（11亿参数对比52亿参数）即可达到相当的生成质量（gFVD与ViCLIP评分指标）。最后，我们通过训练文本到视频模型展示VideoFlexTok如何以可控计算成本生成长视频：该模型仅用672个令牌处理10秒81帧的视频序列，令牌数量比同类三维网格分词器减少8倍。

摘要 (Abstract)

Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details “pixel-by-pixel” irrespective of the video’s inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner – where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding videos longer than the baselines with the same budget. We evaluate VideoFlexTok on class- and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 5x smaller model (1.1B vs 5.2B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.

关键词: video tokenization, coarse-to-fine representation, variable-length tokens, generative flow decoder, text-to-video generation, model efficiency, long video generation, computational cost reduction

212. ❌ PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

作者: Xuan Wang, Kai Ruan, Jiayi Han, kaiyue Zhou, Gaoang Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12856v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PianoFlow专注于音频驱动的钢琴动作生成，采用流匹配框架、MIDI模态蒸馏、角色门控交互模块和自回归流延续方案等技术。虽然属于AI应用（音乐生成），但研究内容与所有评分关键词（均围绕大模型/深度学习技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，未涉及任何关键词中的具体技术或概念。

!!! tip deepseek-chat TL;DR

该论文提出了PianoFlow框架，通过流匹配、MIDI模态蒸馏和角色门控交互模块，解决了音频驱动钢琴动作生成中音乐结构建模和双手协调的难题，实现了高质量、实时流式生成并加速推理9倍以上。

摘要翻译

音频驱动的双手钢琴动作生成需要对复杂音乐结构与动态双手协调性进行精确建模。然而，现有方法通常仅依赖缺乏符号先验的纯声学表征，采用僵化的交互机制，且仅限于计算成本高昂的短序列生成。为应对这些局限，我们提出PianoFlow——一种用于精确协调的双手钢琴动作合成的流匹配框架。我们的方法在训练阶段策略性地利用MIDI作为特权模态，通过提炼这些结构化音乐先验以实现深层语义理解，同时保持仅使用音频的推理能力。此外，我们引入了一种非对称角色门控交互模块，通过角色感知注意力与时序门控机制显式捕捉动态的双手协调关系。为实现任意长序列的实时流式生成，我们设计了自回归流延续方案，以确保跨片段时序连贯性的无缝衔接。在PianoMotion10M数据集上的大量实验表明，PianoFlow在定量与定性评估中均取得优越性能，同时推理速度较以往方法提升超过9倍。

摘要 (Abstract)

Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.

关键词: Piano motion generation, Flow-matching framework, MIDI modality distillation, Bimanual coordination, Asymmetric role-gated interaction, Autoregressive flow continuation, Real-time streaming generation, PianoMotion10M dataset

213. ❌ Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

作者: Yingying Zhao, Chengyin Hu, Qike Zhang, Xin Li, Xin Wang, Yiwei Wei, Jiujiang Guo, Jiahuan Long, Tingsong Jiang, Wen Yao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12833v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视觉语言模型（VLMs）的物理对抗攻击，属于大模型安全领域，但所有关键词均针对语言模型（LLMs）而非视觉语言模型（VLMs）。仅与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分），因为论文提到攻击会导致语义幻觉（semantic hallucinations），但这并非论文核心研究内容。其他关键词均与论文无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了首个可物理部署的多模态语义光照攻击框架（MSLA），通过可控对抗性光照破坏视觉语言模型的多模态语义理解，导致零样本分类性能下降和语义幻觉，揭示了VLMs在物理世界中的安全脆弱性。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）已展现出卓越的性能，但其安全性仍未得到充分理解。现有的对抗性研究几乎完全集中于数字环境，而对物理世界威胁的探索严重不足。随着VLM在实际环境中日益广泛部署，这一空白变得至关重要，因为对抗性扰动必须具备物理可实现性。尽管具有实际相关性，针对VLM的物理攻击尚未得到系统研究。此类攻击可能引发识别失败，并进一步破坏多模态推理，导致下游任务中出现严重的语义误解。因此，研究针对VLM的物理攻击对于评估其真实世界安全风险至关重要。为填补这一空白，我们提出了多模态语义照明攻击（Multimodal Semantic Lighting Attacks, MSLA），这是首个针对VLM的可物理部署对抗攻击框架。MSLA利用可控的对抗性照明干扰真实场景中的多模态语义理解，攻击目标为语义对齐而非仅针对特定任务输出。因此，该方法不仅降低了主流CLIP变体在零样本分类任务上的性能，还在LLaVA和BLIP等先进VLM的图像描述和视觉问答（Visual Question Answering, VQA）任务中引发了严重的语义幻觉。在数字域和物理域的大量实验表明，MSLA具有高效性、可迁移性和实际可行性。我们的研究首次证明VLM对可物理部署的语义攻击高度脆弱，揭示了一个先前被忽视的鲁棒性缺陷，并强调了亟需对VLM开展物理世界鲁棒性评估。

摘要 (Abstract)

Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework against VLMs. MSLA uses controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes, attacking semantic alignment rather than only task-specific outputs. Consequently, it degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and visual question answering (VQA). Extensive experiments in both digital and physical domains demonstrate that MSLA is effective, transferable, and practically realizable. Our findings provide the first evidence that VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.

关键词: Vision-Language Models, Physical Adversarial Attacks, Multimodal Semantic Lighting Attacks, Semantic Hallucinations, Zero-shot Classification, Image Captioning, Visual Question Answering, Robustness Evaluation

214. ❌ DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

作者: Xinyue Li, Shubo Xu, Zhichao Zhang, Zhaolin Cai, Yitong Chen, Guangtao Zhai 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12813v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文DPC-VQA提出了一种用于视频质量评估（VQA）的解耦感知与校准框架，其核心是利用冻结的多模态大语言模型（MLLM）作为感知先验，并通过轻量级校准分支进行残差校正以实现目标场景适应。该研究与以下关键词高度相关：1) ‘Large Language Models OR LLMs OR Foundation Models’（8分）：论文明确使用MLLM作为基础模型，属于大模型范畴。2) ‘Post-training OR Supervised Fine-tuning OR SFT’（8分）：论文的核心方法涉及对预训练MLLM进行校准（即微调），以适配目标MOS空间，这本质上是监督微调的一种高效形式。3) ‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（8分）：论文通过冻结MLLM主干并仅训练轻量级校准分支（使用少于2%的可训练参数），显著减少了训练成本，这完全符合参数高效微调（PEFT）的核心思想。4) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：论文利用了预训练的MLLM，并涉及将其适应到特定领域（VQA），与领域适应相关。5) ‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）：视频质量评估可视为AI在多媒体分析科学领域的一个应用，虽非生物或化学信息学，但属于广义的AI for Science范畴。其他关键词（如MoE、量化、推理加速、RAG等）在论文中未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对基于多模态大语言模型的视频质量评估方法存在大规模重训练和标注成本高的问题，提出了一种解耦感知与校准框架（DPC-VQA），通过冻结大模型主干并训练轻量级校准分支，在显著降低参数和标注需求的同时，在用户生成内容和AI生成内容基准上取得了有竞争力的性能。

摘要翻译

近期，多模态大语言模型（MLLMs）在视频质量评估（VQA）任务中展现出良好性能。然而，由于需要大规模重新训练和昂贵的平均意见得分（MOS）标注，使其适应新场景的成本依然高昂。本文认为，预训练的MLLM已为VQA提供了有用的感知先验，而主要挑战在于如何高效地将此先验校准至目标MOS空间。基于这一洞见，我们提出DPC-VQA，一种用于视频质量评估的解耦感知与校准框架。具体而言，DPC-VQA使用冻结的MLLM提供基础质量估计和感知先验，并采用轻量级校准分支预测残差校正以实现目标场景适配。该设计避免了昂贵的端到端重新训练，同时以更低的训练和数据成本保持可靠性能。在用户生成内容（UGC）和人工智能生成内容（AIGC）基准上的大量实验表明，DPC-VQA相较于代表性基线方法取得了具有竞争力的性能，且仅使用传统基于MLLM的VQA方法中不到2%的可训练参数，并在仅使用20% MOS标注时仍保持有效性。代码将在论文发表时公开。

摘要 (Abstract)

Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20% of MOS labels. The code will be released upon publication.

关键词: Video Quality Assessment, Multimodal Large Language Models, Parameter-efficient Fine-tuning, Decoupling Framework, Residual Calibration, Perceptual Prior, Mean Opinion Score, Lightweight Adaptation

215. ❌ Image-to-Image Translation Framework Embedded with Rotation Symmetry Priors

作者: Feiyu Tan, Heran Yang, Qihong Duan, Kai Ye, Qi Xie, Deyu Meng 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12805v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的图像到图像转换任务，提出了一种嵌入旋转对称先验的等变卷积网络框架。虽然论文涉及深度学习技术，但所有关键词都明确针对大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等），而本文研究的是卷积神经网络在图像处理中的应用，没有涉及任何语言模型、文本生成或大模型技术。关键词中的’AI for Science’虽然范围较广，但论文的计算机视觉应用不属于指定的生物信息学或化学信息学子领域，因此也不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种嵌入旋转对称先验的等变卷积网络框架，用于解决图像到图像转换任务中缺乏配对数据的问题，通过理论分析和实验验证了该方法在保持旋转对称性和提升生成质量方面的有效性。

摘要翻译

图像到图像转换（Image-to-image translation，简称I2I）是计算机视觉中的一项基础任务，其核心目标是将源域中的输入图像映射到目标域中的对应图像，同时保持域不变特征并适应域特定属性。尽管基于深度学习的I2I方法已取得显著成功，但配对数据的缺乏以及无监督学习框架仍制约着其效能。在本研究中，我们通过将变换对称性先验融入图像到图像转换网络来应对这一挑战。具体而言，我们引入了旋转群等变卷积，以实现旋转等变的I2I框架——据我们所知，这是该研究方向的一项新颖贡献。该设计确保了旋转对称性（自然图像与科学图像最本质且域不变的特性之一）在整个网络中的保持。此外，我们对真实数据集上的图像对称性先验进行了系统研究，并提出了一种新颖的可学习变换等变卷积（Transformation Learnable Equivariant Convolutions，简称TL-Conv），它能自适应地学习变换群，从而增强跨不同数据集的对称性保持能力。我们还对TL-Conv的等变误差进行了理论分析，证明其在连续域中保持精确等变性，并为离散情况下的误差提供了界限。通过在一系列I2I任务上的广泛实验，我们验证了所提方法的有效性和优越性能，凸显了等变网络在提升生成质量及其广泛适用性方面的潜力。我们的代码公开于https://github.com/tanfy929/Equivariant-I2I。

摘要 (Abstract)

Image-to-image translation (I2I) is a fundamental task in computer vision, focused on mapping an input image from a source domain to a corresponding image in a target domain while preserving domain-invariant features and adapting domain-specific attributes. Despite the remarkable success of deep learning-based I2I approaches, the lack of paired data and unsupervised learning framework still hinder their effectiveness. In this work, we address the challenge by incorporating transformation symmetry priors into image-to-image translation networks. Specifically, we introduce rotation group equivariant convolutions to achieve rotation equivariant I2I framework, a novel contribution, to the best of our knowledge, along this research direction. This design ensures the preservation of rotation symmetry, one of the most intrinsic and domain-invariant properties of natural and scientific images, throughout the network. Furthermore, we conduct a systematic study on image symmetry priors on real dataset and propose a novel transformation learnable equivariant convolutions (TL-Conv) that adaptively learns transformation groups, enhancing symmetry preservation across diverse datasets. We also provide a theoretical analysis of the equivariance error of TL-Conv, proving that it maintains exact equivariance in continuous domains and provide a bound for the error in discrete cases. Through extensive experiments across a range of I2I tasks, we validate the effectiveness and superior performance of our approach, highlighting the potential of equivariant networks in enhancing generation quality and its broad applicability. Our code is available at https://github.com/tanfy929/Equivariant-I2I

关键词: Image-to-image translation, Rotation symmetry priors, Equivariant convolutions, Transformation learnable equivariant convolutions, Unsupervised learning, Computer vision, Deep learning, Symmetry preservation

216. ❌ Generative Anonymization in Event Streams

作者: Adam T. Müller, Mihai Kocsis, Nicolaj C. Stache 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12803v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究神经形态视觉传感器事件流的生成式匿名化框架，属于计算机视觉和隐私保护领域。与大多数大模型技术关键词（如LLMs、MoE、RLHF等）完全无关。仅与两个关键词有弱关联：1. “Pre-training OR Continual Pre-training OR Domain Adaptation”（5分）：论文提到使用预训练模型（pretrained models）合成身份，但未深入探讨预训练技术本身。2. “AI for Science OR Bioinformatics OR Cheminformatics”（5分）：论文涉及AI在科学领域的应用（神经形态视觉），但非生物信息学或化学信息学。其他关键词均未涉及。加权总分较低，反映论文主题与大模型技术焦点不匹配。

!!! tip deepseek-chat TL;DR

该论文提出了一种生成式匿名化框架，用于神经形态视觉传感器的事件流，通过预训练模型合成非真实身份以保护隐私，同时保持数据完整性，并引入了一个新的同步事件-RGB数据集用于评估。

摘要翻译

神经形态视觉传感器具备低延迟与高动态范围的优势，但其在公共空间的部署引发了严重的数据保护隐忧。近期的事件到视频（Event-to-Video, E2V）模型能够从稀疏的事件流中重建高保真强度图像，这无意中暴露了人物身份。现有的混淆方法（如掩码或置乱）会破坏时空结构，严重降低数据在下游感知任务中的可用性。本文首次提出了一种针对事件流的生成式匿名化框架，以解决这种效用与隐私之间的权衡问题。通过弥合异步事件与标准空间生成模型之间的模态鸿沟，我们的流程将事件投影为中间强度表示，利用预训练模型合成逼真且不存在的人物身份，并将特征重新编码回神经形态域。实验表明，我们的方法能有效防止从E2V重建中恢复身份，同时保持下游视觉任务所需的结构化数据完整性。最后，为支持严谨评估，我们引入了一个新颖的、通过精确机器人轨迹采集的同步真实世界事件与RGB数据集，为未来隐私保护型神经形态视觉研究提供了可靠的基准。

摘要 (Abstract)

Neuromorphic vision sensors offer low latency and high dynamic range, but their deployment in public spaces raises severe data protection concerns. Recent Event-to-Video (E2V) models can reconstruct high-fidelity intensity images from sparse event streams, inadvertently exposing human identities. Current obfuscation methods, such as masking or scrambling, corrupt the spatio-temporal structure, severely degrading data utility for downstream perception tasks. In this paper, to the best of our knowledge, we present the first generative anonymization framework for event streams to resolve this utility-privacy trade-off. By bridging the modality gap between asynchronous events and standard spatial generative models, our pipeline projects events into an intermediate intensity representation, leverages pretrained models to synthesize realistic, non-existent identities, and re-encodes the features back into the neuromorphic domain. Experiments demonstrate that our method reliably prevents identity recovery from E2V reconstructions while preserving the structural data integrity required for downstream vision tasks. Finally, to facilitate rigorous evaluation, we introduce a novel, synchronized real-world event and RGB dataset captured via precise robotic trajectories, providing a robust benchmark for future research in privacy-preserving neuromorphic vision.

关键词: generative anonymization, event streams, neuromorphic vision, privacy-preserving, Event-to-Video models, data utility, pretrained models, synchronized dataset

217. ❌ Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images

作者: Haoyang Jiang, Mingyang Yi, Shaolei Zhang, Junxian Cai, Qingbin Liu, Xi Chen, Ju Fan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12781v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究扩散模型生成图像的检测器（特别是基于重建的方法）的对抗脆弱性，属于计算机视觉、图像取证和对抗机器学习领域。所有评分关键词均聚焦于大语言模型（LLM）及其相关技术（如训练、对齐、推理、应用等），而论文完全不涉及语言模型、文本生成或自然语言处理。论文的核心是图像检测器的安全评估，与LLM技术无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文揭示了基于重建的扩散生成图像检测器存在严重的对抗脆弱性，即微小的对抗扰动可使其检测准确率降至接近零，且攻击具有可迁移性，现有防御方法效果有限。

摘要翻译

近年来，基于扩散模型生成的AI图像因其对安全构成的潜在威胁，检测此类图像日益受到关注。在现有方法中，基于重构的检测方法已成为该任务的重要范式。然而，我们发现此类方法对对抗性扰动表现出严重的安全脆弱性；即通过对输入图像添加难以察觉的对抗性扰动，分类器的检测准确率会骤降至接近零。为验证这一威胁，我们对四种不同生成骨干模型下的三种代表性检测器进行了对抗鲁棒性的系统评估。首先，我们在白盒场景下构建对抗攻击，结果表明所有训练良好的检测器性能均显著下降。此外，我们发现这些攻击具有可迁移性；具体而言，针对某一检测器设计的攻击可迁移至其他检测器，这表明对抗攻击在黑盒设置下同样能够构建。最后，我们评估了常见防御措施，发现针对对抗攻击的标准防御方法仅能提供有限缓解。我们将这些失效归因于检测器所感知的受攻击样本具有较低的信噪比。总体而言，我们的研究揭示了基于重构的检测器存在根本性安全局限，并强调需重新思考现有检测策略。

摘要 (Abstract)

Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zero. To verify this threat, we present a systematic evaluation of the adversarial robustness of three representative detectors across four diverse generative backbone models. First, we construct adversarial attacks in white-box scenarios, which degrade the performance of all well-trained detectors. Moreover, we find that these attacks demonstrate transferability; specifically, attacks crafted against one detector can be transferred to others, indicating that adversarial attacks on detectors can also be constructed in a black-box setting. Finally, we assess common countermeasures and find that standard defense methods against adversarial attacks provide limited mitigation. We attribute these failures to the low signal-to-noise ratio (SNR) of attacked samples as perceived by the detectors. Overall, our results reveal fundamental security limitations of reconstruction-based detectors and highlight the need to rethink existing detection strategies.

关键词: Adversarial Vulnerability, Reconstruction-based Detectors, Diffusion-generated Images, Adversarial Attacks, Transferability, Security Limitations, Image Detection, AI-generated Content

218. ❌ A Multi-Agent Feedback System for Detecting and Describing News Events in Satellite Imagery

作者: Madeline Anderson, Mikhail Klassen, Ash Hoover, Kerri Cahoy 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12772v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出SkyScraper，一种迭代多智能体工作流，用于从新闻文章中地理编码并生成卫星图像序列的标题，核心创新在于多智能体系统（Multi-agent Systems/Agent Coordination）和智能体工作流（LLM Agents/Agentic Workflow），与这两个关键词高度相关（10分）。论文应用AI处理卫星图像和新闻数据，属于AI for Science范畴（8分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化等，论文未直接涉及这些技术细节，故评0分。

!!! tip deepseek-chat TL;DR

该研究解决了卫星图像中多时序事件标注数据集缺乏的问题，通过开发一个多智能体工作流SkyScraper，自动从新闻文章中地理编码并生成卫星图像序列的标题，成功发现了比传统方法多5倍的事件，并创建了一个包含5000个序列的新数据集。

摘要翻译

卫星影像的变化常发生于多个时间步骤。尽管双时相变化描述数据集已开始出现，但遥感领域仍缺乏多时相事件描述数据集（每个序列至少包含两幅影像）。这一空白的存在是因为：（1）在卫星影像中搜寻可见事件，以及（2）为多时相序列进行标注，均需要大量时间和人力。为应对这些挑战，我们提出了SkyScraper——一种迭代式多智能体工作流程，该系统能够对新闻文章进行地理编码，并为对应的卫星影像序列生成描述文本。实验表明，SkyScraper成功发现的事件数量是传统地理编码方法的5倍，这证明智能体反馈是一种在卫星影像中发掘新多时相事件的有效策略。我们将该框架应用于一个全球新闻文章大型数据库，构建了一个包含5,000个序列的新型多时相描述数据集。通过自动识别与新闻事件相关的影像，我们的工作也为新闻业和报道工作提供了支持。

摘要 (Abstract)

Changes in satellite imagery often occur over multiple time steps. Despite the emergence of bi-temporal change captioning datasets, there is a lack of multi-temporal event captioning datasets (at least two images per sequence) in remote sensing. This gap exists because (1) searching for visible events in satellite imagery and (2) labeling multi-temporal sequences require significant time and labor. To address these challenges, we present SkyScraper, an iterative multi-agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences. Our experiments show that SkyScraper successfully finds 5x more events than traditional geocoding methods, demonstrating that agentic feedback is an effective strategy for surfacing new multi-temporal events in satellite imagery. We apply our framework to a large database of global news articles, curating a new multi-temporal captioning dataset with 5,000 sequences. By automatically identifying imagery related to news events, our work also supports journalism and reporting efforts.

关键词: multi-agent system, satellite imagery, news events, geocoding, captioning dataset, remote sensing, agentic workflow, multi-temporal events

219. ❌ A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture

作者: Yeeun Park, Miqdad Naduthodi, Suryansh Kumar 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12765v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的4D无标记人体运动捕捉数据集创建与评估，与提供的大模型和深度学习技术关键词基本无关。唯一相关的是’Post-training OR Supervised Fine-tuning OR SFT’，因为论文提到’fine-tuning improves generalization’，但这不是核心内容，只是验证数据集价值的方法之一，因此给5分（有一定关联）。其他关键词均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于复杂4D无标记人体运动捕捉的新数据集和评估方法，通过包含真实世界复杂交互场景的数据，揭示了现有模型的局限性，并证明针对性微调能提升泛化能力。

摘要翻译

基于标记点的运动捕捉系统长期以来一直是精确四维人体建模的黄金标准，但其对专用硬件和标记物的依赖限制了可扩展性和实际部署。推进可靠的无标记四维人体运动捕捉需要能反映真实世界人体交互复杂性的数据集。然而，现有基准测试通常缺乏真实的多人体动态、严重遮挡以及具有挑战性的交互模式，导致存在持续的领域差距。本研究提出一个针对复杂四维无标记人体运动捕捉的新数据集及评估方案。我们提出的运动捕捉数据集捕捉了包含复杂动作、频繁人际遮挡、衣着相似对象间快速位置交换以及不同对象距离的单人与多人场景。它包含同步的多视角RGB与深度序列、精确的相机标定、来自Vicon系统的真实三维运动捕捉数据，以及对应的SMPL/SMPL-X参数。该设置确保了视觉观测与运动真实数据之间的精确对齐。对当前先进的无标记运动捕捉模型进行基准测试显示，在这些真实条件下模型性能显著下降，凸显了现有方法的局限性。我们进一步证明，有针对性的微调能提升泛化能力，验证了数据集的真实性和对模型开发的价值。我们的评估揭示了现有模型中的关键不足，并为推进鲁棒的无标记四维人体运动捕捉技术奠定了严谨基础。

摘要 (Abstract)

Marker-based motion capture (MoCap) systems have long been the gold standard for accurate 4D human modeling, yet their reliance on specialized hardware and markers limits scalability and real-world deployment. Advancing reliable markerless 4D human motion capture requires datasets that reflect the complexity of real-world human interactions. Yet, existing benchmarks often lack realistic multi-person dynamics, severe occlusions, and challenging interaction patterns, leading to a persistent domain gap. In this work, we present a new dataset and evaluation for complex 4D markerless human motion capture. Our proposed MoCap dataset captures both single and multi-person scenarios with intricate motions, frequent inter-person occlusions, rapid position exchanges between similarly dressed subjects, and varying subject distances. It includes synchronized multi-view RGB and depth sequences, accurate camera calibration, ground-truth 3D motion capture from a Vicon system, and corresponding SMPL/SMPL-X parameters. This setup ensures precise alignment between visual observations and motion ground truth. Benchmarking state-of-the-art markerless MoCap models reveals substantial performance degradation under these realistic conditions, highlighting limitations of current approaches. We further demonstrate that targeted fine-tuning improves generalization, validating the dataset’s realism and value for model development. Our evaluation exposes critical gaps in existing models and provides a rigorous foundation for advancing robust markerless 4D human motion capture.

关键词: 4D human motion capture, markerless motion capture, multi-person dynamics, occlusion handling, SMPL/SMPL-X parameters, dataset evaluation, fine-tuning generalization, real-world human interactions

220. ❌ Scaling In-Context Segmentation with Hierarchical Supervision

作者: T. Camaret Ndir, Marco Reisert, Robin T. Schirrmeister 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12752v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学图像分割中的上下文学习（ICL）方法，与关键词’In-context Learning OR Many-shot Learning’高度相关（10分），因为这是论文的核心技术；同时属于’AI for Science OR Bioinformatics OR Cheminformatics’领域（8分），因为应用于医学影像分析；其他关键词主要涉及大语言模型（LLM）技术、对齐、推理、代理等，与论文的计算机视觉和医学图像分割焦点无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PatchICL的分层框架，通过选择性图像分块和多级监督来解决医学图像分割中上下文学习方法计算效率低的问题，在保持竞争力的分割准确性的同时将计算量减少了44%。

摘要翻译

上下文学习（ICL）使医学图像分割模型能够通过有限示例适应新的解剖结构，从而减轻临床标注负担。然而，标准ICL方法通常依赖于密集的全局交叉注意力机制，其计算复杂度随图像分辨率增加而急剧上升。尽管近期研究引入了局部注意力机制，但这些方法往往缺乏对区域选择过程的显式监督，导致在非信息区域产生冗余计算。本文提出PatchICL——一种结合选择性图像分块与多级监督的层次化框架。该方法通过学习主动识别并仅关注最具信息量的解剖区域，在512×512分辨率下，相比采用全局注意力的强基线模型UniverSeg，PatchICL在保持竞争力的域内CT分割精度的同时，计算量降低44%。在涵盖多种成像模态的35个域外数据集测试中，PatchICL在13个模态类别中的6类表现优于基线，尤其在以局部病理特征为主的OCT和皮肤镜成像模态中展现出显著优势。训练与评估代码已发布于https://github.com/tidiane-camaret/ic_segmentation。

摘要 (Abstract)

In-context learning (ICL) enables medical image segmentation models to adapt to new anatomical structures from limited examples, reducing the clinical annotation burden. However, standard ICL methods typically rely on dense, global cross-attention, which scales poorly with image resolution. While recent approaches have introduced localized attention mechanisms, they often lack explicit supervision on the selection process, leading to redundant computation in non-informative regions. We propose PatchICL, a hierarchical framework that combines selective image patching with multi-level supervision. Our approach learns to actively identify and attend only to the most informative anatomical regions. Compared to UniverSeg, a strong global-attention baseline, PatchICL achieves competitive in-domain CT segmentation accuracy while reducing compute by 44% at $512\times512$ resolution. On 35 out-of-domain datasets spanning diverse imaging modalities, PatchICL outperforms the baseline on 6 of 13 modality categories, with particular strength on modalities dominated by localized pathology such as OCT and dermoscopy. Training and evaluation code are available at https://github.com/tidiane-camaret/ic_segmentation

关键词: In-context learning, Medical image segmentation, Hierarchical supervision, Selective image patching, Compute efficiency, Cross-modality generalization, PatchICL, Anatomical structures

221. ❌ Risk-Calibrated Learning: Minimizing Fatal Errors in Medical AI

作者: Abolfazl Mohammadi-Seif, Ricardo Baeza-Yates 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12693v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学AI中的深度学习模型安全性和错误校准，特别是针对医学图像分类中的致命错误（如将恶性肿瘤误分类为良性）。研究提出了Risk-Calibrated Learning方法，通过嵌入临床严重性矩阵来区分视觉模糊性和灾难性结构错误，并在多个医学影像数据集上验证了其有效性。论文的核心主题是深度学习在医学领域的应用（具体为医学图像分类和错误缓解），这与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关，因为医学AI是AI for Science的重要子领域。然而，论文未涉及大模型（LLMs）、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文窗口扩展、注意力优化、推理技术、代理系统、模型压缩、推理加速、幻觉缓解、可解释性、世界模型、模型合并或上下文学习等主题，因此这些关键词的相关性评分为0。

!!! tip deepseek-chat TL;DR

该论文针对医学AI中深度学习模型的高置信度语义不连贯错误（如将恶性肿瘤误分类为良性）问题，提出了Risk-Calibrated Learning方法，通过优化嵌入临床严重性矩阵的损失函数，在多个医学影像数据集上显著降低了致命错误率，实现了20.0%至92.4%的相对安全提升。

摘要翻译

深度学习模型在医学影像分类中常能达到专家级准确率，却存在一个关键缺陷：语义不一致性。这类高置信度的语义不连贯错误（例如将恶性肿瘤分类为良性）与源于视觉模糊性的可接受错误存在本质差异。不同于安全的细粒度分歧，这些致命性失误会严重侵蚀临床信任。为解决此问题，我们提出风险校准学习技术，该方法能明确区分视觉模糊性（细粒度错误）与灾难性结构错误。通过将混淆感知的临床严重性矩阵M嵌入优化空间，我们的方法可在无需复杂架构改动的前提下抑制关键错误（假阴性）。我们在四种不同成像模态中验证了本方法：脑肿瘤MRI、ISIC 2018（皮肤镜影像）、BreaKHis（乳腺组织病理学）和SICAPv2（前列腺组织病理学）。大量实验表明，相较于Focal Loss等先进基线方法，我们的风险校准损失函数能持续降低所有四个数据集的关键错误率，在乳腺组织病理学上实现20.0%至前列腺组织病理学上92.4%的相对安全性提升。这些结果证实，我们的方法在CNN和Transformer架构中均能提供更优的安全性与准确性平衡。

摘要 (Abstract)

Deep learning models often achieve expert-level accuracy in medical image classification but suffer from a critical flaw: semantic incoherence. These high-confidence mistakes that are semantically incoherent (e.g., classifying a malignant tumor as benign) fundamentally differ from acceptable errors which stem from visual ambiguity. Unlike safe, fine-grained disagreements, these fatal failures erode clinical trust. To address this, we propose Risk-Calibrated Learning, a technique that explicitly distinguishes between visual ambiguity (fine-grained errors) and catastrophic structural errors. By embedding a confusion-aware clinical severity matrix M into the optimization landscape, our method suppresses critical errors (false negatives) without requiring complex architectural changes. We validate our approach in four different imaging modalities: Brain Tumor MRI, ISIC 2018 (Dermoscopy), BreaKHis (Breast Histopathology), and SICAPv2 (Prostate Histopathology). Extensive experiments demonstrate that our Risk-Calibrated Loss consistently reduces the Critical Error Rate (CER) for all four datasets, achieving relative safety improvements ranging from 20.0% (on breast histopathology) to 92.4% (on prostate histopathology) compared to state-of-the-art baselines such as Focal Loss. These results confirm that our method offers a superior safety-accuracy trade-off across both CNN and Transformer architectures.

关键词: Risk-Calibrated Learning, Medical AI, Deep Learning, Medical Image Classification, Fatal Errors, Clinical Severity Matrix, Critical Error Rate, Safety-Accuracy Trade-off

222. ❌ Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Pretraining

作者: Junfeng Xia, Wenhao Ye, Xuanye Pan, Xinke Shen, Mo Wang, Quanying Liu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12683v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出Brain-DiT，一个用于fMRI数据的通用多状态基础模型，属于AI for Science（生物信息学）领域，并采用预训练方法。因此，与’AI for Science OR Bioinformatics OR Cheminformatics’、‘Pre-training OR Continual Pre-training OR Domain Adaptation’和’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为这些关键词直接对应论文的领域、核心方法和模型类型。其他关键词主要涉及大语言模型的具体技术（如MoE、RLHF、量化等）或应用（如代理、工具使用），论文未涉及，故得0分。

!!! tip deepseek-chat TL;DR

论文针对现有fMRI基础模型局限于有限脑状态和预训练任务的问题，提出了Brain-DiT，一个通过元数据条件扩散预训练的通用多状态fMRI基础模型，在多个下游任务上验证了其优于传统重建或对齐方法的性能。

摘要翻译

当前的功能磁共振成像基础模型主要依赖于有限范围的脑状态及不匹配的预训练任务，这限制了其学习跨不同脑状态的泛化表征能力。我们提出了 \textit{Brain-DiT}，这是一个通用的多状态功能磁共振成像基础模型，基于来自24个数据集的349,898个扫描会话进行预训练，涵盖静息、任务、自然情境、疾病和睡眠等多种状态。与先前依赖原始信号空间或潜在空间中掩码重建的功能磁共振成像基础模型不同，\textit{Brain-DiT} 采用基于元数据条件的扩散预训练方法，结合扩散变换器（Diffusion Transformer, DiT），使模型能够学习多尺度表征，从而同时捕捉细粒度的功能结构和全局语义。通过对7项下游任务进行广泛评估与消融实验，我们获得了一致的证据，表明基于扩散的生成式预训练比重建或对齐方法更具优势，而元数据条件的预训练通过将内在神经动力学与群体水平变异解耦，进一步提升了下游任务性能。我们还观察到，下游任务对表征尺度表现出不同的偏好：ADNI（阿尔茨海默病神经影像学倡议）分类更多受益于全局语义表征，而年龄/性别预测则相对更依赖于细粒度的局部结构。Brain-DiT的代码与参数可在 \href{https://github.com/REDMAO4869/Brain-DiT}{链接} 获取。

摘要 (Abstract)

Current fMRI foundation models primarily rely on a limited range of brain states and mismatched pretraining tasks, restricting their ability to learn generalized representations across diverse brain states. We present \textit{Brain-DiT}, a universal multi-state fMRI foundation model pretrained on 349,898 sessions from 24 datasets spanning resting, task, naturalistic, disease, and sleep states. Unlike prior fMRI foundation models that rely on masked reconstruction in the raw-signal space or a latent space, \textit{Brain-DiT} adopts metadata-conditioned diffusion pretraining with a Diffusion Transformer (DiT), enabling the model to learn multi-scale representations that capture both fine-grained functional structure and global semantics. Across extensive evaluations and ablations on 7 downstream tasks, we find consistent evidence that diffusion-based generative pretraining is a stronger proxy than reconstruction or alignment, with metadata-conditioned pretraining further improving downstream performance by disentangling intrinsic neural dynamics from population-level variability. We also observe that downstream tasks exhibit distinct preferences for representational scale: ADNI classification benefits more from global semantic representations, whereas age/sex prediction comparatively relies more on fine-grained local structure. Code and parameters of Brain-DiT are available at \href{https://github.com/REDMAO4869/Brain-DiT}{Link}.

关键词: fMRI foundation model, multi-state brain data, diffusion pretraining, metadata-conditioned, Diffusion Transformer, generative pretraining, downstream tasks, neural dynamics

223. ❌ OFA-Diffusion Compression: Compressing Diffusion Model in One-Shot Manner

作者: Haoyang Jiang, Zekun Wang, Mingyang Yi, Xiuyu Li, Lanqing Hu, Junxian Cai, Qingbin Liu, Xi Chen, Ju Fan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12668v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散概率模型（DPM）的压缩，提出了一种一次性训练框架（OFA）来生成不同计算需求的子网络，以降低部署成本。论文与绝大多数关键词无关，因为这些关键词主要针对大语言模型（LLMs）及其相关技术（如对齐、推理、代理等），而本文研究的是扩散模型（一种生成模型，主要用于图像生成）。唯一相关的关键词是’Quantization OR Model Compression OR Low-bit Weights’，因为论文涉及模型压缩（尽管是结构化剪枝而非量化），但并非核心匹配（论文重点是架构压缩，而非低比特权重），因此给予5分（有一定关联）。其他关键词如’AI for Science’等不适用，因为论文未涉及科学领域应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种一次性压缩框架（OFA-Diffusion Compression），用于扩散概率模型，通过重要性通道分配和重加权策略，在单次训练中生成多种计算需求的子网络，显著降低训练开销并保持性能。

摘要翻译

扩散概率模型在图像生成领域取得了卓越的性能，但其不断增长的参数量与计算开销阻碍了在实际应用中的部署。为改善这一问题，现有研究主要集中于通过模型压缩获得固定架构的小型模型。然而在实际应用中，扩散概率模型通常需部署于资源条件各异的多种设备上，这会导致多次压缩过程，产生重复训练的巨大开销。为解决此问题，我们提出一种面向扩散概率模型的一劳永逸压缩框架，该框架能够通过单次训练生成具有不同计算量的多个子网络。现有的一劳永逸框架通常包含大量参数量各异的子网络，而庞大的候选空间会拖慢优化进程。因此，我们提出使用一组特定的参数量来约束候选子网络，其中每个参数量对应一个特定子网络。具体而言，为构建具有给定参数量的子网络，我们依据通道重要性逐步分配所保留的通道。此外，我们提出一种重加权策略以平衡不同子网络的优化过程。实验结果表明，我们的方法能够以显著降低的训练开销为多种规模生成压缩后的扩散概率模型，同时获得令人满意的性能。

摘要 (Abstract)

The Diffusion Probabilistic Model (DPM) achieves remarkable performance in image generation, while its increasing parameter size and computational overhead hinder its deployment in practical applications. To improve this, the existing literature focuses on obtaining a smaller model with a fixed architecture through model compression. However, in practice, DPMs usually need to be deployed on various devices with different resource constraints, which leads to multiple compression processes, incurring significant overhead for repeated training. To obviate this, we propose a once-for-all (OFA) compression framework for DPMs that yields different subnetworks with various computations in a one-shot training manner. The existing OFA framework typically involves massive subnetworks with different parameter sizes, while such a huge candidate space slows the optimization. Thus, we propose to restrict the candidate subnetworks with a certain set of parameter sizes, where each size corresponds to a specific subnetwork. Specifically, to construct each subnetwork with a given size, we gradually allocate the maintained channels by their importance. Furthermore, we propose a reweighting strategy to balance the optimization process of different subnetworks. Experimental results show that our approach can produce compressed DPMs for various sizes with significantly lower training overhead while achieving satisfactory performance.

关键词: Diffusion Probabilistic Model, Model Compression, Once-for-All, Subnetwork, Training Overhead, Image Generation, Parameter Size, Computational Constraints

224. ❌ Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

作者: Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang, Xinchao Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12665v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的多目标跟踪（MOT）任务，提出了一种结合超图计算和状态空间模型的运动推理框架（HyperSSM）。论文的核心是解决视觉跟踪中的运动估计不稳定和遮挡问题，属于传统的深度学习在计算机视觉领域的应用。所有评分关键词均与大语言模型（LLM）、大模型技术原理、AI for Science（如生物信息学）或大模型在不同领域的创新应用直接相关。本论文未涉及任何大模型、语言模型、提示工程、对齐、高效微调、推理优化、智能体或科学AI应用等内容，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对多目标跟踪中运动估计不稳定和遮挡导致轨迹断裂的问题，提出了一个基于超图状态空间模型的协同推理框架（HyperSSM），通过在多个相关目标间进行联合推理来增强运动估计，并在多个主流基准测试中取得了最先进的性能。

摘要翻译

运动推理是多目标跟踪（MOT）的基石，它能够实现目标在连续帧间的一致性关联。然而，现有的运动估计方法面临两大局限：（1）由噪声或概率性预测引起的不稳定性，以及（2）在遮挡情况下的脆弱性——一旦视觉线索消失，轨迹往往会断裂。为克服这些问题，我们提出了一种协同推理框架，通过多个相关目标间的联合推断来增强运动估计。该框架允许具有相似运动状态的目标相互约束与优化，从而稳定噪声轨迹，并在目标被遮挡时推断出合理的运动连续性。为实现这一构想，我们设计了HyperSSM架构，该架构整合了超图（Hypergraph）计算与状态空间模型（State Space Model, SSM），以进行统一的时空推理。超图模块通过动态超边捕捉空间运动关联，而SSM则通过结构化状态转移强制实现时间平滑性。这种协同设计能够同时优化空间一致性与时间连贯性，从而获得鲁棒且稳定的运动估计。我们在涵盖多种运动模式与场景复杂度的四个主流多样化基准数据集（MOT17、MOT20、DanceTrack和SportsMOT）上进行了大量实验，结果表明，我们的方法在广泛的跟踪场景中均达到了最先进的性能水平。

摘要 (Abstract)

Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded. To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation. Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.

关键词: multi-object tracking, motion reasoning, collaborative reasoning, hypergraph, state space model, occlusion handling, trajectory estimation, spatial-temporal reasoning

225. ❌ Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

作者: Miao Liu, Fangda Wei, Jing Wang, Xinyuan Qian 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12650v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于深度伪造检测领域，特别是针对’倾听深度伪造’的新任务，提出了数据集和检测网络MANet。所有评分关键词均与大模型、深度学习技术原理或AI科学应用直接相关，而本文研究的是计算机视觉和多媒体安全中的深度伪造检测，属于传统深度学习应用，未涉及大模型技术、AI科学应用或评分关键词中的任何具体技术。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了倾听深度伪造检测的新任务，构建了首个数据集ListenForge，并开发了MANet网络来检测倾听伪造视频中的运动不一致性，显著优于现有说话深度伪造检测方法。

摘要翻译

现有深度伪造检测研究主要集中于被篡改主体处于主动说话状态的场景，即通过改变说话者的外貌或声音生成伪造内容。然而在实际交互场景中，攻击者往往交替伪造说话与倾听状态以误导目标对象，从而提升场景的真实性与说服力。尽管"倾听型深度伪造"的检测研究尚处于空白阶段，且受限于数据集与方法的匮乏，但合成倾听反应相对有限的质量为当前深度伪造检测工作提供了绝佳的突破契机。本文提出倾听型深度伪造检测任务，并首次构建了专用于该任务的ListenForge数据集，该数据集通过五种倾听头部生成方法构建而成。针对倾听伪造的独特性，我们提出MANet——一种运动感知与音频引导网络，该网络能捕捉倾听者视频中的细微运动不一致性，同时利用说话者音频语义指导跨模态融合。大量实验表明，现有说话型深度伪造检测模型在倾听场景中表现不佳，而MANet在ListenForge数据集上取得了显著优越的性能。本研究揭示了超越传统以说话为中心范式的深度伪造检测必要性，为交互式通信场景中的多模态伪造分析开辟了新方向。数据集与代码公开于https://anonymous.4open.science/r/LDD-B4CB。

摘要 (Abstract)

Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker’s appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of ’listening deepfakes’ remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker’s audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at https://anonymous.4open.science/r/LDD-B4CB.

关键词: Listening Deepfake Detection, ListenForge dataset, MANet, Motion-aware, Audio-guided, Multimodal forgery analysis, Listening Head Generation, Deepfake detection

作者: Ziyuan Xia, Jingyi Xu, Chong Cui, Yuanhong Yu, Jiazhao Zhang, Qingsong Yan, Tao Ni, Junbo Chen, Xiaowei Zhou, Hujun Bao, Ruizhen Hu, Sida Peng 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12626v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于具身AI模拟器的开发，特别是使用3D高斯泼溅技术进行高保真渲染和动态人体建模，以提升导航代理的训练效果。所有评分关键词均与大语言模型、模型训练、推理优化、对齐、代理系统等大模型核心技术相关，而本文研究的是计算机视觉、图形学和机器人学领域的模拟器技术，与评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了Habitat-GS，一个基于3D高斯泼溅的高保真导航模拟器，通过集成3DGS渲染和可驱动的高斯化身来增强视觉真实感和动态人体建模，实验表明其能有效提升具身AI代理的跨域泛化能力和人类感知导航性能。

摘要翻译

训练具身智能体关键取决于仿真环境的视觉保真度与动态人体建模能力。当前仿真器主要依赖基于网格的光栅化渲染，其视觉真实感有限；即便支持动态人体化身，也通常受限于网格表征，这阻碍了智能体向真实人类场景的泛化。本文提出Habitat-GS——一个基于Habitat-Sim扩展的以导航为核心的具身智能仿真平台，该系统集成了三维高斯溅射（3D Gaussian Splatting）场景渲染与可驱动的高斯化身，同时保持与Habitat生态系统的完全兼容。我们的系统实现了支持实时逼真渲染的3DGS渲染器，并能从多源数据导入可扩展的3DGS资源。针对动态人体建模，我们引入了高斯化身模块，使每个化身既能作为逼真的视觉实体，又能作为有效的导航障碍物，从而让智能体在高度真实的场景中学习人类感知行为。在点目标导航任务上的实验表明，在3DGS场景中训练的智能体具有更强的跨域泛化能力，其中混合域训练策略效果最佳。针对化身感知导航的评估进一步证实，高斯化身能够实现有效的人类感知导航。最后，性能基准测试验证了系统在不同场景复杂度与化身数量下的可扩展性。

摘要 (Abstract)

Training embodied AI agents depends critically on the visual fidelity of simulation environments and the ability to model dynamic humans. Current simulators rely on mesh-based rasterization with limited visual realism, and their support for dynamic human avatars, where available, is constrained to mesh representations, hindering agent generalization to human-populated real-world scenarios. We present Habitat-GS, a navigation-centric embodied AI simulator extended from Habitat-Sim that integrates 3D Gaussian Splatting scene rendering and drivable gaussian avatars while maintaining full compatibility with the Habitat ecosystem. Our system implements a 3DGS renderer for real-time photorealistic rendering and supports scalable 3DGS asset import from diverse sources. For dynamic human modeling, we introduce a gaussian avatar module that enables each avatar to simultaneously serve as a photorealistic visual entity and an effective navigation obstacle, allowing agents to learn human-aware behaviors in realistic settings. Experiments on point-goal navigation demonstrate that agents trained on 3DGS scenes achieve stronger cross-domain generalization, with mixed-domain training being the most effective strategy. Evaluations on avatar-aware navigation further confirm that gaussian avatars enable effective human-aware navigation. Finally, performance benchmarks validate the system’s scalability across varying scene complexity and avatar counts.

关键词: Embodied AI, Navigation Simulator, 3D Gaussian Splatting, Gaussian Avatars, Cross-domain Generalization, Human-aware Navigation, Habitat-Sim, Photorealistic Rendering

227. ❌ Spatial-Spectral Adaptive Fidelity and Noise Prior Reduction Guided Hyperspectral Image Denoising

作者: Xuelin Xie, Xiliang Lu, Zhengshan Wang, Yang Zhang, Long Chen 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12600v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于高光谱图像去噪，属于计算机视觉和图像处理领域，而非大语言模型或深度学习技术原理研究。所有关键词（除最后一个）均与大语言模型、深度学习技术、推理方法、对齐、优化等直接相关，而本文未涉及这些内容。仅最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’与科学应用有一定关联，因为高光谱图像去噪可视为AI在遥感或科学成像中的应用，但并非核心匹配，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合噪声先验减少和空间-光谱自适应保真度项的高光谱图像去噪框架，有效处理混合噪声并保持计算效率，在模拟和真实数据集上表现出优越性能。

摘要翻译

高光谱图像去噪的核心挑战在于如何恰当权衡数据保真度与噪声先验建模。现有方法大多过度强调图像的内在先验，而忽视了多样化的噪声假设以及保真度与先验之间的动态平衡。为解决这些问题，我们提出了一种融合噪声先验约简与空谱自适应保真项的去噪框架。该框架以较少参数考虑全面的噪声先验，并引入自适应权重张量以动态平衡保真项与先验正则项。在此框架内，我们进一步结合代表性系数全变分正则器，开发了一种快速鲁棒的像素级模型，用于精确去除高光谱图像中的混合噪声。所提方法不仅能高效处理多种噪声类型，还能准确捕捉高光谱图像的光谱低秩结构与局部平滑特性。我们设计了基于交替方向乘子法的高效优化算法，确保模型稳定快速收敛。在模拟与真实数据集上的大量实验表明，所提模型在保持较强计算效率的同时，实现了卓越的去噪性能。

摘要 (Abstract)

The core challenge of hyperspectral image denoising is striking the right balance between data fidelity and noise prior modeling. Most existing methods place too much emphasis on the intrinsic priors of the image while overlooking diverse noise assumptions and the dynamic trade-off between fidelity and priors. To address these issues, we propose a denoising framework that integrates noise prior reduction and a spatial-spectral adaptive fidelity term. This framework considers comprehensive noise priors with fewer parameters and introduces an adaptive weight tensor to dynamically balance the fidelity and prior regularization terms. Within this framework, we further develop a fast and robust pixel-wise model combined with the representative coefficient total variation regularizer to accurately remove mixed noise in HSIs. The proposed method not only efficiently handles various types of noise but also accurately captures the spectral low-rank structure and local smoothness of HSIs. An efficient optimization algorithm based on the alternating direction method of multipliers is designed to ensure stable and fast convergence. Extensive experiments on simulated and real-world datasets demonstrate that the proposed model achieves superior denoising performance while maintaining competitive computational efficiency.

关键词: hyperspectral image denoising, noise prior reduction, spatial-spectral adaptive fidelity, mixed noise removal, alternating direction method of multipliers, computational efficiency, total variation regularizer

228. ❌ ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction

作者: Yuhao Liu, Dingju Wang, Ziyang Zheng 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12592v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的3D重建任务，特别是极端低光环境下的高斯泼溅技术优化。所有关键词均涉及大语言模型、深度学习技术原理或AI在科学领域的应用，而本文研究的是纯粹的3D视觉重建方法，未涉及任何语言模型、模型训练技术、推理方法、代理系统或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ELoG-GS的极端低光优化高斯泼溅方法，通过几何感知初始化和光度适应策略，在NTIRE 2026挑战赛中显著提升了低光环境下3D重建的质量和几何一致性。

摘要翻译

本文介绍了我们针对NTIRE 2026三维复原与重建挑战赛（赛道一）所提出的方法，该赛道聚焦于从退化的多视角输入中重建高质量的三维表征。挑战任务在于在极端低光照环境下恢复几何一致且具有照片级真实感的三维场景。为解决此问题，我们提出了极端低光优化高斯泼溅（Extreme Low-light Optimized Gaussian Splatting, ELoG-GS）方法，这是一个鲁棒的低光三维重建流程，它集成了基于学习的点云初始化和亮度引导的颜色增强技术，以实现稳定且具有照片级真实感的高斯泼溅。我们的方法结合了几何感知初始化和光度适应策略，以提升在挑战性条件下的重建保真度。在NTIRE赛道一基准上的大量实验表明，我们的方法相较于基线模型显著提升了重建质量，实现了卓越的视觉保真度和几何一致性。所提出的方法为现实世界退化场景下的鲁棒三维重建提供了一个实用解决方案。在最终测试阶段，我们的方法在官方平台排行榜上取得了18.6626的峰值信噪比（PSNR）和0.6855的结构相似性指数（SSIM）。代码发布于 https://github.com/lyh120/FSGS_EAPGS。

摘要 (Abstract)

This paper presents our approach to the NTIRE 2026 3D Restoration and Reconstruction Challenge (Track 1), which focuses on reconstructing high-quality 3D representations from degraded multi-view inputs. The challenge involves recovering geometrically consistent and photorealistic 3D scenes in extreme low-light environments. To address this task, we propose Extreme Low-light Optimized Gaussian Splatting (ELoG-GS), a robust low-light 3D reconstruction pipeline that integrates learning-based point cloud initialization and luminance-guided color enhancement for stable and photorealistic Gaussian Splatting. Our method incorporates both geometry-aware initialization and photometric adaptation strategies to improve reconstruction fidelity under challenging conditions. Extensive experiments on the NTIRE Track 1 benchmark demonstrate that our approach significantly improves reconstruction quality over the baselines, achieving superior visual fidelity and geometric consistency. The proposed method provides a practical solution for robust 3D reconstruction in real-world degraded scenarios. In the final testing phase, our method achieved a PSNR of 18.6626 and an SSIM of 0.6855 on the official platform leaderboard. Code is available at https://github.com/lyh120/FSGS_EAPGS.

关键词: 3D reconstruction, Gaussian Splatting, low-light enhancement, point cloud initialization, photometric adaptation, NTIRE challenge, PSNR, SSIM

229. ❌ PDF-GS: Progressive Distractor Filtering for Robust 3D Gaussian Splatting

作者: Kangmin Seo, MinKyu Lee, Tae-Young Kim, ByeongCheol Lee, JoonSeoung An, Jae-Pil Heo 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12580v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PDF-GS专注于3D高斯泼溅（3DGS）的鲁棒性优化，提出了一种渐进式干扰物过滤框架，属于计算机视觉和3D重建领域。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，而本文不涉及任何大模型技术（如LLMs、MoE、RLHF等），也未应用于生物信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对3D高斯泼溅对输入图像中干扰物敏感的问题，提出了PDF-GS框架，通过渐进式多阶段优化过滤干扰物并恢复细节，实现了鲁棒、高保真且无干扰物的3D重建，在多种数据集和挑战性条件下优于基线方法。

摘要翻译

三维高斯泼溅（3D Gaussian Splatting，简称3DGS）技术的最新进展已实现令人印象深刻的实时照片级真实感渲染。然而，传统训练流程本质上假设输入图像之间具有完全的多视角一致性，这使其对违反该假设的干扰物十分敏感，并导致视觉伪影。本研究重新审视了3DGS中一个尚未充分探索的特性：其内在的抑制不一致信号的能力。基于这一洞见，我们提出PDF-GS（面向鲁棒三维高斯泼溅的渐进式干扰物过滤框架），该框架通过渐进式多阶段优化来增强这种自过滤特性。渐进过滤阶段通过利用差异线索逐步移除干扰物，而随后的重建阶段则从净化后的高斯表示中恢复细粒度、视角一致的细节。通过这种迭代优化，PDF-GS实现了鲁棒、高保真且无干扰物的重建效果，在多样化数据集和具有挑战性的真实场景条件下均持续优于基线方法。此外，我们的方法轻量且易于适配现有3DGS框架，无需修改架构或增加额外推理开销，从而实现了新的最优性能。代码已公开于https://github.com/kangrnin/PDF-GS。

摘要 (Abstract)

Recent advances in 3D Gaussian Splatting (3DGS) have enabled impressive real-time photorealistic rendering. However, conventional training pipelines inherently assume full multi-view consistency among input images, which makes them sensitive to distractors that violate this assumption and cause visual artifacts. In this work, we revisit an underexplored aspect of 3DGS: its inherent ability to suppress inconsistent signals. Building on this insight, we propose PDF-GS (Progressive Distractor Filtering for Robust 3D Gaussian Splatting), a framework that amplifies this self-filtering property through a progressive multi-phase optimization. The progressive filtering phases gradually remove distractors by exploiting discrepancy cues, while the following reconstruction phase restores fine-grained, view-consistent details from the purified Gaussian representation. Through this iterative refinement, PDF-GS achieves robust, high-fidelity, and distractor-free reconstructions, consistently outperforming baselines across diverse datasets and challenging real-world conditions. Moreover, our approach is lightweight and easily adaptable to existing 3DGS frameworks, requiring no architectural changes or additional inference overhead, leading to a new state-of-the-art performance. The code is publicly available at https://github.com/kangrnin/PDF-GS.

关键词: 3D Gaussian Splatting, distractor filtering, progressive optimization, multi-view consistency, robust reconstruction, real-time rendering, visual artifacts, state-of-the-art

230. ❌ Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

作者: Zijian Liu, Sihan Cao, Pengcheng Zheng, Kuien Liu, Caiyan Qin, Xiaolin Qin, Jiwei Wei, Chaoning Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12582v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Video Large Language Models（Video-LLMs）的幻觉缓解问题，核心贡献是提出Decoder-side Temporal Rebalancing（DTR）方法。与关键词高度相关的是：1）‘Large Language Models OR LLMs OR Foundation Models’（10分），因为论文明确研究Video-LLMs，属于大模型范畴；2）‘Hallucination Mitigation OR Factuality OR Truthfulness’（10分），因为论文核心是解决幻觉问题。与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文分析了注意力机制中的锚帧现象，涉及模型行为解释。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了Video Large Language Models中因过度依赖锚帧导致的幻觉问题，并提出了一种无需训练的Decoder-side Temporal Rebalancing方法，有效缓解了幻觉并保持了视频理解性能。

摘要翻译

近期视频大语言模型（Video-LLMs）在视频理解方面展现出强大能力，但仍普遍存在幻觉问题。现有的缓解方法通常依赖于训练、输入修改、辅助引导或额外的解码流程，而很大程度上忽视了一个更根本的挑战：在生成过程中，Video-LLMs 倾向于过度依赖有限的部分时序证据，导致对视频的时序证据聚合呈现不平衡状态。为应对此问题，我们研究了解码器侧的一种现象，即模型表现出时序不平衡的关注模式。我们将帧级注意力质量聚合最高的帧定义为锚定帧。研究发现，这种偏差在很大程度上与输入视频无关，反而反映了一种持久的、模型特定的结构或位置偏差，其过度主导性与易产生幻觉的生成过程密切相关。基于这一发现，我们提出了解码器侧时序再平衡（Decoder-side Temporal Rebalancing, DTR），这是一种无需训练、选择性作用于特定层的推理方法，可在不改变视觉编码或依赖辅助模型的前提下，对中后期解码器层的时序证据分配进行再平衡。DTR 自适应地校准解码器侧的视觉注意力，以缓解时序不平衡的关注现象，并促使未被充分关注的帧更有效地参与响应生成。通过这种方式，DTR 引导解码器将其输出建立在更广泛、更平衡的时序视频证据基础上。在幻觉检测与视频理解基准上的大量实验表明，DTR 能持续提升不同 Video-LLM 系列的幻觉鲁棒性，同时保持有竞争力的视频理解性能与较高的推理效率。

摘要 (Abstract)

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.

关键词: Video Large Language Models, Hallucination Mitigation, Temporal Imbalance, Anchor Frame, Decoder-side Temporal Rebalancing, Attention Mechanism, Training-free Inference, Video Understanding

作者: Francesco Chiumento, Julia Dietlmeier, Ronan P. Killeen, Kathleen M. Curran, Noel E. O’Connor, Mingming Liu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12574v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像分析（MRI和PET）中的阿尔茨海默病检测，采用知识蒸馏框架（BiomedCLIP教师模型、跨模态注意力、三重对比学习）实现PET-free Aβ预测。论文内容与大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关，因为其属于AI在生物医学（神经影像学）领域的应用，符合’AI for Science’范畴。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于知识蒸馏的PET引导框架，仅使用MRI即可检测阿尔茨海默病相关的淀粉样蛋白-β（Aβ）阳性，在OASIS-3和ADNI数据集上分别达到0.74和0.68的AUC，实现了无PET、无临床变量的有效筛查。

摘要翻译

检测淀粉样蛋白-β（Aβ）阳性对阿尔茨海默病的早期诊断至关重要，但通常需要正电子发射断层扫描（PET）成像，这种方法成本高昂、具有侵入性且不易普及，限制了其在人群筛查中的应用。为弥补这一不足，我们提出了一种PET引导的知识蒸馏框架，仅通过磁共振成像（MRI）即可预测Aβ状态，无需依赖非成像临床协变量或在推理阶段使用PET。我们的方法采用基于BiomedCLIP的教师模型，该模型通过跨模态注意力机制和基于PET信息（Centiloid感知）的在线负采样的三重对比学习，实现PET与MRI的对齐。随后，一个仅使用MRI的学生模型通过特征级和逻辑级蒸馏来模仿教师模型。在四种MRI对比序列（T1加权、T2加权、FLAIR、T2*）和两个独立数据集上的评估表明，我们的方法实现了有效的知识迁移（最佳AUC：OASIS-3上为0.74，ADNI上为0.68），同时保持了可解释性，且无需临床变量。显著性分析证实，预测聚焦于解剖学相关的皮质区域，这支持了无PET的Aβ筛查在临床上的可行性。代码发布于https://github.com/FrancescoChiumento/pet-guided-mri-amyloid-detection。

摘要 (Abstract)

Detecting amyloid-$β$ (A$β$) positivity is crucial for early diagnosis of Alzheimer’s disease but typically requires PET imaging, which is costly, invasive, and not widely accessible, limiting its use for population-level screening. We address this gap by proposing a PET-guided knowledge distillation framework that enables A$β$ prediction from MRI alone, without requiring non-imaging clinical covariates or PET at inference. Our approach employs a BiomedCLIP-based teacher model that learns PET-MRI alignment via cross-modal attention and triplet contrastive learning with PET-informed (Centiloid-aware) online negative sampling. An MRI-only student then mimics the teacher via feature-level and logit-level distillation. Evaluated across four MRI contrasts (T1w, T2w, FLAIR, T2*) and two independent datasets, our approach demonstrates effective knowledge transfer (best AUC: 0.74 on OASIS-3, 0.68 on ADNI) while maintaining interpretability and eliminating the need for clinical variables. Saliency analysis confirms that predictions focus on anatomically relevant cortical regions, supporting the clinical viability of PET-free A$β$ screening. Code is available at https://github.com/FrancescoChiumento/pet-guided-mri-amyloid-detection.

关键词: Alzheimer’s disease, amyloid-beta detection, knowledge distillation, cross-modal learning, MRI, PET-free, BiomedCLIP, medical imaging analysis

232. ❌ Evolution-Inspired Sample Competition for Deep Neural Network Optimization

作者: Ying Zheng, Yiyi Zhang, Yi Wang, Lap-Pui Chau 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12568v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出了一种名为Natural Selection（NS）的深度神经网络优化方法，该方法受进化论启发，通过样本竞争机制来改进训练过程。论文专注于图像分类任务中的样本重加权策略，与所有评分关键词（主要围绕大语言模型、训练技术、推理优化、AI代理等）均无直接关联。论文未涉及大模型、深度学习技术原理创新或科学领域应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种受进化论启发的样本竞争优化方法（Natural Selection），通过动态重加权样本损失来解决深度神经网络训练中的类别不平衡、难样本学习和噪声样本问题，在12个图像分类数据集上验证了其有效性。

摘要翻译

传统的深度网络训练通常在高度统一的学习范式下优化所有样本，未显式建模样本间的异质竞争关系。这种过度简化的处理方式可能导致若干已知问题，包括类别不平衡下的偏差、困难样本学习不足以及噪声样本的错误强化。本文提出《自然选择》（NS），一种受进化论启发的新型优化方法，将竞争交互显式引入深度网络训练。与主要依赖预定义启发式规则或静态标准的传统样本重加权策略不同，NS在分组语境中评估每个样本的竞争状态，并据此自适应调节其训练贡献。具体而言，NS首先将多个样本组合成复合图像，并将其缩放至原始输入尺寸进行模型推理。基于所得预测结果，计算每个样本的“自然选择分数”以刻画其在构建组内的相对竞争变化。这些分数随后被用于动态重加权样本损失，从而在优化过程中引入显式的竞争驱动机制。通过这种方式，NS提供了一种简单而有效的方法，能够超越均匀样本处理模式，实现更具适应性与平衡性的模型优化。在四个图像分类任务的12个公开数据集上的大量实验验证了所提方法的有效性。此外，NS兼容多种网络架构，且不依赖任务特定假设，展现出较强的通用性与实用潜力。代码将公开发布。

摘要 (Abstract)

Conventional deep network training generally optimizes all samples under a largely uniform learning paradigm, without explicitly modeling the heterogeneous competition among them. Such an oversimplified treatment can lead to several well-known issues, including bias under class imbalance, insufficient learning of hard samples, and the erroneous reinforcement of noisy samples. In this work, we present \textit{Natural Selection} (NS), a novel evolution-inspired optimization method that explicitly incorporates competitive interactions into deep network training. Unlike conventional sample reweighting strategies that rely mainly on predefined heuristics or static criteria, NS estimates the competitive status of each sample in a group-wise context and uses it to adaptively regulate its training contribution. Specifically, NS first assembles multiple samples into a composite image and rescales it to the original input size for model inference. Based on the resulting predictions, a natural selection score is computed for each sample to characterize its relative competitive variation within the constructed group. These scores are then used to dynamically reweight the sample-wise loss, thereby introducing an explicit competition-driven mechanism into the optimization process. In this way, NS provides a simple yet effective means of moving beyond uniform sample treatment and enables more adaptive and balanced model optimization. Extensive experiments on 12 public datasets across four image classification tasks demonstrate the effectiveness of the proposed method. Moreover, NS is compatible with diverse network architectures and does not depend on task-specific assumptions, indicating its strong generality and practical potential. The code will be made publicly available.

关键词: Deep Neural Network Optimization, Sample Competition, Natural Selection, Sample Reweighting, Image Classification, Evolution-inspired, Training Contribution, Group-wise Context

233. ❌ Scalable Trajectory Generation for Whole-Body Mobile Manipulation

作者: Yida Niu, Xinhai Chang, Xin Liu, Ziyuan Jiao, Yixin Zhu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12565v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人学领域，研究移动操作机器人的轨迹生成和数据扩展问题，使用GPU加速框架AutoMoMa生成大规模物理有效轨迹数据集。论文内容涉及机器人运动学、轨迹优化、模仿学习等，但完全不涉及大语言模型、深度学习技术原理或AI for Science等关键词。所有关键词均与论文主题无关，因此相关度评分均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了移动操作机器人因场景和物体多样性导致状态空间组合爆炸、缺乏大规模物理有效轨迹数据的问题，提出了GPU加速框架AutoMoMa，实现了80倍加速，生成了50万条轨迹，并证明数据稀缺是性能瓶颈，为协调移动操作研究提供了基础设施。

摘要翻译

部署在非结构化环境中的机器人必须协调全身运动——同时移动移动基座与机械臂——以与物理世界交互。这种移动性与灵巧性的耦合产生了一个随场景和物体多样性呈组合增长的状态空间，所需的数据集规模远超固定基座操作任务的需求。然而现有的数据采集方法（包括遥操作和规划）在大规模应用时要么劳动密集，要么计算成本过高。核心瓶颈在于缺乏一个可扩展的流水线，用于跨不同机器人构型与环境生成大规模、物理有效的协调轨迹数据。本文介绍AutoMoMa，这是一个GPU加速框架，它将AKR（整合基座、手臂和物体运动学为单一链）建模与并行化轨迹优化相统一。AutoMoMa实现了每GPU小时5,000条轨迹片段（比基于CPU的基线快80倍以上），生成了包含超过50万条物理有效轨迹的数据集，涵盖330个场景、多样化的关节化物体以及多种机器人构型。以往的数据集不得不在规模、多样性或运动学保真度上做出妥协；AutoMoMa则同时解决了这三个问题。对下游模仿学习（IL）策略的训练进一步揭示，即使是单个关节化物体任务，也需要数万条示范数据才能使前沿方法达到约80%的成功率，这证实了数据稀缺（而非算法限制）一直是关键制约因素。因此，AutoMoMa连接了高性能规划与可靠的基于模仿学习的控制，为协调移动操作研究提供了此前缺失的基础设施。通过使大规模、运动学有效的训练数据变得可行，AutoMoMa展示了能够在真实世界多样化、非结构化环境中运行的、可泛化的全身机器人策略。

摘要 (Abstract)

Robots deployed in unstructured environments must coordinate whole-body motion – simultaneously moving a mobile base and arm – to interact with the physical world. This coupled mobility and dexterity yields a state space that grows combinatorially with scene and object diversity, demanding datasets far larger than those sufficient for fixed-base manipulation. Yet existing acquisition methods, including teleoperation and planning, are either labor-intensive or computationally prohibitive at scale. The core bottleneck is the lack of a scalable pipeline for generating large-scale, physically valid, coordinated trajectory data across diverse embodiments and environments. Here we introduce AutoMoMa, a GPU-accelerated framework that unifies AKR modeling, which consolidates base, arm, and object kinematics into a single chain, with parallelized trajectory optimization. AutoMoMa achieves 5,000 episodes per GPU-hour (over $80\times$ faster than CPU-based baselines), producing a dataset of over 500k physically valid trajectories spanning 330 scenes, diverse articulated objects, and multiple robot embodiments. Prior datasets were forced to compromise on scale, diversity, or kinematic fidelity; AutoMoMa addresses all three simultaneously. Training downstream IL policies further reveals that even a single articulated-object task requires tens of thousands of demonstrations for SOTA methods to reach $\approx 80%$ success, confirming that data scarcity – not algorithmic limitations – has been the binding constraint. AutoMoMa thus bridges high-performance planning and reliable IL-based control, providing the infrastructure previously missing for coordinated mobile manipulation research. By making large-scale, kinematically valid training data practical, AutoMoMa showcases generalizable whole-body robot policies capable of operating in the diverse, unstructured settings of the real world.

关键词: whole-body mobile manipulation, trajectory generation, GPU-accelerated framework, AutoMoMa, kinematically valid data, imitation learning, scalable pipeline, articulated objects

234. ❌ Cross-Attentive Multiview Fusion of Vision-Language Embeddings

作者: Tomas Berriel Martins, Martin R. Oswald, Javier Civera 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12551v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉-语言模型在3D场景理解中的应用，具体提出了跨注意力多视图融合方法（CAMFusion）来提升3D语义和实例分类性能。虽然涉及视觉-语言模型，但所有关键词均聚焦于纯语言模型（LLM）的技术原理、训练方法、推理优化、对齐、代理系统等，与论文的计算机视觉和3D场景理解核心内容无直接关联。论文未涉及任何语言模型技术、科学AI应用或大模型创新方法。

!!! tip deepseek-chat TL;DR

该论文解决了将2D视觉-语言模型提升到3D场景时多视图描述符融合的挑战，通过提出跨注意力多视图融合架构（CAMFusion）并利用多视图一致性作为自监督信号，在3D语义和实例分类基准上实现了最先进的性能，包括在域外数据集上的零样本评估。

摘要翻译

视觉语言模型已成为开放词汇二维语义分割发展的关键。然而，将这些模型从二维图像提升至三维场景仍然是一个具有挑战性的问题。现有方法通常将二维描述符跨视图反向投影并求平均，或启发式地选择单一代表性描述符，这往往导致次优的三维表示。在本研究中，我们提出了一种新颖的多视图变换器架构，该架构通过交叉注意力机制处理来自多个视角的视觉语言描述符，并将其融合为统一的三维实例级嵌入。作为第二项贡献，我们利用多视图一致性作为该融合过程的自监督信号，当将其与标准监督目标类别损失结合时，性能得到显著提升。我们提出的交叉注意力多视图融合方法（简称CAMFusion）不仅持续优于简单的平均或单视图描述符选择策略，还在三维语义与实例分类基准测试中取得了最先进的结果，包括在领域外数据集上的零样本评估。

摘要 (Abstract)

Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.

关键词: vision-language models, 3D semantic segmentation, multiview fusion, cross-attention, self-supervision, zero-shot evaluation, instance classification, CAMFusion

235. ❌ CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

作者: Zhaoyang Jia, Naifu Xue, Zihan Zheng, Jiahao Li, Bin Li, Xiaoyi Zhang, Zongyu Guo, Yuan Zhang, Houqiang Li, Yan Lu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12525v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型在实时图像压缩中的应用，属于生成式AI在特定领域的应用。与大多数关键词无关，因为论文不涉及语言模型、推理、对齐、代理等主题。仅与两个关键词有弱关联：1) “Pre-training OR Continual Pre-training OR Domain Adaptation”（5分）：论文探讨了生成导向与压缩导向的预训练对轻量级模型的影响，属于预训练策略研究。2) “Quantization OR Model Compression OR Low-bit Weights”（5分）：论文旨在设计轻量级模型以实现实时压缩，涉及模型轻量化，但未明确使用量化或低比特权重技术。其他关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文研究了如何设计轻量级扩散模型用于实时生成式图像压缩，通过分析预训练策略和架构选择，提出了一种基于卷积的轻量级扩散编解码器，在保持生成质量的同时实现了85%的比特率降低和实时帧率。

摘要翻译

近期先进的扩散方法通常通过扩展扩散变换器来获得强大的生成先验。然而，在需要轻量级模型的实时压缩场景中，扩展方法难以实现良好泛化。本文通过探讨两个关键问题，研究实时轻量级扩散编解码器的设计。首先，扩散预训练是否对轻量级扩散编解码器有益？通过系统分析，我们发现面向生成的预训练在小模型规模下效果有限，而面向压缩的预训练能持续提供更优性能。其次，变换器是否必不可少？我们发现，尽管全局注意力对于标准生成任务至关重要，但在结合蒸馏技术时，轻量级卷积已足以满足面向压缩的扩散需求。基于这些发现，我们构建了一种一步式轻量级卷积扩散编解码器，在1080p分辨率下实现了实时$60$~FPS编码与$42$~FPS解码。通过蒸馏与对抗学习的进一步强化，该编解码器在与MS-ILLM相近的FID（Fréchet Inception Distance）指标下，将比特率降低了85%，弥合了生成式压缩与实际实时部署之间的差距。代码发布于https://github.com/microsoft/GenCodec/CoD_Lite。

摘要 (Abstract)

Recent advanced diffusion methods typically derive strong generative priors by scaling diffusion transformers. However, scaling fails to generalize when adapted for real-time compression scenarios that demand lightweight models. In this paper, we explore the design of real-time and lightweight diffusion codecs by addressing two pivotal questions. First, does diffusion pre-training benefit lightweight diffusion codecs? Through systematic analysis, we find that generation-oriented pre-training is less effective at small model scales whereas compression-oriented pre-training yields consistently better performance. Second, are transformers essential? We find that while global attention is crucial for standard generation, lightweight convolutions suffice for compression-oriented diffusion when paired with distillation. Guided by these findings, we establish a one-step lightweight convolution diffusion codec that achieves real-time $60$~FPS encoding and $42$~FPS decoding at 1080p. Further enhanced by distillation and adversarial learning, the proposed codec reduces bitrate by 85% at a comparable FID to MS-ILLM, bridging the gap between generative compression and practical real-time deployment. Codes are released at https://github.com/microsoft/GenCodec/CoD_Lite

关键词: diffusion models, generative image compression, real-time compression, lightweight models, pre-training strategies, convolutional architectures, distillation, adversarial learning

236. ❌ Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers

作者: Snehal Jauhri, Vignesh Prasad, Georgia Chalvatzaki 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12509v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人移动操作（Mobile Manipulation）领域，使用离线强化学习（Offline RL）改进次优控制器。论文内容涉及机器人控制、强化学习、动作分块扩散策略等技术，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术或AI for Science相关，而本文研究的是传统机器人控制问题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为WHOLE-MoMa的两阶段方法，通过随机化轻量级全身控制器生成多样化演示数据，然后应用离线强化学习来改进机器人对铰接物体（如门、抽屉）的全身移动操作性能，在仿真和真实机器人上均取得了显著优于基线方法的效果。

摘要翻译

移动操作（MoMa）铰接物体（如打开门、抽屉和橱柜）需要机器人的底盘与机械臂之间进行同步的全身协调。经典全身控制器（WBCs）可通过分层优化解决此类问题，但需要大量手动调优且仍显脆弱。另一方面，基于学习的方法展现出强大的泛化能力，但通常依赖昂贵的全身遥操作数据或复杂的奖励函数设计。我们观察到，即使一个次优的WBC也是强大的结构先验：它可用于在状态-动作空间中受限的任务相关区域内收集数据，并且其行为仍可通过离线强化学习进一步改进。基于此，我们提出WHOLE-MoMa——一种两阶段流程：首先生成多样化示教数据，通过随机化轻量级WBC实现；随后应用离线强化学习，借助奖励信号识别并组合优化后的行为。为支持复杂协调任务所需的表达性动作分块扩散策略，我们扩展了离线隐式Q学习，引入Q分块机制以实现分块级评价器评估和优势加权策略提取。在使用仿真环境中的TIAGo++移动操作器进行的三个难度递增的任务中，WHOLE-MoMa显著优于WBC、行为克隆及多种离线强化学习基线方法。策略无需微调即可直接迁移至真实机器人，在双手抽屉操作任务中达到80%成功率，在同步橱柜开启与物体放置任务中达到68%成功率，且全程未使用任何遥操作或真实世界训练数据。

摘要 (Abstract)

Mobile Manipulation (MoMa) of articulated objects, such as opening doors, drawers, and cupboards, demands simultaneous, whole-body coordination between a robot’s base and arms. Classical whole-body controllers (WBCs) can solve such problems via hierarchical optimization, but require extensive hand-tuned optimization and remain brittle. Learning-based methods, on the other hand, show strong generalization capabilities but typically rely on expensive whole-body teleoperation data or heavy reward engineering. We observe that even a sub-optimal WBC is a powerful structural prior: it can be used to collect data in a constrained, task-relevant region of the state-action space, and its behavior can still be improved upon using offline reinforcement learning. Building on this, we propose WHOLE-MoMa, a two-stage pipeline that first generates diverse demonstrations by randomizing a lightweight WBC, and then applies offline RL to identify and stitch together improved behaviors via a reward signal. To support the expressive action-chunked diffusion policies needed for complex coordination tasks, we extend offline implicit Q-learning with Q-chunking for chunk-level critic evaluation and advantage-weighted policy extraction. On three tasks of increasing difficulty using a TIAGo++ mobile manipulator in simulation, WHOLE-MoMa significantly outperforms WBC, behavior cloning, and several offline RL baselines. Policies transfer directly to the real robot without finetuning, achieving 80% success in bimanual drawer manipulation and 68% in simultaneous cupboard opening and object placement, all without any teleoperated or real-world training data.

关键词: Mobile Manipulation, Offline Reinforcement Learning, Whole-Body Control, Action-Chunked Diffusion Policies, Robot Learning, Articulated Objects, Sim-to-Real Transfer, TIAGo++ Robot

237. ❌ From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

作者: Jilong Zhu, Yang Feng 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12508v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文聚焦于多模态大语言模型（MLLMs）在细粒度视觉感知任务中的局限性，并提出了一种名为变分信息流（VIF）的框架来缓解视觉衰减问题。论文的核心内容与大语言模型（LLMs）高度相关，因为MLLMs是LLMs的扩展，且论文明确提及并研究MLLMs。然而，论文未涉及其他关键词，如MoE、SLMs、缩放定律、训练技术（预训练、微调、对齐、RLHF、PEFT）、推理优化（RAG、上下文扩展、注意力优化、解码加速）、推理方法（CoT、系统2、MCTS）、自我改进、智能体、工具使用、多智能体、模型压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。因此，仅第一个关键词（Large Language Models OR LLMs OR Foundation Models）获得10分，其余关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在细粒度视觉感知任务中因视觉衰减导致的性能不足问题，提出了变分信息流框架，通过条件变分自编码器建模视觉显著性，有效提升了模型在细粒度视觉问答和视觉定位任务中的表现。

摘要翻译

尽管多模态大语言模型（MLLMs）在通用视觉理解方面展现出令人瞩目的能力，但在需要识别微小物体或辨别细微视觉关系的细粒度感知任务中，它们常常表现不佳。我们将此局限性归因于视觉衰减现象：在网络传播过程中，稀疏的细粒度视觉信号被占主导地位的文本标记过早地抑制或稀释，导致深层决策过程中出现“注意力流失”。现有以输入为中心的解决方案未能从根本上逆转这种信息损失的内在机制。为应对这一挑战，我们提出了变分信息流（Variational Information Flow, VIF）框架。VIF采用概率视角，利用条件变分自编码器（Conditional Variational Autoencoder, CVAE）将与问答对相关的视觉显著性建模为一个潜在分布。作为一个即插即用模块，VIF可以集成到现有架构中。在涵盖通用视觉问答（General VQA）、细粒度感知和视觉定位的多种基准测试上进行广泛评估，结果表明VIF相比先前方法带来了显著的性能提升，验证了其在增强MLLMs细粒度感知能力方面的有效性。

摘要 (Abstract)

While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a “loss of focus” during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as a latent distribution. As a plug-and-play module, VIF can be integrated into existing architectures. Extensive evaluations across diverse benchmarks, covering General VQA, fine-grained perception, and visual grounding, demonstrate that VIF yields competitive improvements over previous methods, validating its effectiveness in enhancing the fine-grained perception of MLLMs.

关键词: Multimodal Large Language Models, Fine-grained Visual Perception, Visual Attenuation, Variational Information Flow, Conditional Variational Autoencoder, Visual Question Answering, Visual Grounding

238. ❌ T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models

作者: Nihal Jaiswal, Siddhartha Arjaria, Gyanendra Chaubey, Ankush Kumar, Aditya Singh, Anchal Chaurasiya 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12481v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于文本到图像（T2I）生成模型的偏见评估，而非大语言模型（LLMs）或深度学习技术原理的创新。仅与两个关键词相关：1）‘Instruction Tuning OR Alignment OR Value Alignment’（5分）：论文提到RLHF-aligned模型作为基线，涉及对齐技术，但非核心研究内容；2）‘Hallucination Mitigation OR Factuality OR Truthfulness’（8分）：论文评估偏见、元素遗漏和文化崩溃，与幻觉缓解和事实性高度相关，但主要针对图像生成而非文本。其他关键词均与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了T2I-BiasBench框架，用于评估文本到图像生成模型中的偏见、元素遗漏和文化崩溃问题，发现现有模型（包括RLHF对齐模型）均存在显著的偏见放大和文化覆盖不足。

摘要翻译

文本到图像生成模型在视觉保真度方面取得了令人瞩目的成就，但继承并放大了训练数据中固有的人口统计失衡与文化偏见。我们提出了T2I-BiasBench，这是一个统一的评估框架，包含十三项互补性指标，能够联合捕捉扩散模型中的人口统计偏差、元素遗漏与文化坍缩——这是首个同时处理这三个维度的框架。
我们以Gemini 2.5 Flash（经过RLHF对齐）作为参考基线，评估了三个开源模型——Stable Diffusion v1.5、BK-SDM Base和Koala Lightning。该基准测试涵盖五个结构化提示类别，共生成1,574张图像。T2I-BiasBench整合了六项既有指标与七项新增指标：其中四项为新提出的（综合偏差分数、显性元素遗漏率、隐性元素遗漏率、文化准确率），三项为改编的（幻觉分数、Vendi分数、CLIP代理分数）。
研究得出三个关键发现：（1）Stable Diffusion v1.5和BK-SDM在与“美”相关的提示中表现出偏差放大现象（>1.0）；（2）诸如外科手术个人防护装备等情境约束能显著削弱职业角色的性别偏差（例如SD v1.5的“医生”综合偏差分数仅为0.06）；（3）包括经过RLHF对齐的Gemini在内的所有模型，其生成结果均坍缩至狭窄的文化表征集合（文化准确率：0.54-1.00），这证实了对齐技术并未解决文化覆盖度不足的问题。
T2I-BiasBench已公开发布，旨在为生成模型提供标准化、细粒度的偏差评估支持。项目页面地址为：https://gyanendrachaubey.github.io/T2I-BiasBench/

摘要 (Abstract)

Text-to-image (T2I) generative models achieve impressive visual fidelity but inherit and amplify demographic imbalances and cultural biases embedded in training data. We introduce T2I-BiasBench, a unified evaluation framework of thirteen complementary metrics that jointly captures demographic bias, element omission, and cultural collapse in diffusion models - the first framework to address all three dimensions simultaneously. We evaluate three open-source models - Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning - against Gemini 2.5 Flash (RLHF-aligned) as a reference baseline. The benchmark comprises 1,574 generated images across five structured prompt categories. T2I-BiasBench integrates six established metrics with seven additional measures: four newly proposed (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) and three adapted (Hallucination Score, Vendi Score, CLIP Proxy Score). Three key findings emerge: (1) Stable Diffusion v1.5 and BK-SDM exhibit bias amplification (>1.0) in beauty-related prompts; (2) contextual constraints such as surgical PPE substantially attenuate professional-role gender bias (Doctor CBS = 0.06 for SD v1.5); and (3) all models, including RLHF-aligned Gemini, collapse to a narrow set of cultural representations (CAS: 0.54-1.00), confirming that alignment techniques do not resolve cultural coverage gaps. T2I-BiasBench is publicly released to support standardized, fine-grained bias evaluation of generative models. The project page is available at: https://gyanendrachaubey.github.io/T2I-BiasBench/

关键词: Text-to-Image Models, Bias Evaluation, Demographic Bias, Cultural Bias, Diffusion Models, RLHF-aligned, Hallucination Score, Evaluation Framework

239. ❌ Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling

作者: Zida Li, Jun Li, Yuzhe Sha, Ziqiang Li, Lizhi Xiong, Zhangjie Fu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12446v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是文本到图像扩散模型中的后门检测，属于计算机视觉和AI安全领域。所有评分关键词都专注于大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等），而本文研究的是扩散模型（一种生成模型），与LLM技术没有直接关联。虽然都属于AI领域，但技术栈、模型架构和应用场景完全不同，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对文本到图像扩散模型中隐蔽的后门攻击，提出了一种基于交叉注意力缩放响应差异的输入级检测框架SET，该框架无需攻击先验知识即可有效识别后门输入，在多种攻击场景下显著优于现有基线方法。

摘要翻译

文本到图像（Text-to-image, T2I）扩散模型在图像合成方面取得了显著成功，但其对大规模数据和开放生态系统的依赖引入了严重的后门安全风险。现有的防御方法，尤其是输入级方法，在部署上更具实用性，但通常依赖于可观察的异常特征，这些特征在隐蔽且保持语义的触发器设计下变得不可靠。随着现代后门攻击越来越多地将触发器嵌入自然输入中，这些方法的性能大幅下降，这引发了一个关键问题：能否利用良性输入与后门输入之间更稳定、隐式且与触发器无关的差异进行检测？在本工作中，我们从主动探测的角度应对这一挑战。我们通过对交叉注意力施加受控缩放扰动，发现了一种称为“交叉注意力缩放响应差异”（Cross-Attention Scaling Response Divergence, CSRD）的新现象，即良性输入与后门输入在去噪步骤中展现出系统性不同的响应演化模式。基于这一发现，我们提出了SET，一种输入级后门检测框架。该框架在多尺度扰动下构建响应偏移特征，并从小规模干净样本中学习一个紧凑的良性响应空间。随后通过测量输入特征与该学习空间的偏差进行检测，无需预先了解攻击细节或访问模型训练过程。大量实验表明，SET在多种攻击方法、触发器类型和模型设置下均持续优于现有基线方法，在隐蔽的隐式触发器场景下提升尤为显著。总体而言，SET将AUROC提升了9.1%，ACC提升了6.5%，优于最佳基线方法，凸显了其在实际部署中的有效性和鲁棒性。

摘要 (Abstract)

Text-to-image (T2I) diffusion models have achieved remarkable success in image synthesis, but their reliance on large-scale data and open ecosystems introduces serious backdoor security risks. Existing defenses, particularly input-level methods, are more practical for deployment but often rely on observable anomalies that become unreliable under stealthy, semantics-preserving trigger designs. As modern backdoor attacks increasingly embed triggers into natural inputs, these methods degrade substantially, raising a critical question: can more stable, implicit, and trigger-agnostic differences between benign and backdoor inputs be exploited for detection? In this work, we address this challenge from an active probing perspective. We introduce controlled scaling perturbations on cross-attention and uncover a novel phenomenon termed Cross-Attention Scaling Response Divergence (CSRD), where benign and backdoor inputs exhibit systematically different response evolution patterns across denoising steps. Building on this insight, we propose SET, an input-level backdoor detection framework that constructs response-offset features under multi-scale perturbations and learns a compact benign response space from a small set of clean samples. Detection is then performed by measuring deviations from this learned space, without requiring prior knowledge of the attack or access to model training. Extensive experiments demonstrate that SET consistently outperforms existing baselines across diverse attack methods, trigger types, and model settings, with particularly strong gains under stealthy implicit-trigger scenarios. Overall, SET improves AUROC by 9.1% and ACC by 6.5% over the best baseline, highlighting its effectiveness and robustness for practical deployment.

关键词: text-to-image diffusion models, backdoor detection, cross-attention scaling, input-level defense, security, adversarial attacks, generative models

240. ❌ DiffusionPrint: Learning Generative Fingerprints for Diffusion-Based Inpainting Localization

作者: Paschalis Giakoumoglou, Symeon Papadopoulos 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12443v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型的图像修复定位，使用对比学习框架检测生成指纹，属于计算机视觉和图像取证领域。所有评分关键词均与大语言模型、模型训练、推理优化、对齐、代理系统等大模型核心技术或AI for Science应用相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出DiffusionPrint框架，通过对比学习检测扩散模型图像修复区域的生成指纹，解决了现有取证方法因潜在解码破坏噪声模式而难以定位伪造区域的问题，在多个融合框架中显著提升了定位性能。

摘要翻译

现代基于扩散模型的修复技术对图像伪造定位提出了重大挑战，因其完整的再生流程通过潜在解码器重构整幅图像，破坏了现有取证方法所依赖的相机级噪声模式。我们提出DiffusionPrint，一种基于图像块的对比学习框架，该框架能够学习对潜在解码引入的频谱失真具有鲁棒性的取证信号。该方法利用同一模型生成的修复区域共享一致的生成指纹这一特性，并将其作为自监督信号。DiffusionPrint通过MoCo式目标函数、跨类别难负例挖掘以及生成器感知分类头，训练一个卷积骨干网络，从而生成取证特征图。该特征图可作为基于融合的伪造定位框架中高判别力的辅助模态。将DiffusionPrint集成至TruFor、MMFusion及一个轻量级融合基线后，其在多种生成模型上持续提升了定位性能，在微调阶段未见过的掩码类型上最高提升达+28%，并证实了对未见生成架构的泛化能力。代码发布于https://github.com/mever-team/diffusionprint。

摘要 (Abstract)

Modern diffusion-based inpainting models pose significant challenges for image forgery localization (IFL), as their full regeneration pipelines reconstruct the entire image via a latent decoder, disrupting the camera-level noise patterns that existing forensic methods rely on. We propose DiffusionPrint, a patch-level contrastive learning framework that learns a forensic signal robust to the spectral distortions introduced by latent decoding. It exploits the fact that inpainted regions generated by the same model share a consistent generative fingerprint, using this as a self-supervisory signal. DiffusionPrint trains a convolutional backbone via a MoCo-style objective with cross-category hard negative mining and a generator-aware classification head, producing a forensic feature map that serves as a highly discriminative secondary modality in fusion-based IFL frameworks. Integrated into TruFor, MMFusion, and a lightweight fusion baseline, DiffusionPrint consistently improves localization across multiple generative models, with gains of up to +28% on mask types unseen during fine-tuning and confirmed generalization to unseen generative architectures. Code is available at https://github.com/mever-team/diffusionprint

关键词: diffusion-based inpainting, image forgery localization, generative fingerprint, contrastive learning, forensic signal, latent decoder, MoCo-style objective, fusion-based IFL

241. ❌ A Hybrid Architecture for Benign-Malignant Classification of Mammography ROIs

作者: Mohammed Asad, Mohit Bajpai, Sudhir Singh, Rahul Katarya 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12437v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像（乳腺钼靶）的良恶性分类，使用CNN（EfficientNetV2-M）和状态空间模型（Vision Mamba）的混合架构。所有关键词均与大语言模型（LLM）或通用大模型技术相关，而本文研究的是计算机视觉（CV）领域的特定应用，未涉及任何LLM、提示工程、对齐、推理、代理等技术。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学影像分析可视为AI在科学（医疗）领域的应用，但并非核心内容，因此给予5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合EfficientNetV2-M和Vision Mamba的混合架构，用于乳腺钼靶ROI的良恶性分类，在CBIS-DDSM数据集上取得了良好的病灶级分类性能。

摘要翻译

乳腺X线摄影中可疑病灶的准确表征对于早期诊断和治疗规划至关重要。虽然卷积神经网络（CNN）能有效提取局部视觉特征，但其在建模长程依赖关系方面存在局限。视觉Transformer（ViT）通过自注意力机制解决了这一问题，但其二次计算复杂度可能过高。本文提出一种混合架构，将用于局部特征提取的EfficientNetV2-M与用于高效全局上下文建模的视觉Mamba（一种状态空间模型，State Space Model, SSM）相结合。该模型基于CBIS-DDSM数据集，对以异常区域为中心的乳腺X线摄影感兴趣区域（ROIs）进行良恶性二元分类。通过将强CNN主干网络与线性计算复杂度的序列模型相结合，该方法在基于ROI的设定下实现了优异的病灶级分类性能。

摘要 (Abstract)

Accurate characterization of suspicious breast lesions in mammography is important for early diagnosis and treatment planning. While Convolutional Neural Networks (CNNs) are effective at extracting local visual patterns, they are less suited to modeling long-range dependencies. Vision Transformers (ViTs) address this limitation through self-attention, but their quadratic computational cost can be prohibitive. This paper presents a hybrid architecture that combines EfficientNetV2-M for local feature extraction with Vision Mamba, a State Space Model (SSM), for efficient global context modeling. The proposed model performs binary classification of abnormality-centered mammography regions of interest (ROIs) from the CBIS-DDSM dataset into benign and malignant classes. By combining a strong CNN backbone with a linear-complexity sequence model, the approach achieves strong lesion-level classification performance in an ROI-based setting.

关键词: Mammography, Benign-Malignant Classification, Hybrid Architecture, EfficientNetV2-M, Vision Mamba, State Space Model, CBIS-DDSM, ROI Classification

242. ❌ DeferredSeg: A Multi-Expert Deferral Framework for Trustworthy Medical Image Segmentation

作者: Qiuyu Tian, Haoliang Sun, Yunshan Wang, Yinghuan Shi, Yilong Yin 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12411v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学图像分割的信任度提升，提出了一种多专家延迟框架（DeferredSeg），属于AI for Science（生物医学AI应用）领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（8分）。框架涉及多个专家分支的协作决策，与’Mixture of Experts OR MoE OR Sparse Models’有一定关联（8分），但并非严格意义上的MoE稀疏模型技术。其他关键词主要涉及大语言模型（LLMs）的技术原理、训练方法、推理优化、代理系统等，而本文研究的是基于深度神经网络的医学图像分割模型，未涉及LLMs、语言模型技术或相关概念，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对医学图像分割模型在模糊区域置信度不可靠的问题，提出了一个多专家延迟框架DeferredSeg，通过动态路由像素到基础分割器或人类专家，并引入空间一致性和负载平衡机制，有效提升了分割的信任度和性能。

摘要翻译

基于深度神经网络的分割模型在医学图像分割中展现出强大的泛化能力。然而，这些模型常表现出过度自信或自信不足，导致分割掩码的置信度评分不可靠，尤其在模糊区域。这削弱了临床部署所需的可信度。受学习延迟（L2D）范式启发，我们提出了DeferredSeg——一种延迟感知的分割框架，即一种人机协作系统，用于决定是否将特定区域的预测交由人类专家处理。
DeferredSeg通过聚合延迟预测器和额外的路由通道扩展了基础分割器，这些通道动态地将每个像素路由至基础分割器或人类专家。为高效训练此路由机制，我们引入了像素级代理协作损失来监督延迟决策。此外，为保持延迟区域内的空间连贯性，我们提出了空间连贯性损失以强制生成平滑的延迟掩码，从而提升可靠性。
在单专家延迟机制之外，我们通过引入多个差异专家进行协同决策，进一步将框架扩展至多专家场景。为防止单个专家过载或利用不足，我们进一步设计了负载均衡惩罚机制，以在专家分支间均匀分配工作量。我们在三个具有挑战性的医学数据集上评估DeferredSeg，并使用MedSAM和CENet作为基础分割器以确保公平比较。实验结果表明，DeferredSeg始终优于基线方法，证明了其在可信密集医学分割中的有效性。此外，所提框架与模型无关，可轻松应用于其他分割架构。

摘要 (Abstract)

Segmentation models based on deep neural networks demonstrate strong generalization for medical image segmentation. However, they often exhibit overconfidence or underconfidence, leading to unreliable confidence scores for segmentation masks, especially in ambiguous regions. This undermines the trustworthiness required for clinical deployment. Motivated by the learning-to-defer (L2D) paradigm, we introduce DeferredSeg, a deferral-aware segmentation framework, i.e., a Human–AI collaboration system that determines whether to defer predictions to human experts in specific regions. DeferredSeg extends the base segmentor with an aggregated deferral predictor and additional routing channels that dynamically route each pixel to either the base segmentor or a human expert. To train this routing efficiently, we introduce a pixel-wise surrogate collaboration loss that supervises deferral decisions. In addition, to preserve spatial coherence within deferral regions, we propose a spatial-coherence loss that enforces smooth deferral masks, thereby enhancing reliability. Beyond single-expert deferral, we further extend the framework to a multi-expert setting by introducing multiple discrepancy experts for collaborative decision-making. To prevent overloading or underutilizing individual experts, we further design a load-balancing penalty that evenly distributes workload across expert branches. We evaluate DeferredSeg on three challenging medical datasets using MedSAM and CENet as the base segmentor for fair comparison. Experimental results show that DeferredSeg consistently outperforms the baseline, demonstrating its effectiveness for trustworthy dense medical segmentation. Moreover, the proposed framework is model-agnostic and can be readily applied to other segmentation architectures.

关键词: medical image segmentation, trustworthy AI, deferral framework, multi-expert system, human-AI collaboration, confidence calibration, spatial coherence, load balancing

243. ❌ Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning

作者: Jungwon Choi, Eunwoo Kim 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12403v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是视觉-语言模型（VLM）的测试时提示调优（TPT），属于计算机视觉与自然语言处理的交叉领域，但所有评分关键词均针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、量化等）。论文未涉及LLM技术原理、训练方法、推理优化、对齐、智能体或科学AI应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对视觉-语言模型在测试时提示调优中视图选择不可靠的问题，提出了一种基于文本和图像锚点的双模态引导框架，通过语义对齐过滤信息视图并加权集成预测，在15个基准数据集上实现了最先进的性能。

摘要翻译

测试时提示调优（TPT）通过增强视图来适配视觉语言模型，但其效果受限于难以判定哪些视图具有增益作用。标准的基于熵的过滤方法依赖于模型内部置信度分数，这些分数在分布偏移下常出现校准失准，可能对无关图像裁剪或背景区域赋予高置信度，同时忽略语义内容。为解决此问题，我们提出一种双模态锚点引导框架，将视图选择建立在语义证据基础上。我们引入来自丰富属性描述的文本锚点，以提供细粒度类别语义；同时设计自适应图像锚点，以捕捉动态变化的测试时统计特征。利用这些锚点，我们基于对齐度和置信度对视图进行筛选，确保仅信息量丰富的视图指导适配过程。此外，我们将锚点视为辅助预测头，将其预测结果与原始输出通过置信度加权集成相结合，从而为提示更新生成稳定的监督信号。在15个基准数据集上的大量实验证明了该方法取得了新的最优性能，凸显了锚点引导监督作为鲁棒提示更新基础的重要贡献。

摘要 (Abstract)

Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.

关键词: Test-Time Prompt Tuning, Vision-Language Models, Dual-Modality, Anchor-Guided Filtering, View Selection, Semantic Alignment, Confidence-Weighted Ensemble, Distribution Shift

244. ❌ Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

作者: Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12371v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）的排版提示注入攻击，属于多模态模型安全领域，与绝大多数关键词（主要针对纯文本大模型技术）完全无关。仅与’LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定关联（5分），因为论文提到VLMs作为自主代理的感知骨干，但未深入探讨代理工作流本身。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型（VLMs）中排版提示注入攻击的有效性，发现字体大小、视觉变换和文本-图像嵌入距离与攻击成功率显著相关，且不同模型的鲁棒性模式各异，为对抗环境下选择VLM骨干提供了实证指导。

摘要翻译

本研究探讨针对视觉语言模型（VLM）的排版提示注入攻击，其中对抗性文本以图像形式呈现以绕过安全机制。随着VLM日益成为自主智能体的感知核心（涵盖浏览器自动化、计算机使用系统乃至配备摄像头的具身智能体），此类攻击构成日益增长的威胁。实际攻击场景具有异质性：对抗性文本以不同字体尺寸出现于多样化的视觉条件下，而不断扩展的VLM生态系统在脆弱性方面表现出显著差异，这使防御策略复杂化。我们在四种VLM（GPT-4o、Claude Sonnet 4.5、Mistral-Large-3和Qwen3-VL-4B-Instruct）上评估来自SALAD-Bench的1000条提示，测试涵盖不同字体尺寸（6-28像素）和视觉变换（旋转、模糊、噪声、对比度变化），发现：（1）字体尺寸显著影响攻击成功率（ASR），极小字体（6像素）的ASR接近零，而中等尺寸字体达到峰值效果；（2）对GPT-4o（36%对比8%）和Claude（47%对比22%），文本攻击比图像攻击更有效，而Qwen3-VL和Mistral在两种模态下ASR相当；（3）来自两种多模态嵌入模型（JinaCLIP和Qwen3-VL-Embedding）的文本-图像嵌入距离与所有四种模型的ASR呈强负相关（r = -0.71至-0.93，p < 0.01）；（4）严重退化使嵌入距离增加10-12%、ASR降低34-96%，而旋转对模型的影响不对称（Mistral下降50%，GPT-4o无变化）。这些发现表明，模型特定的鲁棒性模式排除了通用防御方案的可能，并为从业者在对抗性环境中为智能体系统选择VLM骨干网络提供了实证指导。

摘要 (Abstract)

We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6–28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10–12% and reduce ASR by 34–96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.

关键词: vision-language models, typographic prompt injection attacks, adversarial text, embedding alignment, attack success rate, model robustness, multimodal embedding, agentic systems

245. ❌ Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

作者: Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, Chanyoung Park 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12358v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）中的视觉令牌剪枝问题，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展。论文重点研究复杂视觉推理任务，与’Chain of Thought’和’System 2 Thinking’高度相关（各10分），因为推理过程涉及多步和深度思考。其他关键词如MoE、SLMs、训练方法、推理加速等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了多模态大语言模型中视觉令牌剪枝在复杂推理任务中失败的原因，提出了解码阶段移位感知令牌剪枝（DSTP）框架，显著减轻了性能下降并提高了泛化能力。

摘要翻译

近期，视觉令牌剪枝技术被提出以处理多模态大语言模型中数量庞大的视觉令牌。然而，我们观察到，尽管现有剪枝方法在简单视觉理解任务上表现可靠，却难以有效泛化至复杂的视觉推理任务，这一关键差距在先前研究中尚未得到充分探索。通过系统性分析，我们发现解码过程中的相关视觉信息偏移是导致失败的主要原因。为解决此问题，我们提出解码阶段偏移感知令牌剪枝，这是一种无需训练的附加框架，能够使现有剪枝方法在解码阶段将视觉令牌与动态变化的推理需求对齐。大量实验表明，DSTP显著减轻了剪枝方法在复杂推理任务中的性能下降，同时即便在视觉理解基准测试中也持续带来性能提升。此外，DSTP在多种先进架构中均表现出有效性，凸显了其泛化能力以及在极小计算开销下的高效性。

摘要 (Abstract)

Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.

关键词: Multimodal Large Language Models, Visual Token Pruning, Complex Visual Reasoning, Relevant Visual Information Shift, Decoding-stage Shift-aware Token Pruning, Training-free Framework, Performance Degradation Mitigation

246. ❌ OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion

作者: Dongjian Yu, Weiqing Min, Qian Jiang, Xing Lin, Xin Jin, Shuqiang Jiang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12356v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉和营养学交叉领域，提出了一种从单张RGB图像进行营养估计的端到端框架，包括深度图预测、频率对齐融合和基于掩码的预测头。论文与大多数大模型和深度学习技术原理关键词完全无关，因为这些关键词主要涉及语言模型、训练方法、推理技术、对齐、压缩等特定领域。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI应用于营养科学（属于生物信息学相关领域），但并非核心创新点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了OmniFood8K数据集和一个端到端框架，通过从单张RGB图像预测深度图并进行频率对齐融合，实现了对中餐营养的准确估计。

摘要翻译

食物营养的精确估算对促进健康饮食习惯和个性化饮食管理至关重要。现有食物数据集大多以西方菜肴为主，缺乏对中国菜品的充分覆盖，这限制了对中式餐食营养的精确评估。此外，许多先进的营养预测方法依赖深度传感器，制约了其在日常场景中的适用性。为应对这些局限，我们提出了OmniFood8K——一个包含8,036个食物样本的综合多模态数据集，每个样本均配有详细的营养标注和多视角图像。同时，为增强模型的营养预测能力，我们构建了NutritionSynth-115K大规模合成数据集，该数据集在保持精确营养标签的同时引入了成分多样性。此外，我们提出了一种从单张RGB图像进行营养预测的端到端框架。首先，我们从单张RGB图像预测深度图，并设计了尺度偏移残差适配器（Scale-Shift Residual Adapter, SSRA）对其进行优化，以实现全局尺度一致性与局部结构保持。其次，我们提出了频率对齐融合模块（Frequency-Aligned Fusion Module, FAFM），在频域中对RGB与深度特征进行分层对齐与融合。最后，我们设计了基于掩码的预测头（Mask-based Prediction Head, MPH），通过动态通道选择来强调关键食材区域，从而实现更精准的预测。在多个数据集上的大量实验证明了我们的方法相较于现有方案的优越性。项目主页：https://yudongjian.github.io/OmniFood8K-food/

摘要 (Abstract)

Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets primarily focus on Western cuisines and lack sufficient coverage of Chinese dishes, which restricts accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food samples, each with detailed nutritional annotations and multi-view images. In addition, to enhance models’ capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework for nutritional prediction from a single RGB image. First, we predict a depth map from a single RGB image and design the Scale-Shift Residual Adapter (SSRA) to refine it for global scale consistency and local structural preservation. Second, we propose the Frequency-Aligned Fusion Module (FAFM) to hierarchically align and fuse RGB and depth features in the frequency domain. Finally, we design a Mask-based Prediction Head (MPH) to emphasize key ingredient regions via dynamic channel selection for more accurate prediction. Extensive experiments on multiple datasets demonstrate the superiority of our method over existing approaches. Project homepage: https://yudongjian.github.io/OmniFood8K-food/

关键词: nutrition estimation, single-image, depth prediction, frequency-aligned fusion, food dataset, Chinese dishes, multimodal, end-to-end framework

247. ❌ Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection

作者: Haifeng Zhang, Qinghui He, Xiuli Bi, Bo Liu, Chi-Man Pun, Bin Xiao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12353v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于AI生成图像检测，提出了一种对抗性特征学习框架来抑制生成模式和内容偏差。虽然论文使用了预训练的多模态图像编码器作为特征提取主干，这与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（得5分），但论文核心内容并非大语言模型或深度学习技术原理的创新，而是计算机视觉领域的生成图像检测方法。其他关键词主要涉及大语言模型的技术细节、训练方法、推理优化、对齐、代理系统等，与论文的计算机视觉检测任务无直接关系，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种多维度对抗性特征学习框架，通过抑制生成模式和内容偏差来提升AI生成图像检测的跨模型泛化能力，在实验中比现有方法准确率提升10.89%。

摘要翻译

近年来，生成式人工智能技术的飞速发展显著降低了制作高质量伪造图像的门槛，对信息的真实性与可信度构成了严峻挑战。现有的生成图像检测方法通常通过模型架构或网络设计来提升泛化能力，但其泛化性能仍易受数据偏差影响，因为训练数据可能驱使模型拟合特定的生成模式与内容，而非不同生成模型图像间共有的特征（非对称偏差学习）。为应对这一问题，我们提出了一种多维对抗特征学习框架。该框架采用预训练的多模态图像编码器作为特征提取主干，构建真假特征学习网络，并设计了一个配备多维对抗损失的对抗偏差学习分支，从而在真实性判别特征学习与偏差特征学习之间形成对抗训练机制。通过抑制生成模式与内容偏差，MAFL引导模型聚焦于不同生成模型间共享的生成特征，从而有效捕捉真实图像与生成图像间的本质差异，增强跨模型泛化能力，并大幅降低对大规模训练数据的依赖。经大量实验验证，本方法在准确率上超越现有最优方法10.89%，平均精度提升8.57%。值得注意的是，即使仅使用320张图像进行训练，该方法在公开数据集上仍能实现超过80%的检测准确率。

摘要 (Abstract)

In recent years, the rapid development of generative artificial intelligence technology has significantly lowered the barrier to creating high-quality fake images, posing a serious challenge to information authenticity and credibility. Existing generated image detection methods typically enhance generalization through model architecture or network design. However, their generalization performance remains susceptible to data bias, as the training data may drive models to fit specific generative patterns and content rather than the common features shared by images from different generative models (asymmetric bias learning). To address this issue, we propose a Multi-dimensional Adversarial Feature Learning (MAFL) framework. The framework adopts a pretrained multimodal image encoder as the feature extraction backbone, constructs a real-fake feature learning network, and designs an adversarial bias-learning branch equipped with a multi-dimensional adversarial loss, forming an adversarial training mechanism between authenticity-discriminative feature learning and bias feature learning. By suppressing generation-pattern and content biases, MAFL guides the model to focus on the generative features shared across different generative models, thereby effectively capturing the fundamental differences between real and generated images, enhancing cross-model generalization, and substantially reducing the reliance on large-scale training data. Through extensive experimental validation, our method outperforms existing state-of-the-art approaches by 10.89% in accuracy and 8.57% in Average Precision (AP). Notably, even when trained with only 320 images, it can still achieve over 80% detection accuracy on public datasets.

关键词: AI-generated image detection, adversarial feature learning, generalization, bias suppression, multimodal encoder, cross-model generalization, fake image detection

248. ❌ Fundus Image-based Glaucoma Screening via Retinal Knowledge-Oriented Dynamic Multi-Level Feature Integration

作者: Yuzhuo Zhou, Chi Liu, Sheng Shen, Zongyuan Ge, Fengshi Jing, Shiran Zhang, Yu Jiang, Anli Wang, Wenjian Liu, Feilong Yang, Tianqing Zhu, Xiaotong Han 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12351v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像分析（眼底图像青光眼筛查），属于AI在生物医学领域的应用。论文使用了深度学习技术（CNN、注意力机制）和预训练的基础模型来提取视网膜先验知识，但核心内容并非大语言模型（LLM）或通用大模型技术。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文主题有一定关联（生物信息学/医学AI应用），其他关键词均涉及LLM、MoE、对齐、推理、代理等与大模型直接相关的技术，与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合视网膜解剖知识引导的动态多尺度特征学习框架，用于眼底图像的青光眼自动筛查，在多个数据集上实现了优异的诊断性能（AUC 98.5%）和跨域泛化能力。

摘要翻译

基于彩色眼底照相的自动化诊断对于大规模青光眼筛查至关重要。然而，现有的深度学习模型通常是数据驱动的，缺乏对视网膜解剖知识的显式整合，这限制了其在异构临床数据集上的鲁棒性。此外，眼底图像中的病理线索可能出现在预定义的解剖区域之外，使得固定区域的特征提取不足以实现可靠的诊断。为解决这些挑战，我们提出了一种面向视网膜知识的青光眼筛查框架，该框架将动态多尺度特征学习与特定领域的视网膜先验知识相结合。该框架采用三分支结构来捕获互补的视网膜表征，包括全局视网膜上下文、视盘/视杯的结构特征以及动态定位的病理区域。我们设计了一种动态窗口机制来自适应地识别具有诊断信息价值的区域，同时通过一个知识增强的卷积注意力模块，整合从预训练基础模型中提取的视网膜先验知识来引导注意力学习。在大规模AIROGS数据集上的大量实验表明，所提方法优于多种基线模型，实现了98.5%的AUC和94.6%的准确率。在SMDG-19基准测试的多个数据集上的进一步评估，进一步证实了其强大的跨域泛化能力，表明知识引导的注意力与自适应病灶定位相结合，能显著提升自动化青光眼筛查系统的鲁棒性。

摘要 (Abstract)

Automated diagnosis based on color fundus photography is essential for large-scale glaucoma screening. However, existing deep learning models are typically data-driven and lack explicit integration of retinal anatomical knowledge, which limits their robustness across heterogeneous clinical datasets. Moreover, pathological cues in fundus images may appear beyond predefined anatomical regions, making fixed-region feature extraction insufficient for reliable diagnosis. To address these challenges, we propose a retinal knowledge-oriented glaucoma screening framework that integrates dynamic multi-scale feature learning with domain-specific retinal priors. The framework adopts a tri-branch structure to capture complementary retinal representations, including global retinal context, structural features of the optic disc/cup, and dynamically localized pathological regions. A Dynamic Window Mechanism is devised to adaptively identify diagnostically informative regions, while a Knowledge-Enhanced Convolutional Attention Module incorporates retinal priors extracted from a pre-trained foundation model to guide attention learning. Extensive experiments on the large-scale AIROGS dataset demonstrate that the proposed method outperforms diverse baselines, achieving an AUC of 98.5% and an accuracy of 94.6%. Additional evaluations on multiple datasets from the SMDG-19 benchmark further confirm its strong cross-domain generalization capability, indicating that knowledge-guided attention combined with adaptive lesion localization can significantly improve the robustness of automated glaucoma screening systems.

关键词: glaucoma screening, fundus image, retinal knowledge, dynamic multi-level feature integration, knowledge-enhanced attention, cross-domain generalization, automated diagnosis, AI for medical imaging

249. ❌ Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization

作者: Zanyi Wang, Fan Li, Dengyang Jiang, Liuzhuozheng Li, Yunhua Zhong, Guang Dai, Mengmeng Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12346v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视频时空定位任务中的小数据挑战，提出ST-GD框架，通过冻结预训练模型并注入轻量级适配器（约1000万可训练参数）来适应视频任务。这与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为论文明确使用了参数高效微调技术。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文涉及预训练模型的领域适应（从2D图像到视频）。其他关键词与论文内容无关（0分），因为论文专注于计算机视觉和视频理解，而非大语言模型、推理、对齐、科学AI等主题。

!!! tip deepseek-chat TL;DR

该论文解决了数据稀缺条件下视频时空定位的挑战，提出了一种参数高效的适配框架ST-GD，通过冻结预训练模型并注入轻量级适配器，在有限数据场景下实现了竞争性性能。

摘要翻译

时空视频定位（STVG）旨在动态视频片段中定位查询对象。当前主流的全监督方法存在严重的数据依赖问题。然而，大规模STVG数据的收集异常困难：密集的帧级边界框标注和复杂的时序语言对齐标注成本极高，在专业视频领域尤为突出。因此，传统模型在这些固有有限的数据集上极易过拟合，而零样本基础模型则缺乏精确定位所需的任务特异性时序感知能力。为应对这一小数据挑战，我们提出了ST-GD，一个数据高效框架，可将预训练的二维视觉-语言模型（例如Grounding DINO）适配于视频任务。为避免在小数据集上破坏预训练先验知识，ST-GD保持基础模型冻结，并策略性地注入轻量级适配器（约1000万可训练参数）以注入时空感知能力，同时结合一个新颖的时序解码器进行边界预测。这一设计天然应对了数据稀缺问题。因此，ST-GD在数据稀缺场景中表现卓越，在有限规模的HC-STVG v1/v2基准测试上取得了极具竞争力的性能，同时在VidSTG数据集上保持了强大的泛化能力。这验证了ST-GD是在严格小数据约束下进行复杂视频理解的有效范式。

摘要 (Abstract)

Spatio-temporal video grounding (STVG) aims to localize queried objects within dynamic video segments. Prevailing fully-trained approaches are notoriously data-hungry. However, gathering large-scale STVG data is exceptionally challenging: dense frame-level bounding boxes and complex temporal language alignments are prohibitively expensive to annotate, especially for specialized video domains. Consequently, conventional models suffer from severe overfitting on these inherently limited datasets, while zero-shot foundational models lack the task-specific temporal awareness needed for precise localization. To resolve this small-data challenge, we introduce ST-GD, a data-efficient framework that adapts pre-trained 2D visual-language models (e.g., Grounding DINO) to video tasks. To avoid destroying pre-trained priors on small datasets, ST-GD keeps the base model frozen and strategically injects lightweight adapters (~10M trainable parameters) to instill spatio-temporal awareness, alongside a novel temporal decoder for boundary prediction. This design naturally counters data scarcity. Consequently, ST-GD excels in data-scarce scenarios, achieving highly competitive performance on the limited-scale HC-STVG v1/v2 benchmarks, while maintaining robust generalization on the VidSTG dataset. This validates ST-GD as a powerful paradigm for complex video understanding under strict small-data constraints.

关键词: Spatio-temporal video grounding, Parameter-efficient fine-tuning, Limited-data adaptation, Video understanding, Adapters, Small-data challenge, Temporal localization

250. ❌ Detecting Precise Hand Touch Moments in Egocentric Video

作者: Huy Anh Nguyen, Feras Dayoub, Minh Hoai 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12343v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的手部接触检测任务，使用深度学习模型（如CNN、注意力机制）处理第一人称视频，研究内容涉及视频理解、动作识别和数据集构建。所有评分关键词均与大语言模型（LLM）、大模型技术原理、AI for Science等主题相关，而本文完全不涉及这些领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为HiCE的深度学习方法来精确检测第一人称视频中手部接触物体的时刻，并在新构建的TouchMoment数据集上实现了比现有方法高16.91%的平均精度。

摘要翻译

我们致力于解决在自我中心视角视频中检测手部与物体发生接触的精确时刻这一挑战性任务。这种帧级检测对于增强现实、人机交互、辅助技术和机器人学习应用至关重要，因为接触起始信号标志着动作的启动或完成。由于接触点附近手部运动的细微变化、频繁的遮挡、精细的操作模式以及第一人称视角固有的运动动态，实现时间上精确的检测尤为困难。
为应对这些挑战，我们提出了一种手部信息上下文增强模块（Hand-informed Context Enhanced module，简称HiCE），该模块通过交叉注意力机制，利用手部区域及其周围上下文的时空特征，学习识别潜在的接触模式。我们的方法进一步通过一种抓握感知损失和软标签进行优化，该损失强调触摸事件特有的手部姿态模式和运动动态，使模型能够区分接近接触帧和实际接触帧。我们还引入了TouchMoment数据集，这是一个包含4,021个视频、8,456个标注接触时刻、总帧数超过一百万帧的自我中心视角数据集。在TouchMoment数据集上的实验表明，在仅当预测落在真实接触时刻两帧容差范围内才计为正确的严格评估标准下，我们的方法取得了显著提升，并以16.91%的平均精度优势超越了当前最先进的事件定位基线方法。

摘要 (Abstract)

We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see’) that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.

关键词: egocentric video, hand contact detection, spatiotemporal features, cross-attention, grasp-aware loss, TouchMoment dataset, frame-level detection, action initiation

251. ❌ CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training

作者: Qi Li, Cheng-Long Wang, Yinzhi Cao, Di Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12342v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究机器学习中数据子集训练的隐私风险，提出了CoLA攻击框架来分析选择泄露攻击。虽然论文提到了语言模型（LLMs）作为应用案例之一，但核心焦点是隐私攻击和机器学习安全，而非大模型技术本身。因此，只有"Large Language Models OR LLMs OR Foundation Models"获得5分（有一定关联），其他关键词均与论文主题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文挑战了数据子集训练能降低隐私风险的假设，提出了CoLA攻击框架，证明子集选择过程会泄露训练数据和选择参与数据的隐私信息，扩大了机器学习生态系统的隐私风险面。

摘要翻译

在现代机器学习中，对经过精心筛选的数据子集而非完整数据集进行训练，已成为一种标准的预处理流程。从视觉核心集选择到语言模型的大规模数据过滤，这种方法能够在效用损失最小的前提下实现可扩展性。一种常见的直觉是，对更少样本进行训练也应降低隐私风险。本文中，我们挑战了这一假设。我们证明子集训练并非无隐私风险：数据被纳入或排除的选择本身可能引入新的隐私暴露面并泄露更多敏感信息。攻击者可通过子集选择过程的侧信道元数据，或通过目标模型的输出，捕获此类信息。为系统研究这一现象，我们提出CoLA（选择泄露攻击），一个用于分析子集选择中隐私泄露的统一框架。在CoLA中，根据攻击者对侧信道信息的了解程度，我们定义两种实际攻击场景：子集感知侧信道攻击和黑盒攻击。在这两种场景下，我们研究了子集训练特有的两个隐私暴露面：（1）训练成员推理攻击（TM-MIA），仅关注训练数据成员身份的隐私；（2）选择参与推理攻击（SP-MIA），关注参与子集选择过程的所有样本的隐私。值得注意的是，SP-MIA将成员身份的概念从模型训练扩展至整个数据-模型供应链。在视觉和语言模型上的实验表明，现有威胁模型低估了子集训练的隐私风险：扩展的隐私暴露面同时泄露训练成员和选择成员信息，将风险从单个模型延伸至更广泛的机器学习生态系统。

摘要 (Abstract)

Training models on a carefully chosen portion of data rather than the full dataset is now a standard preprocess for modern ML. From vision coreset selection to large-scale filtering in language models, it enables scalability with minimal utility loss. A common intuition is that training on fewer samples should also reduce privacy risks. In this paper, we challenge this assumption. We show that subset training is not privacy free: the very choices of which data are included or excluded can introduce new privacy surface and leak more sensitive information. Such information can be captured by adversaries either through side-channel metadata from the subset selection process or via the outputs of the target model. To systematically study this phenomenon, we propose CoLA (Choice Leakage Attack), a unified framework for analyzing privacy leakage in subset selection. In CoLA, depending on the adversary’s knowledge of the side-channel information, we define two practical attack scenarios: Subset-aware Side-channel Attacks and Black-box Attacks. Under both scenarios, we investigate two privacy surfaces unique to subset training: (1) Training-membership MIA (TM-MIA), which concerns only the privacy of training data membership, and (2) Selection-participation MIA (SP-MIA), which concerns the privacy of all samples that participated in the subset selection process. Notably, SP-MIA enlarges the notion of membership from model training to the entire data-model supply chain. Experiments on vision and language models show that existing threat models underestimate subset-training privacy risks: the expanded privacy surface leaks both training and selection membership, extending risks from individual models to the broader ML ecosystem.

关键词: subset training, privacy risks, choice leakage attack, membership inference attack, data selection, privacy surface, machine learning security, CoLA framework

252. ❌ Bridging the Micro–Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization

作者: Xiaojie Liang, Zhimin Chen, Ziqi Sheng, Wei Lu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12341v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图像篡改定位（Image Manipulation Localization）任务，提出了一种结合频率分析和语义对齐的计算机视觉方法。虽然论文使用了CLIP模型（一种视觉-语言模型）的冻结表示，但研究重点完全在图像处理、频率分析、对比学习和篡改检测上，与评分关键词列表中的大模型技术原理、训练方法、推理优化、对齐技术、代理系统、科学AI应用等主题均无直接关联。所有关键词都涉及大模型/深度学习的不同技术方面或应用领域，而本文是纯粹的计算机视觉图像分析研究。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FASA的统一框架，通过结合频率分析和语义对齐来桥接微观-宏观差距，实现了对传统和扩散生成图像篡改的精准定位，并在多个基准测试中取得了最先进的性能。

摘要翻译

随着生成式图像编辑技术的发展，图像篡改定位任务必须同时处理具有明显取证痕迹的传统篡改手段与局部逼真的扩散模型生成式编辑。现有方法通常仅依赖低层次取证线索或高层次语义特征，导致微观与宏观信息之间存在根本性鸿沟。为弥合这一差距，我们提出FASA——一个能够统一定位传统篡改与扩散生成篡改的框架。具体而言，我们通过自适应双频带离散余弦变换模块提取篡改敏感的频域线索，并基于冻结的CLIP表征通过块级对比对齐学习篡改感知的语义先验。随后，我们通过语义-频域侧适配器将这些先验注入分层频域通路，实现多尺度特征交互，并采用原型引导、频域门控的掩码解码器，将语义一致性与边界感知定位相结合以预测篡改区域。在OpenSDI及多个传统篡改基准数据集上的大量实验表明，该方法实现了最先进的定位性能，具备强大的跨生成器与跨数据集泛化能力，并在常见图像退化条件下保持鲁棒性。

摘要 (Abstract)

As generative image editing advances, image manipulation localization (IML) must handle both traditional manipulations with conspicuous forensic artifacts and diffusion-generated edits that appear locally realistic. Existing methods typically rely on either low-level forensic cues or high-level semantics alone, leading to a fundamental micro–macro gap. To bridge this gap, we propose FASA, a unified framework for localizing both traditional and diffusion-generated manipulations. Specifically, we extract manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learn manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. We then inject these priors into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and employ a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization for tampered region prediction. Extensive experiments on OpenSDI and multiple traditional manipulation benchmarks demonstrate state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robust performance under common image degradations.

关键词: Image Manipulation Localization, Frequency Analysis, Semantic Alignment, Diffusion-generated Edits, Forensic Artifacts, CLIP Representations, Multi-scale Feature Interaction, Tampered Region Prediction

253. ❌ All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

作者: Tanzila Rahman, Renjie Liao, Leonid Sigal 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12335v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）的视频理解，核心贡献是统一的合成数据生成管道和VQA微调策略。与LLMs高度相关（10分），因为MLLMs是LLMs的扩展；与SFT高度相关（10分），因为采用了VQA-based fine-tuning策略；与Scaling Laws AND Data Quality（5分）相关，因为涉及数据质量和可扩展性；与Pre-training（5分）相关，因为涉及模型训练；与Chain of Thought和System 2 Thinking（各5分）相关，因为强调深度推理和视觉基础。其他关键词如MoE、SLMs、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种统一的合成数据生成管道，用于多模态视频理解，通过VQA微调策略增强模型推理能力，实验表明基于合成数据训练的模型在真实数据集上表现优异。

摘要翻译

为视频理解训练多模态大语言模型（MLLMs）需要大规模涵盖多种任务（如物体计数、问答和分割）的标注数据。然而，在现实世界中收集和标注多模态视频数据成本高昂、过程缓慢，且其多样性和覆盖范围本质上受限。为应对这一挑战，我们提出了一种统一的合成数据生成流程，能够自动生成具有丰富多样监督信息的无限多模态视频数据。我们的框架在单一流程内支持多种任务格式，实现了跨任务的可扩展且一致的数据创建。为进一步增强推理能力，我们引入了一种基于视觉问答（VQA）的微调策略，该策略训练模型回答关于视觉内容的结构化问题，而非仅仅依赖描述文本或简单指令。这种设计鼓励更深层次的视觉基础与推理。我们在三个具有挑战性的任务中评估了所提方法：视频物体计数、基于视频的视觉问答以及视频物体分割。实验结果表明，主要使用合成数据训练的模型能够有效地泛化到真实世界数据集，其表现通常优于传统方法训练的模型。我们的研究凸显了统一合成数据流程作为一种可扩展方案的潜力，能够替代昂贵且受限的真实世界标注，以促进多模态视频理解的发展。

摘要 (Abstract)

Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video-based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real-world annotation for multimodal video understanding.

关键词: multimodal large language models, synthetic data generation, video understanding, VQA fine-tuning, visual reasoning, object counting, visual question answering, object segmentation

254. ❌ HyperLiDAR: Adaptive Post-Deployment LiDAR Segmentation via Hyperdimensional Computing

作者: Ivannia Gomez Moreno, Yi Yao, Ye Tian, Xiaofan Yu, Flavio Ponzina, Michael Sullivan, Jingyi Zhang, Mingyu Yang, Hun Seok Kim, Tajana Rosing 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12331v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LiDAR语义分割的轻量级后部署适应框架，使用超维计算（HDC）而非大模型或深度学习。与关键词的相关性有限：1）与’Small Language Models OR SLMs OR On-device AI’（5分）相关，因为涉及边缘设备上的轻量级适应；2）与’Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）相关，因为处理环境变化后的模型适应；3）与’Post-training OR Supervised Fine-tuning OR SFT’（5分）相关，因为涉及后部署微调；4）与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’（5分）相关，因为强调轻量级和高效的适应；5）与’Quantization OR Model Compression OR Low-bit Weights’（5分）相关，因为关注计算和能量约束下的模型效率；6）与’Speculative Decoding OR Inference Acceleration’（5分）相关，因为实现了13.8倍的加速。其他关键词（如LLMs、MoE、RAG等）与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出HyperLiDAR，一个基于超维计算的轻量级后部署LiDAR语义分割框架，解决了边缘设备在环境变化下的适应问题，在保持性能的同时实现了高达13.8倍的加速。

摘要翻译

激光雷达语义分割在自动驾驶等边缘应用的三维场景理解中起着关键作用。然而，在实际部署中仍存在重大挑战，特别是对于设备端的部署后适应问题。现实环境会随着系统在不同地点运行而发生变化，若缺乏有效且及时的模型适应，将导致性能显著下降。此外，边缘系统在严格的计算和能源限制下运行，使得直接在设备上适配传统的基于大型神经网络的分割模型并不可行。为应对上述挑战，我们提出了HyperLiDAR，这是首个基于超维计算（Hyperdimensional Computing, HDC）的轻量级、部署后激光雷达分割框架。HyperLiDAR的设计充分借鉴了人脑处理信息的方式，利用了HDC快速学习与高效运行的优势。为进一步提升适应效率，我们指出每次扫描产生的高数据量是关键瓶颈，并引入了一种缓冲区选择策略，该策略将学习重点集中在信息最丰富的点上。我们在两个先进的激光雷达分割基准数据集和两种代表性设备上进行了广泛评估。结果表明，HyperLiDAR在适应性能上优于或达到了与先进分割方法相当的水平，同时实现了高达13.8倍的重新训练加速。

摘要 (Abstract)

LiDAR semantic segmentation plays a pivotal role in 3D scene understanding for edge applications such as autonomous driving. However, significant challenges remain for real-world deployments, particularly for on-device post-deployment adaptation. Real-world environments can shift as the system navigates through different locations, leading to substantial performance degradation without effective and timely model adaptation. Furthermore, edge systems operate under strict computational and energy constraints, making it infeasible to adapt conventional segmentation models (based on large neural networks) directly on-device. To address the above challenges, we introduce HyperLiDAR, the first lightweight, post-deployment LiDAR segmentation framework based on Hyperdimensional Computing (HDC). The design of HyperLiDAR fully leverages the fast learning and high efficiency of HDC, inspired by how the human brain processes information. To further improve the adaptation efficiency, we identify the high data volume per scan as a key bottleneck and introduce a buffer selection strategy that focuses learning on the most informative points. We conduct extensive evaluations on two state-of-the-art LiDAR segmentation benchmarks and two representative devices. Our results show that HyperLiDAR outperforms or achieves comparable adaptation performance to state-of-the-art segmentation methods, while achieving up to a 13.8x speedup in retraining.

关键词: LiDAR semantic segmentation, Hyperdimensional Computing, post-deployment adaptation, edge applications, lightweight framework, on-device adaptation, buffer selection strategy, retraining speedup

255. ❌ Self-Adversarial One Step Generation via Condition Shifting

作者: Deyuan Liu, Peng Sun, Yansen Han, Zhenglin Cheng, Chuyan Chen, Tao Lin 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12322v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文本到图像生成的高效单步采样方法，提出了一种基于条件转移的自对抗框架APEX。论文核心是图像生成模型的训练优化，而非大语言模型或深度学习技术原理的创新。唯一相关的关键词是’PEFT OR LoRA OR Parameter-efficient Fine-tuning’，因为论文明确提到APEX兼容LoRA调优，并在实验中使用了LoRA调优Qwen-Image 20B模型。其他关键词均与论文内容无关，论文未涉及大语言模型、推理加速、对齐、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于条件转移的自对抗框架APEX，解决了文本到图像生成中单步采样的保真度、推理速度和训练效率的三方权衡问题，实现了在单步采样下超越多步教师模型的生成质量，并显著提升了推理速度。

摘要翻译

追求高效的文本到图像合成技术正推动该领域向单步采样发展，但现有方法仍面临保真度、推理速度与训练效率之间的三重权衡。依赖外部判别器的方法虽能提升单步生成质量，却常伴随训练不稳定、高GPU内存开销及收敛缓慢等问题，这增加了模型扩展与参数高效调优的复杂性。相比之下，基于回归的蒸馏与一致性目标更易于优化，但在单步约束下通常会丢失细节特征。本文提出APEX方法，其核心理论洞见在于：通过条件偏移，可从流模型中内生地提取对抗性校正信号。利用变换构建一个偏移条件分支，其速度场可作为模型当前生成分布的独立估计量，从而产生理论上与GAN对齐的梯度，替代了原本导致梯度消失的样本依赖型判别器项。这种无判别器设计保持了架构不变性，使APEX成为兼容全参数调优与基于LoRA调优的即插即用框架。实验表明，我们的0.6B参数模型在单步生成质量上超越了FLUX-Schnell 12B模型（参数量为其20倍）。基于Qwen-Image 20B进行LoRA调优时，APEX仅用6小时即在NFE=1条件下达到0.89的GenEval分数，超越原50步教师模型的0.87分，并实现15.33倍的推理加速。代码已开源：https://github.com/LINs-lab/APEX。

摘要 (Abstract)

The push for efficient text to image synthesis has moved the field toward one step sampling, yet existing methods still face a three way tradeoff among fidelity, inference speed, and training efficiency. Approaches that rely on external discriminators can sharpen one step performance, but they often introduce training instability, high GPU memory overhead, and slow convergence, which complicates scaling and parameter efficient tuning. In contrast, regression based distillation and consistency objectives are easier to optimize, but they typically lose fine details when constrained to a single step. We present APEX, built on a key theoretical insight: adversarial correction signals can be extracted endogenously from a flow model through condition shifting. Using a transformation creates a shifted condition branch whose velocity field serves as an independent estimator of the model’s current generation distribution, yielding a gradient that is provably GAN aligned, replacing the sample dependent discriminator terms that cause gradient vanishing. This discriminator free design is architecture preserving, making APEX a plug and play framework compatible with both full parameter and LoRA based tuning. Empirically, our 0.6B model surpasses FLUX-Schnell 12B (20$\times$ more parameters) in one step quality. With LoRA tuning on Qwen-Image 20B, APEX reaches a GenEval score of 0.89 at NFE=1 in 6 hours, surpassing the original 50-step teacher (0.87) and providing a 15.33$\times$ inference speedup. Code is available https://github.com/LINs-lab/APEX.

关键词: text-to-image synthesis, one-step sampling, adversarial correction, condition shifting, flow model, LoRA tuning, inference speedup, parameter-efficient fine-tuning

256. ❌ EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

作者: Jianzhe Ma, Zhonghao Cao, Shangkui Chen, Yichen Xu, Wenxuan Wang, Qin Jin 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12320v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究视频大语言模型（Video-LLMs）在电子竞技领域的应用，因此与’Large Language Models’高度相关（10分）。论文评估模型在感知和推理任务上的表现，涉及深度战术推理，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。其他关键词如MoE、SLMs、训练技术、优化方法、代理系统等均未在摘要中提及，故评0分。论文属于大模型在特定领域（电子竞技）的应用研究，符合研究背景中’大模型在不同领域的研究应用’的酌情给分标准。

!!! tip deepseek-chat TL;DR

该论文针对视频大语言模型在高速、信息密集的虚拟环境（如电子竞技）中理解能力不足的问题，提出了EgoEsportsQA基准测试，评估结果显示当前模型在深度战术推理和精细操作理解上存在显著差距，最佳模型准确率仅71.58%。

摘要翻译

尽管视频大语言模型（Video-LLMs）在理解慢节奏的真实世界第一人称视频方面表现出色，但其在高速、信息密集的虚拟环境中的能力仍未被充分探索。现有基准测试主要关注日常活动，缺乏用于评估虚拟场景中快速、规则约束推理能力的严格测试平台。为填补这一空白，我们提出了EgoEsportsQA，这是一个开创性的视频问答（QA）基准，旨在基于专业电子竞技知识进行感知与推理的落地评估。我们通过一个可扩展的六阶段流程，从三款第一人称射击游戏的职业比赛中精心构建了1,745个高质量问答对。这些问题被组织成一个二维解耦分类体系：认知能力维度包含11个子任务（涵盖感知与推理层级），电子竞技知识维度包含6个子任务。对当前先进视频大语言模型的综合评估表明，现有模型仍未能取得令人满意的性能，最佳模型准确率仅为71.58%。结果揭示出两个维度上的显著差距：模型在基础视觉感知方面表现优于深层战术推理，且对整体宏观进程的理解强于对精细微观操作的把握。广泛的消融实验揭示了当前视频大语言模型架构的内在缺陷。进一步分析表明，我们的数据集不仅揭示了真实世界与虚拟第一人称领域之间的联系，还为优化下游电子竞技应用提供了指导，从而推动视频大语言模型在各种第一人称环境中的未来发展。

摘要 (Abstract)

While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.

关键词: Video-LLMs, egocentric video, esports, benchmark, perception, reasoning, question-answering, tactical reasoning

257. ❌ RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

作者: Guoan Xu, Yang Xiao, Guangwei Gao, Dongchen Zhu, Wenjing Jia, Guo-Jun Qi 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12319v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算机视觉领域的多模态语义分割，提出了一种基于状态空间模型（Mamba）的可靠性感知融合框架。所有关键词均与语言模型、训练方法、推理技术、对齐、代理系统等大模型核心技术相关，而本文研究的是视觉任务中的多模态融合，未涉及任何语言模型或相关技术。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为语义分割可视为计算机视觉在科学或工程应用中的一部分，但并非核心，故给5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态语义分割中模态可靠性不均导致特征退化的问题，提出了一种可靠性感知的自门控状态空间模型（RSGMamba），在RGB-D和RGB-T基准上实现了最先进的性能。

摘要翻译

多模态语义分割通过利用来自多种感知模态（如RGB、深度和热成像）的互补信息，已成为增强场景理解的重要范式。然而，现有的跨模态融合方法通常隐含地假设所有模态均同等可靠，当辅助模态存在噪声、未对齐或不完整时，这可能导致特征退化。本文从模态可靠性的视角重新审视跨模态融合，提出了一种称为可靠性感知自门控状态空间模型（RSGMamba）的新框架。我们方法的核心是可靠性感知自门控Mamba模块（RSGMB），它显式地建模模态可靠性，并通过自门控机制动态调节跨模态交互。与不加区分地在模态间交换信息的传统融合策略不同，RSGMB实现了可靠性感知的特征选择，并增强了信息性特征的聚合。此外，我们引入了一个轻量级的局部交叉门控调制模块（LCGM）来细化细粒度空间细节，以补充RSGMB的全局建模能力。大量实验表明，RSGMamba在RGB-D和RGB-T语义分割基准上均取得了最先进的性能：在NYUDepth V2和SUN-RGBD数据集上分别达到58.8%和54.0%的平均交并比（mIoU）（较先前最佳结果提升+0.4% / +0.7%），在MFNet和PST900数据集上分别达到61.1%和88.9%的mIoU（最高提升+1.6%），且参数量仅为48.6M，从而验证了所提方法的有效性和优越性。

摘要 (Abstract)

Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability-aware Self-Gated State Space Model (RSGMamba). At the core of our method is the Reliability-aware Self-Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability-aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross-Gated Modulation (LCGM) is incorporated to refine fine-grained spatial details, complementing the global modeling capability of RSGMB. Extensive experiments demonstrate that RSGMamba achieves state-of-the-art performance on both RGB-D and RGB-T semantic segmentation benchmarks, resulting 58.8% / 54.0% mIoU on NYUDepth V2 and SUN-RGBD (+0.4% / +0.7% over prior best), and 61.1% / 88.9% mIoU on MFNet and PST900 (up to +1.6%), with only 48.6M parameters, thereby validating the effectiveness and superiority of the proposed approach.

关键词: Multimodal Semantic Segmentation, Reliability-aware Fusion, State Space Model, Mamba, Cross-modal Interaction, RGB-D Segmentation, RGB-T Segmentation, Feature Aggregation

258. ❌ Cell Instance Segmentation via Multi-Task Image-to-Image Schrödinger Bridge

作者: Hayato Inoue, Shota Harada, Shumpei Takezaki, Ryoma Bise 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12318v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于细胞实例分割的计算机视觉任务，提出了一种基于Schrödinger Bridge的多任务图像到图像生成框架。所有关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，但论文内容与绝大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为细胞分割属于生物信息学/生物医学图像分析领域，属于AI for Science的一个子领域，但论文未明确提及这些术语，且核心是计算机视觉方法而非大模型应用，因此给予5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多任务图像到图像Schrödinger Bridge的细胞实例分割方法，在PanNuke和MoNuSeg数据集上实现了竞争性或更优的性能，且不依赖SAM预训练或额外后处理。

摘要翻译

现有的细胞实例分割流程通常将确定性预测与后处理相结合，这对实例掩码的全局结构施加了有限的显式约束。在本研究中，我们提出了一种多任务图像到图像的薛定谔桥框架，将实例分割构建为一个基于分布的图像到图像生成问题。通过反向距离图整合了边界感知监督，并采用确定性推理以产生稳定的预测。在PanNuke数据集上的实验结果表明，所提出的方法在不依赖SAM预训练或额外后处理的情况下，实现了具有竞争力或更优的性能。在MoNuSeg数据集上的补充结果展示了其在有限训练数据下的鲁棒性。这些发现表明，基于薛定谔桥的图像到图像生成为细胞实例分割提供了一个有效的框架。

摘要 (Abstract)

Existing cell instance segmentation pipelines typically combine deterministic predictions with post-processing, which imposes limited explicit constraints on the global structure of instance masks. In this work, we propose a multi-task image-to-image Schrödinger Bridge framework that formulates instance segmentation as a distribution-based image-to-image generation problem. Boundary-aware supervision is integrated through a reverse distance map, and deterministic inference is employed to produce stable predictions. Experimental results on the PanNuke dataset demonstrate that the proposed method achieves competitive or superior performance without relying on SAM pre-training or additional post-processing. Additional results on the MoNuSeg dataset show robustness under limited training data. These findings indicate that Schrödinger Bridge-based image-to-image generation provides an effective framework for cell instance segmentation.

关键词: cell instance segmentation, Schrödinger Bridge, image-to-image generation, multi-task learning, boundary-aware supervision, PanNuke dataset, MoNuSeg dataset, medical image analysis

259. ❌ GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality

作者: Zhiwei Zhang, Xingyuan Zeng, Xinkai Kong, Kunquan Zhang, Haoyuan Liang, Bohan Shi, Juepeng Zheng, Jianxi Huang, Yutong Lu, Haohuan Fu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12315v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于遥感图像处理中的农业梯田地块提取，提出了一个多模态基准数据集（GTPBD-MM）和一个基线模型（ETTerra）。论文的核心是计算机视觉、多模态融合（图像、文本、DEM）和地理空间分析，与绝大多数关键词（涉及大模型架构、训练方法、推理优化、对齐技术、代理系统等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在农业科学（可视为广义科学应用）中的一个具体应用，但并非论文的核心技术焦点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对山区复杂梯田地块提取的挑战，提出了首个全球多模态基准数据集GTPBD-MM和一个融合高程与文本引导的基线模型ETTerra，实验表明多模态信息能显著提升提取精度和结构一致性。

摘要翻译

农业田块提取在基于遥感的农业监测中具有重要作用，可为田块测绘、精准管理和生态评估提供支持。然而，现有的公共基准数据集主要关注规则且相对平坦的农田场景。相比之下，山区的梯田田块具有阶梯状地形、显著的高程变化、不规则边界以及强烈的跨区域异质性，使得田块提取成为一个更具挑战性的问题，需要同时结合视觉识别、语义判别和地形感知的几何理解。尽管近期研究在视觉田块基准和图文农田理解方面取得了进展，但在对齐的图像-文本-数字高程模型（DEM）设置下，针对复杂梯田田块提取的统一基准仍然缺失。为填补这一空白，我们提出了GTPBD-MM，这是首个面向全球梯田田块提取的多模态基准。该基准基于GTPBD构建，整合了高分辨率光学影像、结构化文本描述和DEM数据，并支持在仅图像（Image-only）、图像+文本（Image+Text）以及图像+文本+DEM（Image+Text+DEM）三种设置下的系统评估。我们进一步提出了高程与文本引导的梯田田块网络（ETTerra），作为一种用于梯田田块勾绘的多模态基线方法。大量实验表明，文本语义和地形几何信息能够提供超越单一视觉外观的互补线索，从而在复杂梯田场景中产生更精确、连贯且结构一致的勾绘结果。

摘要 (Abstract)

Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular boundaries, and strong cross-regional heterogeneity, making parcel extraction a more challenging problem that jointly requires visual recognition, semantic discrimination, and terrain-aware geometric understanding. Although recent studies have advanced visual parcel benchmarks and image-text farmland understanding, a unified benchmark for complex terraced parcel extraction under aligned image-text-DEM settings remains absent. To fill this gap, we present GTPBD-MM, the first multimodal benchmark for global terraced parcel extraction. Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, and supports systematic evaluation under Image-only, Image+Text, and Image+Text+DEM settings. We further propose Elevation and Text guided Terraced parcel network (ETTerra), a multimodal baseline for terraced parcel delineation. Extensive experiments demonstrate that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.

关键词: terraced parcel extraction, multimodal benchmark, remote sensing, agricultural monitoring, DEM data, text-guided network, geometric understanding, GTPBD-MM

260. ❌ Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

作者: Rong Wang, Ruyi Zha, Ziang Cheng, Jiayu Yang, Pulak Purkait, Hongdong Li 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12309v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究3D基础模型在视频生成中的应用，属于计算机视觉和生成模型领域，而非大语言模型（LLM）或深度学习技术原理的直接创新。仅与’Pre-training’和’Post-training’有一定关联（5分），因为论文提到使用基础模型进行预训练，并涉及微调过程。其他关键词均与LLM、推理、对齐、压缩等主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用3D基础模型先验从单张图像生成几何真实且一致的轨道视频的新方法，通过多尺度3D适配器注入特征，显著提升了视觉质量、形状真实性和多视角一致性。

摘要翻译

我们提出了一种从物体单张图像生成几何真实且一致的轨道视频的新方法。现有视频生成工作主要依赖像素级注意力来保证帧间的视角一致性。然而，这种机制对于长距离外推（例如后视合成）约束不足，因为此类场景中与输入图像的像素对应关系有限。因此，这些方法往往难以生成结构合理且连贯的结果。为解决这一问题，我们提出利用三维基础生成模型所提供的丰富形状先验作为辅助约束，其动机在于该模型能够对从大规模三维资产库中学习到的真实物体形状分布进行建模。具体而言，我们通过三维基础模型编码的两尺度潜在特征来引导视频生成：（i）一个去噪的全局潜在向量作为整体结构指导；（ii）一组从体素特征投影得到的潜在图像，以提供视角相关且细粒度的几何细节。与深度图或法线图等常用的2.5维表示相比，这些紧凑特征能够建模完整的物体形状，并通过避免显式网格提取来提高推理效率。为实现有效的形状条件控制，我们引入了一个多尺度三维适配器，通过交叉注意力机制将特征标记注入基础视频模型，这既保留了模型在通用视频预训练中获得的能力，又实现了简单且与模型无关的微调过程。在多个基准测试上的大量实验表明，相较于现有先进方法，我们的方法在视觉质量、形状真实性和多视角一致性方面均表现优异，并能稳健地推广到复杂的相机轨迹和真实场景图像中。

摘要 (Abstract)

We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.

关键词: orbital video generation, 3D foundation model, shape priors, multi-view consistency, video generation, geometric realism, single image input, multi-scale 3D adapter

261. ❌ Boosting Robust AIGI Detection with LoRA-based Pairwise Training

作者: Ruiyang Xia, Qi Zhang, Yaowen Xu, Zhaofan Zou, Hao Sun, Zhongjiang He, Xuelong Li 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12307v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于AI生成图像（AIGI）检测的鲁棒性研究，核心创新是提出了一种基于LoRA的成对训练策略（LPT）。该论文与大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）的技术原理、训练方法、推理优化、对齐、代理系统等。唯一高度相关的关键词是’PEFT OR LoRA OR Parameter-efficient Fine-tuning’，因为论文明确使用LoRA进行参数高效微调。论文研究的是计算机视觉领域的AIGI检测问题，而非大语言模型或AI for Science的特定应用，因此其他关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于LoRA的成对训练策略（LPT），以提升AI生成图像（AIGI）在严重失真条件下的鲁棒检测性能，并在NTIRE挑战赛中取得了第三名。

摘要翻译

高度逼真的人工智能生成图像（AIGI）的激增，使得开发实用的检测方法变得必要。尽管当前的AIGI检测器在干净数据集上表现优异，但当它们部署于“野外”环境时——即图像遭受不可预测的复杂失真——其检测性能常常会下降。为解决这一关键脆弱性，我们提出了一种新颖的基于LoRA的成对训练（LPT）策略，专门设计用于在严重失真条件下实现鲁棒的AIGI检测。我们策略的核心包括：针对性地微调视觉基础模型、在训练阶段刻意模拟数据分布，以及一个独特的成对训练过程。具体而言，我们引入了失真和尺寸模拟，以更好地拟合验证集和测试集的分布。基于视觉基础模型强大的视觉表征能力，我们通过微调该模型来实现AIGI检测。成对训练则被用来通过解耦泛化性与鲁棒性优化，以提升检测性能。实验表明，我们的方法在NTIRE“野外鲁棒人工智能生成图像检测”挑战赛中获得了第三名。

摘要 (Abstract)

The proliferation of highly realistic AI-Generated Image (AIGI) has necessitated the development of practical detection methods. While current AIGI detectors perform admirably on clean datasets, their detection performance frequently decreases when deployed “in the wild”, where images are subjected to unpredictable, complex distortions. To resolve the critical vulnerability, we propose a novel LoRA-based Pairwise Training (LPT) strategy designed specifically to achieve robust detection for AIGI under severe distortions. The core of our strategy involves the targeted finetuning of a visual foundation model, the deliberate simulation of data distribution during the training phase, and a unique pairwise training process. Specifically, we introduce distortion and size simulations to better fit the distribution from the validation and test sets. Based on the strong visual representation capability of the visual foundation model, we finetune the model to achieve AIGI detection. The pairwise training is utilized to improve the detection via decoupling the generalization and robustness optimization. Experiments show that our approach secured the 3th placement in the NTIRE Robust AI-Generated Image Detection in the Wild challenge

关键词: AI-Generated Image Detection, Robust Detection, LoRA, Parameter-efficient Fine-tuning, Pairwise Training, Visual Foundation Model, Distortion Simulation, NTIRE Challenge

262. ❌ CBAM-Enhanced DenseNet121 for Multi-Class Chest X-Ray Classification with Grad-CAM Explainability

作者: Utsho Kumar Dey 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12305v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像分类（胸部X光），使用传统的卷积神经网络（DenseNet121）和注意力机制（CBAM），属于计算机视觉在生物医学领域的应用。论文内容与绝大多数关键词（涉及大语言模型、训练技术、推理优化、智能体等）完全无关，仅与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（属于AI for Science在生物医学影像的具体应用），以及与’Mechanistic Interpretability OR Explainable AI’有一定关联（论文使用了Grad-CAM进行可解释性分析）。

!!! tip deepseek-chat TL;DR

该研究针对资源有限地区儿童肺炎诊断问题，提出了一种结合注意力模块的CBAM-DenseNet121模型，用于胸部X光的三分类（正常、细菌性肺炎、病毒性肺炎），取得了84.29%的测试准确率，并通过Grad-CAM可视化增强了模型的可解释性。

摘要翻译

肺炎仍是全球儿童死亡的主要原因，在孟加拉国等放射科医生资源有限的低收入地区负担尤重。现有深度学习研究多将肺炎检测视为二分类问题，忽视了细菌性与病毒性病因这一临床关键区别。本文提出CBAM-DenseNet121迁移学习框架，将卷积注意力模块（Convolutional Block Attention Module, CBAM）集成至DenseNet121架构，实现胸部X光的三分类：正常、细菌性肺炎与病毒性肺炎。我们同时进行了系统性二分类基线研究，发现EfficientNetB3（73.88%）甚至低于定制卷积神经网络（Convolutional Neural Network, CNN）基线（78.53%）——这一重要负面发现对医学影像模型选择具有实际意义。为确保统计可靠性，所有实验均采用独立随机种子（42、7、123）重复三次，结果以均值±标准差形式报告。CBAM-DenseNet121取得84.29%±1.14%的测试准确率，其细菌性肺炎、正常和病毒性肺炎的各类别AUC值分别为0.9565±0.0010、0.9610±0.0014和0.9187±0.0037。梯度加权类激活映射（Grad-CAM）可视化证实，该模型能聚焦于各类别对应的解剖学合理肺区，为在资源受限临床环境中实现可解释性部署提供了支持。

摘要 (Abstract)

Pneumonia remains a leading cause of childhood mortality worldwide, with a heavy burden in low-resource settings such as Bangladesh where radiologist availability is limited. Most existing deep learning approaches treat pneumonia detection as a binary problem, overlooking the clinically critical distinction between bacterial and viral aetiology. This paper proposes CBAM-DenseNet121, a transfer-learning framework that integrates the Convolutional Block Attention Module (CBAM) into DenseNet121 for three-class chest X-ray classification: Normal, Bacterial Pneumonia, and Viral Pneumonia. We also conduct a systematic binary-task baseline study revealing that EfficientNetB3 (73.88%) underperforms even the custom CNN baseline (78.53%) – a practically important negative finding for medical imaging model selection. To ensure statistical reliability, all experiments were repeated three times with independent random seeds (42, 7, 123), and results are reported as mean +/- standard deviation. CBAM-DenseNet121 achieves 84.29% +/- 1.14% test accuracy with per-class AUC scores of 0.9565 +/- 0.0010, 0.9610 +/- 0.0014, and 0.9187 +/- 0.0037 for bacterial pneumonia, normal, and viral pneumonia respectively. Grad-CAM visualizations confirm that the model attends to anatomically plausible pulmonary regions for each class, supporting interpretable deployment in resource-constrained clinical environments.

关键词: Chest X-Ray Classification, Pneumonia Detection, CBAM-DenseNet121, Transfer Learning, Grad-CAM, Medical Imaging, Three-class Classification, Explainable AI

263. ❌ CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

作者: Benzhao Tang, Shiyu Yang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13024v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CLAD专注于系统日志异常检测，提出了一种直接在压缩字节流上进行检测的深度学习框架。虽然使用了深度学习技术（如卷积编码器、Transformer-mLSTM架构），但研究内容与所有评分关键词均无直接关联：1）论文不涉及大语言模型、小语言模型或基础模型；2）未使用MoE、稀疏模型、量化、模型压缩等技术；3）不涉及预训练、后训练、指令调优、对齐、RLHF、PEFT等训练方法；4）未使用RAG、上下文扩展、KV缓存压缩等技术；5）不涉及推理加速、幻觉缓解、可解释AI等主题；6）未应用思维链、系统2思维、MCTS、自我纠正、智能体、工具使用等技术；7）不涉及世界模型、模型合并、上下文学习；8）虽然属于AI应用，但并非生物信息学或化学信息学等科学领域。论文的核心是特定领域的深度学习应用，而非大模型技术或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CLAD的深度学习框架，首次实现了直接在压缩字节流上进行系统日志异常检测，避免了传统方法所需的解压缩和解析开销，在五个数据集上达到了最先进的平均F1分数0.9909。

摘要翻译

系统日志的爆炸式增长使得流式压缩变得至关重要，然而现有的日志异常检测方法因需要完全解压缩与解析而产生严重的预处理开销。我们提出了CLAD，这是首个能在压缩字节流上直接执行日志异常检测的深度学习框架。CLAD通过利用一个关键发现绕过了这些瓶颈：正常日志会压缩成规则的字节模式，而异常则会系统地破坏这些模式。为了从这些不透明的字节中提取多尺度的偏差，我们提出了一种专门构建的架构，该架构集成了扩张卷积字节编码器、混合Transformer-mLSTM以及四路聚合池化。配合采用掩码预训练和焦点对比微调的两阶段训练策略，以有效处理严重的类别不平衡问题。在五个数据集上的评估表明，CLAD实现了0.9909的平均F1分数，达到最先进水平，并以2.72个百分点的优势超越了最佳基线方法。它在完全消除解压缩与解析开销的同时，提供了卓越的检测精度，为结构化流式压缩器提供了一个具有良好泛化能力的鲁棒解决方案。

摘要 (Abstract)

The explosive growth of system logs makes streaming compression essential, yet existing log anomaly detection (LAD) methods incur severe pre-processing overhead by requiring full decompression and parsing. We introduce CLAD, the first deep learning framework to perform LAD directly on compressed byte streams. CLAD bypasses these bottlenecks by exploiting a key insight: normal logs compress into regular byte patterns, while anomalies systematically disrupt them. To extract these multi-scale deviations from opaque bytes, we propose a purpose-built architecture integrating a dilated convolutional byte encoder, a hybrid Transformer–mLSTM, and four-way aggregation pooling. This is coupled with a two-stage training strategy of masked pre-training and focal-contrastive fine-tuning to effectively handle severe class imbalance. Evaluated across five datasets, CLAD achieves a state-of-the-art average F1-score of 0.9909 and outperforms the best baseline by 2.72 percentage points. It delivers superior accuracy while completely eliminating decompression and parsing overheads, offering a robust solution that generalizes to structured streaming compressors.

关键词: log anomaly detection, compressed byte streams, deep learning framework, dilated convolutional encoder, Transformer-mLSTM, focal-contrastive fine-tuning, streaming compression, system logs

264. ❌ Classical and Quantum Speedups for Non-Convex Optimization via Energy Conserving Descent

作者: Yihang Sun, Huaijin Wang, Patrick Hayden, Jose Blanchet 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13022v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究非凸优化的经典和量子加速算法（Energy Conserving Descent），属于优化理论领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐、应用或相关概念。

!!! tip deepseek-chat TL;DR

该论文首次对Energy Conserving Descent（ECD）算法进行理论分析，证明了其在经典随机版本（sECD）和量子版本（qECD）下，对于特定非凸优化问题能实现相对于梯度下降的指数级加速。

摘要翻译

能量守恒下降算法（Energy Conserving Descent, ECD）是近期提出的一种全局非凸优化方法（De Luca & Silverstein, 2022）。与梯度下降不同，经适当配置的ECD动态能够逃离严格局部极小值并收敛至全局极小值，这使其在机器学习优化中颇具吸引力。本文首次对ECD进行解析研究，作为系列工作的第一部分，我们聚焦于一维情形。我们形式化了一种具有能量守恒噪声的随机ECD动态，并提出了ECD哈密顿量的量子模拟版本，为通过哈密顿量模拟实现量子算法奠定了基础。针对正双阱目标函数，我们计算了从局部极小值到全局极小值的期望命中时间。我们证明，相较于各自的梯度下降基线——随机梯度下降及其量子化版本，随机ECD与量子ECD均能实现指数级加速。对于具有高势垒的目标函数，量子ECD相比随机ECD能取得进一步的加速效果。

摘要 (Abstract)

The Energy Conserving Descent (ECD) algorithm was recently proposed (De Luca & Silverstein, 2022) as a global non-convex optimization method. Unlike gradient descent, appropriately configured ECD dynamics escape strict local minima and converge to a global minimum, making it appealing for machine learning optimization. We present the first analytical study of ECD, focusing on the one-dimensional setting for this first installment. We formalize a stochastic ECD dynamics (sECD) with energy-preserving noise, as well as a quantum analog of the ECD Hamiltonian (qECD), providing the foundation for a quantum algorithm through Hamiltonian simulation. For positive double-well objectives, we compute the expected hitting time from a local to the global minimum. We prove that both sECD and qECD yield exponential speedup over respective gradient descent baselines–stochastic gradient descent and its quantization. For objectives with tall barriers, qECD achieves a further speedup over sECD.

关键词: Non-convex optimization, Energy Conserving Descent, Quantum algorithm, Hamiltonian simulation, Exponential speedup, Global minimum, Stochastic dynamics, Double-well objectives

265. ❌ Causal Diffusion Models for Counterfactual Outcome Distributions in Longitudinal Data

作者: Farbod Alinezhad, Jianfei Cao, Gary J. Young, Brady Post 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12992v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于因果扩散模型在纵向数据反事实预测中的应用，属于AI在科学（特别是医学）领域的应用。论文核心是扩散概率模型和因果推断，与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文应用于医学（药代动力学-药效学肿瘤生长模拟器）和决策支持，属于AI for Science范畴，但并非核心生物信息学或化学信息学，故给5分。

!!! tip deepseek-chat TL;DR

该论文提出了因果扩散模型（CDM），用于在具有复杂时间依赖性混杂的纵向数据中生成反事实结果的完整概率分布，并在肿瘤生长模拟器上验证了其优于现有方法的分布准确性。

摘要翻译

在纵向数据中预测反事实结果至关重要，但由于复杂的时间依赖性混杂因素及现有方法中不确定性量化的不足，该任务极具挑战性，尤其是在序列治疗决策高度依赖于患者动态演变状态的情况下。我们提出了因果扩散模型（Causal Diffusion Model, CDM），这是首个明确设计用于生成序列干预下反事实结果的完整概率分布的降噪扩散概率方法。CDM采用了一种新颖的残差去噪架构，结合关系自注意力机制，能够捕捉复杂的时间依赖性和多模态结果轨迹，而无需针对混杂因素进行显式调整（例如逆概率加权或对抗平衡）。在先前工作中广泛采用的药代动力学-药效动力学肿瘤生长模拟器上进行严格评估后，CDM始终优于最先进的纵向因果推断方法，在分布准确性（1-瓦瑟斯坦距离）上实现了15-30%的相对提升，同时在高混杂情况下保持具有竞争力或更优的点估计准确性（均方根误差）。通过在不依赖定制化去混杂策略的情况下，将不确定性量化与稳健的反事实预测统一于复杂、序列混杂的场景中，CDM为医学、政策评估及其他纵向领域的决策支持提供了一个灵活且具有高影响力的工具。

摘要 (Abstract)

Predicting counterfactual outcomes in longitudinal data, where sequential treatment decisions heavily depend on evolving patient states, is critical yet notoriously challenging due to complex time-dependent confounding and inadequate uncertainty quantification in existing methods. We introduce the Causal Diffusion Model (CDM), the first denoising diffusion probabilistic approach explicitly designed to generate full probabilistic distributions of counterfactual outcomes under sequential interventions. CDM employs a novel residual denoising architecture with relational self-attention, capturing intricate temporal dependencies and multimodal outcome trajectories without requiring explicit adjustments (e.g., inverse-probability weighting or adversarial balancing) for confounding. In rigorous evaluation on a pharmacokinetic-pharmacodynamic tumor-growth simulator widely adopted in prior work, CDM consistently outperforms state-of-the-art longitudinal causal inference methods, achieving a 15-30% relative improvement in distributional accuracy (1-Wasserstein distance) while maintaining competitive or superior point-estimate accuracy (RMSE) under high-confounding regimes. By unifying uncertainty quantification and robust counterfactual prediction in complex, sequentially confounded settings, without tailored deconfounding, CDM offers a flexible, high-impact tool for decision support in medicine, policy evaluation, and other longitudinal domains.

关键词: Causal Diffusion Model, counterfactual outcomes, longitudinal data, sequential interventions, probabilistic distributions, time-dependent confounding, pharmacokinetic-pharmacodynamic, distributional accuracy

266. ❌ An Optimal Sauer Lemma Over $k$-ary Alphabets

作者: Steve Hanneke, Qinglin Meng, Shay Moran, Amirreza Shaeiri 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12952v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是纯理论计算机科学和组合数学研究，聚焦于Sauer-Shelah-Perles引理在k元字母表上的最优推广，涉及VC维度、Natarajan维度、DS维度等组合学习理论概念。论文完全不涉及大模型、深度学习、AI应用或任何评分关键词中的技术主题，所有关键词均完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了多类预测和列表预测中Sauer不等式的最优界问题，通过引入DS维度和列表-DS维度，为任意字母大小k和列表大小ℓ建立了紧致的多项式依赖关系，改进了先前基于Natarajan维度的指数依赖结果。

摘要翻译

索尔-谢拉-佩尔莱斯引理是组合数学与学习理论的基石，它依据假设类的维普尼克-切沃年基斯（VC）维数界定了二元假设类的大小。对于定义在 $k$ 元字母表上的函数类（即多分类场景），纳塔拉詹维数长期以来被视为 VC 维的类比，但相应的索尔型界在字母表大小 $k>2$ 时并非最优。
本文针对多分类与列表预测建立了一个尖锐的索尔不等式。我们的界通过丹尼尔利-沙莱夫-施瓦茨（DS）维数表达，并更一般地通过其扩展——列表-DS 维数（该组合参数刻画了多分类与列表 PAC 可学习性）来表达。该界对于任意字母表大小 $k$、列表大小 $\ell$ 及维数值均是紧的：它将基于纳塔拉詹维数的界中对 $\ell$ 的指数依赖替换为最优的多项式依赖，并同时改善了对 $k$ 的依赖。我们的证明使用了多项式方法。与经典的 VC 场景（已知存在多种直接组合证明）不同，在 DS 场景中我们尚未发现任何纯组合证明。这为未来研究提出了若干方向，文中对此进行了讨论。
作为推论，我们得到了列表 PAC 学习与列表预测器一致收敛的改进样本复杂度上界，从而优化了 Charikar 等人（STOC 2023）、Hanneke 等人（COLT 2024）以及 Brukhim 等人（NeurIPS 2024）的最新结果。

摘要 (Abstract)

The Sauer-Shelah-Perles Lemma is a cornerstone of combinatorics and learning theory, bounding the size of a binary hypothesis class in terms of its Vapnik-Chervonenkis (VC) dimension. For classes of functions over a $k$-ary alphabet, namely the multiclass setting, the Natarajan dimension has long served as an analogue of VC dimension, yet the corresponding Sauer-type bounds are suboptimal for alphabet sizes $k>2$. In this work, we establish a sharp Sauer inequality for multiclass and list prediction. Our bound is expressed in terms of the Daniely–Shalev-Shwartz (DS) dimension, and more generally with its extension, the list-DS dimension – the combinatorial parameters that characterize multiclass and list PAC learnability. Our bound is tight for every alphabet size $k$, list size $\ell$, and dimension value, replacing the exponential dependence on $\ell$ in the Natarajan-based bound by the optimal polynomial dependence, and improving the dependence on $k$ as well. Our proof uses the polynomial method. In contrast to the classical VC case, where several direct combinatorial proofs are known, we are not aware of any purely combinatorial proof in the DS setting. This motivates several directions for future research, which are discussed in the paper. As consequences, we obtain improved sample complexity upper bounds for list PAC learning and for uniform convergence of list predictors, sharpening the recent results of Charikar et al.~~(STOC~~2023), Hanneke et al.~~(COLT~~2024), and Brukhim et al.~~(NeurIPS~~2024).

关键词: Sauer-Shelah-Perles Lemma, multiclass prediction, list prediction, DS dimension, list-DS dimension, VC dimension, Natarajan dimension, PAC learnability

267. ❌ The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime

作者: Jason Z Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12951v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI模型校准误差的统计验证极限，特别是针对LLMs在低错误率下的校准评估。核心相关关键词是’Large Language Models’（论文测试了6个LLMs）和’Hallucination Mitigation/Factuality’（校准与事实性相关），以及’Mechanistic Interpretability/Explainable AI’（校准评估属于可解释性范畴）。其他关键词如MoE、SFT、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文证明了在AI模型错误率极低时，验证其校准误差存在根本性统计极限（即'验证税'），并发现自评估无法提供校准信息，且验证成本随模型改进而指数增长。

摘要翻译

深度学习领域被引用最多的校准结果——CIFAR-100上经过温度缩放后的预期校准误差（ECE）为0.012（Guo等人，2017）——实际上低于统计噪声基底。我们证明这并非实验的失败，而是一个定律：在模型错误率为ε时，估计校准误差的极小极大速率是Θ((Lε/m)^{1/3})，且任何估计器都无法超越该速率。这一“验证税”意味着，随着AI模型的改进，验证其校准性从根本上变得更加困难——两者以相同的指数朝相反方向变化。我们确立了四个与标准评估实践相悖的结果：（1）无标签的自评估对校准性提供的准确信息为零，其上限是一个与计算量无关的常数；（2）在mε ≈ 1处存在一个尖锐的相变点，低于该点时错误校准不可检测；（3）主动查询可消除利普希茨常数，将估计问题坍缩为检测问题；（4）验证成本随流水线深度L以L^K的速率指数增长。我们在五个基准测试（MMLU、TruthfulQA、ARC-Challenge、HellaSwag、WinoGrande；约27,000个项目）上使用来自5个系列的6个大型语言模型（8B-405B参数，27个基于对数概率置信度的基准-模型对）、95%自助置信区间和置换测试进行了验证。在80%的模型对中，自评估结果不显著。在前沿模型中，23%的成对比较结果与噪声无法区分，这意味着可靠的校准声明必须报告验证基底，并在增益接近基准分辨率时优先采用主动查询。

摘要 (Abstract)

The most cited calibration result in deep learning – post-temperature-scaling ECE of 0.012 on CIFAR-100 (Guo et al., 2017) – is below the statistical noise floor. We prove this is not a failure of the experiment but a law: the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^{1/3}), and no estimator can beat it. This “verification tax” implies that as AI models improve, verifying their calibration becomes fundamentally harder – with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: (1) self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; (2) a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; (3) active querying eliminates the Lipschitz constant, collapsing estimation to detection; (4) verification cost grows exponentially with pipeline depth at rate L^K. We validate across five benchmarks (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande; ~27,000 items) with 6 LLMs from 5 families (8B-405B parameters, 27 benchmark-model pairs with logprob-based confidence), 95% bootstrap CIs, and permutation tests. Self-evaluation non-significance holds in 80% of pairs. Across frontier models, 23% of pairwise comparisons are indistinguishable from noise, implying that credible calibration claims must report verification floors and prioritize active querying once gains approach benchmark resolution.

关键词: AI auditing, calibration error, statistical limits, verification tax, rare-error regime, self-evaluation, minimax rate, LLM evaluation

268. ❌ Parcae: Scaling Laws For Stable Looped Language Models

作者: Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, Daniel Y. Fu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12946v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究循环语言模型架构（Parcae）及其缩放规律，直接涉及大语言模型（LLMs）和缩放定律（Scaling Laws）与数据质量（Data Quality）的关系，因此这两个关键词得10分。论文涉及模型训练，与预训练（Pre-training）有一定关联，得5分。其他关键词如MoE、SFT、RAG、推理加速等未在摘要中提及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了循环语言模型训练中的不稳定性问题，提出了稳定的Parcae架构，并研究了通过循环增加FLOPs来提升模型质量的缩放规律，在固定参数和数据预算下，Parcae相比Transformer基线实现了质量提升。

摘要翻译

传统固定深度架构通常通过增加参数量来提升训练浮点运算量（FLOPs），但这会以更高的内存占用或数据需求为代价。循环架构是一种潜在的替代方案，它通过将激活值在层块中循环传递来增加FLOPs。尽管前景广阔，现有训练循环架构的方法存在不稳定性，常出现残差爆炸和损失值尖峰问题。我们通过将循环过程重新表述为残差流上的非线性时变动力系统来解决这些挑战。通过对该系统进行线性近似分析，我们发现现有循环架构的不稳定性源于其注入参数具有较大的谱范数。针对这些不稳定问题，我们提出了Parcae——一种新型的稳定循环架构，它通过对负对角参数化进行离散化来约束注入参数的谱范数。因此，Parcae在验证困惑度上比先前的大规模循环模型降低了最高达6.3%。利用这种稳定循环架构，我们研究了以循环作为媒介、通过增加训练和测试阶段FLOPs来提升模型性能的扩展特性。在训练方面，我们推导出可预测的幂律规律，在保持参数量固定的同时扩展FLOPs。我们的初步扩展规律表明，在固定FLOP预算下，应同步增加循环次数和数据量。在测试阶段，我们发现Parcae能够通过循环扩展计算量，并遵循可预测的饱和指数衰减规律。当参数量扩展至13亿时，在固定参数和数据预算下，Parcae相较于强大的Transformer基线模型，在CORE和Core-Extended评估指标上分别提升了2.99和1.18个点，达到了相当于两倍规模Transformer模型87.5%的相对性能水平。

摘要 (Abstract)

Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is looped architectures, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting looping as a nonlinear time-variant dynamical system over the residual stream. Via a linear approximation to this system, we find that instability occurs in existing looped architectures as a result of large spectral norms in their injection parameters. To address these instability issues, we propose Parcae, a novel stable, looped architecture that constrains the spectral norm of the injection parameters via discretization of a negative diagonal parameterization. As a result, Parcae achieves up to 6.3% lower validation perplexity over prior large-scale looped models. Using our stable looped architecture, we investigate the scaling properties of looping as a medium to improve quality by increasing FLOPs in training and test-time. For training, we derive predictable power laws to scale FLOPs while keeping parameter count fixed. Our initial scaling laws suggest that looping and data should be increased in tandem, given a fixed FLOP budget. At test-time, we find that Parcae can use looping to scale compute, following a predictable, saturating exponential decay. When scaled up to 1.3B parameters, we find that Parcae improves CORE and Core-Extended quality by 2.99 and 1.18 points when compared to strong Transformer baselines under a fixed parameter and data budget, achieving a relative quality of up to 87.5% a Transformer twice the size.

关键词: looped language models, scaling laws, stable architecture, FLOPs scaling, parameter efficiency, validation perplexity, Transformer baselines, dynamical systems

269. ❌ Token Encoding for Semantic Recovery

作者: Jingzhi Hu, Geoffrey Ye Li 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12931v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是无线通信中的语义通信技术，具体针对token编码和语义恢复问题，使用了基础模型（foundation model）进行适应，但并非大语言模型（LLM）或深度学习技术原理的创新。论文的核心是通信工程中的语义传输和信道容错，而非大模型在不同领域的应用或技术原理创新。所有关键词均与大模型、深度学习技术原理或AI for Science直接相关，而本文属于通信工程领域，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于无线语义通信的token编码框架（TokCode），通过句子语义引导的基础模型适应算法（SFMA）来优化编码器，在信道丢失40%-60% tokens的恶劣条件下仍能有效减轻语义失真并接近性能上限。

摘要翻译

基于令牌的语义通信因其能在有限信道容量下压缩语义令牌，对未来无线网络具有广阔前景。然而，恶劣的无线信道常导致令牌丢失，引发严重失真，阻碍接收端实现可靠的语义恢复。本文提出一种面向鲁棒语义恢复的令牌编码框架（TokCode），该框架不增加额外传输开销，并支持即插即用部署。为实现高效的令牌编码器优化，我们开发了一种句子语义引导的基础模型自适应算法（SFMA），避免了昂贵的端到端训练。基于提示生成式图像传输的仿真结果表明，即使在40%至60%令牌随机丢失的恶劣信道条件下，TokCode仍能有效减轻语义失真，并逼近性能上限。

摘要 (Abstract)

Token-based semantic communication is promising for future wireless networks, as it can compact semantic tokens under very limited channel capacity. However, harsh wireless channels often cause missing tokens, leading to severe distortion that prevents reliable semantic recovery at the receiver. In this article, we propose a token encoding framework for robust semantic recovery (TokCode), which incurs no additional transmission overhead and supports plug-and-play deployment. For efficient token encoder optimization, we develop a sentence-semantic-guided foundation model adaptation algorithm (SFMA) that avoids costly end-to-end training. Based on simulation results on prompt-based generative image transmission, TokCode mitigates semantic distortion and can approach the performance upper-bound, even under harsh channels where 40% to 60% of tokens are randomly lost.

关键词: semantic communication, token encoding, wireless networks, foundation model adaptation, semantic recovery, channel capacity, transmission overhead, generative image transmission

270. ❌ Frequency-aware Decomposition Learning for Sensorless Wrench Forecasting on a Vibration-rich Hydraulic Manipulator

作者: Hyeonbeen Lee, Min-Jae Jung, Tae-Kyeong Yeu, Jong-Boo Han, Daegil Park, Jin-Gyun Kim 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12905v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究机器人领域中的传感器力/力矩估计问题，提出了一种频率感知分解网络（FDN）用于振动丰富的液压机械臂的短期力/力矩预测。论文与大多数大模型（LLM）相关关键词（如LLMs、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及自然语言处理、模型架构、对齐、推理等大模型特定技术，而本文专注于机器人感知和控制的深度学习应用。仅有两个关键词获得5分：1）‘Pre-training OR Continual Pre-training OR Domain Adaptation’：论文提到在大规模开源机器人数据集上预训练FDN，并将学习到的表示迁移到下游任务，这体现了预训练和迁移学习的概念，但并非大模型领域的典型预训练。2）‘AI for Science OR Bioinformatics OR Cheminformatics’：论文属于AI在机器人学（可视为科学或工程应用）中的应用，符合’AI for Science’的广义范畴，但非生物信息学或化学信息学。其他关键词均未涉及，因此得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种频率感知分解网络（FDN），用于从本体感知历史中预测振动丰富的液压机械臂的短期力/力矩，通过预训练和迁移学习在真实世界研磨挖掘数据上实现了高频带性能提升。

摘要翻译

力与力矩传感对于机器人与环境交互至关重要，但物理力/力矩传感器在尺寸、成本和易损性方面存在局限。为缓解此问题，近期研究开始基于机器人内部状态无传感器地估计力/力矩。现有方法通常针对相对缓慢的交互任务，而涉及快速交互的任务（如打磨）可能引发影响任务执行的高频振动，此类机器人场景中的力/力矩估计研究尚不充分。为填补这一空白，我们提出一种频率感知分解网络，用于基于本体感知历史数据对富含振动的力矩进行短期预测。该网络通过非对称的确定性与概率性预测头输出频谱分解的力矩估计，将高频残差建模为可学习的条件概率分布。该网络进一步引入频率感知机制：通过可学习的滤波自适应增强输入频谱，并对输出施加频带先验约束。我们在大规模开源机器人数据集上预训练该网络，并将学习到的“本体感知-力矩”表征迁移至下游任务。基于六自由度液压机械臂的真实世界打磨挖掘数据，在延迟估计设定下，该网络在高频段的估计性能优于基线估计器与预测器，在低频段亦保持竞争力。迁移学习带来了额外性能提升，表明大规模预训练与迁移学习在机器人力矩估计领域具有潜力。代码与数据将在论文录用后公开。

摘要 (Abstract)

Force and torque (F/T) sensing is critical for robot-environment interaction, but physical F/T sensors impose constraints in size, cost, and fragility. To mitigate this, recent studies have estimated force/wrench sensorlessly from robot internal states. While existing methods generally target relatively slow interactions, tasks involving rapid interactions, such as grinding, can induce task-critical high-frequency vibrations, and estimation in such robotic settings remains underexplored. To address this gap, we propose a Frequency-aware Decomposition Network (FDN) for short-term forecasting of vibration-rich wrench from proprioceptive history. FDN predicts spectrally decomposed wrench with asymmetric deterministic and probabilistic heads, modeling the high-frequency residual as a learned conditional distribution. It further incorporates frequency-awareness to adaptively enhance input spectra with learned filtering and impose a frequency-band prior on the outputs. We pretrain FDN on a large-scale open-source robot dataset and transfer the learned proprioception-to-wrench representation to the downstream. On real-world grinding excavation data from a 6-DoF hydraulic manipulator and under a delayed estimation setting, FDN outperforms baseline estimators and forecasters in the high-frequency band and remains competitive in the low-frequency band. Transfer learning provides additional gains, suggesting the potential of large-scale pretraining and transfer learning for robotic wrench estimation. Code and data will be made available upon acceptance.

关键词: wrench forecasting, vibration-rich, hydraulic manipulator, frequency-aware decomposition, pretraining, transfer learning, proprioceptive sensing, robot-environment interaction

271. ❌ Rapid LoRA Aggregation for Wireless Channel Adaptation in Open-Set Radio Frequency Fingerprinting

作者: Mingxi Zhang, Renjie Xie, Jincheng Wang, Guyue Li, Wei Xu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12834v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心是使用LoRA（Low-Rank Adaptation）技术进行无线射频指纹识别中的信道自适应，属于参数高效微调（PEFT）范畴，因此与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分）。论文提到预训练LoRA模块，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），但并非核心。其他关键词均与论文内容无关（0分），因为论文专注于无线通信领域的特定应用，未涉及大语言模型、推理方法、对齐、代理等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于LoRA的轻量级自适应射频指纹提取框架，用于解决开放集无线认证中未知设备和变化信道条件下的泛化问题，实验表明该方法在降低等错误率15%的同时减少了83%的训练时间。

摘要翻译

射频指纹（Radio Frequency Fingerprints, RFFs）能够实现安全的无线身份认证，但在包含未知设备和多变信道的开放集场景中面临挑战。现有方法在泛化能力上存在不足，且计算成本高昂。我们提出一种基于低秩自适应（Low-Rank Adaptation, LoRA）的轻量化自适应RFF提取框架。通过为每个环境预训练LoRA模块，该方法能够快速适应未知信道条件，而无需完整重训练。在推理过程中，通过LoRA模块的加权组合动态增强特征提取能力。实验结果表明，在使用相同训练数据集的情况下，与未微调的基线方法相比，该方法将等错误率（Equal Error Rate, EER）降低了15%；与完整微调相比，训练时间减少了83%。该方法为动态无线车载网络中的开放集RFF认证提供了一种可扩展且高效的解决方案。

摘要 (Abstract)

Radio frequency fingerprints (RFFs) enable secure wireless authentication but struggle in open-set scenarios with unknown devices and varying channels. Existing methods face challenges in generalization and incur high computational costs. We propose a lightweight, self-adaptive RFF extraction framework using Low-Rank Adaptation (LoRA). By pretraining LoRA modules per environment, our method enables fast adaptation to unseen channel conditions without full retraining. During inference, a weighted combination of LoRAs dynamically enhances feature extraction. Experimental results demonstrate a 15% reduction in equal error rate (EER) compared to non-finetuned baselines and an 83% decrease in training time relative to full fine-tuning, using the same training dataset. This approach provides a scalable and efficient solution for open-set RFF authentication in dynamic wireless vehicular networks.

关键词: LoRA, Radio Frequency Fingerprinting, Wireless Authentication, Open-set Scenarios, Channel Adaptation, Parameter-efficient Fine-tuning, Vehicular Networks, Lightweight Framework

272. ❌ TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning

作者: Chaoyao Shen, Linfeng Jiang, Yixian Shen, Tao Xu, Guoqing Li, Anuj Pathania, Andy D. Pimentel, Meng Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12891v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文TCL专注于深度学习编译器优化，特别是针对跨硬件平台的张量程序优化，其核心贡献包括数据高效的主动学习采样器（RDU Sampler）、基于Mamba的轻量级成本模型和连续知识蒸馏框架。所有评分关键词均围绕大模型技术原理、训练方法、推理优化、应用领域等，而本文研究的是编译器层面的系统优化问题，与这些大模型相关关键词无直接关联。论文未涉及大模型本身的技术创新或应用，也未提及任何评分关键词中的具体技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出TCL框架，通过数据高效的主动学习采样器、轻量级Mamba成本模型和连续知识蒸馏，解决了深度学习编译器跨硬件平台优化时数据收集成本高、可迁移性差的问题，在CPU和GPU上分别实现了16.8倍和12.48倍的调优加速以及1.20倍和1.13倍的推理延迟降低。

摘要翻译

深度学习（DL）编译器依赖成本模型与自动调优技术来优化面向目标硬件的张量程序。然而，现有方法依赖于大规模离线数据集，导致数据收集成本高昂，且在不同平台间的可迁移性欠佳。本文提出TCL，一种新颖的高效可迁移编译器框架，旨在跨多样硬件平台实现快速张量程序优化，以应对上述挑战。具体而言，TCL基于三个核心支撑技术构建：（1）RDU采样器，一种数据高效主动学习策略，通过联合优化代表性（Representativeness）、多样性（Diversity）和不确定性（Uncertainty），仅需选择10%的张量程序，即可在保持模型精度接近原始水平的同时，显著降低数据收集成本；（2）一种基于Mamba的新型成本模型，该模型能有效捕捉长程调度依赖关系，并通过参数精简与轻量级序列建模，在预测精度与计算成本之间实现更优权衡；（3）一种连续知识蒸馏框架，能够高效且渐进地在多个硬件平台间迁移知识，同时避免传统多任务学习通常引发的参数爆炸与数据依赖问题。大量实验验证了每个独立支撑技术及整体TCL框架的有效性。在CPU与GPU平台上优化一系列主流深度学习模型时，相较于Tenset-MLP，TCL平均分别实现了16.8倍与12.48倍的调优加速，以及1.20倍与1.13倍的推理延迟降低。

摘要 (Abstract)

Deep learning (DL) compilers rely on cost models and auto-tuning to optimize tensor programs for target hardware. However, existing approaches depend on large offline datasets, incurring high collection costs and offering suboptimal transferability across platforms. In this paper, we introduce TCL, a novel efficient and transferable compiler framework for fast tensor program optimization across diverse hardware platforms to address these challenges. Specifically, TCL is built on three core enablers: (1) the RDU Sampler, a data-efficient active learning strategy that selects only 10% of tensor programs by jointly optimizing Representativeness, Diversity, and Uncertainty, substantially reducing data collection costs while maintaining near-original model accuracy; (2) a new Mamba-based cost model that efficiently captures long-range schedule dependencies while achieving a favorable trade-off between prediction accuracy and computational cost through reduced parameterization and lightweight sequence modeling; and (3) a continuous knowledge distillation framework that effectively and progressively transfers knowledge across multiple hardware platforms while avoiding the parameter explosion and data dependency issues typically caused by traditional multi-task learning. Extensive experiments validate the effectiveness of each individual enabler and the holistic TCL framework. When optimizing a range of mainstream DL models on both CPU and GPU platforms, TCL achieves, on average, 16.8x and 12.48x faster tuning time, and 1.20x and 1.13x lower inference latency, respectively, compared to Tenset-MLP.

关键词: Deep Learning Compiler, Tensor Program Optimization, Cross-Hardware Transfer, Active Learning, Mamba-based Cost Model, Continuous Knowledge Distillation, Inference Latency Reduction, Tuning Time Acceleration

273. ❌ Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

作者: Shaopeng Fu, Di Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12817v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的对抗训练防御机制，通过in-context learning理论分析连续对抗训练，提出基于嵌入矩阵奇异值的正则化方法改进CAT。因此与’Large Language Models’高度相关（10分），与’In-context Learning’高度相关（10分），其他关键词如MoE、SLMs、Scaling Laws、Alignment等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文通过in-context learning理论首次分析了LLMs连续对抗训练（CAT）的机制，揭示了嵌入空间扰动半径与鲁棒性的负相关关系，并基于嵌入矩阵奇异值提出正则化方法，在实验中改善了LLMs的越狱鲁棒性与实用性的权衡。

摘要翻译

对抗训练（Adversarial Training, AT）是保护大语言模型（Large Language Models, LLMs）免受越狱攻击的有效防御手段，但在LLMs上执行AT成本高昂。为提高LLMs对抗训练的效率，近期研究提出了连续对抗训练（Continuous AT, CAT），该方法在AT过程中于LLMs的连续嵌入空间内搜索对抗性输入。尽管CAT已取得实证成功，但其内在机制——即为何嵌入空间中的对抗扰动能帮助LLMs防御在输入词元空间中合成的越狱提示——仍不明确。本文基于上下文学习（In-Context Learning, ICL）理论，首次对LLMs上的CAT进行了理论分析。针对在上下文线性回归任务中使用嵌入空间对抗样本训练的线性Transformer，我们证明了一个鲁棒泛化界，该界限与嵌入空间中的扰动半径呈负相关。这清晰地解释了CAT为何能够防御来自LLM词元空间的越狱提示。进一步地，该鲁棒界表明，对抗训练后LLM的鲁棒性与其嵌入矩阵的奇异值密切相关。基于此，我们提出通过向CAT的目标函数中引入一个额外的正则化项来改进LLM的CAT，该正则化项依赖于LLM嵌入矩阵的奇异值。在真实世界LLMs上的实验表明，我们的方法能够帮助LLMs实现更好的越狱鲁棒性与实用性的权衡。代码发布于https://github.com/fshp971/continuous-adv-icl。

摘要 (Abstract)

Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM’s token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM’s embedding matrix, into the objective function of CAT. Experiments on real-world LLMs demonstrate that our method can help LLMs achieve a better jailbreak robustness-utility tradeoff. The code is available at https://github.com/fshp971/continuous-adv-icl.

关键词: Large Language Models, Adversarial Training, Continuous Adversarial Training, In-context Learning, Jailbreak Defense, Robust Generalization, Embedding Space, Regularization

274. ❌ Interpretable Relational Inference with LLM-Guided Symbolic Dynamics Modeling

作者: Xiaoxiao Liang, Juyuan Zhang, Liming Pan, Linyuan Lü 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12806v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出COSINE框架，使用大语言模型（LLM）指导符号动力学建模，属于大模型在科学领域的应用（AI for Science），因此’Large Language Models’得8分（核心应用但非纯技术研究），‘AI for Science’得10分（直接相关）。框架强调可解释性，与’Mechanistic Interpretability’高度相关得10分。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了从观测动力学中推断潜在交互结构的问题，提出了COSINE框架，通过结合大语言模型指导的符号回归，实现了交互图和稀疏符号动力学的联合发现，并在合成系统和真实流行病数据上验证了其有效性和可解释性。

摘要翻译

从观测动力学中推断潜在的交互结构是多体相互作用系统中的一个基本逆问题。多数神经方法依赖于可训练图上的黑箱代理模型，虽能获得准确性，却牺牲了机制的可解释性。符号回归能够提供显式的动力学方程和更强的归纳偏置，但通常假设已知拓扑结构和固定的函数库。我们提出 \textbf{COSINE}（符号交互与网络边的协同优化），这是一个可微分框架，能够联合发现交互图和稀疏符号动力学。为克服固定符号库的局限，COSINE 进一步引入一个外循环大语言模型，该模型利用内层优化循环的反馈，自适应地剪枝和扩展假设空间。在合成系统和大规模真实世界流行病数据上的实验表明，该方法能够稳健地恢复结构，并得到紧凑且与机制对齐的动力学表达式。代码：https://anonymous.4open.science/r/COSINE-6D43。

摘要 (Abstract)

Inferring latent interaction structures from observed dynamics is a fundamental inverse problem in many-body interacting systems. Most neural approaches rely on black-box surrogates over trainable graphs, achieving accuracy at the expense of mechanistic interpretability. Symbolic regression offers explicit dynamical equations and stronger inductive biases, but typically assumes known topology and a fixed function library. We propose \textbf{COSINE} (\textbf{C}o-\textbf{O}ptimization of \textbf{S}ymbolic \textbf{I}nteractions and \textbf{N}etwork \textbf{E}dges), a differentiable framework that jointly discovers interaction graphs and sparse symbolic dynamics. To overcome the limitations of fixed symbolic libraries, COSINE further incorporates an outer-loop large language model that adaptively prunes and expands the hypothesis space using feedback from the inner optimization loop. Experiments on synthetic systems and large-scale real-world epidemic data demonstrate robust structural recovery and compact, mechanism-aligned dynamical expressions. Code: https://anonymous.4open.science/r/COSINE-6D43.

关键词: interpretable relational inference, LLM-guided symbolic dynamics, interaction graph discovery, symbolic regression, many-body interacting systems, mechanistic interpretability, epidemic data modeling, differentiable framework

275. ❌ Rethinking the Personalized Relaxed Initialization in the Federated Learning: Consistency and Generalization

作者: Li Shen, Yan Sun, Dacheng Tao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12768v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于联邦学习（Federated Learning）中的算法优化和理论分析，具体研究个性化初始化方法（FedInit）以缓解客户端漂移问题，并进行泛化误差分析。所有评分关键词均涉及大模型、深度学习技术原理或特定科学AI应用（如生物信息学），而本文完全不涉及这些主题。论文内容与关键词列表中的任何技术（如LLM、MoE、微调方法、推理技术、代理系统等）均无直接或间接关联，也未涉及AI在科学领域的应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种联邦学习算法FedInit，通过个性化松弛初始化来缓解客户端漂移问题，并理论分析了不一致性对泛化误差的影响，实验表明该方法能提升性能且无需额外成本。

摘要翻译

联邦学习（Federated Learning, FL）是一种分布式范式，它协调大量本地客户端，通过在异构数据集上进行分阶段本地训练来协作训练全局模型。先前的研究已间接表明，联邦学习存在“客户端漂移”问题，该问题源于各本地客户端间最优解的不一致性。然而，至今仍缺乏坚实的理论分析来解释这种本地不一致性的影响。为减轻“客户端漂移”的负面影响并探究其在联邦学习中的实质，本文首先提出一种高效的联邦学习算法FedInit，该算法允许在每个本地训练阶段开始时采用个性化的松弛初始化状态。具体而言，FedInit通过使本地状态从当前全局状态出发，沿最新本地状态的反方向移动来进行初始化。此外，为深入理解不一致性如何破坏联邦学习的性能，我们引入超额风险分析，并通过研究发散项来探究联邦学习中的测试误差。我们的研究表明，优化误差对此本地不一致性并不敏感，而其主要影响泛化误差界。大量实验验证了该方法的有效性。所提出的FedInit方法无需任何额外的训练或通信成本，即可取得与多种先进基准方法相当的结果。同时，这种分阶段的个性化松弛初始化策略也能融入当前多种先进算法中，从而在联邦学习范式中实现更高的泛化性能。

摘要 (Abstract)

Federated learning (FL) is a distributed paradigm that coordinates massive local clients to collaboratively train a global model via stage-wise local training processes on the heterogeneous dataset. Previous works have implicitly studied that FL suffers from the client-drift'' problem, which is caused by the inconsistent optimum across local clients. However, till now it still lacks solid theoretical analysis to explain the impact of this local inconsistency. To alleviate the negative impact of client drift’’ and explore its substance in FL, in this paper, we first propose an efficient FL algorithm FedInit, which allows employing the personalized relaxed initialization state at the beginning of each local training stage. Specifically, FedInit initializes the local state by moving away from the current global state towards the reverse direction of the latest local state. Moreover, to further understand how inconsistency disrupts performance in FL, we introduce the excess risk analysis and study the divergence term to investigate the test error in FL. Our studies show that optimization error is not sensitive to this local inconsistency, while it mainly affects the generalization error bound. Extensive experiments are conducted to validate its efficiency. The proposed FedInit method could achieve comparable results compared to several advanced benchmarks without any additional training or communication costs. Meanwhile, the stage-wise personalized relaxed initialization could also be incorporated into several current advanced algorithms to achieve higher generalization performance in the FL paradigm.

关键词: Federated Learning, Client Drift, Personalized Initialization, Generalization Error, Excess Risk Analysis, FedInit Algorithm, Distributed Optimization, Heterogeneous Data

276. ❌ Stress Detection Using Wearable Physiological and Sociometric Sensors

作者: Oscar Martinez Mozos, Virginia Sandulescu, Sally Andrews, David Ellis, Nicola Bellotto, Radu Dobrescu, Jose Manuel Ferrandez 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12746v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于传统机器学习方法（SVM、AdaBoost、k-NN）的生理和社会传感器数据融合进行压力检测，未涉及任何大模型、深度学习技术原理或AI for Science相关创新，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合生理和社会传感器数据的机器学习方法，在受控的Trier社会压力测试中能够准确检测压力状态。

摘要翻译

在现代社会中，压力仍是个人面临的重要社会问题。本文提出一种机器学习方法，通过结合捕获生理反应与社会反应的两种传感器系统，实现社交情境下人群压力的自动检测。我们比较了包括支持向量机（SVM）、AdaBoost和k近邻（k-NNN）在内的不同分类器的性能。实验结果表明，通过融合两种传感器系统的测量数据，我们能够在受控的特里尔社会压力测试（TSST）中准确区分压力情境与中性情境。此外，本文分别评估了各传感器模态的独立判别能力，并探讨了其在实时压力检测中的适用性。最后，我们对压力检测中最具判别力的特征进行了研究分析。

摘要 (Abstract)

Stress remains a significant social problem for individuals in modern societies. This paper presents a machine learning approach for the automatic detection of stress of people in a social situation by combining two sensor systems that capture physiological and social responses. We compare the performance using different classifiers including support vector machine, AdaBoost, and k-nearest neighbor. Our experimental results show that by combining the measurements from both sensor systems, we could accurately discriminate between stressful and neutral situations during a controlled Trier social stress test (TSST). Moreover, this paper assesses the discriminative ability of each sensor modality individually and considers their suitability for real-time stress detection. Finally, we present an study of the most discriminative features for stress detection.

关键词: stress detection, wearable sensors, physiological sensors, sociometric sensors, machine learning, Trier social stress test, feature analysis, real-time detection

277. ❌ Evaluating Differential Privacy Against Membership Inference in Federated Learning: Insights from the NIST Genomics Red Team Challenge

作者: Gustavo de Carvalho Bertoli 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12737v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究联邦学习中的差分隐私防御成员推理攻击，属于隐私保护机器学习领域，与绝大多数大模型技术关键词（如LLM、MoE、RLHF等）完全无关。唯一相关的是"AI for Science OR Bioinformatics OR Cheminformatics”，因为论文基于NIST基因组学挑战赛环境，涉及生物信息学应用场景，但论文核心是隐私攻击/防御方法而非AI科学应用本身，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文评估了差分隐私在联邦学习中防御成员推理攻击的效果，提出了一种集成七种黑盒估计器的堆叠攻击策略，并在不同隐私配置下验证了该方法在低隐私设置下仍能保持可测量的成员信息泄露。

摘要翻译

尽管联邦学习（FL）减轻了直接数据暴露的风险，但训练所得的模型仍易受成员推理攻击（MIA）的影响。本文基于2025年NIST基因组学隐私保护联邦学习（PPFL）红队演练环境，对差分隐私（DP）作为FL中防御MIA的机制进行了实证评估。为提高推理准确性，我们提出了一种堆叠攻击策略，该方法集成七个黑盒估计器，基于预测概率和交叉熵损失训练元分类器。我们在三种隐私配置下评估该策略对目标模型的攻击效果：未受保护的卷积神经网络（CNN，$ε=\infty$）、低隐私DP模型（$ε=200$）和高隐私DP模型（$ε=10$）。该攻击在无DP和低隐私设置下均优于所有基线方法，且关键的是，在$ε=200$时仍能检测到可测量的成员信息泄露，而单信号LiRA基线在此条件下已失效。在独立第三方基准测试中，这些结果实证刻画了基于堆叠的推理攻击在FL中经过校准的不同DP层级下如何逐步减弱。

摘要 (Abstract)

While Federated Learning (FL) mitigates direct data exposure, the resulting trained models remain susceptible to membership inference attacks (MIAs). This paper presents an empirical evaluation of Differential Privacy (DP) as a defense mechanism against MIAs in FL, leveraging the environment of the 2025 NIST Genomics Privacy-Preserving Federated Learning (PPFL) Red Teaming Event. To improve inference accuracy, we propose a stacking attack strategy that ensembles seven black-box estimators to train a meta-classifier on prediction probabilities and cross-entropy losses. We evaluate this methodology against target models under three privacy configurations: an unprotected convolutional neural network (CNN, $ε=\infty$), a low-privacy DP model ($ε=200$), and a high-privacy DP model ($ε=10$). The attack outperforms all baselines in the No DP and Low Privacy settings and, critically, maintains measurable membership leakage at $ε=200$ where a single-signal LiRA baseline collapses. Evaluated on an independent third-party benchmark, these results provide an empirical characterisation of how stacking-based inference degrades across calibrated DP tiers in FL.

关键词: Federated Learning, Differential Privacy, Membership Inference Attacks, Privacy-Preserving Machine Learning, Stacking Attack, Genomics, Model Security, Empirical Evaluation

278. ❌ Transformer Based Machine Fault Detection From Audio Input

作者: Kiran Voderhobli Holla 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究基于Transformer的机器故障检测，属于AI在工业领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为AI for Science包括工业应用。但论文未涉及大语言模型（LLMs）、MoE、SLMs、缩放定律、预训练、后训练、对齐、RLHF、PEFT、RAG、长上下文、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等具体技术，因此其他关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于Transformer架构在机器故障声音检测中的应用，并与传统CNN方法进行比较，证明了Transformer在声音数据分析中的有效性。

摘要翻译

近年来，声音人工智能技术正被越来越多地用于预测机器故障。通过在目标机器上安装麦克风，可以从现场实时获取机器运行状态数据。传统上，卷积神经网络架构被用于分析从捕获声音生成的频谱图图像，并预测机器是否按预期运行。尽管卷积神经网络存在局部性和参数共享等可能不完全适用于频谱图分析的固有偏差，但经验表明其效果良好。自2020年视觉Transformer模型在图像处理领域成功应用以来，学术界对在声音人工智能领域运用此类Transformer模型产生了浓厚兴趣。由于基于Transformer的架构具有显著更低的归纳偏差，在数据充足的情况下，预计其在频谱图分析任务上的表现将优于卷积神经网络。本文论证了Transformer驱动架构在分析声音数据方面的有效性，并在机器故障检测这一具体任务中，将其生成的嵌入表示与卷积神经网络进行了对比。

摘要 (Abstract)

In recent years, Sound AI is being increasingly used to predict machine failures. By attaching a microphone to the machine of interest, one can get real time data on machine behavior from the field. Traditionally, Convolutional Neural Net (CNN) architectures have been used to analyze spectrogram images generated from the sounds captured and predict if the machine is functioning as expected. CNN architectures seem to work well empirically even though they have biases like locality and parameter-sharing which may not be completely relevant for spectrogram analysis. With the successful application of transformer-based models in the field of image processing starting with Vision Transformer (ViT) in 2020, there has been significant interest in leveraging these in the field of Sound AI. Since transformer-based architectures have significantly lower inductive biases, they are expected to perform better than CNNs at spectrogram analysis given enough data. This paper demonstrates the effectiveness of transformer-driven architectures in analyzing Sound data and compares the embeddings they generate with CNNs on the specific task of machine fault detection.

关键词: Transformer, Machine Fault Detection, Audio Input, Sound AI, Spectrogram Analysis, Vision Transformer (ViT), CNN Comparison, Embeddings

作者: Malik Amir, Sourangshu Ghosh 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12725v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是经典协方差渐近理论的高阶几何修正，属于信息几何和统计推断的理论研究。论文内容涉及Fisher信息、Riemannian流形、曲率张量、奇异模型等纯数学和统计理论概念，完全不涉及大模型、深度学习、AI应用或任何现代机器学习技术。所有评分关键词都针对大模型相关技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过信息几何方法，在Riemannian流形框架下推导了经典Fisher信息协方差渐近的高阶曲率修正项，并扩展到奇异模型以解释学习速率和误差行为。

摘要翻译

经典的费希尔信息渐近理论通过对数似然的局部二次逼近来描述正则有效估计量的协方差，因而仅捕捉一阶几何特性。在包含混合模型、弯曲指数族、潜变量模型及流形约束参数空间等弯曲模型中，有限样本行为可能系统性地偏离这些预测。我们通过将正则参数族视为具有费希尔-拉奥度量的黎曼流形((Θ,g))，并借助平方根密度映射浸入(L^2(μ))空间，提出了一种坐标不变且曲率感知的改进方法。在适当的正则性与矩假设下，我们针对得分根型一阶有效估计量，推导出对主导项(n^{-1}I(θ)^{-1})协方差的(n^{-2})阶修正。该修正由一个张量(P_{ij})主导，该张量可规范分解为三部分：费希尔-拉奥曲率张量的内蕴里奇型收缩、第二基本形式的外蕴格拉姆型收缩，以及刻画浸入几何无法单独确定的高阶概率信息的海林格差异张量。其中外蕴项半正定，完整修正具有光滑重参数化不变性，且在完全指数族中恒为零。随后我们将该框架推广至费希尔信息退化的奇异模型。基于加法正规交叉假设下的奇点消解理论，我们描述了消解后的度量、实对数典范阈值在学习速率与后验均方误差中的作用，以及在消解空间上建立的曲率协方差展开式——该展开以正则理论为特例。此框架还提出了弱可识别性的几何诊断方法，以及面向正则化与优化的曲率感知原则。

摘要 (Abstract)

Classical Fisher-information asymptotics describe the covariance of regular efficient estimators through the local quadratic approximation of the log-likelihood, and thus capture first-order geometry only. In curved models, including mixtures, curved exponential families, latent-variable models, and manifold-constrained parameter spaces, finite-sample behavior can deviate systematically from these predictions. We develop a coordinate-invariant, curvature-aware refinement by viewing a regular parametric family as a Riemannian manifold ((Θ,g)) with Fisher–Rao metric, immersed in (L^2(μ)) through the square-root density map. Under suitable regularity and moment assumptions, we derive an (n^{-2}) correction to the leading (n^{-1}I(θ)^{-1}) covariance term for score-root, first-order efficient estimators. The correction is governed by a tensor (P_{ij}) that decomposes canonically into three parts, an intrinsic Ricci-type contraction of the Fisher–Rao curvature tensor, an extrinsic Gram-type contraction of the second fundamental form, and a Hellinger discrepancy tensor encoding higher-order probabilistic information not determined by immersion geometry alone. The extrinsic term is positive semidefinite, the full correction is invariant under smooth reparameterization, and it vanishes identically for full exponential families. We then extend the picture to singular models, where Fisher information degenerates. Using resolution of singularities under an additive normal crossing assumption, we describe the resolved metric, the role of the real log canonical threshold in learning rates and posterior mean-squared error, and a curvature-based covariance expansion on the resolved space that recovers the regular theory as a special case. This framework also suggests geometric diagnostics of weak identifiability and curvature-aware principles for regularization and optimization.

关键词: Fisher information, covariance asymptotics, information geometry, Riemannian manifold, curvature correction, singular models, real log canonical threshold

280. ❌ Monte Carlo Stochastic Depth for Uncertainty Estimation in Deep Learning

作者: Adam T. Müller, Tobias Rögelein, Nicolaj C. Stache 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12719v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于深度学习中的不确定性量化（UQ），特别是将随机深度（Stochastic Depth）重新用于蒙特卡洛近似贝叶斯推断，并在目标检测任务上进行基准测试。所有关键词均与大模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、推理加速、幻觉缓解等）或特定科学AI应用（如生物信息学）相关。论文内容完全不涉及大模型、语言模型或任何关键词中提到的技术，而是关于传统深度神经网络（如YOLO、RT-DETR）的贝叶斯近似方法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了将蒙特卡洛随机深度（MCSD）作为一种理论上有依据、经验上有效的贝叶斯近似方法，用于现代深度学习中的不确定性量化，并在目标检测任务上展示了其与蒙特卡洛丢弃方法相比具有竞争力的预测准确性和校准性能。

摘要翻译

在安全关键系统中部署深度神经网络需要可靠且高效的不确定性量化方法。一种实用且广泛采用的不确定性量化策略是将随机正则化器重新用作可扩展的近似贝叶斯推断方法，例如蒙特卡洛丢弃法（Monte Carlo Dropout, MCD）和蒙特卡洛丢弃块法（MC-DropBlock, MCDB）。然而，对于随机深度（Stochastic Depth, SD）这一内置于大多数现代架构中基于残差的主干网络的重要正则化器，该范式仍未被充分探索。尽管先前的研究已证明其在分割任务中的实证潜力，但其与贝叶斯变分推断的形式化理论联系，以及在复杂多任务问题（如目标检测）上的基准测试仍然缺失。本文首先从理论上揭示了蒙特卡洛随机深度（Monte Carlo Stochastic Depth, MCSD）与基于原理的近似变分推断之间的联系。随后，我们使用COCO和COCO-O数据集，在先进检测器（YOLO, RT-DETR）上首次对MCSD与MCD和MCDB进行了全面的实证基准测试。我们的结果表明，MCSD是一种鲁棒且计算高效的方法，能够实现极具竞争力的预测准确率（mAP），尤其在校准指标（ECE）和不确定性排序（AUARC）方面相比MCD略有提升。因此，我们将MCSD确立为一种理论基础坚实且经过实证验证的、适用于现代深度学习的高效贝叶斯近似工具。

摘要 (Abstract)

The deployment of deep neural networks in safety-critical systems necessitates reliable and efficient uncertainty quantification (UQ). A practical and widespread strategy for UQ is repurposing stochastic regularizers as scalable approximate Bayesian inference methods, such as Monte Carlo Dropout (MCD) and MC-DropBlock (MCDB). However, this paradigm remains under-explored for Stochastic Depth (SD), a regularizer integral to the residual-based backbones of most modern architectures. While prior work demonstrated its empirical promise for segmentation, a formal theoretical connection to Bayesian variational inference and a benchmark on complex, multi-task problems like object detection are missing. In this paper, we first provide theoretical insights connecting Monte Carlo Stochastic Depth (MCSD) to principled approximate variational inference. We then present the first comprehensive empirical benchmark of MCSD against MCD and MCDB on state-of-the-art detectors (YOLO, RT-DETR) using the COCO and COCO-O datasets. Our results position MCSD as a robust and computationally efficient method that achieves highly competitive predictive accuracy (mAP), notably yielding slight improvements in calibration (ECE) and uncertainty ranking (AUARC) compared to MCD. We thus establish MCSD as a theoretically-grounded and empirically-validated tool for efficient Bayesian approximation in modern deep learning.

关键词: Monte Carlo Stochastic Depth, uncertainty quantification, Bayesian inference, deep neural networks, object detection, calibration, variational inference, stochastic regularizers

281. ❌ FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving

作者: Baoyun Wang, Zhuoren Li, Ming Liu, Xinrui Zhang, Bo Leng, Lu Xiong 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12656v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶领域的端到端扩散规划方法（FeaXDrive），研究轨迹生成中的物理可行性问题，包括几何不规则性、运动学约束和可行驶区域一致性。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或科学AI应用相关，而本文的核心是扩散模型在自动驾驶轨迹规划中的具体应用，未涉及LLMs、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、智能体、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或生物/化学信息学等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对端到端自动驾驶扩散规划中生成轨迹的物理可行性不足问题，提出了FeaXDrive方法，通过轨迹中心化建模、自适应曲率约束训练、可行驶区域引导和可行性感知GRPO后训练，在NAVSIM基准测试中显著提升了规划性能和轨迹空间可行性。

摘要翻译

端到端扩散规划在自动驾驶领域展现出强大潜力，但生成轨迹的物理可行性仍未得到充分解决。具体而言，生成轨迹可能出现局部几何不规则性、违反轨迹级运动学约束或偏离可行驶区域等问题，这表明当前扩散规划中普遍采用的以噪声为中心的建模方式，尚未与能更自然表征可行性的轨迹空间充分对齐。为解决这一问题，我们提出FeaXDrive——一种面向端到端自动驾驶的、具备可行性感知能力的以轨迹为中心的扩散规划方法。其核心思想是在整个扩散过程中，将洁净轨迹作为统一对象进行可行性感知建模。基于这种以轨迹为中心的建模框架，FeaXDrive整合了以下关键技术：通过自适应曲率约束训练提升内在几何与运动学可行性；在反向扩散采样中引入可行驶区域引导以增强与可行驶区域的一致性；以及采用可行性感知的GRPO后训练方法，在平衡轨迹空间可行性的同时进一步提升规划性能。在NAVSIM基准测试上的实验表明，FeaXDrive在实现优异闭环规划性能的同时，显著提升了轨迹空间可行性。这些发现凸显了在端到端扩散规划中显式建模轨迹空间可行性的重要性，并为构建更可靠、更符合物理规律的自动驾驶规划器迈出了关键一步。

摘要 (Abstract)

End-to-end diffusion planning has shown strong potential for autonomous driving, but the physical feasibility of generated trajectories remains insufficiently addressed. In particular, generated trajectories may exhibit local geometric irregularities, violate trajectory-level kinematic constraints, or deviate from the drivable area, indicating that the commonly used noise-centric formulation in diffusion planning is not yet well aligned with the trajectory space where feasibility is more naturally characterized. To address this issue, we propose FeaXDrive, a feasibility-aware trajectory-centric diffusion planning method for end-to-end autonomous driving. The core idea is to treat the clean trajectory as the unified object for feasibility-aware modeling throughout the diffusion process. Built on this trajectory-centric formulation, FeaXDrive integrates adaptive curvature-constrained training to improve intrinsic geometric and kinematic feasibility, drivable-area guidance within reverse diffusion sampling to enhance consistency with the drivable area, and feasibility-aware GRPO post-training to further improve planning performance while balancing trajectory-space feasibility. Experiments on the NAVSIM benchmark show that FeaXDrive achieves strong closed-loop planning performance while substantially improving trajectory-space feasibility. These findings highlight the importance of explicitly modeling trajectory-space feasibility in end-to-end diffusion planning and provide a step toward more reliable and physically grounded autonomous driving planners.

关键词: autonomous driving, diffusion planning, trajectory feasibility, end-to-end planning, kinematic constraints, drivable area guidance, GRPO post-training, NAVSIM benchmark

282. ❌ Robust Semi-Supervised Temporal Intrusion Detection for Adversarial Cloud Networks

作者: Anasuya Chattopadhyay, Daniel Reti, Hans D. Schotten 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12655v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于网络入侵检测（NIDS）的机器学习方法，特别是针对云网络环境的半监督学习框架。研究内容涉及网络流量分析、对抗性环境、时间漂移处理、一致性正则化、伪标签等技术，属于网络安全领域的应用研究。所有给定的关键词均与大语言模型（LLM）、深度学习技术原理、AI for Science（如生物信息学）等主题相关，而本文未涉及任何大模型、深度学习技术原理或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种鲁棒的半监督时间学习框架，用于解决云网络环境中因对抗性污染和时间漂移导致的网络入侵检测性能下降问题，并在多个公开数据集上验证了其优于现有方法的检测性能、标签效率和鲁棒性。

摘要翻译

云网络日益依赖基于机器学习的网络入侵检测系统来抵御不断演变的网络威胁。然而，实际部署面临着标注数据有限、流量非稳态以及攻击者自适应等挑战。尽管半监督学习能够缓解标注稀缺问题，但现有方法大多隐含假设未标注流量是良性且稳态的，导致其在对抗性云环境中性能下降。本文提出一种面向云入侵检测的鲁棒半监督时序学习框架，该框架明确处理未标注网络流量中的对抗性污染与时序漂移问题。该框架基于流级数据运行，将监督学习与一致性正则化、置信度感知伪标注及选择性时序不变性相结合，在保守利用未标注流量的同时抑制不可靠样本。通过挖掘网络流量的时序结构特征，所提方法提升了跨异构云环境的鲁棒性与泛化能力。在有限标注条件下对公开数据集（CIC-IDS2017、CSE-CIC-IDS2018和UNSW-NB15）的广泛评估表明，该框架在检测性能、标注效率以及对对抗性与非稳态流量的适应能力方面均持续优于当前最先进的监督与半监督网络入侵检测系统。

摘要 (Abstract)

Cloud networks increasingly rely on machine learning based Network Intrusion Detection Systems to defend against evolving cyber threats. However, real-world deployments are challenged by limited labeled data, non-stationary traffic, and adaptive adversaries. While semi-supervised learning can alleviate label scarcity, most existing approaches implicitly assume benign and stationary unlabeled traffic, leading to degraded performance in adversarial cloud environments. This paper proposes a robust semi-supervised temporal learning framework for cloud intrusion detection that explicitly addresses adversarial contamination and temporal drift in unlabeled network traffic. Operating on flow-level data, this framework combines supervised learning with consistency regularization, confidence-aware pseudo-labeling, and selective temporal invariance to conservatively exploit unlabeled traffic while suppressing unreliable samples. By leveraging the temporal structure of network flows, the proposed method improves robustness and generalization across heterogeneous cloud environments. Extensive evaluations on publicly available datasets (CIC-IDS2017, CSE-CIC-IDS2018, and UNSW-NB15) under limited-label conditions demonstrate that the proposed framework consistently outperforms state-of-the-art supervised and semi-supervised network intrusion detection systems in detection performance, label efficiency, and resilience to adversarial and non-stationary traffic.

关键词: Network Intrusion Detection, Semi-Supervised Learning, Adversarial Cloud Networks, Temporal Drift, Consistency Regularization, Pseudo-Labeling, Cloud Security, Machine Learning

283. ❌ Data-driven Reachable Set Estimation with Tunable Adversarial and Wasserstein Distributional Guarantees

作者: Georgios Pantazis, Michelle S. Chong 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12654v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于采样的未知离散时间动力系统的可达集估计问题，属于控制理论和优化领域，使用了场景优化、对抗鲁棒性、Wasserstein距离等数学工具。论文完全不涉及大语言模型、深度学习、AI for Science或任何评分关键词中的技术概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于场景优化的方法，用于估计未知离散时间动力系统的有限时域可达集，并通过引入松弛变量和对抗鲁棒性扩展，在可达集大小与轨迹包含概率之间实现可调权衡，同时提供了对抗扰动和分布漂移下的概率保证。

摘要翻译

我们研究仅利用采样状态轨迹对未知离散时间动力系统进行有限时域可达集估计的问题。不同于将场景优化视为黑箱工具，本文展示了如何将其专门应用于可达集估计——该任务要求基于完整轨迹学习一族集合，同时为整个时域保留未来轨迹包含的概率保证。为此，我们提出一种引入松弛变量的松弛场景规划，可在可达集规模与时域内样本外轨迹包含率之间实现可调节的权衡，从而降低对异常值的敏感性。借助对抗鲁棒场景优化领域的最新成果，我们将该框架扩展至考虑观测轨迹的有界对抗扰动情形，并推导出未来轨迹包含的后验概率保证。当概率分布在Wasserstein距离意义下发生偏移时，我们获得了理论概率保证衰减程度的显式界。针对不同几何结构（即$p$-范数球、椭球体和zonotope），我们推导出可处理的凸重构形式，并通过仿真验证了理论结果。

摘要 (Abstract)

We study finite horizon reachable set estimation for unknown discrete-time dynamical systems using only sampled state trajectories. Rather than treating scenario optimization as a black-box tool, we show how it can be tailored to reachable set estimation, where one must learn a family of sets based on whole trajectories, while preserving probabilistic guarantees on future trajectory inclusion for the entire horizon. To this end, we formulate a relaxed scenario program with slack variables that yields a tunable trade-off between reachable set size and out-of-sample trajectory inclusion over the horizon, thereby reducing sensitivity to outliers. Leveraging the recent results in adversarially robust scenario optimization, we then extend this formulation to account for bounded adversarial perturbations of the observed trajectories and derive a posteriori probabilistic guarantees on future trajectory inclusion. When probability distribution shifts in the Wasserstein distance occur, we obtain an explicit bound on how gracefully the theoretical probabilistic guarantees degrade. For different geometries, i.e., $p$-norm balls, ellipsoids, and zonotopes, we derive tractable convex reformulations and corroborate our theoretical results in simulation.

关键词: reachable set estimation, scenario optimization, adversarial robustness, Wasserstein distance, probabilistic guarantees, dynamical systems, trajectory inclusion, convex reformulations

284. ❌ EEG-Based Multimodal Learning via Hyperbolic Mixture-of-Curvature Experts

作者: Runhe Zhou, Shanglin Li, Guanxiang Huang, Xinliang Zhou, Qibin Zhao, Motoaki Kawanabe, Yi Ding, Cuntai Guan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12579v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文提出了一种用于多模态神经技术的双曲混合曲率专家框架（EEG-MoCE），与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为其核心创新是基于MoE架构。论文属于脑电图（EEG）分析领域，涉及情绪识别、睡眠分期和认知评估，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），因为EEG分析可视为生物信息学或科学AI的应用。其他关键词主要涉及大语言模型（LLM）技术、训练方法、推理优化、代理系统等，而本文专注于EEG信号处理和特定MoE架构，未涉及LLM、训练技术、推理方法或代理相关主题，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对脑电图（EEG）多模态学习中异质模态表示学习的挑战，提出了一种双曲混合曲率专家框架（EEG-MoCE），通过自适应建模模态内在几何和曲率感知融合，在情绪识别、睡眠分期和认知评估任务上实现了最先进的性能。

摘要翻译

基于脑电图（EEG）的多模态学习通过整合脑信号与互补模态来改善心理状态评估，具有重要的临床潜力。此类范式的有效性在很大程度上依赖于对异构模态的表征学习。对于基于脑电图的范式，一种有效方法是利用其层次结构，因为近期研究表明，脑电图及相关模态（如面部表情）均呈现出反映复杂认知过程的层次结构。然而，欧几里得嵌入因其平坦几何特性难以表征这些层次结构，而双曲空间凭借其指数增长特性天然适用于此类结构。本研究提出EEG-MoCE，一种新颖的双曲混合曲率专家框架，专为多模态神经技术设计。EEG-MoCE将每个模态分配到可学习曲率的双曲空间中的专家，从而自适应地建模其内在几何特征。随后通过曲率感知融合策略动态加权各专家，强调具有更丰富层次信息的模态。在基准数据集上的大量实验表明，EEG-MoCE在情绪识别、睡眠分期和认知评估等任务中均达到了最先进的性能水平。

摘要 (Abstract)

Electroencephalography (EEG)-based multimodal learning integrates brain signals with complementary modalities to improve mental state assessment, providing great clinical potential. The effectiveness of such paradigms largely depends on the representation learning on heterogeneous modalities. For EEG-based paradigms, one promising approach is to leverage their hierarchical structures, as recent studies have shown that both EEG and associated modalities (e.g., facial expressions) exhibit hierarchical structures reflecting complex cognitive processes. However, Euclidean embeddings struggle to represent these hierarchical structures due to their flat geometry, while hyperbolic spaces, with their exponential growth property, are naturally suited for them. In this work, we propose EEG-MoCE, a novel hyperbolic mixture-of-curvature experts framework designed for multimodal neurotechnology. EEG-MoCE assigns each modality to an expert in a learnable-curvature hyperbolic space, enabling adaptive modeling of its intrinsic geometry. A curvature-aware fusion strategy then dynamically weights experts, emphasizing modalities with richer hierarchical information. Extensive experiments on benchmark datasets demonstrate that EEG-MoCE achieves state-of-the-art performance, including emotion recognition, sleep staging, and cognitive assessment.

关键词: EEG, multimodal learning, hyperbolic space, mixture-of-curvature experts, hierarchical structures, emotion recognition, sleep staging, cognitive assessment

285. ❌ Instantiating Bayesian CVaR lower bounds in Interactive Decision Making Problems

作者: Raghav Bongole, Tobias J. Oechtering, Mikael Skoglund 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12519v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是贝叶斯CVaR下界在交互式决策问题中的具体实例化，属于统计决策理论和风险敏感学习领域。所有关键词均与大模型、深度学习、AI应用或相关技术原理相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文将广义Fano框架应用于具体交互式问题，推导出贝叶斯CVaR的显式下界，为交互式学习和风险敏感决策提供了实用的下界工具。

摘要翻译

近期研究为交互式统计决策中的先验预测（贝叶斯）条件风险价值（CVaR）下界建立了一个广义Fano框架。本文展示了如何将该框架应用于具体交互问题，并基于其抽象推论推导出显式的贝叶斯CVaR下界。我们的方法通过平方Hellinger距离比较困难模型与参考模型，并将参考铰链项的下界与两模型可区分性的界相结合。我们将此方法应用于典型示例（包括高斯多臂赌博机问题），得到了能清晰反映关键问题参数依赖关系的显式界。这些结果表明，广义Fano贝叶斯CVaR框架可作为交互式学习与风险敏感决策的实用下界推导工具。

摘要 (Abstract)

Recent work established a generalized-Fano framework for lower bounding prior-predictive (Bayesian) CVaR in interactive statistical decision making. In this paper, we show how to instantiate that framework in concrete interactive problems and derive explicit Bayesian CVaR lower bounds from its abstract corollaries. Our approach compares a hard model with a reference model using squared Hellinger distance, and combines a lower bound on a reference hinge term with a bound on the distinguishability of the two models. We apply this approach to canonical examples, including Gaussian bandits, and obtain explicit bounds that make the dependence on key problem parameters transparent. These results show how the generalized-Fano Bayesian CVaR framework can be used as a practical lower-bound tool for interactive learning and risk-sensitive decision making.

关键词: Bayesian CVaR, interactive decision making, generalized-Fano framework, lower bounds, risk-sensitive decision making, Gaussian bandits, squared Hellinger distance, prior-predictive

286. ❌ Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design

作者: Leon Eshuijs, Shihan Wang, Antske Fokkens 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12500v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究指令调优LLMs在强化学习（RL）环境下的安全对齐问题，与’Large Language Models’、‘Instruction Tuning’、‘RLHF’高度相关（10分），因为论文明确使用指令调优的LLMs进行RL训练，并探讨RL诱导的错位问题。与’Small Language Models’有一定关联（5分），因为研究涉及0.5B-14B参数范围的模型，包括较小模型。其他关键词如MoE、Scaling Laws、RAG等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了指令调优大语言模型在策略强化学习训练中如何导致有害错位行为，发现模型大小在某些环境中起安全缓冲作用，但在其他环境中会加剧有害利用，且安全基准大多无法预测RL诱导的错位。

摘要翻译

强化学习（RL）下的规范博弈已知会导致大语言模型（LLMs）产生谄媚、操纵或欺骗性行为，但其发生的条件仍不明确。我们在三个环境中使用同策略强化学习训练了11个指令微调大语言模型（0.5B–14B），发现模型规模在某些环境中起到安全缓冲作用，但在其他环境中却促成了更具危害性的策略利用。受控消融实验将这种逆转归因于环境特异性特征，例如角色设定和隐性的可博弈性线索。我们进一步表明，大多数安全性基准测试无法预测强化学习引发的错位，除非在利用行为依赖于推断用户偏好的情况下，谄媚性（SycoPhancy）分数能够提供预测。最后，我们发现同策略强化学习保留了模型自身生成分布中固有的安全缓冲，而这一缓冲在异策略设置中会被绕过。

摘要 (Abstract)

Specification gaming under Reinforcement Learning (RL) is known to cause LLMs to develop sycophantic, manipulative, or deceptive behavior, yet the conditions under which this occurs remain unclear. We train 11 instruction-tuned LLMs (0.5B–14B) with on-policy RL across 3 environments and find that model size acts as a safety buffer in some environments but enables greater harmful exploitation in others. Controlled ablations trace this reversal to environment-specific features such as role framing and implicit gameability cues. We further show that most safety benchmarks do not predict RL-induced misalignment, except in the case of Sycophancy scores when the exploit relies on inferring the user’s preference. Finally, we find that on-policy RL preserves a safety buffer inherent in the model’s own generation distribution, one that is bypassed during off-policy settings.

关键词: Reinforcement Learning, LLMs, Instruction Tuning, Safety Alignment, Specification Gaming, Model Size, On-policy RL, Harmful Exploitation

287. ❌ Analyzing the Effect of Noise in LLM Fine-tuning

作者: Lingfang Li, Procheta Sen 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12469v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM微调过程中噪声的影响，与’Large Language Models’和’Post-training/Supervised Fine-tuning’高度相关（10分）。研究涉及数据质量对微调的影响，与’Scaling Laws AND Data Quality’有一定关联（5分）。论文分析模型内部学习动态和注意力模式，与’Mechanistic Interpretability’有一定关联（5分）。其他关键词如MoE、SLMs、RLHF、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了噪声（标签噪声、语法噪声、拼写错误）对大型语言模型（GPT-2、Qwen2、Llama-2）微调过程的影响，发现标签噪声导致最大性能下降，而语法和拼写噪声有时能带来轻微的正则化效果。

摘要翻译

微调是将预训练大语言模型适配至下游自然语言处理任务的主流范式。在实际应用中，微调数据集可能包含因标注错误、预处理伪影或自动化数据收集过程而产生的各类噪声。尽管先前研究主要集中于设计鲁棒的学习算法以减轻噪声条件下的性能衰减，但对于不同类型噪声如何影响大语言模型在微调过程中的内部学习动态，目前认知仍相对有限。本研究系统探究了噪声对三类预训练模型家族（GPT-2、Qwen2和Llama-2）在三种不同自然语言处理任务中行为表现的影响。我们引入了对应三种常见现实噪声类型的受控扰动：标签噪声、语法噪声与拼写噪声。除任务层面性能评估外，我们通过分析层级表征变化与注意力模式来理解噪声在神经网络中的传播机制。实验结果表明：标签损坏（即标签噪声）始终导致最显著的性能衰减，而语法噪声与拼写噪声偶尔能产生轻微的正则化效益。进一步研究发现，噪声效应主要集中于任务特定层，而注意力结构则保持相对稳定。

摘要 (Abstract)

Fine-tuning is the dominant paradigm for adapting pretrained large language models (LLMs) to downstream NLP tasks. In practice, fine-tuning datasets may contain various forms of noise arising from annotation errors, preprocessing artifacts, or automated data collection. While prior work has focused on designing robust learning algorithms to mitigate performance degradation under noisy conditions, comparatively little is known about how different types of noise affect the internal learning dynamics of LLMs during fine-tuning. In this work, we systematically study the impact of noise on model behavior across three pretrained model families (GPT-2, Qwen2 and Llama-2) and three diverse NLP tasks. We introduce controlled perturbations corresponding to three common real-world noise types: label noise, grammatical noise, and typographical noise. Beyond task-level performance, we analyze layer-wise representation changes and attention patterns to understand how noise propagates through the network. Our results show that corrupting labels (i.e. label noise) consistently causes the largest performance degradation, whereas grammatical noise and typographical noise can occasionally yield mild regularization benefits. We further find that noise effects are localized primarily to task-specific layers, while attention structures remain comparatively stable.

关键词: LLM fine-tuning, noise analysis, label noise, grammatical noise, typographical noise, attention patterns, representation changes, NLP tasks

288. ❌ A Bayesian Perspective on the Role of Epistemic Uncertainty for Delayed Generalization in In-Context Learning

作者: Abdessamed Qchohi, Simone Rossi 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12434v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究In-context Learning（ICL）中的延迟泛化现象（grokking），这是大模型（特别是transformer架构）的关键能力之一，因此与’Large Language Models’和’In-context Learning’高度相关（分别给8分和10分）。论文采用贝叶斯视角分析不确定性，这属于模型可解释性范畴，与’Mechanistic Interpretability’相关（给8分）。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或与核心研究无关，故给0分。

!!! tip deepseek-chat TL;DR

该研究从贝叶斯视角探究了transformer模型在上下文学习中从记忆到泛化的延迟过渡机制，发现认知不确定性在模型“顿悟”时急剧下降，可作为泛化的无标签诊断指标。

摘要翻译

情境学习使Transformer模型能够在推理时通过少量示例适应新任务，而顿悟现象则表明这种泛化能力可能仅在长时间训练后突然涌现。本研究从贝叶斯视角出发，探讨情境学习中的任务泛化与顿悟机制，重点研究何种因素导致模型从记忆到泛化的延迟转变。具体而言，我们以模算术任务为研究对象——该任务要求Transformer仅通过情境示例推断潜在的线性函数——并分析训练过程中预测不确定性的演化规律。我们结合近似贝叶斯技术估计后验分布，系统考察不确定性在训练过程中的变化规律，以及任务多样性、上下文长度和上下文噪声对其产生的影响。研究发现，当模型发生顿悟时，认知不确定性会急剧坍缩，这使得不确定性可作为Transformer泛化能力的无标注实用诊断指标。此外，我们通过简化的贝叶斯线性模型提供理论支持，证明渐进状态下延迟泛化与不确定性峰值均源于相同的光谱机制，该机制将顿悟时间与不确定性动态变化相联结。

摘要 (Abstract)

In-context learning enables transformers to adapt to new tasks from a few examples at inference time, while grokking highlights that this generalization can emerge abruptly only after prolonged training. We study task generalization and grokking in in-context learning using a Bayesian perspective, asking what enables the delayed transition from memorization to generalization. Concretely, we consider modular arithmetic tasks in which a transformer must infer a latent linear function solely from in-context examples and analyze how predictive uncertainty evolves during training. We combine approximate Bayesian techniques to estimate the posterior distribution and we study how uncertainty behaves across training and under changes in task diversity, context length, and context noise. We find that epistemic uncertainty collapses sharply when the model groks, making uncertainty a practical label-free diagnostic of generalization in transformers. Additionally, we provide theoretical support with a simplified Bayesian linear model, showing that asymptotically both delayed generalization and uncertainty peaks arise from the same underlying spectral mechanism, which links grokking time to uncertainty dynamics.

关键词: In-context Learning, Grokking, Transformer, Bayesian Perspective, Epistemic Uncertainty, Delayed Generalization, Modular Arithmetic

289. ❌ VeriX-Anon: A Multi-Layered Framework for Mathematically Verifiable Outsourced Target-Driven Data Anonymization

作者: Miit Daga, Swarna Priya Ramu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12431v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是数据匿名化验证框架，与深度学习/大模型技术原理创新或科学领域应用基本无关。唯一相关的是’Mechanistic Interpretability OR Explainable AI’关键词，因为论文使用了SHAP（一种可解释AI方法）进行效用验证，但这不是论文的核心技术（核心是验证框架而非XAI本身），因此给10分（有一定关联但非核心）。其他所有关键词都完全不相关，给0分。

!!! tip deepseek-chat TL;DR

论文提出了VeriX-Anon框架，通过多层验证机制解决外包数据匿名化过程中算法执行忠实性的验证问题，在12个测试场景中成功检测出11个偏差。

摘要翻译

随着组织日益将隐私敏感的数据转换外包给云服务提供商，目前尚缺乏实用机制使数据所有者能够验证外包算法是否被忠实执行。VeriX-Anon是一个面向外包式目标驱动k-匿名化的多层验证框架，它融合了三种正交机制：通过认证决策树（Authenticated Decision Tree）的默克尔式哈希实现确定性验证；通过在随机森林决策边界附近部署边界哨兵（Boundary Sentinels）与携带密码学标识符的精确复制双胞胎记录（Twins）实现概率性验证；以及通过可解释人工智能指纹技术进行效用验证——该方法使用瓦瑟斯坦距离（Wasserstein distance）比较匿名化前后SHAP值的分布。在三个跨领域数据集上针对三类攻击者（惰性攻击者Lazy：丢弃5%记录；愚钝攻击者Dumb：随机分割并伪造哈希；近似攻击者Approximate：随机分割但使用有效哈希）进行评估，VeriX-Anon在12个场景中成功检测出11种异常。没有任何单一层次能独立实现此效果。XAI层是唯一能检测近似攻击者的机制，在Adult和Bank数据集上成功，但在严重不平衡的Diabetes数据集上失效——该数据集的类别不平衡抑制了SHAP信号，这证实了自适应阈值设置的必要性。通过11点k值扫描实验表明，目标驱动匿名化比盲目匿名化保留了显著更高的数据效用（威尔科克森检验p=0.000977，科恩d值=1.96，平均F1分数差距+0.1574）。客户端验证可在百万行数据量下于一秒内完成。威胁模型涵盖三种经验评估的攻击者类型及一种理论类型（知情攻击者Informed Attacker），后者虽知晓陷阱嵌入机制但无法破解密码学盐值。哨兵规避概率从平衡数据集的接近零到不平衡数据集的0.52不等，而双胞胎层在所有测试场景中均能有效补偿这一局限性。

摘要 (Abstract)

Organisations increasingly outsource privacy-sensitive data transformations to cloud providers, yet no practical mechanism lets the data owner verify that the contracted algorithm was faithfully executed. VeriX-Anon is a multi-layered verification framework for outsourced Target-Driven k-anonymization combining three orthogonal mechanisms: deterministic verification via Merkle-style hashing of an Authenticated Decision Tree, probabilistic verification via Boundary Sentinels near the Random Forest decision boundary and exact-duplicate Twins with cryptographic identifiers, and utility-based verification via Explainable AI fingerprinting that compares SHAP value distributions before and after anonymization using the Wasserstein distance. Evaluated on three cross-domain datasets against Lazy (drops 5 percent of records), Dumb (random splitting, fake hash), and Approximate (random splitting, valid hash) adversaries, VeriX-Anon correctly detected deviations in 11 of 12 scenarios. No single layer achieved this alone. The XAI layer was the only mechanism that caught the Approximate adversary, succeeding on Adult and Bank but failing on the severely imbalanced Diabetes dataset where class imbalance suppresses the SHAP signal, confirming the need for adaptive thresholding. An 11-point k-sweep showed Target-Driven anonymization preserves significantly more utility than Blind anonymization (Wilcoxon $p = 0.000977$, Cohen’s $d = 1.96$, mean F1 gap $+0.1574$). Client-side verification completes under one second at one million rows. The threat model covers three empirically evaluated profiles and one theoretical profile (Informed Attacker) aware of trap embedding but unable to defeat the cryptographic salt. Sentinel evasion probability ranges from near-zero for balanced datasets to 0.52 for imbalanced ones, a limitation the twin layer compensates for in every tested scenario.

关键词: data anonymization, verification framework, outsourced computation, k-anonymization, explainable AI, cryptographic verification, privacy preservation, target-driven anonymization

290. ❌ Forecasting the Past: Gradient-Based Distribution Shift Detection in Trajectory Prediction

作者: Michele De Vita, Julian Wiederer, Vasileios Belagiannis 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12425v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于轨迹预测中的分布偏移检测，使用自监督学习和梯度分析方法，属于计算机视觉和自动驾驶领域。论文未涉及大语言模型、深度学习技术原理创新或科学AI应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种自监督的梯度分析方法来检测轨迹预测模型中的分布偏移，在Shifts和Argoverse数据集上显著提升了检测性能，并可用于早期碰撞检测。

摘要翻译

轨迹预测模型常因训练与测试条件间的分布偏移而在现实世界自动驾驶中失效。此类分布偏移——无论是行为性还是环境性的——会导致模型在不熟悉情境下做出错误预测，从而构成严重风险。我们提出一种自监督方法，通过后训练方式在解码器上执行基于轨迹观测数据前半段预测后半段的自监督任务。该预测损失相对于解码器最终层的梯度L2范数被定义为识别分布偏移的评分指标。我们的方法首先不影响轨迹预测模型本身，确保对原始预测性能无干扰；其次在Shifts和Argoverse数据集上的轨迹预测分布偏移检测中展现出显著改进。此外，我们证明该方法也可用于在Highway模拟器中早期检测深度Q网络运动规划器的碰撞风险。源代码发布于https://github.com/Michedev/forecasting-the-past。

摘要 (Abstract)

Trajectory prediction models often fail in real-world automated driving due to distributional shifts between training and test conditions. Such distributional shifts, whether behavioural or environmental, pose a critical risk by causing the model to make incorrect forecasts in unfamiliar situations. We propose a self-supervised method that trains a decoder in a post-hoc fashion on the self-supervised task of forecasting the second half of observed trajectories from the first half. The L2 norm of the gradient of this forecasting loss with respect to the decoder’s final layer defines a score to identify distribution shifts. Our approach, first, does not affect the trajectory prediction model, ensuring no interference with original prediction performance and second, demonstrates substantial improvements on distribution shift detection for trajectory prediction on the Shifts and Argoverse datasets. Moreover, we show that this method can also be used to early detect collisions of a deep Q-Network motion planner in the Highway simulator. Source code is available at https://github.com/Michedev/forecasting-the-past.

关键词: trajectory prediction, distribution shift detection, self-supervised learning, gradient analysis, automated driving, collision detection, deep Q-Network

291. ❌ Machine learning for four-dimensional SU(3) lattice gauge theories

作者: Urs Wenger 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12416v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于四维SU(3)规范理论的机器学习应用，特别是使用生成模型（如归一化流和扩散过程）以及基于重整化群变换的方法来改进规范场配置的采样。虽然论文涉及机器学习在科学计算中的应用，属于"AI for Science"的广义范畴，但具体内容与所有其他关键词（主要关于大语言模型、训练技术、推理优化、对齐、代理系统等）完全无关。因此，仅"AI for Science OR Bioinformatics OR Cheminformatics"获得5分（有一定关联），其他关键词均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该综述总结了机器学习在四维SU(3)规范理论中的应用，重点介绍了使用生成模型和重整化群变换改进规范场配置采样的方法，并展示了机器学习固定点作用在连续极限下的标度结果。

摘要翻译

本文综述了机器学习在格点规范理论模拟中的应用，以及当前可用于改进规范场构型采样的方法，重点关注四维SU(3)规范理论中的具体应用。这些方法包括基于生成式机器学习模型的途径，如（随机）归一化流与扩散过程；以及基于重正化群变换的途径，具体而言是通过规范等变卷积神经网络学习重正化群改进的规范作用量。特别地，本文展示了四维SU(3)规范理论中机器学习所得不动点作用量在趋近连续极限时的标度行为。相关结果包括基于经典完美梯度流标度的观测量——该标度在所有阶均无树层级格点离散化误差，以及与静态势能和退禁闭相变相关的物理量。

摘要 (Abstract)

In this review I summarize how machine learning can be used in lattice gauge theory simulations and what ap-proaches are currently available to improve the sampling of gauge field configurations, with a focus on applications in four-dimensional SU(3) gauge theories. These include approaches based on generative machine-learning models such as (stochastic) normalizing flows and diffusion processes, and an approach based on renormalization group (RG) transformations, more specifically the machine learning of RG-improved gauge actions using gauge-equivariant convolutional neural networks. In particular, I present scaling results for a machine-learned fixed-point action in four-dimensional SU(3) gauge theory towards the continuum limit. The results include observables based on the classically perfect gradient-flow scales, which are free of tree-level lattice artefacts to all orders, and quantities related to the static potential and the deconfinement transition.

关键词: machine learning, lattice gauge theory, SU(3) gauge theory, normalizing flows, diffusion processes, renormalization group, gauge-equivariant neural networks, continuum limit

292. ❌ Is Sliding Window All You Need? An Open Framework for Long-Sequence Recommendation

作者: Sayak Chakrabarty, Souradip Pal 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12372v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于推荐系统领域的长序列训练技术，特别是滑动窗口方法和k-shift嵌入层，属于传统深度学习在推荐系统中的应用。所有评分关键词均与大模型技术原理、训练方法、推理优化、对齐、科学AI应用等直接相关，而本文完全不涉及大模型（LLMs）、MoE、量化、RAG、CoT、对齐、科学AI等任何评分关键词领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出并开源了一个用于长序列推荐的端到端框架，通过滑动窗口训练和新型k-shift嵌入层，在有限计算资源下实现了高效的推荐性能提升。

摘要翻译

长交互历史是现代推荐系统的核心，然而在实际内存与延迟预算下，长序列训练常被视为不切实际而被摒弃。本研究表明，长序列训练不仅在学术规模上可行，而且高效。我们发布了一个完整、端到端的框架，实现了工业级滑动窗口长序列训练，包含全部数据处理、训练与评估脚本。除复现已有成果外，本研究补充了此前报告中缺失的两项能力：（i）一项运行时感知的消融实验，量化了不同窗口机制与步长下精度与计算资源的平衡边界；（ii）一种新颖的k-shift嵌入层，可在消费级GPU上支持百万级词表规模，且精度损失可忽略不计。我们的实现在中等规模大学集群上训练稳定，同时提供有竞争力的检索质量（例如在Retailrocket数据集上MRR提升达+6.04%，Recall@10提升达+6.34%），训练时间开销约为基准的$\sim 4$倍。通过封装稳健的流程、报告训练时间成本，并引入适用于低资源环境的嵌入机制，我们将长序列训练从封闭的工业技术转化为面向学术社区实用、开放且可扩展的方法论。

摘要 (Abstract)

Long interaction histories are central to modern recommender systems, yet training with long sequences is often dismissed as impractical under realistic memory and latency budgets. This work demonstrates that it is not only practical but also effective-at academic scale. We release a complete, end-to-end framework that implements industrial-style long-sequence training with sliding windows, including all data processing, training, and evaluation scripts. Beyond reproducing prior gains, we contribute two capabilities missing from earlier reports: (i) a runtime-aware ablation study that quantifies the accuracy-compute frontier across windowing regimes and strides, and (ii) a novel k-shift embedding layer that enables million-scale vocabularies on commodity GPUs with negligible accuracy loss. Our implementation trains reliably on modest university clusters while delivering competitive retrieval quality (e.g., up to +6.04% MRR and +6.34% Recall@10 on Retailrocket) with $\sim 4 \times $ training-time overheads. By packaging a robust pipeline, reporting training time costs, and introducing an embedding mechanism tailored for low-resource settings, we transform long-sequence training from a closed, industrial technique into a practical, open, and extensible methodology for the community.

关键词: long-sequence recommendation, sliding window, k-shift embedding, training framework, computational efficiency, retrieval quality, low-resource settings, end-to-end pipeline

293. ❌ Cross-Domain Transfer with Particle Physics Foundation Models: From Jets to Neutrino Interactions

作者: Gregor Krzmanc, Vinicius Mikuni, Benjamin Nachman, Callum Wilkinson 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12364v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究粒子物理基础模型（Foundation Models）的跨领域迁移应用，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。研究涉及预训练模型在不同物理实验中的迁移，与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分）。论文属于AI在科学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到对预训练模型进行评估和微调，与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、CoT、Agents、Quantization等均未在论文中涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了粒子物理基础模型OmniLearned从高能碰撞数据到低能中微子实验的跨领域迁移能力，结果表明预训练模型在能量回归和事件分类任务上均优于从头训练的模型，证明了基础模型在粒子物理中的泛化能力和跨探测器推理潜力。

摘要翻译

未来基于人工智能的粒子物理学研究很可能将以基础模型为起点，以加速训练并提升灵敏度。作为构建粒子物理学通用基础模型的一步，我们研究了在多样化高$Q^2$模拟及真实$pp$与$ep$对撞数据上预训练的OmniLearned基础模型，能否有效迁移至几GeV能区的固定靶中微子实验。我们处理了MINERvA中微子-核散射事例，并在两类任务上评估了预训练模型：可用能量的回归分析，以及带电流π介子未态（$\mathrm{CC1π^{\pm}}$、$\mathrm{CCNπ^{\pm}}$和$\mathrm{CC1π^{0}}$）的二分类识别。预训练的OmniLearned模型始终优于同等规模、从零开始训练的模型，在相同计算资源下实现了更优的整体性能，并在相同训练步数下取得了更好的表现。这些结果表明，粒子层级的基础模型所获得的归纳偏置能够跨越能量尺度、探测器技术和底层物理过程的巨大差异进行泛化，这为粒子物理学中实现探测器无关的推理范式指明了方向。

摘要 (Abstract)

Future AI-based studies in particle physics will likely start from a foundation model to accelerate training and enhance sensitivity. As a step towards a general-purpose foundation model for particle physics, we investigate whether the OmniLearned foundation model pre-trained on diverse high-$Q^2$ simulated and real $pp$ and $ep$ collisions can be effectively transferred to a few-GeV fixed-target neutrino experiment. We process MINERvA neutrino–nucleus scattering events and evaluate pre-trained models on two types of tasks: regression of available energy and binary classification of charged-current pion final states ($\mathrm{CC1π^{\pm}}$, $\mathrm{CCNπ^{\pm}}$, and $\mathrm{CC1π^{0}}$). Pre-trained OmniLearned models consistently outperform similarly sized models trained from scratch, achieving better overall performance at the same compute budget, as well as achieving better performance at the same number of training steps. These results suggest that particle-level foundation models acquire inductive biases that generalize across large differences in energy scale, detector technology, and underlying physics processes, pointing toward a paradigm of detector-agnostic inference in particle physics.

关键词: Foundation Models, Particle Physics, Cross-Domain Transfer, Pre-training, Neutrino Interactions, OmniLearned, Model Generalization, Detector-Agnostic Inference

294. ❌ Scaffold-Conditioned Preference Triplets for Controllable Molecular Optimization with Large Language Models

作者: Yi Xiong, Liang Xiong, Xiaohong Ji, Sen Yang, Zhifeng Gao, Huaimin Wang, Kele Xu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12350v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在分子优化中的应用，属于AI for Science领域。论文使用偏好学习（alignment）方法（SCPT pipeline）训练LLM，这与Instruction Tuning/Alignment和RLHF/DPO等关键词高度相关。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于大语言模型的分子优化方法（SCPT），通过构建支架条件偏好三元组来对齐预训练分子LLM，实现了在保持分子支架的同时改善分子性质，在单目标和多目标优化任务中均优于现有基线方法。

摘要翻译

分子性质优化是药物发现的核心环节，但许多深度学习方法依赖于黑箱评分系统，对骨架结构的保留控制有限，常产生不稳定或生物学上不合理的修饰。尽管大语言模型（LLMs）展现出作为分子生成器的潜力，但其优化能力仍受限于缺乏基于化学知识的偏好监督与规范化的数据构建流程。本文提出骨架条件偏好三元组（Scaffold-Conditioned Preference Triplets, SCPT），该流程通过骨架对齐和基于化学规则的过滤机制（包括有效性、可合成性及有意义的性质提升）构建相似性约束的三元组 $\langle\text{骨架}, \text{更优分子}, \text{次优分子}\rangle$。利用这些偏好数据，我们对预训练的分子大语言模型进行对齐训练，使其成为能够保持骨架结构的同时实现性质提升的条件编辑器。在单目标与多目标优化基准测试中，SCPT 在保持更高骨架相似度的同时，显著提升了优化成功率与性质增益。与代表性的非大语言模型分子优化方法相比，基于 SCPT 训练的大语言模型更适用于骨架约束优化和多目标优化任务。此外，在单性质和双性质监督下训练的模型能够有效泛化至三性质优化任务，表明其在有限的高阶监督下具备良好的外推泛化能力。SCPT 还提供了可控的数据构建调节机制，可生成可预测的相似性-增益边界，从而系统性地适应不同的优化需求。

摘要 (Abstract)

Molecular property optimization is central to drug discovery, yet many deep learning methods rely on black-box scoring and offer limited control over scaffold preservation, often producing unstable or biologically implausible edits. While large language models (LLMs) are promising molecular generators, optimization remains constrained by the lack of chemistry-grounded preference supervision and principled data curation. We introduce \textbf{Scaffold-Conditioned Preference Triplets (SCPT)}, a pipeline that constructs similarity-constrained triplets $\langle\text{scaffold}, \text{better}, \text{worse}\rangle$ via scaffold alignment and chemistry-driven filters for validity, synthesizability, and meaningful property gains. Using these preferences, we align a pretrained molecular LLM as a conditional editor, enabling property-improving edits that retain the scaffold. Across single- and multi-objective benchmarks, SCPT improves optimization success and property gains while maintaining higher scaffold similarity than competitive baselines. Compared with representative non-LLM molecular optimization methods, SCPT-trained LLMs are better suited to scaffold-constrained and multi-objective optimization. In addition, models trained on single-property and two-property supervision generalize effectively to three-property tasks, indicating promising extrapolative generalization under limited higher-order supervision. SCPT also provides controllable data-construction knobs that yield a predictable similarity-gain frontier, enabling systematic adaptation to diverse optimization regimes.

关键词: Large Language Models, Molecular Optimization, Scaffold Preservation, Preference Learning, Drug Discovery, AI for Science, Alignment, Multi-objective Optimization

295. ❌ PrivEraserVerify: Efficient, Private, and Verifiable Federated Unlearning

作者: Parthaw Goswami, Md Khairul Islam, Ashfak Yeafi 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12348v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦学习和联邦遗忘技术，提出了一种结合效率、隐私和可验证性的统一框架。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是联邦学习中的模型遗忘机制，不涉及大模型架构、训练方法、推理优化、对齐技术、代理系统或科学领域应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

本文提出了PrivEraserVerify框架，解决了联邦学习中高效、隐私保护且可验证的模型遗忘问题，实现了比重新训练快2-3倍的遗忘速度，同时提供形式化的隐私保证和可扩展验证。

摘要翻译

联邦学习（Federated Learning, FL）支持在不共享原始数据的情况下进行协同模型训练，为隐私保护人工智能提供了一条前景广阔的路径。然而，联邦学习模型仍可能记忆来自参与者的敏感信息，这与被遗忘权（Right to Be Forgotten, RTBF）的要求相冲突。为满足这些需求，联邦遗忘机制应运而生，旨在消除退出客户端的贡献。现有解决方案仅部分应对了这一挑战：FedEraser提升了效率但缺乏隐私保护，FedRecovery确保了差分隐私（Differential Privacy, DP）但降低了准确性，而VeriFi实现了可验证性却引入了开销，且无法保证效率或隐私。本文提出PrivEraserVerify（PEV），一个将效率、隐私和可验证性统一整合到联邦遗忘中的框架。PEV采用（i）自适应检查点技术来保留关键历史更新以实现快速重建，（ii）层自适应差分隐私校准以选择性消除客户端影响，同时最小化准确性损失，以及（iii）基于指纹的验证机制，使参与者能够以去中心化且非侵入性的方式确认遗忘效果。在图像、手写字符和医疗数据集上的实验表明，PEV的遗忘速度比重新训练快2至3倍，在降低性能损失的同时提供形式化的不可区分性保证，并支持可扩展的验证。据我们所知，PEV是首个为联邦遗忘同时提供效率、隐私和可验证性的框架，推动联邦学习向实用且符合监管要求的方向迈进。

摘要 (Abstract)

Federated learning (FL) enables collaborative model training without sharing raw data, offering a promising path toward privacy preserving artificial intelligence. However, FL models may still memorize sensitive information from participants, conflicting with the right to be forgotten (RTBF). To meet these requirements, federated unlearning has emerged as a mechanism to remove the contribution of departing clients. Existing solutions only partially address this challenge: FedEraser improves efficiency but lacks privacy protection, FedRecovery ensures differential privacy (DP) but degrades accuracy, and VeriFi enables verifiability but introduces overhead without efficiency or privacy guarantees. We present PrivEraserVerify (PEV), a unified framework that integrates efficiency, privacy, and verifiability into federated unlearning. PEV employs (i) adaptive checkpointing to retain critical historical updates for fast reconstruction, (ii) layer adaptive differentially private calibration to selectively remove client influence while minimizing accuracy loss, and (iii) fingerprint based verification, enabling participants to confirm unlearning in a decentralized and noninvasive manner. Experiments on image, handwritten character, and medical datasets show that PEV achieves up to 2 to 3 times faster unlearning than retraining, provides formal indistinguishability guarantees with reduced performance degradation, and supports scalable verification. To the best of our knowledge, PEV is the first framework to simultaneously deliver efficiency, privacy, and verifiability for federated unlearning, moving FL closer to practical and regulation compliant deployment.

关键词: Federated Learning, Federated Unlearning, Privacy Preservation, Differential Privacy, Verifiability, Model Efficiency, Right to be Forgotten, Checkpointing

296. ❌ Information-Geometric Decomposition of Generalization Error in Unsupervised Learning

作者: Gilhan Kim 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12340v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是无监督学习中泛化误差的信息几何分解理论，具体应用于正则化主成分分析（ε-PCA）。所有评分关键词都直接与大模型、深度学习技术、AI应用或相关方法相关，而本文是纯理论统计学习研究，不涉及任何大模型、深度学习、AI应用或相关技术方法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于信息几何的无监督学习泛化误差分解框架，将其精确分解为模型误差、数据偏差和方差三个非负分量，并在ε-PCA模型上进行了理论分析和数值验证。

摘要翻译

我们将无监督学习的Kullback–Leibler泛化误差（GE）——即从数据分布到训练模型的期望KL散度——分解为三个非负分量：模型误差、数据偏差和方差。该分解对于任意e-平坦模型类均精确成立，并源于信息几何中的两个恒等式：广义勾股定理和对偶e-混合方差恒等式。作为一个可解析处理的示例，我们将该框架应用于$ε$-PCA——一种正则化主成分分析方法，其中经验协方差矩阵在秩$N_K$处截断，被舍弃的方向被固定于一个噪声基底$ε$。尽管秩约束的$ε$-PCA本身并非e-平坦模型，但在各向同性高斯数据下，可通过技术性重构得到一个具有相同总GE的等价形式，使得分解的每个分量均能获得闭合表达式。最优秩表现为截断值$λ_{\mathrm{cut}}^{} = ε$——模型仅保留那些超过噪声基底的经验特征值——该截断值反映了模型误差收益与数据偏差成本之间的边际率平衡。进一步的边界比较得出了一个三区域相图——全保留区、内部区和坍缩区——它们由下Marchenko–Pastur边缘和一个可解析计算的坍缩阈值$ε_{}(α)$分隔，其中$α$为维度-样本量比率。所有结论均通过数值实验验证。

摘要 (Abstract)

We decompose the Kullback–Leibler generalization error (GE) – the expected KL divergence from the data distribution to the trained model – of unsupervised learning into three non-negative components: model error, data bias, and variance. The decomposition is exact for any e-flat model class and follows from two identities of information geometry: the generalized Pythagorean theorem and a dual e-mixture variance identity. As an analytically tractable demonstration, we apply the framework to $ε$-PCA, a regularized principal component analysis in which the empirical covariance is truncated at rank $N_K$ and discarded directions are pinned at a fixed noise floor $ε$. Although rank-constrained $ε$-PCA is not itself e-flat, it admits a technical reformulation with the same total GE on isotropic Gaussian data, under which each component of the decomposition takes closed form. The optimal rank emerges as the cutoff $λ_{\mathrm{cut}}^{} = ε$ – the model retains exactly those empirical eigenvalues exceeding the noise floor – with the cutoff reflecting a marginal-rate balance between model-error gain and data-bias cost. A boundary comparison further yields a three-regime phase diagram – retain-all, interior, and collapse – separated by the lower Marchenko–Pastur edge and an analytically computable collapse threshold $ε_{}(α)$, where $α$ is the dimension-to-sample-size ratio. All claims are verified numerically.

关键词: generalization error, unsupervised learning, information geometry, Kullback-Leibler divergence, ε-PCA, model error, data bias, variance

297. ❌ Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study

作者: Charlotte S. Alexander, Shane Storks, Souradip Pal, Sayak Chakrabarty, Arushi Sharma, Mlen-Too Wesley, Bailey Russo 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12337v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确使用LLMs（Llama 2）进行性别分类，并应用SHAP等可解释性方法分析语言模式，因此与’Large Language Models’和’Mechanistic Interpretability’高度相关（10分）。研究涉及学术推荐信分析，属于AI在科学/学术领域的应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该研究使用Transformer模型和LLMs分析学术推荐信中隐含的性别线索，发现即使匿名化后模型仍能通过语言模式预测性别（准确率最高68%），并通过移除特定词汇降低预测性能，揭示了推荐信中难以消除的性别偏见问题。

摘要翻译

推荐信（LoRs）可能携带隐含的性别化语言模式，这些模式可能无意中影响后续决策，例如在招聘和录取过程中。在本研究中，我们探讨了基于Transformer的编码器模型以及大语言模型（LLMs）在多大程度上能够推断出提交给美国住院医师项目的学术推荐信中申请人的性别——即使当姓名和代词等显性标识被去性别化后。通过使用三种模型（DistilBERT、RoBERTa和Llama 2）对匿名化和去性别化的推荐信进行性别分类，我们观察到显著的性别信息泄露，分类准确率最高可达68%。文本解释方法（如TF-IDF和SHAP）表明，某些语言模式是性别的强代理指标，例如“情感化的”和“人道主义的”常与女性申请人的推荐信相关联。作为创建真正性别中立推荐信的实验，我们移除了这些隐含的性别线索，导致重新训练的分类器准确率下降高达5.5%，宏观$F_1$分数下降2.7%。然而，对申请人性别的预测仍优于随机猜测。在本案例研究中，我们的发现突出表明：1）推荐信包含难以消除的性别识别线索，可能激活决策过程中的偏见；2）虽然我们的技术框架可能是迈向更公平学术和专业评估的具体一步，但未来仍需进一步探究性别在推荐信评审中的作用。综上所述，我们的研究结果推动了对现实学术推荐信中评价性文本进行上游审计的必要性，将其作为模型层面公平干预的必要补充。

摘要 (Abstract)

Letters of recommendation (LoRs) can carry patterns of implicitly gendered language that can inadvertently influence downstream decisions, e.g. in hiring and admissions. In this work, we investigate the extent to which Transformer-based encoder models as well as Large Language Models (LLMs) can infer the gender of applicants in academic LoRs submitted to an U.S. medical-residency program after explicit identifiers like names and pronouns are de-gendered. While using three models (DistilBERT, RoBERTa, and Llama 2) to classify the gender of anonymized and de-gendered LoRs, significant gender leakage was observed as evident from up to 68% classification accuracy. Text interpretation methods, like TF-IDF and SHAP, demonstrate that certain linguistic patterns are strong proxies for gender, e.g. “emotional’’ and “humanitarian’’ are commonly associated with LoRs from female applicants. As an experiment in creating truly gender-neutral LoRs, these implicit gender cues were remove resulting in a drop of up to 5.5% accuracy and 2.7% macro $F_1$ score on re-training the classifiers. However, applicant gender prediction still remains better than chance. In this case study, our findings highlight that 1) LoRs contain gender-identifying cues that are hard to remove and may activate bias in decision-making and 2) while our technical framework may be a concrete step toward fairer academic and professional evaluations, future work is needed to interrogate the role that gender plays in LoR review. Taken together, our findings motivate upstream auditing of evaluative text in real-world academic letters of recommendation as a necessary complement to model-level fairness interventions.

关键词: gender bias, recommendation letters, large language models, interpretability, fairness, Transformer models, SHAP, academic evaluation

298. ❌ Black-Box Optimization From Small Offline Datasets via Meta Learning with Synthetic Tasks

作者: Azza Fadhel, The Hung Tran, Trong Nghia Hoang, Jana Doppa 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12325v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究离线黑盒优化问题，提出基于元学习和合成任务生成的OptBias框架来解决小数据集场景下的优化偏差问题。论文属于AI for Science领域（如分子、材料设计），与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分），因为涉及科学应用中的优化问题。但论文未涉及大模型、深度学习技术原理或任何其他关键词的具体技术（如LLMs、MoE、微调方法等），因此其他关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文针对离线黑盒优化中数据稀缺的挑战，提出了通过元学习和合成任务生成来学习优化偏差的OptBias框架，在多个小数据集基准测试中优于现有方法。

摘要翻译

本文研究离线黑箱优化问题，其目标是从历史实验数据中发现最优设计（例如分子或材料）。该场景下的一个核心挑战是数据稀缺性：在许多科学应用中，仅能获得少量或低质量的数据集，这严重限制了现有算法的有效性。已有研究从理论和实证上表明，离线优化算法的性能取决于代理模型对优化偏置的捕捉能力（即正确对输入设计进行排序的能力），而在有限的实验数据下实现这一目标十分困难。本文提出基于合成任务生成的优化偏置代理学习框架（OptBias），这是一种直接应对数据稀缺问题的元学习方法。OptBias通过在高斯过程生成的合成任务上进行训练，学习可重复使用的优化偏置，随后针对目标任务的小规模数据进行代理模型的微调。在多种连续与离散离线优化基准测试中，OptBias在小数据场景下均持续优于当前最先进的基线方法。这些结果表明，OptBias为现实小数据环境下的离线优化问题提供了一个鲁棒且实用的解决方案。

摘要 (Abstract)

We consider the problem of offline black-box optimization, where the goal is to discover optimal designs (e.g., molecules or materials) from past experimental data. A key challenge in this setting is data scarcity: in many scientific applications, only small or poor-quality datasets are available, which severely limits the effectiveness of existing algorithms. Prior work has theoretically and empirically shown that performance of offline optimization algorithms depends on how well the surrogate model captures the optimization bias (i.e., ability to rank input designs correctly), which is challenging to accomplish with limited experimental data. This paper proposes Surrogate Learning with Optimization Bias via Synthetic Task Generation (OptBias), a meta-learning framework that directly tackles data scarcity. OptBias learns a reusable optimization bias by training on synthetic tasks generated from a Gaussian process, and then fine-tunes the surrogate model on the small data for the target task. Across diverse continuous and discrete offline optimization benchmarks, OptBias consistently outperforms state-of-the-art baselines in small data regimes. These results highlight OptBias as a robust and practical solution for offline optimization in realistic small data settings.

关键词: offline black-box optimization, data scarcity, meta-learning, synthetic task generation, optimization bias, surrogate model, small data regimes, molecular design

299. ❌ GCA Framework: A Gulf-Grounded Dataset and Agentic Pipeline for Climate Decision Support

作者: Muhammad Umer Sheikh, Khawar Shehzad, Salman Khan, Fahad Shahbaz Khan, Muhammad Haris Khan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12306v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在气候科学领域的应用，通过构建海湾地区气候数据集（GCA-DS）和开发工具增强的智能体（Gulf Climate Agent）来解决通用LLMs在区域气候知识和地理空间工具交互方面的不足。论文明确提到使用LLMs、进行领域微调（domain fine-tuning，属于SFT范畴）、开发LLM智能体（LLM Agents）并集成工具（Tool Use），以及将AI应用于气候科学（AI for Science）。这些关键词与论文内容高度相关，评分10分。其他关键词如MoE、量化、推理加速、可解释性等，论文未涉及，评分0分。

!!! tip deepseek-chat TL;DR

该论文针对通用大语言模型在区域气候知识和地理空间工具交互方面的不足，提出了GCA框架，包括一个海湾地区气候数据集和一个工具增强的智能体，通过领域微调和工具集成显著提升了气候决策任务的可靠性。

摘要翻译

海湾地区的气候决策日益需要能够将多元化的科学与政策证据转化为可操作指导的系统，然而通用大语言模型在区域特定气候知识以及与地理空间和预测工具的落地交互方面仍存在不足。本文提出GCA框架，该框架整合了（i）GCA-DS——一个精心构建的聚焦海湾地区的多模态数据集，以及（ii）海湾气候智能体——一个用于气候分析的工具增强型智能体。GCA-DS包含约20万个问答对，涵盖政府政策与适应计划、非政府组织及国际框架、学术文献，以及关于热浪、沙尘暴和洪水的事件驱动型报道，并辅以将遥感影像与文本证据耦合的遥感数据输入。在此基础上，GCA智能体协调一个基于实时与历史信号及地理空间处理的模块化工具流程，生成衍生指标和可解释的可视化结果。最后，我们在海湾气候任务上对开源和专有大语言模型进行基准测试，结果表明领域微调与工具集成相较于通用基线模型显著提升了可靠性。

摘要 (Abstract)

Climate decision-making in the Gulf increasingly demands systems that can translate heterogeneous scientific and policy evidence into actionable guidance, yet general-purpose large language models (LLMs) remain weak both in region-specific climate knowledge and grounded interaction with geospatial and forecasting tools. We present the GCA framework, which unifies (i) GCA-DS, a curated Gulf-focused multimodal dataset, and (ii) Gulf Climate Agent (GCA), a tool-augmented agent for climate analysis. GCA-DS comprises ~200k question-answer pairs spanning governmental policies and adaptation plans, NGO and international frameworks, academic literature, and event-driven reporting on heatwaves, dust storms, and floods, complemented with remote-sensing inputs that couple imagery with textual evidence. Building on this foundation, the GCA agent orchestrates a modular tool pipeline grounded in real-time and historical signals and geospatial processing that produces derived indices and interpretable visualizations. Finally, we benchmark open and proprietary LLMs on Gulf climate tasks and show that domain fine-tuning and tool integration substantially improve reliability over general-purpose baselines.

关键词: Large Language Models, Climate Decision Support, Domain Fine-tuning, LLM Agents, Tool Augmentation, AI for Science, Geospatial Analysis, Multimodal Dataset

300. ❌ Beyond Weather Correlation: A Comparative Study of Static and Temporal Neural Architectures for Fine-Grained Residential Energy Consumption Forecasting in Melbourne, Australia

作者: Prasad Nimantha Madusanka Ukwatta Hewage, Hao Wu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12304v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究住宅能源消耗预测，使用传统神经网络（MLP和LSTM）方法，未涉及大模型、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型、深度学习技术原理或AI for Science相关，而本文属于传统时间序列预测应用，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文比较了多层感知机（MLP）和长短期记忆网络（LSTM）在澳大利亚墨尔本家庭5分钟粒度能源消耗预测中的性能，发现LSTM利用时间自相关性显著优于仅依赖天气特征的MLP，并揭示了太阳能发电引入的不对称性。

摘要翻译

在亚分钟级分辨率下实现精准的短期住宅能耗预测，对于智能电网管理、需求响应计划及可再生能源整合至关重要。尽管气象变量被广泛认为是住宅电力需求的关键驱动因素，但对于澳大利亚家庭而言，在细粒度（5分钟）时间分辨率下，纳入时间自相关性（即过去能耗的序列记忆）相较于仅使用静态气象特征的优势，仍未得到充分探究。本文对多层感知机（MLP）与长短期记忆（LSTM）循环网络进行了严谨的实证比较，并将其应用于墨尔本的两个真实家庭：房屋3（标准并网住宅）和房屋4（屋顶太阳能光伏一体化住宅）。两种模型均基于14个月（2023年3月至2024年4月）的5分钟间隔智能电表数据，并结合澳大利亚气象局（BOM）的每日天气观测数据进行训练，每个家庭产生超过117,000个样本。LSTM模型基于24步（2小时）滑动能耗窗口运行，其决定系数分别达到R^2 = 0.883（房屋3）和R^2 = 0.865（房屋4）；而相应的基于气象驱动的MLP模型仅获得R^2 = -0.055和R^2 = 0.410，两者差距分别达93.8和45.5个百分点。这些结果表明，在5分钟粒度的短期预测中，能耗序列的时间自相关性主导了气象信息的作用。此外，我们揭示了太阳能发电引入的不对称性：对于光伏一体化家庭，MLP模型取得了R^2 = 0.410，这表明模型通过天气-时间相关性实现了隐式的太阳能发电预测。通过持续性基线分析和季节性分层评估，我们对模型性能进行了背景化解读。最后，我们提出了一种结合气象增强的混合LSTM模型以及联邦学习扩展方案，作为未来工作的研究方向。

摘要 (Abstract)

Accurate short-term residential energy consumption forecasting at sub-hourly resolution is critical for smart grid management, demand response programmes, and renewable energy integration. While weather variables are widely acknowledged as key drivers of residential electricity demand, the relative merit of incorporating temporal autocorrelation - the sequential memory of past consumption; over static meteorological features alone remains underexplored at fine-grained (5-minute) temporal resolution for Australian households. This paper presents a rigorous empirical comparison of a Multilayer Perceptron (MLP) and a Long Short-Term Memory (LSTM) recurrent network applied to two real-world Melbourne households: House 3 (a standard grid-connected dwelling) and House 4 (a rooftop solar photovoltaic-integrated household). Both models are trained on 14 months of 5-minute interval smart meter data (March 2023-April 2024) merged with official Bureau of Meteorology (BOM) daily weather observations, yielding over 117,000 samples per household. The LSTM, operating on 24-step (2-hour) sliding consumption windows, achieves coefficients of determination of R^2 = 0.883 (House 3) and R^2 = 0.865 (House 4), compared to R^2 = -0.055 and R^2 = 0.410 for the corresponding weather-driven MLPs - differences of 93.8 and 45.5 percentage points. These results establish that temporal autocorrelation in the consumption sequence dominates meteorological information for short-term forecasting at 5-minute granularity. Additionally, we demonstrate an asymmetry introduced by solar generation: for the PV-integrated household, the MLP achieves R^2 = 0.410, revealing implicit solar forecasting from weather-time correlations. A persistence baseline analysis and seasonal stratification contextualise model performance. We propose a hybrid weather-augmented LSTM and federated learning extensions as directions for future work.

关键词: residential energy consumption forecasting, LSTM, MLP, temporal autocorrelation, weather variables, smart grid, solar photovoltaic, fine-grained temporal resolution

301. ❌ Labeled TrustSet Guided: Batch Active Learning with Reinforcement Learning

作者: Guofeng Cui, Yang Liu, Pichao Wang, Hankai Hsu, Xiaohang Sun, Xiang Hao, Zhu Liu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12303v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于批处理主动学习（Batch Active Learning）方法，提出TrustSet和基于强化学习的采样策略BRAL-T框架，应用于图像分类任务。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是通用的主动学习算法，不涉及大模型技术、特定训练方法（如预训练、微调、对齐）、推理优化、代理系统或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种结合TrustSet和强化学习的批处理主动学习框架BRAL-T，用于高效选择标注数据，在多个图像分类基准和主动微调任务上取得了最先进的结果。

摘要翻译

批量主动学习（Batch Active Learning, BAL）是一种关键技术，用于降低标注成本并提升大规模深度学习模型训练的数据效率。传统的BAL方法在选择待标注数据时，常依赖马哈拉诺比斯距离（Mahalanobis Distance）等度量来平衡不确定性与多样性。然而，这些方法主要关注未标注数据的分布，未能充分利用已标注数据的反馈或模型性能信息。为应对这些局限，我们提出了TrustSet，这是一种新颖的方法，它从已标注数据集中选择信息量最大的样本，并确保类别分布均衡以缓解长尾问题。与侧重于维持整体数据分布的CoreSet不同，TrustSet通过剪枝冗余数据并利用标签信息优化选择过程，从而提升模型性能。为了将TrustSet的优势扩展至未标注数据池，我们提出了一种基于强化学习（Reinforcement Learning, RL）的采样策略，该策略能够近似地从未标注数据中筛选出高质量的TrustSet候选样本。结合TrustSet与强化学习，我们提出了基于TrustSet的批量强化主动学习框架（Batch Reinforcement Active Learning with TrustSet, BRAL-T）。BRAL-T在10个图像分类基准测试和2个主动微调任务中均取得了最先进的成果，证明了其在多个领域中的有效性与高效性。

摘要 (Abstract)

Batch active learning (BAL) is a crucial technique for reducing labeling costs and improving data efficiency in training large-scale deep learning models. Traditional BAL methods often rely on metrics like Mahalanobis Distance to balance uncertainty and diversity when selecting data for annotation. However, these methods predominantly focus on the distribution of unlabeled data and fail to leverage feedback from labeled data or the model’s performance. To address these limitations, we introduce TrustSet, a novel approach that selects the most informative data from the labeled dataset, ensuring a balanced class distribution to mitigate the long-tail problem. Unlike CoreSet, which focuses on maintaining the overall data distribution, TrustSet optimizes the model’s performance by pruning redundant data and using label information to refine the selection process. To extend the benefits of TrustSet to the unlabeled pool, we propose a reinforcement learning (RL)-based sampling policy that approximates the selection of high-quality TrustSet candidates from the unlabeled data. Combining TrustSet and RL, we introduce the Batch Reinforcement Active Learning with TrustSet (BRAL-T) framework. BRAL-T achieves state-of-the-art results across 10 image classification benchmarks and 2 active fine-tuning tasks, demonstrating its effectiveness and efficiency in various domains.

关键词: Batch Active Learning, TrustSet, Reinforcement Learning, Data Selection, Image Classification, Labeling Efficiency, CoreSet, BRAL-T

302. ❌ Fine-tuning Factor Augmented Neural Lasso for Heterogeneous Environments

作者: Jinhang Chai, Jianqing Fan, Cheng Gao, Qishuo Yin 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12288v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于高维非参数回归中的微调方法，特别是针对变量选择和迁移学习。核心贡献是提出了FAN-Lasso框架，并为其提供了理论分析。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型、推理、对齐、代理等具体技术。然而，论文与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为其核心就是研究微调（fine-tuning）的理论和方法。与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为论文明确提到其框架为参数高效微调方法提供了理论视角。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文涉及迁移学习，但重点在微调而非预训练或领域适应本身。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于高维非参数回归和变量选择的微调框架（FAN-Lasso），并证明了其在协变量和后验偏移下相比单任务学习具有统计加速优势，同时为参数高效微调提供了理论视角。

摘要翻译

微调是一种广泛采用的策略，用于使预训练模型适应新任务，但其在高维非参数设置下结合变量选择的方法论与理论性质尚未得到充分发展。本文提出了微调因子增强神经Lasso（FAN-Lasso），这是一个面向高维非参数回归变量选择的迁移学习框架，能够同时处理协变量偏移与后验偏移。我们采用低秩因子结构来管理高维相依协变量，并提出了一种新颖的残差微调分解方法，其中目标函数被表达为冻结源函数及其他变量的变换，以此实现迁移学习和非参数变量选择。这种源自预测器的增强特征能够将知识迁移至目标域，并降低目标域的模型复杂度。我们推导了微调FAN-Lasso的极小化最优超额风险界，精确刻画了在相对样本量与函数复杂度条件下，微调能够获得超越单任务学习的统计加速效果。所提出的框架也为参数高效微调方法提供了理论视角。在多种协变量与后验偏移场景下的大量数值实验表明，微调FAN-Lasso始终优于标准基线方法，即使在严格的目标样本量限制下也能接近Oracle性能，从而实证验证了所推导的理论速率。

摘要 (Abstract)

Fine-tuning is a widely used strategy for adapting pre-trained models to new tasks, yet its methodology and theoretical properties in high-dimensional nonparametric settings with variable selection have not yet been developed. This paper introduces the fine-tuning factor augmented neural Lasso (FAN-Lasso), a transfer learning framework for high-dimensional nonparametric regression with variable selection that simultaneously handles covariate and posterior shifts. We use a low-rank factor structure to manage high-dimensional dependent covariates and propose a novel residual fine-tuning decomposition in which the target function is expressed as a transformation of a frozen source function and other variables to achieve transfer learning and nonparametric variable selection. This augmented feature from the source predictor allows for the transfer of knowledge to the target domain and reduces model complexity there. We derive minimax-optimal excess risk bounds for the fine-tuning FAN-Lasso, characterizing the precise conditions, in terms of relative sample sizes and function complexities, under which fine-tuning yields statistical acceleration over single-task learning. The proposed framework also provides a theoretical perspective on parameter-efficient fine-tuning methods. Extensive numerical experiments across diverse covariate- and posterior-shift scenarios demonstrate that the fine-tuning FAN-Lasso consistently outperforms standard baselines and achieves near-oracle performance even under severe target sample size constraints, empirically validating the derived rates.

关键词: fine-tuning, transfer learning, high-dimensional nonparametric regression, variable selection, FAN-Lasso, covariate shift, posterior shift, parameter-efficient fine-tuning

303. ❌ Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

作者: Jiayi Li, Shijie Tang, Gün Kaynar, Shiyi Du, Carl Kingsford 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12277v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究预训练语言模型的捷径学习问题，提出部署时缓解框架Shortcut Guardrail，使用基于梯度的归因方法和LoRA微调模块。高度相关关键词：LLMs（研究对象）、PEFT/LoRA（使用LoRA技术）、Hallucination Mitigation（缓解模型偏差/幻觉）、Mechanistic Interpretability（使用梯度归因解释模型行为）。中等相关：Pre-training（涉及预训练模型）、Post-training（涉及微调）。其他关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对预训练语言模型中的捷径学习问题，提出了一个无需原始训练数据或捷径标注的部署时缓解框架Shortcut Guardrail，通过梯度归因识别捷径标记并使用LoRA微调模块改善模型在分布偏移下的泛化性能。

摘要翻译

预训练语言模型常依赖训练时看似具有预测性、却无法在测试时泛化的表层特征，这一现象被称为捷径学习。现有缓解方法通常在训练阶段实施，且需要大量监督信息（如访问原始训练数据或已知捷径类型）。我们提出“捷径护栏”——一种无需访问原始训练数据或捷径标注、在部署阶段缓解词元级捷径的框架。我们的核心发现是：基于梯度归因的方法可在有偏模型上凸显捷径词元。基于此发现，我们采用掩码对比学习目标训练轻量化的LoRA去偏模块，该模块能促使模型在包含或排除特定词元时生成一致的表征。在情感分类、毒性检测和自然语言推理任务中，针对自然产生与受控的捷径场景，捷径护栏在分布偏移下相较于未缓解模型提升了整体准确率与最差组准确率，同时保持了分布内性能。

摘要 (Abstract)

Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.

关键词: shortcut learning, pretrained language models, deployment-time mitigation, gradient-based attribution, LoRA, Masked Contrastive Learning, distribution shifts, worst-group accuracy

304. ❌ SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation

作者: Yexiong Lin, Jia Shi, Shanshan Ye, Wanyu Wang, Yu Yao, Tongliang Liu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12273v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的生成模型（Flow Matching），研究图像生成中的多样性问题，提出SubFlow方法通过子模式条件化解决平均失真问题。所有评分关键词均针对大语言模型（LLM）及相关技术（如MoE、RLHF、RAG等），而本文完全不涉及语言模型、自然语言处理或AI for Science应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对流匹配生成模型中存在的多样性退化问题，提出了SubFlow方法，通过子模式条件化消除平均失真，在保持图像质量的同时显著提升了生成多样性。

摘要翻译

流匹配已成为一种强大的生成框架，近期出现的少步推理方法实现了显著的推理加速。然而，我们发现了一个关键但被忽视的局限：这些模型存在严重的多样性退化问题，其样本集中于主导模态，而忽略了目标分布中虽罕见但有效的变体。我们将这种退化归因于平均失真：当使用均方误差目标进行训练时，类条件流会学习类内子模态的频率加权均值，导致模型过度表征高密度模态，同时系统性忽略低密度模态。为解决此问题，我们提出了子流（SubFlow），即子模态条件流匹配。该方法通过语义聚类将每个类别分解为细粒度的子模态，并使流以子模态索引为条件，从而消除了平均失真。每个条件化的子分布近似为单模态，因此学习到的流能够准确针对单个模态而无平均失真，在单步推理中即可恢复完整的模态覆盖。关键的是，子流是完全即插即用的：它可以无缝集成到现有的一步生成模型（如MeanFlow和Shortcut Models）中，无需任何架构修改。在ImageNet-256上进行的大量实验表明，子流在保持竞争力的图像质量（FID）的同时，显著提升了生成多样性（Recall），证实了其在不同一步生成框架中的广泛适用性。项目页面：https://yexionglin.github.io/subflow。

摘要 (Abstract)

Flow matching has emerged as a powerful generative framework, with recent few-step methods achieving remarkable inference acceleration. However, we identify a critical yet overlooked limitation: these models suffer from severe diversity degradation, concentrating samples on dominant modes while neglecting rare but valid variations of the target distribution. We trace this degradation to averaging distortion: when trained with MSE objectives, class-conditional flows learn a frequency-weighted mean over intra-class sub-modes, causing the model to over-represent high-density modes while systematically neglecting low-density ones. To address this, we propose SubFlow, Sub-mode Conditioned Flow Matching, which eliminates averaging distortion by decomposing each class into fine-grained sub-modes via semantic clustering and conditioning the flow on sub-mode indices. Each conditioned sub-distribution is approximately unimodal, so the learned flow accurately targets individual modes with no averaging distortion, restoring full mode coverage in a single inference step. Crucially, SubFlow is entirely plug-and-play: it integrates seamlessly into existing one-step models such as MeanFlow and Shortcut Models without any architectural modifications. Extensive experiments on ImageNet-256 demonstrate that SubFlow yields substantial gains in generation diversity (Recall) while maintaining competitive image quality (FID), confirming its broad applicability across different one-step generation frameworks. Project page: https://yexionglin.github.io/subflow.

关键词: Flow Matching, Generative Models, One-Step Generation, Diversity Degradation, Averaging Distortion, Sub-mode Conditioning, Image Generation, Mode Coverage

305. ❌ RoleMAG: Learning Neighbor Roles in Multimodal Graphs

作者: Yilong Zuo, Xunkai Li, Zhihan Zhang, Ronghua Li, Guoren Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12271v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RoleMAG专注于多模态属性图（MAGs）中的邻居角色学习和传播机制，属于图神经网络（GNN）和多模态学习领域。所有评分关键词均围绕大模型（LLM）技术、训练方法、推理优化、对齐、代理系统等主题，而本文未涉及任何大模型相关技术、训练过程、推理方法或科学AI应用，核心内容是图结构中的模态特定传播设计，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

论文提出RoleMAG框架，通过区分邻居在共享、互补或异质信号中的角色并路由传播，解决了多模态图中共享消息传递模糊模态特定信号的问题，在多个基准测试中取得了最佳或竞争性结果。

摘要翻译

多模态属性图（Multimodal Attributed Graphs, MAGs）融合了多模态节点属性与结构化关系。然而，现有方法通常在单一图上执行共享的消息传递，并隐含假设相同的邻居对所有模态均等有益。实际上，对某一模态有益的邻居可能干扰另一模态，导致在共享传播过程中模态特定信号变得模糊。为解决这一问题，我们提出了RoleMAG，一个能够学习不同邻居应如何参与传播的多模态图框架。具体而言，RoleMAG区分邻居应提供共享信号、互补信号还是异质信号，并通过独立的传播通道进行路由。这使得互补邻居能够实现跨模态补全，同时避免异质邻居参与共享平滑过程。在三个以图为中心的多模态属性图基准测试（RedditS、Bili_Dance和Toys）上的大量实验表明，RoleMAG在RedditS和Bili_Dance上取得了最佳结果，并在Toys上保持竞争力。消融实验、鲁棒性分析和效率评估进一步验证了所提出的角色感知传播设计的有效性。我们的代码发布于https://anonymous.4open.science/r/RoleMAG-7EE0/。

摘要 (Abstract)

Multimodal attributed graphs (MAGs) combine multimodal node attributes with structured relations. However, existing methods usually perform shared message passing on a single graph and implicitly assume that the same neighbors are equally useful for all modalities. In practice, neighbors that benefit one modality may interfere with another, blurring modality-specific signals under shared propagation. To address this issue, we propose RoleMAG, a multimodal graph framework that learns how different neighbors should participate in propagation. Concretely, RoleMAG distinguishes whether a neighbor should provide shared, complementary, or heterophilous signals, and routes them through separate propagation channels. This enables cross-modal completion from complementary neighbors while keeping heterophilous ones out of shared smoothing. Extensive experiments on three graph-centric MAG benchmarks show that RoleMAG achieves the best results on RedditS and Bili_Dance, while remaining competitive on Toys. Ablation, robustness, and efficiency analyses further support the effectiveness of the proposed role-aware propagation design. Our code is available at https://anonymous.4open.science/r/RoleMAG-7EE0/

关键词: Multimodal attributed graphs, Role-aware propagation, Neighbor roles, Cross-modal completion, Graph neural networks, Modality-specific signals, Message passing, Benchmark evaluation

306. ❌ Decentralized Learning via Random Walk with Jumps

作者: Zonghong Liu, Matthew Dwyer, Salim El Rouayheb 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12260v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究去中心化网络学习中的随机游走算法优化，核心是通信效率、收敛速度和网络拓扑问题，不涉及大模型、深度学习技术原理或科学领域AI应用。所有关键词均与大模型技术、训练方法、推理优化、对齐、应用等主题相关，而本文属于分布式机器学习/优化算法领域，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了去中心化学习中加权随机游走算法可能陷入局部网络区域（entrapment）的问题，并提出了引入Levy跳跃的Metropolis-Hastings算法来恢复探索能力，从而显著加速收敛。

摘要翻译

我们研究去中心化网络学习，其中数据分散在各节点且不存在中央协调器。随机游走学习是一种基于令牌的方法：单个模型在网络中传播，并在每个访问节点利用本地数据进行更新，从而降低通信与计算开销。在加权随机游走学习中，通过设计转移矩阵以实现期望的采样分布，从而加速数据异质条件下的收敛。我们发现，通过Metropolis-Hastings算法实现加权采样可能导致一种先前未被探索的现象——我们称之为“困陷”。随机游走可能被困于网络的小范围区域内，导致更新高度相关并严重降低收敛速度。为解决此问题，我们提出带莱维跳跃的Metropolis-Hastings方法，该方法在尊重本地信息约束的同时引入偶发的长程转移以恢复探索能力。我们建立了一个收敛速率分析框架，明确刻画了数据异质性、网络谱间隙和跳跃概率的作用，并通过实验证明MHLJ方法能有效消除困陷现象，显著加速去中心化学习过程。

摘要 (Abstract)

We study decentralized learning over networks where data are distributed across nodes without a central coordinator. Random walk learning is a token-based approach in which a single model is propagated across the network and updated at each visited node using local data, thereby incurring low communication and computational overheads. In weighted random-walk learning, the transition matrix is designed to achieve a desired sampling distribution, thereby speeding up convergence under data heterogeneity. We show that implementing weighted sampling via the Metropolis-Hastings algorithm can lead to a previously unexplored phenomenon we term entrapment. The random walk may become trapped in a small region of the network, resulting in highly correlated updates and severely degraded convergence. To address this issue, we propose Metropolis-Hastings with Levy jumps, which introduces occasional long-range transitions to restore exploration while respecting local information constraints. We establish a convergence rate that explicitly characterizes the roles of data heterogeneity, network spectral gap, and jump probability, and demonstrate through experiments that MHLJ effectively eliminates entrapment and significantly speeds up decentralized learning.

关键词: decentralized learning, random walk, Metropolis-Hastings, entrapment, Levy jumps, convergence rate, network spectral gap, data heterogeneity

307. ❌ Socrates Loss: Unifying Confidence Calibration and Classification by Leveraging the Unknown

作者: Sandra Gómez-Gálvez, Tobias Olenyi, Gillian Dobbie, Katerina Taškova 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12245v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于深度神经网络（DNN）的置信度校准问题，提出了一种名为Socrates Loss的统一损失函数，通过引入辅助未知类和动态不确定性惩罚来同时优化分类和校准。研究内容属于深度学习技术原理的创新，但所有关键词均与大模型（LLM）或特定AI应用领域（如科学AI）直接相关，而本文未涉及大模型、MoE、量化、推理加速、对齐、RAG、智能体等具体技术，也未在生物信息学等科学领域应用。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

本文提出Socrates Loss，一种通过引入未知类和动态不确定性惩罚来统一优化深度神经网络分类性能和置信度校准的新型损失函数，解决了现有方法在训练稳定性和性能之间的权衡问题。

摘要翻译

深度神经网络尽管具有高精度，却常表现出较差的置信度校准，这限制了其在关键应用中的可靠性。当前临时的置信度校准方法试图在训练过程中解决此问题，但面临一个根本性的权衡：两阶段训练方法以训练不稳定和较差的置信度校准为代价实现了强大的分类性能，而单损失方法虽稳定却在分类任务上表现欠佳。本文旨在应对并缓解这一稳定性与性能之间的权衡。我们提出苏格拉底损失（Socrates Loss），这是一种新颖、统一的损失函数，它通过引入一个辅助的未知类别来显式利用不确定性，该未知类别的预测直接影响损失函数及动态不确定性惩罚。这一统一目标使得模型能够同时针对分类和置信度校准进行优化，而无需依赖复杂且需调度的损失函数所带来的不稳定性。我们提供了理论保证，证明我们的方法能够正则化模型以防止校准错误和过拟合。在四个基准数据集和多种架构上的综合实验表明，苏格拉底损失持续提升了训练稳定性，同时实现了更优的精度-校准权衡，且通常比现有方法收敛更快。

摘要 (Abstract)

Deep neural networks, despite their high accuracy, often exhibit poor confidence calibration, limiting their reliability in high-stakes applications. Current ad-hoc confidence calibration methods attempt to fix this during training but face a fundamental trade-off: two-phase training methods achieve strong classification performance at the cost of training instability and poorer confidence calibration, while single-loss methods are stable but underperform in classification. This paper addresses and mitigates this stability-performance trade-off. We propose Socrates Loss, a novel, unified loss function that explicitly leverages uncertainty by incorporating an auxiliary unknown class, whose predictions directly influence the loss function and a dynamic uncertainty penalty. This unified objective allows the model to be optimized for both classification and confidence calibration simultaneously, without the instability of complex, scheduled losses. We provide theoretical guarantees that our method regularizes the model to prevent miscalibration and overfitting. Across four benchmark datasets and multiple architectures, our comprehensive experiments demonstrate that Socrates Loss consistently improves training stability while achieving more favorable accuracy-calibration trade-off, often converging faster than existing methods.

关键词: confidence calibration, deep neural networks, loss function, unknown class, training stability, classification performance, uncertainty penalty, accuracy-calibration trade-off

308. ❌ A Residual-Shell-Based Lower Bound for Ollivier-Ricci Curvature

作者: Xiang Gu, Huichun Zhang, Jian Sun 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12211v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究图论中的Ollivier-Ricci曲率计算优化问题，属于纯数学和计算几何领域，与所有评分关键词（均涉及大模型、深度学习及其应用技术）完全无关。论文未涉及任何人工智能、机器学习或大模型相关内容，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对Ollivier-Ricci曲率计算的高计算成本问题，提出了一种基于残差壳的紧致下界方法，在保持计算效率的同时显著提高了近似精度。

摘要翻译

奥利维耶-里奇曲率（Ollivier-Ricci curvature，简称ORC）通过捕捉丰富几何信息的瓦瑟斯坦距离（Wasserstein distance）定义，在理论与应用领域日益受到关注。然而，瓦瑟斯坦距离计算的高昂成本严重限制了ORC更广泛的实际应用。为缓解此问题，先前研究基于单跳随机游走提出了一种计算高效的下界作为ORC的近似替代，但该方法在实证中与精确ORC存在显著差距。本文为ORC建立了一个比现有下界更严格的下界，同时保持了远低于精确ORC计算的计算成本，实际加速可达数十倍。此外，我们的下界不仅限于单跳随机游走，也适用于k跳随机游走（k > 1）。在多种基础图结构上的实验表明，我们的下界在近似精度与计算效率方面均表现出优越性。

摘要 (Abstract)

Ollivier-Ricci curvature (ORC), defined via the Wasserstein distance that captures rich geometric information, has received growing attention in both theory and applications. However, the high computational cost of Wasserstein distance evaluation has significantly limited the broader practical use of ORC. To alleviate this issue, previous work introduced a computationally efficient lower bound as a proxy for ORC based on 1-hop random walks, but this approach empirically exhibits large gaps from the exact ORC. In this paper, we establish a substantially tighter lower bound for ORC than the existing lower bound, while retaining much lower computational cost than exact ORC computation, with practical speedups of tens of times. Moreover, our bound is not restricted to 1-hop random walks, but also applies to k-hop random walks (k > 1). Experiments on several fundamental graph structures demonstrate the effectiveness of our bound in terms of both approximation accuracy and computational efficiency.

关键词: Ollivier-Ricci curvature, Wasserstein distance, lower bound, computational efficiency, graph structures, random walks, approximation accuracy, residual-shell-based

309. ❌ LLM-Enhanced Log Anomaly Detection: A Comprehensive Benchmark of Large Language Models for Automated System Diagnostics

作者: Disha Patel 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12218v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在日志异常检测中的应用，与’Large Language Models’高度相关（10分），因为论文明确评估GPT-3.5、GPT-4、LLaMA-3等LLMs。与’In-context Learning’有一定关联（5分），因为论文在零样本和少样本设置中使用提示方法，这属于上下文学习范畴。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了大语言模型与传统方法在日志异常检测中的性能，发现微调transformer模型准确率最高，而基于提示的LLMs在零样本场景下表现出色且无需标注数据。

摘要翻译

系统日志异常检测对于维持大规模软件系统的可靠性至关重要，然而传统方法难以应对现代日志数据的异构性与动态演化特性。大型语言模型（LLMs）的最新进展为日志理解提供了前景广阔的新途径，但目前仍缺乏对基于LLM的方法与现有技术之间的系统性比较。本文通过一项全面的基准研究，在四个广泛使用的公开数据集（HDFS、BGL、Thunderbird和Spirit）上评估了基于LLM的方法与传统方法在日志异常检测中的表现。我们评估了三类方法：（1）经典日志解析器（Drain、Spell、AEL）结合机器学习分类器，（2）微调的Transformer模型（BERT、RoBERTa），以及（3）基于提示的LLM方法（GPT-3.5、GPT-4、LLaMA-3）在零样本和少样本设置下的性能。实验结果表明，尽管微调的Transformer模型取得了最高的F1分数（0.96-0.99），但基于提示的LLM在无需任何标注训练数据的情况下展现了卓越的零样本能力（F1：0.82-0.91）——这对于标注异常稀缺的实际部署场景而言是一个显著优势。我们进一步分析了各类方法在成本与精度之间的权衡、延迟特性以及故障模式。本研究结果为从业者根据其在精度、延迟、成本和标签可用性方面的具体约束选择日志异常检测方法提供了可操作的指导。所有代码与实验配置均已公开，以促进可复现性。

摘要 (Abstract)

System log anomaly detection is critical for maintaining the reliability of large-scale software systems, yet traditional methods struggle with the heterogeneous and evolving nature of modern log data. Recent advances in Large Language Models (LLMs) offer promising new approaches to log understanding, but a systematic comparison of LLM-based methods against established techniques remains lacking. In this paper, we present a comprehensive benchmark study evaluating both LLM-based and traditional approaches for log anomaly detection across four widely-used public datasets: HDFS, BGL, Thunderbird, and Spirit. We evaluate three categories of methods: (1) classical log parsers (Drain, Spell, AEL) combined with machine learning classifiers, (2) fine-tuned transformer models (BERT, RoBERTa), and (3) prompt-based LLM approaches (GPT-3.5, GPT-4, LLaMA-3) in zero-shot and few-shot settings. Our experiments reveal that while fine-tuned transformers achieve the highest F1-scores (0.96-0.99), prompt-based LLMs demonstrate remarkablezero-shot capabilities (F1: 0.82-0.91) without requiring any labeled training data – a significant advantage for real-world deployment where labeled anomalies are scarce. We further analyze the cost-accuracy trade-offs, latency characteristics, and failure modes of each approach. Our findings provide actionable guidelines for practitioners choosing log anomaly detection methods based on their specific constraints regarding accuracy, latency, cost, and label availability. All code and experimental configurations are publicly available to facilitate reproducibility.

关键词: log anomaly detection, Large Language Models, benchmark, zero-shot learning, few-shot learning, system diagnostics, GPT-4, LLaMA-3

310. ❌ Clustering-Enhanced Domain Adaptation for Cross-Domain Intrusion Detection in Industrial Control Systems

作者: Luyao Wang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12183v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究工业控制系统中的跨领域入侵检测，提出了一种聚类增强的领域自适应方法。与大多数大模型技术关键词（如LLMs、MoE、RLHF等）完全无关，因为这些关键词涉及大语言模型架构、训练对齐、推理优化等，而本文专注于传统的机器学习/深度学习领域自适应和聚类方法。唯一相关的是’Pre-training OR Continual Pre-training OR Domain Adaptation’（10分），因为论文核心是领域自适应（Domain Adaptation）技术。‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）有一定关联，因为工业控制系统属于应用领域，但论文未明确提及生物信息学或化学信息学。其他关键词均不涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种聚类增强的领域自适应方法，用于解决工业控制系统中跨领域入侵检测的数据稀缺和领域偏移问题，实验表明该方法显著提高了未知攻击的检测准确性和稳定性。

摘要翻译

工业控制系统运行于动态环境中，其流量分布随场景变化、标记样本有限且未知攻击频繁出现，这对跨域入侵检测提出了重大挑战。为解决该问题，本文提出一种面向工业控制流量的聚类增强域适应方法。该框架包含两个核心部分。首先，基于特征的迁移学习模块通过谱变换特征对齐将源域和目标域映射至共享潜在子空间，并迭代减少分布差异，从而实现精确的跨域检测。其次，聚类增强策略将K-Medoids聚类与基于主成分分析（PCA）的降维相结合，以提升跨域相关性估计能力，并减少人工参数调优导致的性能下降。实验结果表明，所提方法显著提升了未知攻击检测能力。与五种基线模型相比，其检测准确率最高提升49%，F分数获得更大增益，并展现出更强的稳定性。此外，聚类增强策略在典型任务上进一步将检测准确率最高提升26%。这些结果表明，所提方法有效缓解了数据稀缺和域偏移问题，为动态工业环境中鲁棒的跨域入侵检测提供了实用解决方案。

摘要 (Abstract)

Industrial control systems operate in dynamic environments where traffic distributions vary across scenarios, labeled samples are limited, and unknown attacks frequently emerge, posing significant challenges to cross-domain intrusion detection. To address this issue, this paper proposes a clustering-enhanced domain adaptation method for industrial control traffic. The framework contains two key components. First, a feature-based transfer learning module projects source and target domains into a shared latent subspace through spectral-transform-based feature alignment and iteratively reduces distribution discrepancies, enabling accurate cross-domain detection. Second, a clustering enhancement strategy combines K-Medoids clustering with PCA-based dimensionality reduction to improve cross-domain correlation estimation and reduce performance degradation caused by manual parameter tuning. Experimental results show that the proposed method significantly improves unknown attack detection. Compared with five baseline models, it increases detection accuracy by up to 49%, achieves larger gains in F-score, and demonstrates stronger stability. Moreover, the clustering enhancement strategy further boosts detection accuracy by up to 26% on representative tasks. These results suggest that the proposed method effectively alleviates data scarcity and domain shift, providing a practical solution for robust cross-domain intrusion detection in dynamic industrial environments.

关键词: Domain Adaptation, Cross-domain Intrusion Detection, Industrial Control Systems, Clustering Enhancement, Feature Alignment, Traffic Analysis, Unknown Attack Detection, Transfer Learning

311. ❌ CycloneMAE: A Scalable Multi-Task Learning Model for Global Tropical Cyclone Probabilistic Forecasting

作者: Renlong Hang, Zihao Xu, Jiuwei Zhao, Runling Yu, Leye Cheng, Qingshan Liu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12180v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文CycloneMAE专注于热带气旋概率预测，属于AI for Science（气象科学）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。模型采用预训练/微调范式，与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），并涉及微调（‘Post-training OR Supervised Fine-tuning OR SFT’，5分）。通过归因分析实现可解释性，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。其他关键词主要涉及大语言模型（LLM）技术、推理方法、对齐、压缩等，与论文的深度学习气象预测模型无直接关联，均评0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种名为CycloneMAE的可扩展多任务学习模型，用于全球热带气旋概率预测，通过预训练/微调范式和结构感知掩码自编码器，在多个海洋盆地的压力和风预测上优于领先的数值天气预报系统。

摘要翻译

热带气旋（TC）是最具破坏性的自然灾害之一，但其预测面临根本性的权衡：数值天气预报（NWP）模型计算成本高昂且难以有效利用历史数据，而现有基于深度学习（DL）的智能模型则局限于单一变量且为确定性预测，无法泛化至不同的预报变量。本文提出CycloneMAE，一种可扩展的多任务预报模型，它通过一种TC结构感知的掩码自编码器从多模态数据中学习可迁移的TC表征。通过将离散概率网格化机制与预训练/微调范式相结合，CycloneMAE能够同时提供确定性预报和概率分布。在五个全球大洋盆地的评估中，CycloneMAE在气压和风速预报上（长达120小时）及路径预报上（长达24小时）均优于主流NWP系统。通过积分梯度进行的归因分析揭示了具有物理可解释性的学习动态：短期预报主要依赖于卫星图像中的内部核心对流结构，而长期预报则逐渐将注意力转向外部环境因子。我们的框架为业务化TC预报建立了一条可扩展、概率化且可解释的路径。

摘要 (Abstract)

Tropical cyclones (TCs) rank among the most destructive natural hazards, yet their forecasting faces fundamental trade-offs: numerical weather prediction (NWP) models are computationally prohibitive and struggle to leverage historical data, while existing deep learning (DL)-based intelligent models are variable-specific and deterministic, which fail to generalize across different forecasting variables. Here we present CycloneMAE, a scalable multi-task forecasting model that learns transferable TC representations from multi-modal data using a TC structure-aware masked autoencoder. By coupling a discrete probabilistic gridding mechanism with a pre-train/fine-tune paradigm, CycloneMAE simultaneously delivers deterministic forecasts and probability distributions. Evaluated across five global ocean basins, CycloneMAE outperforms leading NWP systems in pressure and wind forecasting up to 120 hours and in track forecasting up to 24 hours. Attribution analysis via integrated gradients reveals physically interpretable learning dynamics: short-term forecasts rely predominantly on the internal core convective structure from satellite imagery, whereas longer-term forecasts progressively shift attention to external environmental factors. Our framework establishes a scalable, probabilistic, and interpretable pathway for operational TC forecasting.

关键词: tropical cyclone forecasting, probabilistic forecasting, masked autoencoder, multi-task learning, pre-train/fine-tune, interpretable AI, satellite imagery, numerical weather prediction

312. ❌ PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving

作者: Xu Bai, Muhammed Tawfiqul Islam, Chen Wang, Adel N. Toosi 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12171v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PipeLive专注于大语言模型（LLM）推理服务中的系统优化问题，特别是动态环境下的流水线并行（PP）实时重配置。其核心贡献在于：1）通过重新设计KV缓存布局和扩展PageAttention，实现实时KV缓存大小调整；2）采用增量KV修补机制，在源配置和目标配置之间同步KV状态，以最小化中断。因此，论文与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLM是研究的核心对象。与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（10分），因为论文的核心创新涉及KV缓存的管理和优化（如KV缓存重排、实时调整）。与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为论文旨在减少首次令牌时间（TTFT）和每个输出令牌时间（TPOT），直接优化LLM推理延迟和效率。其他关键词主要涉及模型训练、对齐、应用领域或特定推理技术（如思维链、智能体），与本文的系统级优化焦点无关，故得0分。

!!! tip deepseek-chat TL;DR

论文PipeLive解决了动态环境下大语言模型（LLM）服务中流水线并行（PP）配置无法实时调整的问题，通过重新设计KV缓存布局和引入增量KV修补机制，实现了实时、低开销的PP重配置，显著降低了首次令牌时间和每个输出令牌时间。

摘要翻译

流水线并行（Pipeline Parallelism, PP）被广泛用于将大语言模型（Large Language Models, LLMs）的各层划分到多个GPU上，从而实现对大规模模型的可扩展推理。然而，现有系统依赖于静态的PP配置，无法适应动态环境，例如无服务器平台和异构GPU环境。通过停止并重新部署服务来重新配置PP会导致难以接受的服务中断，因此重新配置必须在不中断推理的情况下在线原位进行。然而，在线原位PP重新配置面临着根本性的挑战。GPU已被模型权重和KV缓存占满，几乎没有空间容纳新的层放置，因此必须调整KV缓存大小，这与vLLM等为吞吐量而预分配的系统设计相悖。此外，在执行过程中保持KV一致性十分困难：停止-复制方法会引入较长的停顿，而后台同步则因状态持续演变而存在不一致的风险。我们提出了PipeLive，它能够以最小干扰实现在线原位PP重新配置。PipeLive引入了一种重新设计的KV缓存布局，并与PageAttention协同设计了一个扩展机制，共同形成了一个用于在线KV缓存大小调整的统一方案。此外，受在线虚拟机迁移的启发，它采用了一种增量式KV修补机制，用于在源配置和目标配置之间同步KV状态，并确定一个安全的切换点。与禁用KV缓存大小调整的方案相比，PipeLive在避免KV缓存溢出的情况下，将首词延迟（Time-To-First-Token, TTFT）降低了2.5倍。此外，与没有KV修补机制的变体方案相比，它将重新配置开销从数秒降低到10毫秒以下，并将TTFT和单输出词延迟（Time-Per-Output-Token, TPOT）分别提升了最高54.7%和14.7%。

摘要 (Abstract)

Pipeline parallelism (PP) is widely used to partition layers of large language models (LLMs) across GPUs, enabling scalable inference for large models. However, existing systems rely on static PP configurations that fail to adapt to dynamic settings, such as serverless platforms and heterogeneous GPU environments. Reconfiguring PP by stopping and redeploying service incurs prohibitive downtime, so reconfiguration must instead proceed live and in place, without interrupting inference. However, live in-place PP reconfiguration is fundamentally challenging. GPUs are already saturated with model weights and KV cache, leaving little room for new layer placements and necessitating KV cache resizing, at odds with systems like vLLM that preallocate for throughput. Moreover, maintaining KV consistency during execution is difficult: stop-and-copy introduces large pauses, while background synchronization risks inconsistency as states evolve. We present PipeLive, which enables live in-place PP reconfiguration with minimal disruption. PipeLive introduces a redesigned KV cache layout together with a co-designed extension to PageAttention, forming a unified mechanism for live KV resizing. It further adopts an incremental KV patching mechanism, inspired by live virtual machine migration, to synchronize KV states between source and target configurations and identify a safe switch point. PipeLive achieves a 2.5X reduction in time-to-first-token (TTFT) without KV cache overflow compared to disabling KV resizing. Furthermore, compared to a variant without KV patching, it reduces reconfiguration overhead from seconds to under 10ms, and improves TTFT and time-per-output-token (TPOT) by up to 54.7% and 14.7%, respectively.

关键词: Large Language Models, Pipeline Parallelism, KV Cache, Inference Serving, Live Reconfiguration, PageAttention, Time-to-First-Token, Time-per-Output-Token

313. ❌ PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

作者: Anupam Nayak, Baris Askin, Muhammed Ustaomeroglu, Carlee Joe-Wong, Gauri Joshi 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12160v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种联邦RLVR框架，核心涉及推理后训练（Post-training/SFT）、强化学习对齐（RLHF/DPO）、参数高效微调（LoRA/PEFT）以及数学和医学推理（CoT/System 2 Thinking），并应用于科学领域（AI for Science）。其他关键词如MoE、量化、RAG等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习中推理后训练（RLVR）的通信效率和客户端漂移问题，提出了一种结合LoRA本地适应和基于公共数据的离策略协调方法，在数学和医学推理基准上显著提升了性能。

摘要翻译

基于可验证奖励的强化学习推理后训练通常集中于集中式场景，但现实应用常涉及分布于不同机构的分散式私有数据。联邦训练是一种自然的解决方案，但在此场景下扩展可验证奖励的强化学习面临挑战：全模型同步成本高昂，且在异构数据下执行多步本地更新会导致严重的客户端偏移。我们提出一种联邦可验证奖励强化学习框架，结合基于LoRA的本地适配与基于公共数据的离线策略步骤，以提升通信效率与跨客户端协调能力。具体而言，该方法利用少量共享公共数据集周期性地在机构间交换和复用响应级训练信号，从而在不暴露私有数据的前提下，为更全局对齐的目标提供轻量级锚点。在公共数据步骤中，我们的方法选择性地用全局正确响应替换本地错误响应，使训练更贴近本地策略，同时仍受益于跨客户端协调。在数学与医学推理基准测试及多种模型上的实验表明，该方法持续优于标准基线。我们的研究结果揭示了一种简单有效的联邦推理后训练方案：将低秩通信与有限的公共数据协调相结合。

摘要 (Abstract)

Reasoning post-training with reinforcement learning from verifiable rewards (RLVR) is typically studied in centralized settings, yet many realistic applications involve decentralized private data distributed across organizations. Federated training is a natural solution, but scaling RLVR in this regime is challenging: full-model synchronization is expensive, and performing many local steps can cause severe client drift under heterogeneous data. We propose a federated RLVR framework that combines LoRA-based local adaptation with public-data-based off-policy steps to improve both communication efficiency and cross-client coordination. In particular, a small shared public dataset is used to periodically exchange and reuse response-level training signals across organizations, providing a lightweight anchor toward a more globally aligned objective without exposing private data. Our method selectively replaces locally incorrect responses with globally correct ones during public-data steps, thereby keeping training closer to the local policy while still benefiting from cross-client coordination. Across mathematical and medical reasoning benchmarks and models, our method consistently improves over standard baselines. Our results highlight a simple and effective recipe for federated reasoning post-training: combining low-rank communication with limited public-data coordination.

关键词: Federated Learning, Reinforcement Learning from Verifiable Rewards (RLVR), LoRA, Post-training, Reasoning, Medical Reasoning, Mathematical Reasoning, Public-data Coordination

314. ❌ Distinct mechanisms underlying in-context learning in transformers

作者: Cole Gibson, Wenping Cui, Gautam Reddy 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12151v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	15.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究transformer中的in-context learning机制，与关键词’In-context Learning OR Many-shot Learning’高度相关（15分），属于大模型技术原理创新，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。论文通过分析transformer训练动态和损失景观来解释in-context learning机制，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。其他关键词如MoE、SFT、RAG、量化等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了transformer中in-context learning的机制，发现transformer通过不同的子电路实现四种算法阶段，并揭示了从1点到2点泛化的尖锐转变条件。

摘要翻译

现代分布式网络（尤其是Transformer模型）获得了一种显著的能力（称为“上下文学习”），能够使其计算适应输入数据的统计特性，从而使一个固定网络能够应用于广泛系统的数据。本文对在有限离散马尔可夫链集合$S$上训练的Transformer中该行为的机制进行了完整的刻画。该Transformer表现出四种算法阶段，其特征取决于网络是否记忆与泛化，以及它使用的是单点统计量还是两点统计量。我们证明，这四种阶段由多层子电路实现，这些子电路例示了两种性质不同的机制来执行上下文自适应计算。最小模型分离出了两种模式的关键特征。记忆阶段与泛化阶段由两条边界划分，这两条边界取决于数据多样性$K = |S|$。第一条边界（$K_1^\ast$）由子电路之间的动力学竞争决定，第二条边界（$K_2^\ast$）则由表示瓶颈决定。一个受对称性约束的Transformer训练动力学理论解释了从单点泛化到两点泛化的急剧转变，并识别了损失景观中使网络能够泛化的关键特征。综上所述，我们表明Transformer通过发展不同的子电路来实现上下文学习，并确定了某些机制优于其他机制的条件。

摘要 (Abstract)

Modern distributed networks, notably transformers, acquire a remarkable ability (termed `in-context learning’) to adapt their computation to input statistics, such that a fixed network can be applied to data from a broad range of systems. Here, we provide a complete mechanistic characterization of this behavior in transformers trained on a finite set $S$ of discrete Markov chains. The transformer displays four algorithmic phases, characterized by whether the network memorizes and generalizes, and whether it uses 1-point or 2-point statistics. We show that the four phases are implemented by multi-layer subcircuits that exemplify two qualitatively distinct mechanisms for implementing context-adaptive computations. Minimal models isolate the key features of both motifs. Memorization and generalization phases are delineated by two boundaries that depend on data diversity, $K = |S|$. The first ($K_1^\ast$) is set by a kinetic competition between subcircuits and the second ($K_2^\ast$) is set by a representational bottleneck. A symmetry-constrained theory of a transformer’s training dynamics explains the sharp transition from 1-point to 2-point generalization and identifies key features of the loss landscape that allow the network to generalize. Put together, we show that transformers develop distinct subcircuits to implement in-context learning and identify conditions that favor certain mechanisms over others.

关键词: in-context learning, transformers, mechanistic characterization, algorithmic phases, generalization, training dynamics, loss landscape, subcircuits

315. ❌ XANE(3): An E(3)-Equivariant Graph Neural Network for Accurate Prediction of XANES Spectra from Atomic Structures

作者: Vitor F. Grizzi, Luke N. Pretzie, Jiayi Xu, Cong Liu 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12140v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于开发一种用于预测XANES光谱的E(3)-等变图神经网络（XANE(3)），属于科学计算和材料科学领域的AI应用。论文内容与绝大多数关键词（主要涉及大语言模型、训练技术、推理优化、智能体等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究是AI在科学领域（具体是材料科学和光谱学）的直接应用，属于’AI for Science’范畴，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该研究提出了一种名为XANE(3)的物理启发的E(3)-等变图神经网络，用于直接从原子结构准确预测X射线吸收近边结构（XANES）光谱，在铁氧化物表面数据集上实现了高精度的光谱重建，为加速光谱模拟和材料发现提供了有效工具。

摘要翻译

我们提出XANE(3)，一种基于物理的E(3)-等变图神经网络，用于直接从原子结构预测X射线吸收近边结构（XANES）谱。该模型结合了张量积消息传递与球谐边特征、吸收体查询注意力池化、定制等变层归一化、自适应门控残差连接，以及基于多尺度高斯基函数并可选配S型背景项的光谱读出模块。为提升谱线形状保真度，训练采用复合目标函数，包含逐点光谱重构项以及一阶与二阶导数匹配项。我们在包含5,941个氧化铁表面晶面FDMNES模拟数据的数据集上评估模型，在测试集上获得$1.0 \times 10^{-3}$的光谱均方误差。该模型准确复现了主吸收边结构、相对峰强度、前边特征及后边振荡。消融研究表明，导数感知目标函数、定制等变归一化、吸收体条件注意力池化、自适应门控残差混合以及全局背景项均能提升模型性能。值得注意的是，一个容量匹配的纯标量变体模型虽能达到相当的逐点重构误差，但其导数级保真度有所下降，这表明在当前数据集上，显式张量通道并非实现低强度误差的严格必要条件，但其对捕捉更精细的光谱结构仍具优势。这些结果确立了XANE(3)作为XANES模拟的精确高效代理模型，为加速光谱预测、机器学习辅助谱学分析及数据驱动的材料发现提供了可行路径。

摘要 (Abstract)

We present XANE(3), a physics-based E(3)-equivariant graph neural network for predicting X-ray absorption near-edge structure (XANES) spectra directly from atomic structures. The model combines tensor-product message passing with spherical harmonic edge features, absorber-query attention pooling, custom equivariant layer normalization, adaptive gated residual connections, and a spectral readout based on a multi-scale Gaussian basis with an optional sigmoidal background term. To improve line-shape fidelity, training is performed with a composite objective that includes pointwise spectral reconstruction together with first- and second-derivative matching terms. We evaluate the model on a dataset of 5,941 FDMNES simulations of iron oxide surface facets and obtain a spectrum mean squared error of $1.0 \times 10^{-3}$ on the test set. The model accurately reproduces the main edge structure, relative peak intensities, pre-edge features, and post-edge oscillations. Ablation studies show that the derivative-aware objective, custom equivariant normalization, absorber-conditioned attention pooling, adaptive gated residual mixing, and global background term each improve performance. Interestingly, a capacity-matched scalar-only variant achieves comparable pointwise reconstruction error but reduced derivative-level fidelity, indicating that explicit tensorial channels are not strictly required for low intensity error on this dataset, although they remain beneficial for capturing finer spectral structure. These results establish XANE(3) as an accurate and efficient surrogate for XANES simulation and offer a promising route toward accelerated spectral prediction, ML-assisted spectroscopy, and data-driven materials discovery.

关键词: XANES spectra prediction, E(3)-equivariant graph neural network, physics-based machine learning, tensor-product message passing, spectral reconstruction, materials discovery, AI for science, computational spectroscopy

316. ❌ SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

作者: Zikun Liu, Liang Luo, Qianru Li, Zhengyu Zhang, Wei Ling, Jingyi Shen, Zeliang Chen, Yaning Huang, Jingxian Huang, Abdallah Aboelela, Chonglin Sun, Feifan Gu, Fenggang Wu, Hang Qu, Huayu Li, Jill Pan, Kaidi Pei, Laming Chen, Longhao Jin, Qin Huang, Tongyi Tang, Varna Puvvada, Wenlin Chen, Xiaohan Wei, Xu Cao, Yantao Yao, Yuan Jin, Yunchen Pu, Yuxin Chen, Zijian Shen, Zhengkai Zhang, Dong Liang, Ellie Wen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12110v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SOLARIS主要研究推荐系统中的推理加速问题，其核心创新是受推测解码（speculative decoding）启发，通过预测未来用户-物品交互并异步预计算嵌入表示，将昂贵的基础模型推理与延迟关键的服务路径解耦。因此，与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为论文明确提到受其启发并专注于推理加速。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为论文处理的是推荐系统的基础模型（foundation models），虽然不一定是语言模型，但属于大模型范畴。与’Scaling Laws AND Data Quality’有弱关联（5分），因为摘要开头提到了推荐缩放定律（recommendation scaling laws），但未深入讨论数据质量。其他关键词与论文内容无关（0分），因为论文专注于推荐系统的特定推理优化技术，不涉及MoE、对齐、RAG、长上下文、量化、AI for Science等其他主题。

!!! tip deepseek-chat TL;DR

论文SOLARIS解决了推荐系统中基础模型因计算成本高而无法实时服务的问题，通过受推测解码启发的异步预计算框架，在Meta广告系统中实现了0.67%的收入指标提升。

摘要翻译

推荐系统扩展定律的最新进展催生了空前复杂的基础模型。这些模型虽能提供卓越性能，但其计算需求使得实时服务难以实现，开发者往往被迫依赖知识蒸馏技术——为效率而牺牲服务质量。为解决这一挑战，我们提出SOLARIS（基于潜在表征推测卸载的推理扩展框架），该创新框架受推测解码技术启发，通过预测未来请求中可能出现的用户-物品对，并异步预生成其基础模型表征，从而主动预计算用户-物品交互嵌入。该方法将昂贵的基础模型推理过程从对延迟敏感的服务路径中解耦，使得先前因成本过高而无法在线使用的模型能够实现实时知识迁移。在Meta广告系统中部署SOLARIS处理每日数百亿请求的实践中，该系统实现了0.67%驱动收入的核心指标提升，充分证明了其大规模应用的有效性。

摘要 (Abstract)

Recent advances in recommendation scaling laws have led to foundation models of unprecedented complexity. While these models offer superior performance, their computational demands make real-time serving impractical, often forcing practitioners to rely on knowledge distillation-compromising serving quality for efficiency. To address this challenge, we present SOLARIS (Speculative Offloading of Latent-bAsed Representation for Inference Scaling), a novel framework inspired by speculative decoding. SOLARIS proactively precomputes user-item interaction embeddings by predicting which user-item pairs are likely to appear in future requests, and asynchronously generating their foundation model representations ahead of time. This approach decouples the costly foundation model inference from the latency-critical serving path, enabling real-time knowledge transfer from models previously considered too expensive for online use. Deployed across Meta’s advertising system serving billions of daily requests, SOLARIS achieves 0.67% revenue-driving top-line metrics gain, demonstrating its effectiveness at scale.

关键词: speculative decoding, inference scaling, foundation models, recommendation systems, latency optimization, asynchronous precomputation, real-time serving, advertising systems

317. ❌ Parametric Interpolation of Dynamic Mode Decomposition for Predicting Nonlinear Systems

作者: Ananda Chakrabarti, Haitham H. Saleh, Indranil Nayak, Balasubramaniam Shanker, Fernando L. Teixeira, Debdipta Goswami 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12103v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于参数化动态模态分解（piDMD）方法，这是一种用于预测非线性系统的参数化降阶建模框架。论文内容涉及流体动力学、电磁粒子模拟等科学计算领域，属于AI for Science的范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（5分）。然而，论文完全不涉及大语言模型、深度学习技术、模型训练优化、推理加速、对齐技术、智能体系统等主题，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种参数插值动态模态分解（piDMD）方法，用于构建参数化降阶模型以准确预测非线性动力系统，并在流体流动和电磁粒子模拟等基准测试中验证了其优于现有方法的长期预测性能和鲁棒性。

摘要翻译

本文提出参数插值动态模态分解（parameter-interpolated dynamic mode decomposition, piDMD），这是一种参数化降阶建模框架，它将已知的参数仿射结构直接嵌入到DMD回归步骤中。与现有通过插值模态、特征值或降阶算子的参数化DMD方法不同——这些方法在稀疏训练数据或多维参数空间中可能表现脆弱——piDMD能够在多个训练参数样本上学习一个单一的参数仿射Koopman代理降阶模型（reduced order model, ROM），并可在未见参数值处进行预测而无需重新训练。我们在圆柱绕流、横向磁场中的电子束振荡以及虚阴极振荡（后两者采用电磁粒子网格（electromagnetic particle-in-cell, EMPIC）方法进行模拟）等案例上验证了piDMD。在所有基准测试中，与当前基于插值的先进参数化DMD基线方法相比，piDMD以更少的训练样本并在多维参数空间下，实现了精确的长时程预测并展现出更强的鲁棒性。

摘要 (Abstract)

We present parameter-interpolated dynamic mode decomposition (piDMD), a parametric reduced-order modeling framework that embeds known parameter-affine structure directly into the DMD regression step. Unlike existing parametric DMD methods which interpolate modes, eigenvalues, or reduced operators and can be fragile with sparse training data or multi-dimensional parameter spaces, piDMD learns a single parameter-affine Koopman surrogate reduced order model (ROM) across multiple training parameter samples and predicts at unseen parameter values without retraining. We validate piDMD on fluid flow past a cylinder, electron beam oscillations in transverse magnetic fields, and virtual cathode oscillations – the latter two being simulated using an electromagnetic particle-in-cell (EMPIC) method. Across all benchmarks, piDMD achieves accurate long-horizon predictions and improved robustness over state-of-the-art interpolation-based parametric DMD baselines, with less training samples and with multi-dimensional parameter spaces.

关键词: parametric reduced-order modeling, dynamic mode decomposition, Koopman operator, nonlinear systems prediction, fluid flow simulation, electromagnetic particle-in-cell, parameter-affine structure, long-horizon prediction

318. ❌ Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks

作者: Arun Sharma 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12102v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为’compute-grounded reasoning (CGR)‘的设计范式，用于空间感知研究代理，核心是使用确定性计算解决可回答的子问题，然后让大语言模型生成答案。论文与’Large Language Models’高度相关（10分），因为系统明确使用LLMs（OpenAI + Anthropic）进行推理和生成。与’LLM Agents’高度相关（10分），因为整个系统是一个Agent-to-Agent服务器，处理空间问答和ML工程任务。与’Hallucination Mitigation’高度相关（10分），因为CGR通过确定性计算避免幻觉的空间推理。与’Chain of Thought’、‘System 2 Thinking’、‘Self-Correction’、‘Tool Use’、‘Multi-agent Systems’、‘Mechanistic Interpretability’有一定关联（8分），因为论文涉及结构化推理、迭代优化、工具使用（代码生成、计算引擎）、多代理协调和可解释性。其他关键词如MoE、SLMs、Scaling Laws、训练技术、优化方法、量化等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种计算基础推理（CGR）范式，通过确定性计算和结构化空间场景图引擎来避免大语言模型在空间感知研究代理中的幻觉问题，并在两个基准测试中实现了竞争性准确率和可解释性。

摘要翻译

我们提出计算驱动推理（CGR），一种面向空间感知研究智能体的设计范式，其核心在于：在请求语言模型生成答案前，所有可解答的子问题均通过确定性计算先行解决。Spatial Atlas 将 CGR 实例化为一个单体的智能体间（A2A）服务器，该系统处理两大挑战性基准测试：一是 FieldWorkArena——一个涵盖工厂、仓库和零售环境的多模态空间问答基准；二是 MLE-Bench——一套包含 75 项 Kaggle 机器学习竞赛、要求端到端机器学习工程能力的测试集。系统通过结构化的空间场景图引擎从视觉描述中提取实体与关系，确定性地计算距离与安全违规情况，随后将计算所得事实输入大语言模型，从而避免产生幻觉式的空间推理。熵引导的动作选择机制最大化每一步的信息增益，并在三层前沿模型栈（OpenAI + Anthropic）间路由查询。此外，系统配备了一个具备自我修复能力的机器学习流水线，包含策略感知的代码生成、分数驱动的迭代优化循环，以及基于提示的泄露审计注册机制。我们在两项基准上进行了全面评估，结果表明 CGR 在保持竞争力的准确率的同时，通过结构化的中间表示和确定性的空间计算，确保了结果的可解释性。

摘要 (Abstract)

We introduce compute-grounded reasoning (CGR), a design paradigm for spatial-aware research agents in which every answerable sub-problem is resolved by deterministic computation before a language model is asked to generate. Spatial Atlas instantiates CGR as a single Agent-to-Agent (A2A) server that handles two challenging benchmarks: FieldWorkArena, a multimodal spatial question-answering benchmark spanning factory, warehouse, and retail environments, and MLE-Bench, a suite of 75 Kaggle machine learning competitions requiring end-to-end ML engineering. A structured spatial scene graph engine extracts entities and relations from vision descriptions, computes distances and safety violations deterministically, then feeds computed facts to large language models, thereby avoiding hallucinated spatial reasoning. Entropy-guided action selection maximizes information gain per step and routes queries across a three-tier frontier model stack (OpenAI + Anthropic). A self-healing ML pipeline with strategy-aware code generation, a score-driven iterative refinement loop, and a prompt-based leak audit registry round out the system. We evaluate across both benchmarks and show that CGR yields competitive accuracy while maintaining interpretability through structured intermediate representations and deterministic spatial computations.

关键词: compute-grounded reasoning, spatial-aware research agents, large language models, hallucination mitigation, structured spatial scene graph, Agent-to-Agent server, deterministic computation, interpretability

319. ❌ A Nonparametric Adaptive EWMA Control Chart for Binary Monitoring of Multiple Stream Processes

作者: Faruk Muritala, Austin Brown, Dhrubajyoti Ghosh, Sherry Ni 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12095v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于统计过程控制（SPC）中的二项比例监控方法，提出了一种改进的EWMA控制图（CSB-EWMA），用于多流二项数据。论文内容完全属于传统统计方法领域，不涉及任何大模型、深度学习、AI技术或相关技术原理。所有关键词均与大模型、深度学习、AI技术及其应用相关，而本文是纯粹的统计方法论研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多流二项过程的早期监控问题，提出了一种基于精确时变方差的CSB-EWMA控制图，实现了自适应控制限和快速偏移检测。

摘要翻译

在多独立流中监控二项比例是统计过程控制（SPC）中的关键挑战，其应用范围涵盖从制造业到网络安全等多个领域。尽管指数加权移动平均（EWMA）控制图对小偏移具有敏感性，但现有方法依赖于渐近方差近似，这在早期阶段监控中往往失效。本文提出一种累积标准化二项EWMA（CSB-EWMA）控制图，通过推导二值多流数据EWMA统计量的精确时变方差，克服了这一局限，实现了自适应控制限，从而确保从首个样本起就具备统计严谨性。通过大量模拟研究，我们确定了最优平滑参数（λ）和控制限参数（L），以达到目标受控平均运行长度（ARL0）为370和500。CSB-EWMA控制图在两种ARL0目标下均表现出快速的偏移检测能力：对于中等偏移（δ=0.2），失控平均运行长度（ARL1）可降至3-7个样本；同时在不同数据分布下展现出卓越的稳健性，在ARL0为370和500时，对于小偏移均保持较低的ARL1变异系数（CV < 0.10）。这项工作为实践者提供了一个无分布依赖、灵敏度高且理论完备的工具，用于二项多流过程的早期变化检测。

摘要 (Abstract)

Monitoring binomial proportions across multiple independent streams is a critical challenge in Statistical Process Control (SPC), with applications from manufacturing to cybersecurity. While EWMA charts offer sensitivity to small shifts, existing implementations rely on asymptotic variance approximations that fail during early-phase monitoring. We introduce a Cumulative Standardized Binomial EWMA (CSB-EWMA) chart that overcomes this limitation by deriving the exact time-varying variance of the EWMA statistic for binary multiple-stream data, enabling adaptive control limits that ensure statistical rigor from the first sample. Through extensive simulations, we identify optimal smoothing (λ) and limit (L) parameters to achieve target in-control average run length (ARL0) of 370 and 500. The CSB-EWMA chart demonstrates rapid shift detection across both ARL0 targets, with out-of-control average run length (ARL1) dropping to 3-7 samples for moderate shifts (δ=0.2), and exhibits exceptional robustness across different data distributions, with low ARL1 Coefficients of Variation (CV < 0.10 for small shifts) for both ARL0 = 370 and 500. This work provides practitioners with a distribution-free, sensitive, and theoretically sound tool for early change detection in binomial multiple-stream processes.

关键词: Statistical Process Control, EWMA control chart, binary monitoring, multiple streams, adaptive control limits, average run length, binomial proportions, change detection

320. ❌ Robust Optimization for Mitigating Reward Hacking with Correlated Proxies

作者: Zixuan Liu, Xiaolin Sun, Zizhan Zheng 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12086v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文专注于强化学习（RL）中的奖励设计问题，特别是针对代理奖励与真实奖励相关但可能被利用（奖励黑客攻击）的情况，提出了一种鲁棒优化方法。论文的核心是RL的奖励鲁棒性，而非大模型或深度学习技术原理的创新。所有关键词均与大模型、语言模型、训练技术、推理方法、代理系统、模型优化等大模型相关主题直接相关，而本文完全不涉及这些内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种鲁棒优化方法来解决强化学习中代理奖励与真实奖励相关但可能被利用的奖励黑客攻击问题，通过最大化最坏情况下的性能来提升策略的鲁棒性和稳定性。

摘要翻译

在不完美奖励信号存在的情况下设计稳健的强化学习（RL）智能体仍然是一个核心挑战。实践中，智能体通常使用仅近似真实目标的代理奖励进行训练，这使得它们容易受到奖励破解的影响，即高代理回报源于非预期或利用性行为。近期研究通过代理奖励与真实奖励之间的r-相关性来形式化这一问题，但现有方法如占用正则化策略优化（ORPO）仅针对固定代理进行优化，无法对更广泛的相关代理类别提供强保证。在本工作中，我们将奖励破解形式化为在所有r-相关代理奖励空间上的稳健策略优化问题。我们推导出一个可处理的最大-最小化形式，其中智能体在符合相关性约束的最坏情况代理下最大化性能。我们进一步证明，当奖励是已知特征的线性函数时，我们的方法可以适配以融入这一先验知识，从而同时获得改进的策略和可解释的最坏情况奖励。在多个环境中的实验表明，我们的算法在最坏情况回报方面始终优于ORPO，并在不同水平的代理-真实奖励相关性下提供了更高的稳健性和稳定性。这些结果表明，在奖励设计本身存在不确定性的场景中，我们的方法同时提供了稳健性和透明度。代码发布于 https://github.com/ZixuanLiu4869/reward_hacking。

摘要 (Abstract)

Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors. Recent work formalizes this issue using r-correlation between proxy and true rewards, but existing methods like occupancy-regularized policy optimization (ORPO) optimize against a fixed proxy and do not provide strong guarantees against broader classes of correlated proxies. In this work, we formulate reward hacking as a robust policy optimization problem over the space of all r-correlated proxy rewards. We derive a tractable max-min formulation, where the agent maximizes performance under the worst-case proxy consistent with the correlation constraint. We further show that when the reward is a linear function of known features, our approach can be adapted to incorporate this prior knowledge, yielding both improved policies and interpretable worst-case rewards. Experiments across several environments show that our algorithms consistently outperform ORPO in worst-case returns, and offer improved robustness and stability across different levels of proxy-true reward correlation. These results show that our approach provides both robustness and transparency in settings where reward design is inherently uncertain. The code is available at https://github.com/ZixuanLiu4869/reward_hacking.

关键词: robust optimization, reward hacking, reinforcement learning, proxy rewards, correlated proxies, worst-case performance, policy optimization, robustness

321. ❌ Robust Reasoning and Learning with Brain-Inspired Representations under Hardware-Induced Nonlinearities

作者: William Youngwoo Chung, Hamza Errahmouni Barkam, Tamoghno Das, Mohsen Imani 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12079v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于超维计算（HDC）在存内计算（CIM）硬件上的优化，以解决硬件非线性失真问题，提高分类和推理任务的鲁棒性。所有评分关键词均涉及大语言模型（LLMs）及其相关技术（如训练方法、推理技术、应用等），而本文研究的是基于HDC的硬件感知优化框架，属于不同的计算范式和硬件优化领域，与LLMs无直接关联。论文未提及任何LLM技术、应用或相关概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于超维计算的硬件感知优化框架，用于补偿存内计算架构中的非线性失真，显著提高了QuantHD和RelHD在硬件扰动下的分类和推理准确性。

摘要翻译

传统机器学习依赖于高精度算术与近乎理想的硬件假设，这正日益受到激进尺度缩放的半导体器件中变异性的挑战。存内计算架构虽能缓解数据移动瓶颈并提升能效，却引入了非线性失真与可靠性问题。我们提出一种基于超维计算的硬件感知优化框架，以系统性地补偿存内计算中的非理想相似度计算。该方法将编码过程构建为优化问题，通过最小化理想核函数与其硬件受限对应形式之间的弗罗贝尼乌斯范数，并采用联合优化策略对超向量表示进行端到端校准。实验结果表明，在严重硬件扰动条件下，本方法应用于QuantHD时实现了84%的准确率，较同等条件下原始QuantHD提升48%。此外，该优化对于依赖精确变量绑定以实现可解释推理的图结构超维计算至关重要。我们的框架在Cora数据集上保持了RelHD的精度，在非线性环境下较原始RelHD实现5.4倍的准确率提升。通过保持超维计算的鲁棒性与符号特性，本解决方案为新兴存内计算硬件实现了兼具分类与推理能力的可扩展、高能效智能系统。

摘要 (Abstract)

Traditional machine learning depends on high-precision arithmetic and near-ideal hardware assumptions, which is increasingly challenged by variability in aggressively scaled semiconductor devices. Compute-in-memory (CIM) architectures alleviate data-movement bottlenecks and improve energy efficiency yet introduce nonlinear distortions and reliability concerns. We address these issues with a hardware-aware optimization framework based on Hyperdimensional Computing (HDC), systematically compensating for non-ideal similarity computations in CIM. Our approach formulates encoding as an optimization problem, minimizing the Frobenius norm between an ideal kernel and its hardware-constrained counterpart, and employs a joint optimization strategy for end-to-end calibration of hypervector representations. Experimental results demonstrate that our method when applied to QuantHD achieves 84% accuracy under severe hardware-induced perturbations, a 48% increase over naive QuantHD under the same conditions. Additionally, our optimization is vital for graph-based HDC reliant on precise variable-binding for interpretable reasoning. Our framework preserves the accuracy of RelHD on the Cora dataset, achieving a 5.4$\times$ accuracy improvement over naive RelHD under nonlinear environments. By preserving HDC’s robustness and symbolic properties, our solution enables scalable, energy-efficient intelligent systems capable of classification and reasoning on emerging CIM hardware.

关键词: Hyperdimensional Computing, Compute-in-memory, Hardware-aware optimization, Nonlinear distortions, Robust reasoning, QuantHD, RelHD, Graph-based HDC

322. ❌ OpenTME: An Open Dataset of AI-powered H&E Tumor Microenvironment Profiles from TCGA

作者: Maaike Galama, Nina Kozar-Gillan, Christina Embacher, Todd Dembo, Cornelius Böhm, Evelyn Ramberger, Julika Ribbat-Idel, Rosemarie Krupar, Verena Aumiller, Miriam Hägele, Kai Standvoss, Gerrit Erdmann, Blanca Pablos, Ari Angelo, Simon Schallenberg, Andrew Norgan, Viktor Matyas, Klaus-Robert Müller, Maximilian Alber, Lukas Ruff, Frederick Klauschen 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12075v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要介绍了一个基于AI生成的肿瘤微环境（TME）数据集OpenTME，使用了Atlas病理学基础模型进行组织质量控制、分割、细胞检测和空间分析。论文内容与大多数关键词（如LLM、MoE、SFT、RLHF、RAG、CoT等）完全无关，因为这些关键词主要涉及大语言模型的技术原理、训练方法、推理优化等，而本文聚焦于生物医学AI应用。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物信息学/癌症研究领域的应用，与’AI for Science’高度相关，评分为10分。

!!! tip deepseek-chat TL;DR

该研究解决了从常规H&E染色组织病理学中大规模、定量表征肿瘤微环境（TME）数据稀缺的问题，通过引入OpenTME数据集，利用AI驱动的病理学基础模型对TCGA中的3,634张全切片图像进行分析，生成了细胞级分辨率的定量TME图谱。

摘要翻译

肿瘤微环境（TME）在癌症进展、治疗反应和患者预后中起着核心作用，然而，基于常规苏木精-伊红（H&E）染色的组织病理学切片进行大规模、一致且定量的TME表征仍然匮乏。我们推出了OpenTME，这是一个开放获取的数据集，包含预先计算的TME特征谱，数据来源于癌症基因组图谱（TCGA）中五种癌症类型（膀胱癌、乳腺癌、结直肠癌、肝癌和肺癌）的3,634张H&E染色全切片图像。所有输出均由Atlas H&E-TME生成，这是一个基于Atlas系列病理学基础模型构建的人工智能应用，能够执行组织质量控制、组织分割、细胞检测与分类以及空间邻域分析，以细胞级分辨率每张切片产出超过4,500个定量读数。OpenTME可在Hugging Face平台上用于非商业性学术研究。我们将持续扩展OpenTME，并预期它将作为生物标志物发现、空间生物学研究以及TME分析计算方法开发的资源。

摘要 (Abstract)

The tumor microenvironment (TME) plays a central role in cancer progression, treatment response, and patient outcomes, yet large-scale, consistent, and quantitative TME characterization from routine hematoxylin and eosin (H&E)-stained histopathology remains scarce. We introduce OpenTME, an open-access dataset of pre-computed TME profiles derived from 3,634 H&E-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung cancer) from The Cancer Genome Atlas (TCGA). All outputs were generated using Atlas H&E-TME, an AI-powered application built on the Atlas family of pathology foundation models, which performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis, yielding over 4,500 quantitative readouts per slide at cell-level resolution. OpenTME is available for non-commercial academic research on Hugging Face. We will continue to expand OpenTME over time and anticipate it will serve as a resource for biomarker discovery, spatial biology research, and the development of computational methods for TME analysis.

关键词: tumor microenvironment, H&E histopathology, AI-powered analysis, pathology foundation models, TCGA dataset, spatial biology, biomarker discovery, computational pathology

323. ❌ Interpretable DNA Sequence Classification via Dynamic Feature Generation in Decision Trees

作者: Nicolas Huynh, Krzysztof Kacprzyk, Ryan Sheridan, David Bentley, Mihaela van der Schaar 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12060v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出DEFT框架，利用大语言模型（LLMs）生成生物信息学特征，用于可解释的DNA序列分类，因此与’Large Language Models’高度相关（8分）。研究属于生物信息学领域，与’AI for Science/Bioinformatics’高度相关（10分）。论文核心关注可解释性，与’Mechanistic Interpretability/Explainable AI’高度相关（10分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了决策树在DNA序列分析中因特征表达能力有限导致深度过大、可解释性和泛化性能下降的问题，通过提出DEFT框架，利用大语言模型动态生成生物信息学特征，实现了可解释且高预测性能的序列分类。

摘要翻译

DNA序列分析在从进化生物学到理解基因调控与疾病机制等诸多领域已变得至关重要。尽管深度神经网络能够实现卓越的预测性能，但它们通常以黑箱形式运作。与这些黑箱模型相对，轴对齐决策树为可解释的DNA序列分析提供了一个有前景的方向，但它们存在一个根本性局限：在每次分裂时孤立地考虑单个原始特征限制了其表达能力，这导致树深度过大，既损害了可解释性，也削弱了泛化性能。我们通过引入DEFT这一新颖框架来应对这一挑战，该框架能在树构建过程中自适应地生成高层次序列特征。DEFT利用大语言模型来提出符合生物学背景的特征，这些特征针对每个节点的局部序列分布进行定制，并通过反思机制迭代优化。实验表明，在多种基因组任务中，DEFT能够发现人类可解释且具有高预测性的序列特征。

摘要 (Abstract)

The analysis of DNA sequences has become critical in numerous fields, from evolutionary biology to understanding gene regulation and disease mechanisms. While deep neural networks can achieve remarkable predictive performance, they typically operate as black boxes. Contrasting these black boxes, axis-aligned decision trees offer a promising direction for interpretable DNA sequence analysis, yet they suffer from a fundamental limitation: considering individual raw features in isolation at each split limits their expressivity, which results in prohibitive tree depths that hinder both interpretability and generalization performance. We address this challenge by introducing DEFT, a novel framework that adaptively generates high-level sequence features during tree construction. DEFT leverages large language models to propose biologically-informed features tailored to the local sequence distributions at each node and to iteratively refine them with a reflection mechanism. Empirically, we demonstrate that DEFT discovers human-interpretable and highly predictive sequence features across a diverse range of genomic tasks.

关键词: DNA sequence classification, interpretable machine learning, decision trees, large language models, bioinformatics, feature generation, genomic tasks, DEFT framework

324. ❌ VISTA: Validation-Informed Trajectory Adaptation via Self-Distillation

作者: Eli Corn, Daphna Weinshall 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12044v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VISTA专注于深度学习模型训练过程中的优化问题（轨迹偏差），提出了一种基于自蒸馏的在线训练框架，通过验证集信息识别专家锚点并进行集成。虽然属于深度学习领域，但论文内容与所有评分关键词（均围绕大模型技术、训练方法、推理优化、应用等）无直接关联，未涉及大模型、语言模型、特定训练技术（如RLHF、PEFT）、推理方法（如CoT、RAG）、模型优化（如量化、压缩）或科学AI应用。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文针对深度学习模型训练中可能出现的轨迹偏差问题，提出了一种基于验证集信息的在线自蒸馏框架VISTA，通过集成早期模型状态的专家锚点来保持已掌握的知识，从而提升模型的鲁棒性和泛化能力。

摘要翻译

尽管深度学习模型可能展现出较高的验证准确率，但其仍会收敛至次优解，这一现象掩盖了我们称为“轨迹偏离”的优化失败问题。其成因在于，随着训练进行，模型可能为适应特定数据子群体而放弃高泛化能力的状态，从而丢弃先前已习得的潜在特征，且不会触发经典过拟合信号。为解决此问题，我们提出VISTA——一种在线自蒸馏框架，该框架在优化轨迹上强制保持一致性。利用基于验证信息的边际覆盖分数，VISTA识别出专家锚点，即那些在训练早期阶段、对特定数据区域仍保持专业能力的模型状态。这些锚点通过覆盖度加权的集成方式，在训练过程中被在线整合，从而规整损失函数景观并保留已掌握的知识。在多个基准测试中评估表明，相较于标准训练及先前的自蒸馏方法，VISTA展现出更强的鲁棒性与泛化能力；同时，其轻量化实现方案在保持性能不变的前提下，将存储开销降低了90%。

摘要 (Abstract)

Deep learning models may converge to suboptimal solutions despite strong validation accuracy, masking an optimization failure we term Trajectory Deviation. This is because as training proceeds, models can abandon high generalization states for specific data sub-populations, thus discarding previously learned latent features without triggering classical overfitting signals. To address this problem we introduce VISTA, an online self-distillation framework that enforces consistency along the optimization trajectory. Using a validation-informed Marginal Coverage score, VISTA identifies expert anchors, which are earlier model states that retain specialized competence over distinct data regions. A coverage-weighted ensemble of these anchors is integrated online during training, regularizing the loss landscape and preserving mastered knowledge. When evaluated across multiple benchmarks, VISTA demonstrates improved robustness and generalization over standard training and prior self-distillation methods, while a lightweight implementation reduces storage overhead by 90% without performance loss.

关键词: Trajectory Deviation, self-distillation, validation-informed, expert anchors, coverage-weighted ensemble, online training, generalization, robustness

325. ❌ On the continuum limit of t-SNE for data visualization

作者: Jeff Calder, Zhonggan Huang, Ryan Murray, Adam Pickarski 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究t-SNE数据可视化算法的连续极限理论，属于经典机器学习中的降维和可视化领域，与所有评分关键词（均涉及大模型、深度学习技术原理或AI科学应用）完全无关。论文未提及任何大模型、深度学习、AI科学应用或相关技术概念，专注于传统统计机器学习算法的数学理论分析。

!!! tip deepseek-chat TL;DR

该论文从理论上研究了t-SNE数据可视化算法在数据点趋于无穷时的连续极限，证明了其优化目标收敛于一个包含非凸梯度正则项的变分问题，并在一维情况下分析了该问题的适定性和解的性质。

摘要翻译

本研究探讨了一种基于图的数据可视化技术——t分布随机邻域嵌入（t-SNE）的连续极限。该技术广泛应用于各类数据可视化任务，但其理论机制尚未得到充分理解。t-SNE算法通过最小化高维数据与其低维表示之间的相似度矩阵的Kullback-Leibler散度来生成可视化结果。我们证明，当数据点数量$n \to \infty$时，经过自然重标度并在适用参数范围内，随着数据点数量趋于无穷且相似图保持稀疏性，该Kullback-Leibler散度收敛于一个连续变分问题。该问题包含非凸梯度正则化项以及对可视化空间中概率密度函数幅值的惩罚项，这两项分别对应t-SNE算法中吸引力和排斥力的连续极限。
由于该连续变分问题的非凸性，其适定性问题仅得到部分解决。我们证明当数据维度与嵌入维度均为1时，该问题存在唯一光滑极小解，同时存在无穷多个间断极小解（在松弛意义下解释）。这一结论与t-SNE在可视化中能以看似任意方式分离数据的经验观察高度吻合。该能量泛函还与著名的非适定Perona-Malik方程密切相关，后者常用于图像去噪与简化。我们通过数值实验验证了连续极限的正确性，针对高维情形下极限能量问题的微妙特性提供了初步分析结果，并指出了若干值得未来研究的问题方向。

摘要 (Abstract)

This work is concerned with the continuum limit of a graph-based data visualization technique called the t-Distributed Stochastic Neighbor Embedding (t-SNE), which is widely used for visualizing data in a variety of applications, but is still poorly understood from a theoretical standpoint. The t-SNE algorithm produces visualizations by minimizing the Kullback-Leibler divergence between similarity matrices representing the high dimensional data and its low dimensional representation. We prove that as the number of data points $n \to \infty$, after a natural rescaling and in applicable parameter regimes, the Kullback-Leibler divergence is consistent as the number of data points $n \to \infty$ and the similarity graph remains sparse with a continuum variational problem that involves a non-convex gradient regularization term and a penalty on the magnitude of the probability density function in the visualization space. These two terms represent the continuum limits of the attraction and repulsion forces in the t-SNE algorithm. Due to the lack of convexity in the continuum variational problem, the question of well-posedeness is only partially resolved. We show that when both dimensions are $1$, the problem admits a unique smooth minimizer, along with an infinite number of discontinuous minimizers (interpreted in a relaxed sense). This aligns well with the empirically observed ability of t-SNE to separate data in seemingly arbitrary ways in the visualization. The energy is also very closely related to the famously ill-posed Perona-Malik equation, which is used for denoising and simplifying images. We present numerical results validating the continuum limit, provide some preliminary results about the delicate nature of the limiting energetic problem in higher dimensions, and highlight several problems for future work.

关键词: t-SNE, continuum limit, data visualization, variational problem, Kullback-Leibler divergence, non-convex regularization, Perona-Malik equation

326. ❌ Constant-Factor Approximation for the Uniform Decision Tree

作者: Michał Szyfelbein 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是经典计算机科学中的决策树优化问题，属于算法设计与分析领域，具体涉及近似算法、组合优化和计算复杂性理论。论文内容完全围绕传统的决策树构建算法、近似比分析和数学证明展开，没有涉及任何大模型、深度学习、AI技术或科学应用。所有关键词均与大模型技术、AI方法或科学AI应用相关，与该论文的研究主题完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了均匀概率分布下平均情况决策树问题的常数因子近似算法存在性这一长期开放问题，提出了一个多项式时间算法，其近似比小于11.57，改进了之前已知的最佳贪婪算法。

摘要翻译

我们解决了一个长期存在的开放性问题：针对假设空间服从均匀概率分布的平均情况下的\textsc{决策树}问题，是否存在常数倍近似算法？我们通过提出一个简单的多项式时间算法，以近似比$\frac{2}{1-\sqrt{(e+1)/(2e)}}+ε<11.57$，对该问题给出了肯定的回答。这一结果改进了当前已知的最佳贪婪算法所达到的$O(\log n/{\log\log n})$近似比。我们分析中的第一个关键要素是运用了一种源自\textsc{层次聚类}相关问题[SODA ‘17, WALCOM ‘26]的分解技术，该技术允许我们将最优决策树分解为一系列称为“分离子族”的对象。第二个核心思想是将寻找\textsc{分离子族}的子问题归约到\textsc{最大覆盖}问题的实例。为此，我们分析了将团切割成小片段的性质，这些片段代表需要被区分的假设对。这使得我们能够为\textsc{分离子族}问题获得良好的近似解，进而为原问题设计出近似算法。

摘要 (Abstract)

We resolve a long-standing open question, about the existence of a constant-factor approximation algorithm for the average-case \textsc{Decision Tree} problem with uniform probability distribution over the hypotheses. We answer the question in the affirmative by providing a simple polynomial-time algorithm with approximation ratio of $\frac{2}{1-\sqrt{(e+1)/(2e)}}+ε<11.57$. This improves upon the currently best-known, greedy algorithm which achieves $O(\log n/{\log\log n})$-approximation. The first key ingredient in our analysis is the usage of a decomposition technique known from problems related to \textsc{Hierarchical Clustering} [SODA ‘17, WALCOM ‘26], which allows us to decompose the optimal decision tree into a series of objects called separating subfamilies. The second crucial idea is to reduce the subproblem of finding a \textsc{Separating Subfamily} to an instance of the \textsc{Maximum Coverage} problem. To do so, we analyze the properties of cutting cliques into small pieces, which represent pairs of hypotheses to be separated. This allows us to obtain a good approximation for the \textsc{Separating Subfamily} problem, which then enables the design of the approximation algorithm for the original problem.

关键词: Decision Tree, Approximation Algorithm, Constant-factor Approximation, Uniform Probability Distribution, Separating Subfamily, Maximum Coverage, Hierarchical Clustering, Polynomial-time Algorithm

327. ❌ Differentiating Physical and Psychological Stress Using Wearable Physiological Signals and Salivary Cortisol

作者: Ozan Kaya, Nikoletta Athanassopoulou, George G. Malliaras, Marco Vinicio Alban-Paccha 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12671v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究使用可穿戴生理信号和唾液皮质醇区分生理和心理压力，属于生物医学工程和健康监测领域，与所有评分关键词（均涉及大模型、深度学习技术原理或AI for Science的具体技术）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究评估了可穿戴生理信号单独及结合唾液皮质醇在区分生理和心理压力及其恢复状态中的效果，发现单独使用可穿戴信号不足以可靠区分心理压力与休息/恢复，而加入皮质醇可显著提高分类准确性，特别是对心理状态的识别。

摘要翻译

目的：本研究旨在评估可穿戴生理信号单独使用及与唾液皮质醇结合使用时，对生理性压力、心理性压力及其恢复状态的区分能力。方法：六名健康成年人分别在不同日期完成三次实验室测试：静息状态、生理性压力（高强度骑行）或心理性压力（改良版特里尔社会应激测试）。连续记录心率、心率变异性、皮肤电活动及腕部加速度数据，并在五个时间点采集唾液皮质醇样本。在无重叠的10分钟时间窗内提取特征，并标记为静息、生理性压力、生理性恢复、心理性压力或心理性恢复。分别使用仅含可穿戴特征及每个时间窗额外加入五个皮质醇特征的组合训练梯度提升分类器。采用留一被试交叉验证评估模型性能。结果：仅使用可穿戴特征的分类总体准确率为77.8%，对生理性压力及恢复状态识别准确率较高，但对心理性压力及恢复状态存在频繁误判（召回率分别为50.0%和54.2%）。加入皮质醇特征后总体准确率提升至94.4%，尤其改善了心理状态的识别，将其召回率提高至83.3%和87.5%。皮质醇数据还减少了心理性压力与静息状态间的误分类。结论：仅依靠可穿戴信号不足以可靠区分心理性压力与静息及恢复状态。整合唾液皮质醇可提升对心理性压力及恢复状态的分类效果，并降低其与静息状态的混淆，凸显了内分泌背景信息对可穿戴生理监测的补充价值。意义：本研究结果为多模态压力监测提供了依据，并推动开展更大规模、更具生态效度的研究，以及开发可替代重复皮质醇采样的可扩展方案。

摘要 (Abstract)

Objective: This study aimed to assess how wearable physiological signals, alone and combined with salivary cortisol, distinguish physical and psychological stress and their recovery states. Methods: Six healthy adults completed three laboratory sessions on separate days: rest, physical stress (high-intensity cycling), or psychological stress (modified Trier Social Stress Test). Heart rate, heart rate variability, electrodermal activity, and wrist accelerometry were recorded continuously, and salivary cortisol was sampled at five time points. Features were extracted in non-overlapping 10-minute windows and labelled as rest, physical stress, physical recovery, psychological stress, or psychological recovery. A gradient boosting classifier was trained using wearable features alone and with five additional cortisol features per window. Performance was evaluated using leave-one-participant-out cross-validation. Results: Wearable-only classification achieved 77.8% overall accuracy, with high accuracy for physical stress and recovery but frequent misclassification of psychological stress and recovery (recall 50.0% and 54.2%). Including cortisol improved overall accuracy (94.4%), particularly for psychological states, increasing recall to 83.3% and 87.5%. Cortisol also reduced misclassification between psychological stress and rest. Conclusion: Wearable signals alone were insufficient to reliably distinguish psychological stress from rest and recovery. Integrating salivary cortisol improved classification of psychological stress and recovery and reduced confusion with rest, highlighting the value of endocrine context alongside wearable physiology. Significance: These findings support multimodal stress monitoring and motivate larger, ecologically valid studies and scalable alternatives to repeated cortisol sampling.

关键词: stress classification, wearable physiological signals, salivary cortisol, gradient boosting classifier, multimodal monitoring, physical stress, psychological stress, recovery states

328. ❌ The IQ-Motion Confound in Multi-Site Autism fMRI May Be Inflated by Site-Correlated Measurement Uncertainty

作者: Kareem Soliman 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12294v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多站点自闭症fMRI研究中IQ与头部运动混淆的统计估计问题，使用概率云回归方法评估普通最小二乘法的偏差。论文主题是神经影像学统计方法学，完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用。所有关键词均与大模型、深度学习、AI技术或AI科学应用相关，而本文是纯粹的医学统计方法研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在多站点自闭症fMRI研究中，使用普通最小二乘法估计IQ与头部运动的关联时可能高估了4.67倍，而采用考虑测量误差的概率云回归方法能得到更稳健的估计。

摘要翻译

多中心自闭症神经影像学研究通常通过将帧间位移对全量表智商分数进行回归并移除共享方差，来控制智商与头动之间的混杂效应。该流程假设普通最小二乘法能够无偏估计混杂效应的大小。我们使用概率云回归——一种在变量误差估计器中同时对两个变量的逐观测测量不确定性进行建模的方法——在ABIDE-I表型数据集（涵盖19个国际扫描站点、共935名被试）上检验了这一假设。智商测量误差源自已发表的韦氏量表重测信度系数；响应变量的不确定性通过站点水平代理变量（即站点内平均帧间位移的标准差）表示。研究得出三项发现：首先，当基于全精度拟合系数计算偏倚因子时，普通最小二乘法对智商-头动斜率的估计值是误差在变量校正估计值的4.67倍（为显示取整后，OLS为-0.00125，EIV为-0.00027毫米/智商点）。其次，在留出站点交叉验证中，对原始帧间位移的单一合并预测因子在所有19个站点均产生负的样本外R^2（总体R^2 = -0.074），表明一旦移除站点信息，合并预测因子无法在站点间清晰迁移。第三，在涵盖两个噪声参数12倍变化范围的8×8敏感性网格的所有64种配置中，误差在变量校正斜率的方向始终保持稳健。这些结果表明，在ABIDE-I数据中，合并普通最小二乘法可能高估了智商-头动关联，但其对头动校正流程的直接下游影响仍需通过原始头动轨迹和连接组水平的重新分析进行量化。形式化的误差在变量方法在多中心神经影像混杂估计中似乎仍不常见。

摘要 (Abstract)

Multi-site autism neuroimaging studies routinely control for the confound between full-scale IQ and head motion by regressing framewise displacement against IQ scores and removing shared variance. This procedure assumes that ordinary least squares (OLS) provides an unbiased estimate of the confound magnitude. We tested this assumption on the ABIDE-I phenotypic dataset (n=935 subjects across 19 international scanning sites) using Probability Cloud Regression, an errors-in-variables (EIV) estimator that models per-observation measurement uncertainty in both variables. IQ measurement error was derived from published Wechsler test-retest reliability coefficients; response-side uncertainty was represented by a site-level proxy equal to the within-site standard deviation of mean framewise displacement. Three findings emerged. First, OLS overestimates the IQ-motion slope by a factor of 4.67 relative to the EIV-corrected estimate when the bias factor is computed from the full-precision fitted coefficients (OLS -0.00125, EIV -0.00027 mm per IQ point after rounding for display). Second, under leave-site-out cross-validation a single pooled predictor of raw FD produces negative out-of-sample R^2 at all 19 sites (overall R^2 = -0.074), indicating that the pooled predictor does not transport cleanly across sites once site information is removed. Third, the direction of the EIV-corrected slope is robust across all 64 configurations of an 8x8 sensitivity grid spanning 12-fold ranges of each noise parameter. These results suggest that pooled OLS may overstate the IQ-motion association in ABIDE-I, but direct downstream consequences for motion-correction pipelines remain to be quantified using raw motion traces and connectivity-level re-analysis. Formal EIV methods appear to remain uncommon in multi-site neuroimaging confound estimation.

关键词: autism fMRI, IQ-motion confound, multi-site neuroimaging, errors-in-variables, Probability Cloud Regression, ABIDE-I, framewise displacement, measurement uncertainty

329. ❌ A unified data format for managing diabetes time-series data: DIAbetes eXchange (DIAX)

作者: Elliott C. Pryor, Marc D. Breton, Anas El Fathi 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.11944v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于糖尿病时间序列数据的标准化格式开发（DIAX），旨在解决数据格式不一致问题以促进机器学习应用。论文内容与大多数关键词（涉及大模型技术原理、训练方法、推理优化、代理系统等）完全无关，因为这些关键词针对的是大模型/深度学习的技术创新，而本文是数据管理工具。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及生物信息学（糖尿病数据）和AI在科学（机器学习应用）中的潜在用途，但并非核心创新，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一个标准化的JSON数据格式DIAX，用于统一糖尿病设备生成的时间序列数据，以解决数据格式不一致问题并促进机器学习应用中的互操作性和可重复性。

摘要翻译

糖尿病设备，包括连续血糖监测仪、智能胰岛素笔和自动胰岛素输送系统，可生成丰富的时间序列数据，这些数据被广泛应用于研究和机器学习领域。然而，不同来源的数据格式不一致，阻碍了数据的共享、整合与分析。
我们提出DIAX（糖尿病数据交换格式），这是一种基于JSON的标准化格式，用于统一糖尿病时间序列数据，包括连续血糖监测、胰岛素和膳食信号。DIAX旨在提升数据的互操作性、可复现性和可扩展性，尤其适用于机器学习应用。一个开源代码库提供了数据集转换、跨格式兼容性处理、可视化工具以及社区贡献支持。DIAX是一个转化性资源，而非数据托管平台，在确保灵活性的同时不施加数据共享限制。
目前，DIAX兼容其他标准化工作，并支持主要数据集（DCLP3、DCLP5、IOBP2、PEDAP、T1Dexi、Loop），总计涵盖超过一千万患者小时的数据。https://github.com/Center-for-Diabetes-Technology/DIAX

摘要 (Abstract)

Diabetes devices, including Continuous Glucose Monitoring (CGM), Smart Insulin Pens, and Automated Insulin Delivery systems, generate rich time-series data widely used in research and machine learning. However, inconsistent data formats across sources hinder sharing, integration, and analysis. We present DIAX (DIAbetes eXchange), a standardized JSON-based format for unifying diabetes time-series data, including CGM, insulin, and meal signals. DIAX promotes interoperability, reproducibility, and extensibility, particularly for machine learning applications. An open-source repository provides tools for dataset conversion, cross-format compatibility, visualization, and community contributions. DIAX is a translational resource, not a data host, ensuring flexibility without imposing data-sharing constraints. Currently, DIAX is compatible with other standardization efforts and supports major datasets (DCLP3, DCLP5, IOBP2, PEDAP, T1Dexi, Loop), totaling over 10 million patient-hours of data. https://github.com/Center-for-Diabetes-Technology/DIAX

关键词: diabetes, time-series data, data standardization, JSON format, machine learning, interoperability, continuous glucose monitoring, open-source

330. ❌ EOM-fpCCSD: An Accurate Alternative to EOM-CCSD for Doubly Excited and Charge-Transfer States

作者: Katharina Boguslawski, Paweł Tecmer 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13009v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学中的电子激发态计算方法（EOM-CCSD变体），属于量子化学领域，与绝大多数关键词（涉及大模型、深度学习、AI技术原理）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于科学计算（计算化学）范畴，但论文本身并未明确使用AI或机器学习方法，而是基于传统的量子化学耦合簇理论，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于pCCD参考的EOM-fpCCSD方法，用于更准确、高效地计算电子激发态（特别是双激发态和电荷转移态），在QUEST基准测试中优于现有方法并解决了收敛性问题。

摘要翻译

我们提出了一种基于对耦合簇双激发（pCCD）参考的新型运动方程耦合簇方法，称为冻结对EOM-CCSD（EOM-fpCCSD）。该方法将pCCD拟设的计算效率与动态关联校正相结合，能够在EOM框架内可靠描述电子激发态。该方法已在开源PyBEST软件包中实现。我们使用正则哈特里-福克轨道和pCCD自然轨道，系统性地以标准EOM-CCSD及其对定制变体（EOM-ptCCSD）为基准，评估了其性能。对于取自QUEST数据库的电荷转移（CT）激发，EOM-fpCCSD给出的激发能非常接近EOM-CCSD的结果，优于EOM-ptCCSD，并且与理论最佳估计值（TBEs）高度吻合。在局域化的pCCD自然轨道基组下工作，使我们能够确定定向CT特征，该特征量化了电荷从一个分子区域到另一个分子区域的定向流动。数值结果表明，尽管激发能有所变化，但EOM-fpCCSD、EOM-CCSD和EOM-ptCCSD对定向CT特征的描述几乎完全相同。EOM-fpCCSD的真正优势在处理QUEST数据库中具有挑战性的双重激发态子集时变得尤为明显。对于这些棘手态，EOM-ptCCSD的表现与标准EOM-CCSD相似，而EOM-fpCCSD相比TBEs则显著优于这两种方法。除了提高激发能的精度外，EOM-fpCCSD还对一些标准EOM-CCSD和EOM-ptCCSD无法收敛的态实现了收敛。这些结果表明，EOM-fpCCSD为更精确地描述复杂电子激发提供了一条前景广阔且计算高效的途径。

摘要 (Abstract)

We introduce a new equation-of-motion coupled-cluster method based on a pair coupled-cluster doubles (pCCD) reference, termed frozen-pair EOM-CCSD (EOM-fpCCSD). This approach combines the computational efficiency of the pCCD ansatz with a dynamical correlation correction, enabling a reliable description of electronically excited states within the EOM framework. The method has been implemented in the open-source PyBEST software package. Its performance is systematically benchmarked against standard EOM-CCSD and its pair-tailored variant (EOM-ptCCSD), using both canonical Hartree-Fock and pCCD natural orbitals. For charge-transfer (CT) excitations taken from the QUEST database, EOM-fpCCSD yields excitation energies very close to those of EOM-CCSD, outperforming EOM-ptCCSD, as well as to the theoretical best estimates (TBEs). Working within the localized pCCD natural orbital basis allows us to determine the directed CT character, which quantifies the directed charge flow from one molecular domain to another. Numerical results show that EOM-fpCCSD, EOM-CCSD, and EOM-ptCCSD provide nearly identical descriptions of the directed CT character, despite changes in excitation energies. The true advantage of EOM-fpCCSD becomes evident for the challenging QUEST subset of doubly excited states. While EOM-ptCCSD performs similarly to standard EOM-CCSD, EOM-fpCCSD significantly outperforms both methods for these problematic states compared to TBEs. In addition to improving the accuracy of excitation energies, EOM-fpCCSD also converges for several states that standard EOM-CCSD and EOM-ptCCSD fail to converge. These results demonstrate that EOM-fpCCSD offers a promising and computationally efficient route toward a more accurate description of complex electronic excitations.

关键词: EOM-fpCCSD, coupled-cluster method, electronically excited states, charge-transfer excitations, doubly excited states, computational efficiency, PyBEST software, natural orbitals

331. ❌ Efficient Implementation of Relativistic Coupled Cluster Linear Response Theory in Combination with Perturbation Sensitive Natural Spinors and Cholesky Decomposition Treatment of Two-electron Integrals

作者: Sudipta Chakraborty, Muskan Begom, Xubo Wang, Achintya Kumar Dutta 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12914v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学中的相对论耦合簇线性响应理论实现，涉及量子化学方法、数值算法和计算优化，与所有大模型/深度学习关键词完全无关。唯一可能的相关性是广义的’AI for Science’，因为论文属于计算科学领域，但未使用AI方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种高效的相对论线性响应耦合簇单双激发方法实现，通过Cholesky分解和FNS++基组截断策略，显著降低了计算内存需求并保持了高精度，成功应用于含1400多个基函数的铀六氟化物复合物的静态极化率计算。

摘要翻译

本文提出了一种高效的低成本线性响应耦合簇单双激发（LR-CCSD）方法实现，用于计算具有显著相对论效应和电子关联效应的体系的静态及频域极化率。该方法采用基于X2C的哈密顿量（X2CAMF与X2CMP），并引入Cholesky分解以降低内存需求。在当前实现中，计算量大的三外和四外指标积分采用实时生成方式，避免了其存储需求。基准测试结果表明，X2CMP哈密顿量比X2CAMF表现出更稳定的性能，特别是在使用大型且高度增广的基组时。所提出的FNS++CD-X2CMP-LR-CCSD方法在多种体系上与四分量参考值高度吻合。此外，本文评估了构建FNS++基组的不同策略，发现基于平均密度的方案能在精度与计算成本之间取得良好平衡。平均而言，约73%的虚拟旋量空间被截除，这证明了基于FNS++密度的截断方案具有高效性和一致性。本实现方案能够对大型分子体系进行精确且可扩展的相对论响应计算，例如在超过1400个基函数的叁ζ基组水平上对六氟化铀配合物静态极化率的计算，便验证了其应用能力。

摘要 (Abstract)

We present an efficient implementation of the low-cost linear-response coupled-cluster singles and doubles (LR-CCSD) method for computing static and frequency-dependent polarizabilities in systems with significant relativistic and electron-correlation effects. The approach employs X2C-based Hamiltonians (X2CAMF and X2CMP) and incorporates Cholesky decomposition to reduce memory requirements. In the current implementation, costly three- and four-external index integrals are generated on the fly, eliminating the need for their storage. Benchmark results indicate that the X2CMP Hamiltonian provides more consistent performance than X2CAMF, particularly for large and highly augmented basis sets. The proposed FNS++CD-X2CMP-LR-CCSD method shows excellent agreement with four-component reference values across a wide range of systems. Additionally, different strategies for constructing the FNS++ basis were assessed, and an averaged density approach was found to offer a favorable balance between accuracy and computational cost. On average, about 73% of the virtual spinor space is removed, demonstrating the efficiency and consistency of the FNS++ density-based truncation approach. The present implementation enables accurate and scalable relativistic response calculations for large molecular systems, as demonstrated by the calculation of the static polarizability of the Uranium Hexafluoride complex with a triple-zeta basis set more than 1400 basis functions.

关键词: relativistic coupled cluster, linear response theory, Cholesky decomposition, FNS++ basis, polarizability calculation, X2C Hamiltonian, electron correlation, computational chemistry

332. ❌ Fidelity of Machine Learned Potentials: Quantitative Assessment for Protonated Oxalate

作者: Chen Qu, Paul L. Houston, Qi Yu, Apurba Nandi, Joel M. Bowman, Valerii Andreichev, Silvan Käser, Markus Meuwly 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12877v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究机器学习势能面（ML-PESs）在计算化学中的应用，具体评估两种机器学习方法（PIP和PhysNet）在质子化草酸盐体系中的表现。论文内容完全专注于传统的机器学习在科学计算（化学物理）中的应用，不涉及任何大语言模型（LLM）、深度学习技术原理创新、或与大模型相关的训练/推理/对齐/代理等技术。唯一略有相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在科学（计算化学）领域的应用，但论文未涉及生物信息学或化学信息学的典型数据或方法，因此给予5分（有一定关联）。其他所有关键词均与大模型、深度学习技术原理或相关应用无关，评分为0分。

!!! tip deepseek-chat TL;DR

该研究评估了两种机器学习势能面（PIP和PhysNet）在质子化草酸盐体系中的保真度，通过振动能量计算、红外光谱和隧道分裂分析，发现两种势能面结果高度一致。

摘要翻译

用于对电子能量和力数据集进行机器学习回归以构建高维机器学习势能面（ML-PESs）的方法和软件已呈现爆发式增长。一个重要但尚未被深入研究的方面是，除了标准的拟合精度指标外，不同的ML-PESs在多大程度上能一致地表示它们所训练的同一数据集。本文通过若干“压力测试”，对两种广泛应用的机器学习势能方法进行了详细检验。一种基于置换不变多项式（PIP）线性最小二乘回归，另一种是基于消息传递神经网络的PhysNet方法。这些势能面及偶极矩表面被用于振动能量和波函数的VSCF/VCI计算中。我们直接比较了两种势能面计算得到的能量以及红外光谱。此外，针对两个等价结构间的氢转移隧穿分裂，报告了使用三种方法得到的结果：环聚合物瞬子理论、扩散蒙特卡洛模拟以及$Q_{im}$路径方法。这些计算需要在15维构型空间中广泛分布的点上评估约十亿次能量。两种势能面对这些物理量给出的结果彼此高度吻合。

摘要 (Abstract)

There has been a veritable explosion of methods and software to perform machine-learned regression on datasets of electronic energies and forces to develop high-dimensional machine learned potential energy surfaces (ML-PESs). A major, but not deeply-studied aspect is how well different ML-PESs represent the same dataset on which they are trained, beyond the standard fitting precision metrics. Here, this is examined in detail using several ‘‘stress tests’’, for two widely applied machine-learned potential approaches. One is based on permutationally invariant polynomial (PIP) linear least square regression and the other is the message-passing neural network PhysNet approach. These potentials and dipole moment surfaces are used in VSCF/VCI calculations of vibrational energies and wavefunctions. The energies from the two PESs are directly compared as are the IR spectra. In addition, tunneling splittings for the hydrogen transfer between two equivalent structures are reported from using three methods: ring polymer instanton theory, diffusion Monte Carlo simulations, and the $Q_{im}$ path method. These calculations require the evaluation of on the order of one billion energies that are widely dispersed in the 15-dimensional configurational space. The two PESs yield results for these quantities in excellent agreement with each other.

关键词: machine-learned potential energy surfaces, ML-PESs, permutationally invariant polynomial, PhysNet, vibrational energies, IR spectra, tunneling splittings, protonated oxalate

333. ❌ Atomistic Modeling of Methane and Carbon Dioxide Structure I Gas Hydrates Under Pressure: Guest Effects and Properties

作者: Samuel Mathews, Xiaodan Zhu, Andr’e Guerra, Phillip Servio, Alejandro D. Rey 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12861v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究甲烷和二氧化碳气体水合物的原子建模，使用密度泛函理论（DFT）模拟压力-焓景观和机械稳定性，属于计算化学和材料科学领域。论文内容与绝大多数关键词（涉及大模型、深度学习、训练方法、推理优化、对齐、代理等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及计算化学模拟，属于科学计算应用，但论文未明确使用AI或机器学习方法，仅使用传统DFT计算，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究使用密度泛函理论模拟了甲烷和二氧化碳气体水合物在压力下的结构、弹性稳定性和分子行为，揭示了二氧化碳分子在笼中的取向差异及其对材料性能的影响，并验证了实验观察到的客体分子旋转限制。

摘要翻译

气体水合物是未来能源的潜在候选者，同时其结构在碳捕集与封存、气体运输及重要分离过程中具有广泛应用。该领域的先前研究特别关注了水分子骨架的动力学特性。我们采用密度泛函理论模拟了sI型甲烷和二氧化碳水合物的压力-焓值相图与机械稳定性边界，并考察了revPBE+DFT-D2与SCAN+rVV10两种泛函及其对交换关联相互作用处理的差异。研究发现，在零压条件下，revPBE泛函相对弱化了相互作用，导致结构更柔顺且平衡体积较大。在压力作用下，无论采用何种泛函，二氧化碳分子均会自发平行于大笼的六边形面排列。此外，性质差异源于二氧化碳分子能够通过甲烷分子无法实现的方式旋转并分散能量势垒的变化。本计算方法揭示了气体水合物的弹性稳定性、临界稳定性及关键分子相互作用的本质差异，证实了实验观测到的客体分子旋转受限现象以及静水载荷下新颖的压力行为。

摘要 (Abstract)

Gas hydrates are potential candidates in future energy sources while simultaneously providing structures with extensive applications in carbon capture and storage, gas transport, and important separation processes. Prior research in the field considers the dynamics of the water molecule backbone in particular. We investigated the pressure-enthalpy landscape and mechanical stability envelope of sI methane and carbon dioxide hydrates simulated using DFT. We investigated the effect of the revPBE + DFT-D2 and the SCAN + rVV10 and their treatment of the exchange correlation interactions. We examined the zero pressure material properties, finding that revPBE comparatively underbinds the interactions, causing more flexible structures with large equilibrium volumes. Under pressure, the carbon dioxide molecule was found to align itself parallel to the hexagonal faces of the large cage despite the functional used. Additionally, the property differences are caused by the ability of the carbon dioxide molecule to rotate and disperse the changes in the energy landscape in ways that methane molecules cannot. This computational methodology describes the elastic stability of gas hydrate, marginal stability, and critical differences across important molecular interactions, confirming experimentally observed restrictions in guest molecule rotations and novel pressure behaviors under hydrostatic loads

关键词: gas hydrates, density functional theory, methane, carbon dioxide, pressure effects, elastic stability, molecular interactions, computational modeling

334. ❌ Perspective on a challenge: predicting the photochemistry of cyclobutanone

作者: Jiří Janoš, Nanna Holmgaard List, Andrew J. Orr-Ewing, Jiří Suchan, Mario Barbatti, Olivia Bennett, Marcus Brady, Javier Carmona-García, Rachel Crespo-Otero, Julien Eng, O. Jonathan Fajen, Marco Garavelli, Sandra Gómez, Alice E. Green, Federico J. Hernández, Daniel Hollas, Lewis Hutton, Lea M. Ibele, Adam Kirrander, Zhenggang Lan, Yorick Lassmann, Joseph E. Lawrence, Benjamin G. Levine, Dmitry V. Makhov, Jonathan R. Mannouch, Xincheng Miao, Roland Mitrić, Shane M. Parker, Thomas J. Penfold, Jiawei Peng, Jeremy O. Richardson, Dmitrii Shalashilin, Petr Slavíček, K. Eryn Spinlove, Patricia Vindel-Zandbergen, Federica Agostini, Sara Bonella, Todd J. Martínez, Graham A. Worth, Basile F. E. Curchod 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12749v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是关于环丁酮光化学预测的计算化学研究，主要涉及非绝热分子动力学、电子结构理论、计算光化学和实验验证，属于计算化学和物理化学领域。论文内容与大多数关键词（涉及大模型、深度学习、训练技术、推理优化、智能体等）完全无关，因为这些关键词主要针对大语言模型和深度学习技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学研究，可视为科学计算或化学信息学应用，但论文未明确使用AI或机器学习方法，而是基于传统计算化学方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文通过一项预测挑战，评估了非绝热分子动力学方法在模拟环丁酮光化学过程和预测时间分辨MeV-UED信号方面的能力，总结了不同计算策略的优缺点，并验证了计算光化学的定性预测能力。

摘要翻译

本观点文章隶属于一个探讨非绝热分子动力学在预测光化学过程方面成熟度的专题。2023年，我们向计算光化学界发起了一项预测挑战，要求模拟环丁酮在200纳米光激发下的光化学反应及其产生的时间分辨兆电子伏超快电子衍射信号。在SLAC（美国斯坦福）进行相关实验之前，该挑战吸引了来自70多位研究人员的15项理论预测，他们采用了广泛的电子结构和非绝热分子动力学策略来预测时间分辨MeV-UED信号。上海交通大学的MeV-UED装置也被用于为环丁酮的光化学过程提供了第二个独立的时间分辨MeV-UED信号。
本文讨论了参与者们用于预测环丁酮光化学的各种方法和策略。同时，根据参与者在2025年4月于洛桑举办的、专门讨论该挑战结果的CECAM研讨会上达成的共识，本文总结了在光激发、电子结构、非绝热动力学以及可观测量计算等方面所用各种方法的优势与不足。本文还将所有预测得到的时间分辨MeV-UED信号与实验信号汇总于同一图中。此项挑战（i）展示了非绝热分子动力学的定性预测能力，并且（ii）强调了电子结构理论对激发态动力学结果的影响以及对其进行仔细基准测试的必要性。这项工作使得学界能够分享执行非绝热动力学的实用策略（在本文中讨论），并构成了计算光化学领域的一次“校准”实践。

摘要 (Abstract)

This Perspective is part of a Special Topic that explored the maturity of nonadiabatic molecular dynamics for predicting photochemical processes. In 2023, a prediction challenge was issued to the community of computational photochemists to simulate the photochemistry of cyclobutanone, photoexcited at 200 nm, and the resulting time-resolved MeV-UED signal. The challenge attracted 15 theoretical predictions from more than 70 researchers, employing a wide range of strategies for electronic structure and nonadiabatic molecular dynamics to predict the time-resolved MeV-UED signal before the experiment had been conducted at SLAC (Stanford, USA). The MeV-UED instrument at Shanghai Jiao Tong University was also used to provide a second independent time-resolved MeV-UED signal for the photochemistry of cyclobutanone. This Perspective discusses the various approaches and strategies used by the participants to predict the photochemistry of cyclobutanone. This work also summarizes the strengths and weaknesses of various methods used for photoexcitation, electronic structure, nonadiabatic dynamics, and calculation of observables, as agreed by the participants during a CECAM workshop dedicated to the results of the challenge and organized in Lausanne in April 2025. This Perspective also collects all the predicted time-resolved MeV-UED signals into a single figure, together with the experimental signal. This challenge (i) demonstrated the qualitative predictive power of nonadiabatic molecular dynamics and (ii) underscore the impact of electronic-structure theory on the outcome of the excited-state dynamics and the need for its careful benchmarking. This effort allowed the community to share practical strategies to perform nonadiabatic dynamics (discussed in the present Perspective) and constitutes a ‘calibration’ exercise for computational photochemistry.

关键词: photochemistry, cyclobutanone, nonadiabatic molecular dynamics, electronic structure theory, time-resolved MeV-UED, computational photochemistry, prediction challenge, excited-state dynamics

335. ❌ Transferable excited-state dynamics enable screening of fluorescent protein chromophores

作者: Rhyan Barrett, Sophia Wesely, Julia Westermayr 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12699v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于开发一种名为X-MACE的可迁移机器学习势能模型，用于高效筛选荧光蛋白发色团的激发态动力学。论文的核心是计算化学和机器学习在光物理性质预测中的应用，属于AI for Science范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（评5分）。然而，论文并未涉及大语言模型（LLMs）、深度学习技术原理创新、模型训练方法（如预训练、微调、对齐）、推理优化、智能体系统或任何其他列出的具体大模型技术关键词。其机器学习应用是特定于分子动力学模拟的，而非通用大模型研究。

!!! tip deepseek-chat TL;DR

该研究开发了X-MACE可迁移机器学习势能模型，结合曲率驱动表面跳跃方法，实现了对荧光蛋白发色团激发态动力学的高效筛选，揭示了结构修饰如何影响光物理性质的设计原则。

摘要翻译

可迁移的激发态动力学为跨分子体系的光物理行为高效筛选提供了途径，但传统的非绝热模拟仍然成本高昂。本文提出X-MACE——一种用于激发态动力学的可迁移机器学习势函数，能够预测多重势能面、作用力及振子强度，并将其与曲率驱动面跳跃方法相结合，实现对光化学路径的数据高效筛选。我们以荧光发色团为例应用该框架，利用绿色荧光蛋白发色团变体展示细微结构修饰如何重塑激发态弛豫、寿命及光异构化产率。通过对单个预训练模型进行微调（每个衍生物使用少于100个参考几何结构），即可在化学结构多样的类似物集合中获得精确的动力学结果。筛选过程揭示了两条关键设计原则：苯酚环上的空间位阻会降低扭转势垒并加速通往扭曲锥形交叉点的路径，而共轭延伸则能稳定平面激发态构型、抑制非辐射衰变并延长荧光寿命。更广泛而言，该工作流程为可扩展的激发态筛选及光物理性质的可解释设计提供了通用框架。

摘要 (Abstract)

Transferable excited-state dynamics offer a route to efficient screening of photophysical behavior across molecular systems, but conventional nonadiabatic simulations remain prohibitively expensive. Here we introduce X-MACE, a transferable machine-learning potential for excited-state dynamics that predicts multiple potential energy surfaces, forces and oscillator strengths, and combine it with curvature-driven surface hopping to enable data-efficient screening of photochemical pathways. We apply this framework to fluorescent chromophores as an example application, using green fluorescent protein chromophore variants to demonstrate how subtle structural modifications reshape excited-state relaxation, lifetimes and photoisomerization yields. Fine-tuning a single pretrained model with fewer than 100 reference geometries per derivative yields accurate dynamics across a chemically diverse set of analogues. The screening reveals two governing design principles: steric crowding on the phenolate ring lowers the torsional barrier and accelerates access to twisted conical intersections, whereas conjugation extension stabilizes planar excited-state configurations, suppresses non-radiative decay and prolongs fluorescence. More broadly, this workflow provides a general framework for scalable excited-state screening and interpretable design of photophysical properties.

关键词: excited-state dynamics, machine-learning potential, fluorescent chromophores, photophysical screening, transferable model, X-MACE, nonadiabatic simulations, photoisomerization

336. ❌ Exact tunneling splittings from path-integral hybrid Monte Carlo with enveloping bridging potentials

作者: Yu-Chen Wang, Jeremy O. Richardson 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12639v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学领域，提出了一种名为PIHMC-EBP的路径积分混合蒙特卡洛方法，用于精确计算分子系统中的隧道分裂。论文内容与绝大多数关键词（涉及大模型、深度学习、AI技术原理等）完全无关，因为这些关键词主要针对人工智能、机器学习和大语言模型技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学/分子模拟领域，可视为科学计算或AI在科学领域的潜在应用背景，但论文本身并未明确使用AI或机器学习方法，因此给予5分（有一定关联）。其他所有关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的路径积分混合蒙特卡洛方法（PIHMC-EBP），用于精确计算分子系统中的隧道分裂，并在多个分子系统上实现了更高的精度和计算效率。

摘要翻译

本文提出了一种采用包络桥接势的路径积分混合蒙特卡洛方法（PIHMC-EBP），用于计算分子系统中数值精确的隧穿分裂。该方法的核心思想是构建一个近似无势垒的桥接势，该势能平滑地连接环状聚合物相空间中对称相关的区域，从而实现对自由能剖面的直接采样，并由此获得相应的分裂值。我们设计了两种定制的非局域更新方案，以增强对慢速集体运动的采样效率。与使用热力学积分的路径积分分子动力学相比，PIHMC-EBP既不需要数值积分，也无需进行时间步长收敛性检验，从而大幅减少了结果分析所需的人工干预。将本方法应用于丙二醛（及其氘代同位素变体）和氯化氢二聚体，并采用最先进的势能面进行计算，得到了迄今为止这两个体系最精确的隧穿分裂值；同时，总体计算成本分别降低了数倍和三个数量级。最后，通过对水二聚体的计算，我们在三种不同的势能面上首次实现了基态隧穿分裂的数值精确路径积分计算，所有结果均通过重加权同一组轨迹同时获得。

摘要 (Abstract)

A path-integral hybrid Monte Carlo approach with enveloping bridging potentials (PIHMC-EBP) is proposed for calculating numerically exact tunneling splittings in molecular systems. The central idea is to construct an approximately barrierless bridging potential that smoothly connects symmetry-related regions of ring-polymer phase space, enabling direct sampling of the free-energy profile from which the relevant splittings are obtained. Two tailored nonlocal updates are designed to enhance the sampling of slow collective motions. Compared with path-integral molecular dynamics using thermodynamic integration, PIHMC-EBP requires neither quadrature nor time-step convergence checks, thereby substantially reducing the manual effort required to analyze the results. Applications to malonaldehyde (and its deuterated isotopologue) and the HCl dimer using state-of-the-art potential energy surfaces provide the most precise tunneling splittings reported to date for both systems, while simultaneously reducing the overall computational cost by several times and three orders of magnitude, respectively. Finally, application to the water dimer yields the first numerically exact path-integral calculations of the ground-state tunneling splittings on three different potential energy surfaces, all obtained simultaneously by reweighting a single set of trajectories.

关键词: tunneling splittings, path-integral hybrid Monte Carlo, enveloping bridging potentials, molecular systems, free-energy profile, malonaldehyde, HCl dimer, water dimer

337. ❌ Exact tunneling splittings of rotationally excited states from symmetrized path-integral molecular dynamics

作者: Lea Zupan, Yu-Chen Wang, Jeremy O. Richardson 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12638v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是分子动力学中的隧穿分裂计算，属于计算化学/物理化学领域，与所有大模型、深度学习、AI技术相关的关键词均无直接关联。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于科学计算应用，但论文本身并未使用AI或机器学习方法，而是基于路径积分分子动力学的传统数值方法，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文扩展了对称化路径积分分子动力学方法，用于精确计算分子在旋转激发态下的隧穿分裂，并通过水和氨分子的计算验证了方法的准确性。

摘要翻译

我们将先前提出的对称化路径积分分子动力学方法拓展至计算转动激发态分子的隧穿分裂。在此新形式体系中，系统通过埃卡特弹簧被严格投影至选定对称性的特定转动流形与状态，该弹簧通过置换-反演-转动操作连接环状聚合物的两个末端珠点。一旦所有模拟参数均达到收敛，该方法在统计不确定度范围内具有数值精确性。重要的是，该方法能够从单次模拟中同时提取多个总角动量量子数$J$对应的隧穿分裂值，且相较于原始方法无需增加计算成本。通过计算水分子（超越刚性转子近似）的转动能级验证该形式体系后，我们将其应用于氨分子，获得了与精确变分基准高度吻合的转动分辨隧穿分裂值。除势能面本身引入的微小误差外，计算结果成功捕捉到实验观测到的隧穿分裂随$J$增大而减小的趋势。

摘要 (Abstract)

We extend our previous symmetrized path-integral molecular dynamics approach to calculate tunneling splittings of molecules in rotationally excited states. In this new formalism, the system is rigorously projected onto selected rotational manifolds and states of a chosen symmetry through an Eckart spring, which connects the two end beads of the ring polymer via a permutation–inversion–rotation operation. This method is numerically exact within statistical uncertainty once convergence with respect to all simulation parameters has been achieved. Importantly, it enables the simultaneous extraction of tunneling splittings for multiple total angular-momentum quantum numbers $J$ from a single set of simulations, without additional computational cost relative to the original approach. After validating the formalism by computing the rotational levels of water (beyond the rigid-rotor approximation), we apply it to ammonia and obtain rotationally resolved tunneling splittings in excellent agreement with exact variational benchmarks. Except for small errors due to the underlying potential energy surface, the results capture the experimentally observed trend that the tunneling splitting decreases with $J$.

关键词: tunneling splittings, rotationally excited states, symmetrized path-integral molecular dynamics, Eckart spring, ring polymer, ammonia, water, variational benchmarks

338. ❌ Hierarchical generative modeling for the design of multi-component systems

作者: Rhyan Barrett, Robin Curth, Julia Westermayr 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12607v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于化学/材料科学领域，提出了一种结合遗传算法和生成模型的分层生成优化框架，用于设计多组分系统（如催化剂、酶活性位点）。其核心是分子设计和系统优化，属于AI for Science（科学AI）的应用范畴，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理、模型训练/对齐/推理优化、智能体系统等主题，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种分层生成优化框架，通过耦合遗传算法和生成模型，实现了对多组分系统（如催化剂）的分子组成和空间构型的联合优化，成功设计出能降低Claisen重排反应活化能30%的催化环境。

摘要翻译

催化剂、酶和超分子组装体的功能并非源于单个分子本身，而是源于复杂系统中多组分之间微妙的相互作用。设计此类系统是一项重大挑战：可能的化学组成与空间排列的组合爆炸使得暴力探索难以实现，而当前许多生成方法仍局限于孤立分子。本研究引入了一种分层生成优化框架，通过将构型搜索的遗传算法与分子设计的生成模型相耦合，突破了这一障碍。这种闭环方法能够同步优化几何结构与化学组成，从而高效引导针对目标功能的系统发现。作为概念验证，我们通过围绕固定参考过渡态几何结构优化周围组分，设计了用于对甲苯基醚克莱森重排反应的催化环境。尽管在搜索阶段存在此约束，通过爬坡图像推弹力带计算的事后验证证实，活化能垒降低了30%。除该案例外，本框架为功能性多组分系统的数据驱动发现提供了通用策略，为催化剂、酶活性位点和先进材料的自动化设计开启了大门。科学贡献：本研究提出了一种闭环生成框架，能够实现多组分系统中分子组分及其空间组织的联合优化。该方法将生成式分子设计从单一分子推向更庞大、更复杂的系统。

摘要 (Abstract)

The functionality of catalysts, enzymes, and supramolecular assemblies emerges not from individual molecules alone, but from the subtle interplay between multiple components arranged in complex systems. Designing such systems is a grand challenge, the combinatorial explosion of possible chemical compositions and spatial arrangements makes brute-force exploration infeasible, while many current generative approaches remain limited to isolated molecules. In this work, we introduce a hierarchical generative optimization framework that overcomes this barrier by coupling a genetic algorithm for configurational search with a generative model for molecular design. This closed-loop approach enables simultaneous refinement of geometry and composition, efficiently steering discovery toward systems with targeted functionality. As a proof of concept, we design catalytic environments for the Claisen rearrangement of p-tolyl ether by optimizing surrounding components around a fixed reference transition-state geometry. Despite this constraint during the search phase, post-hoc validation via Climbing-Image Nudged Elastic Band calculations confirm a 30% reduction in activation barrier. Beyond this example, our framework provides a general strategy for data-driven discovery of functional multi-component systems, opening the door to automated design of catalysts, enzyme active sites, and advanced materials. Scientific contribution. The study presents a closed loop generative framework that enables joint optimization of molecular components and their spatial organization in multi-component systems. The method moves generative molecular design beyond single molecules toward larger and more complex systems.

关键词: hierarchical generative modeling, multi-component systems, catalyst design, genetic algorithm, molecular design, closed-loop optimization, Claisen rearrangement, activation barrier reduction

339. ❌ Velocity Formulations for Hyper-Rayleigh Scattering Optical Activity Spectroscopy: Addressing the Origin-dependence Problem

作者: Andrea Bonvicini, Sonia Coriani, Benoît Champagne 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12386v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于理论化学物理领域，研究超瑞利散射光学活性光谱的数学公式推导，属于传统计算化学范畴。论文内容完全不涉及大模型、深度学习、AI技术或任何机器学习方法，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文提出了超瑞利散射光学活性光谱的速率公式理论，解决了传统长度公式中的原点依赖性问题，并证明了新公式在计算不变量时的适用性。

摘要翻译

超瑞利散射光学活性（HRS-OA）光谱的理论先前已在描述该过程所需的纯电偶极及混合（电偶极/磁偶极与电偶极/电四极）第一超极化率的长度形式中得以阐述。在本工作中，我们提出了这些纯形式与混合形式超极化率的另一种表述。这一新表述利用了二次响应函数中所涉及的电偶极与电四极矩算符的速度形式。两种表述中获得的规范原点平移存在一一对应关系。这些关系确保了速度形式下的理论同样具有原点无关性。此外，尽管速度形式相较于长度形式对基组的依赖性更为显著，但前者在设计上即为原点无关的。这一特性使其特别适用于采用近似（变分或非变分）波函数进行HRS-OA不变量的计算。

摘要 (Abstract)

The theory of hyper-Rayleigh scattering optical activity (HRS-OA) spectroscopy has previously been described within the length formulation of the pure electric-dipole and mixed (electric-dipole/magnetic-dipole and electric-dipole/electric-quadrupole) first hyperpolarizabilities required for the description of this process. In this work, we provide an alternative formulation of these pure and mixed hyperpolarizabilities. This new formulation made use of the velocity form of the electric-dipole and electric-quadrupole moment operators that enter in the quadratic response functions. A one-to-one correspondence is found for the gauge-origin shifts obtained in the two formulations. These relations ensure the origin-independence of the theory also for the velocity formulation. Furthermore, even though the basis set dependence of the velocity formulation is more significant compared to the length one, the former is origin-independent by design. This property makes it particularly suitable for calculations of HRS-OA invariants using approximated (variational or not) wavefunctions.

关键词: hyper-Rayleigh scattering optical activity, HRS-OA, velocity formulation, origin-independence, hyperpolarizabilities, quadratic response functions, gauge-origin shifts, wavefunction calculations

340. ❌ A Periodic Orbit Trace Formula for Quantum Scrambling: The Role of the Normally Hyperbolic Invariant Manifold

作者: Stephen Wiggins 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12369v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究量子信息扰动的半经典理论，涉及哈密顿系统、周期轨道、不变流形等理论物理和数学物理概念，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何人工智能、机器学习或大语言模型相关内容。

!!! tip deepseek-chat TL;DR

该论文推导了具有指数-1鞍点系统中局域微正则OTOC（量子信息扰动度量）的领先阶半经典展开，将扰动速率表达为在正态双曲不变流形上不稳定周期轨道的相干和，并揭示了扰动速率对横向作用的依赖性为模式选择性控制提供了理论机制。

摘要翻译

越时无序关联函数（Out-of-Time-Order Correlators，OTOCs）可用于量化量子信息置乱，但其与局域化相空间结构（如化学过渡态）的联系仍需形式化发展。本文针对具有一阶鞍点的系统，推导了局域微正则OTOC的领头阶半经典展开，将置乱速率表达为法向双曲不变流形（Normally Hyperbolic Invariant Manifold，NHIM）上不稳定周期轨道的相干求和。该推导在半经典极限及埃伦费斯特时间之前的中期时间范围内成立，利用了过渡态的正则形式理论，该理论将鞍点附近的哈密顿量转化为依赖于守恒作用量的可积（尽管通常不可分离）形式。我们概述了微正则迹的推导、可积系统的半经典传播子、稳定性矩阵的因子分解，以及稳相近似的舒尔补约化过程。本研究将周期轨道迹方法拓展至置乱观测量，得到了主导半经典增长窗口的局域不稳定指数Λ(J)。作为特例，当观测时间与贡献轨道的本征周期重合时，迹求和可简化为有效的1.5Λ标度关系，该结果源于局域双曲增长与波包稀释之间的竞争。此简化形式具有条件性；完整展开仍保留对轨道周期的相干求和。最后，我们讨论了不稳定指数对横向作用量的依赖关系如何为置乱的模式选择性控制建立理论机制，并概述了验证这些预测的数值计算策略。

摘要 (Abstract)

Out-of-Time-Order Correlators (OTOCs) quantify quantum information scrambling, but their connection to localized phase-space structures, such as chemical transition states, requires formal development. We derive a leading-order semiclassical expansion for the local microcanonical OTOC in systems with an index-1 saddle point, expressing the scrambling rate as a coherent sum over unstable periodic orbits on the Normally Hyperbolic Invariant Manifold (NHIM). Valid in the semiclassical limit and the intermediate-time regime before the Ehrenfest time, our derivation utilizes the Normal Form theory of the transition state, which transforms the Hamiltonian near the saddle into an integrable (though generally non-separable) form dependent on conserved actions. We outline the derivation of the microcanonical trace, the semiclassical propagator for integrable systems, the factorization of the stability matrix, and the Schur complement reduction of the stationary phase approximation. Our result extends periodic-orbit trace methods to scrambling observables, yielding a local instability exponent Λ(J) governing the leading semiclassical growth window. As a special case, when the observation time coincides with the intrinsic periods of the contributing orbits, the trace sum reduces to an effective 1.5Λ scaling, resulting from the competition between local hyperbolic growth and wavepacket dilution. This simplified form is conditional; the full expansion retains a coherent sum over orbit periods. Finally, we discuss how the dependence of the instability on transverse actions establishes a theoretical mechanism for mode-selective control of scrambling, and outline a numerical evaluation strategy to test these predictions.

关键词: Out-of-Time-Order Correlators, quantum information scrambling, semiclassical expansion, Normally Hyperbolic Invariant Manifold, periodic orbits, transition state theory, instability exponent, mode-selective control

341. ❌ Surface Plasmons in the Continuum

作者: Mohit Chaudhary, Hans-Christian Weissker, Daniele Toffoli, Mauro Stener, Victor Despré, Franck Rabilloud, Jean Lermé, Rajarshi Sinha-Roy 期刊/来源: arxiv 发布日期: 2026-04-13 arXiv链接: http://arxiv.org/abs/2604.12008v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Surface Plasmons in the Continuum》研究金属团簇（如铝和铟）在紫外能量范围内的表面等离子体共振，采用时间演化形式的时间依赖密度泛函理论进行从头计算。所有关键词均与大模型、深度学习技术原理或AI应用相关，但论文主题是计算物理/材料科学中的量子力学模拟，未涉及任何大模型、深度学习、AI技术或相关应用。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于科学计算领域，但未明确使用AI方法，因此给予5分（有一定关联）。其他关键词完全无关，评分为0分。加权总分计算为5.0（仅一个关键词得5分，权重1.0）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于时间依赖密度泛函理论的时间演化方法，用于计算金属团簇在连续谱中的表面等离子体共振，并以Al13-为参考系统，展示了该方法能准确捕获紫外区域的宽表面等离子体，应用于铝团簇时揭示了从Al6的离散光谱特征到更大团簇在深紫外区域表面等离子体的尺寸依赖性演化。

摘要翻译

为促进紫外波段等离激元应用的研究兴趣，已推动了对铝、铟等非常规等离激元材料团簇的研究热潮，这些材料的表面等离激元共振出现在电离势以上。自然地，量子力学描述需要纳入电离过程，这使得从头计算面临挑战。我们提出了一种基于含时密度泛函理论时间演化形式体系的稳健方法，用于计算金属团簇连续谱中的表面等离激元共振。以被广泛研究的Al${13}^-$作为参考体系，我们证明对连续谱和团簇电离过程的精确描述能够捕捉到紫外波段的宽谱表面等离激元。将此方法应用于铝团簇的研究，揭示了从Al${6}$的离散光谱特征到更大团簇在深紫外波段表面等离激元的尺寸依赖性演化规律。

摘要 (Abstract)

The interest to foster plasmonic applications at energies in the ultra-violet, has escalated research initiatives in clusters of unconventional plasmonic materials like aluminum and indium,for which the surface-plasmon resonance appears above the ionization potential. Naturally, the quantum mechanical description calls for the incorporation of the ionization process, thereby making the ab initio calculations challenging. We present a robust approach within the time-evolution formalism of the time-dependent density-functional theory to calculate surface plasmon resonance in the continuum of metal clusters. Using the much studied Al${13}^-$ as a system of reference, we show that accurate description of the continuum and of the ionization of the cluster allow to capture a broad surface-plasmon in the UV. Application of this approach in aluminum clusters has given the size-dependent evolution from discrete spectral features in Al${6}$ to the surface-plasmon in larger clusters in the deep ultra-violet.

关键词: surface plasmon resonance, time-dependent density-functional theory, metal clusters, ultra-violet, ionization continuum, ab initio calculations, aluminum clusters, size-dependent evolution

Token 消耗统计

总计: 1,076,309 tokens（输入 736,484 / 输出 339,825）

模型	输入	输出	合计
deepseek-chat	619,211	339,825	959,036
glm-4.7	117,273	0	117,273

📊 ArXiv 研究报告 (2026-04-16)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence

2. AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognit

3. LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

4. Latent-Condensed Transformer for Efficient Long Context Modeling

5. Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

6. Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

7. Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

8. Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

9. Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

10. Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Rea

📋 所有论文列表

1. ✅ QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence

2. ✅ AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

3. ✅ LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

4. ✅ Latent-Condensed Transformer for Efficient Long Context Modeling

5. ✅ Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

6. ✅ Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

7. ✅ Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

8. ✅ Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

9. ✅ Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

10. ✅ Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

11. ❌ Agentic Control in Variational Language Models

12. ❌ Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents

13. ❌ The role of System 1 and System 2 semantic memory structure in human and LLM biases

14. ❌ TriFit: Trimodal Fusion with Protein Dynamics for Mutation Fitness Prediction

15. ❌ Operationalising the Right to be Forgotten in LLMs: A Lightweight Sequential Unlearning Framework for Privacy-Aligned Deployment in Politically Sensitive Environments

16. ❌ Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

17. ❌ ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

18. ❌ Evaluating the Limitations of Protein Sequence Representations for Parkinson’s Disease Classification

19. ❌ Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport

20. ❌ GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

21. ❌ Modality-Agnostic Prompt Learning for Multi-Modal Camouflaged Object Detection

22. ❌ Adaptive Budget Allocation in LLM-Augmented Surveys

23. ❌ StructDiff: A Structure-Preserving and Spatially Controllable Diffusion Model for Single-Image Generation

24. ❌ PAL: Personal Adaptive Learner

25. ❌ Visual Preference Optimization with Rubric Rewards

26. ❌ Representation geometry shapes task performance in vision-language modeling for CT enterography

27. ❌ Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

28. ❌ Bilevel Late Acceptance Hill Climbing for the Electric Capacitated Vehicle Routing Problem

29. ❌ Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

30. ❌ LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software

31. ❌ One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

32. ❌ ROSE: An Intent-Centered Evaluation Metric for NL2SQL

33. ❌ Parallax: Why AI Agents That Think Must Never Act

34. ❌ Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

35. ❌ Modeling Co-Pilots for Text-to-Model Translation

36. ❌ Distorted or Fabricated? A Survey on Hallucination in Video LLMs

37. ❌ Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

38. ❌ CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference

39. ❌ Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

40. ❌ BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design

41. ❌ Towards Long-horizon Agentic Multimodal Search

42. ❌ FastGrasp: Learning-based Whole-body Control method for Fast Dexterous Grasping with Mobile Manipulators

43. ❌ AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

44. ❌ LIFE – an energy efficient advanced continual learning agentic AI framework for frontier systems

45. ❌ From edges to meaning: Semantic line sketches as a cognitive scaffold for ancient pictograph invention

46. ❌ Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic

47. ❌ Loop Corrections to the Training and Generalization Errors of Random Feature Models

48. ❌ Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models

49. ❌ RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

50. ❌ DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

51. ❌ Algorithmic Analysis of Dense Associative Memory: Finite-Size Guarantees and Adversarial Robustness

52. ❌ Rethinking Satellite Image Restoration for Onboard AI: A Lightweight Learning-Based Approach

53. ❌ Efficiency of Proportional Mechanisms in Online Auto-Bidding Advertising

54. ❌ VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

55. ❌ OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

56. ❌ Efficient Adversarial Training via Criticality-Aware Fine-Tuning

57. ❌ DoseRAD2026 Challenge dataset: AI accelerated photon and proton dose calculation for radiotherapy

58. ❌ Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling

59. ❌ CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models

60. ❌ ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

61. ❌ GF-Score: Certified Class-Conditional Robustness Evaluation with Fairness Guarantees

62. ❌ Can AI Tools Transform Low-Demand Math Tasks? An Evaluation of Task Modification Capabilities

63. ❌ Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning