📊 ArXiv 研究报告 (2026-03-19)

生成时间: 2026-03-19 09:23:06 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 327 篇
及格论文: 14 篇 (4.3%)
深度分析: 5 篇

⭐ 及格论文详细分析

1. Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

作者: Zhitao Zeng, Mengya Xu, Jian Jiang, Pengfei Guo, Yunqiu Xu, Zhu Zhuo, Chang Han Low, Yufan He, Dong Yang, Chenxi Lin, Yiming Gu, Jiaxin Guo, Yutong Ban, Daguang Xu, Qi Dou, Yueming Jin 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16822v1

评分: 64.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心是构建大规模多模态外科手术数据集（Surg$Σ$-DB）并基于此开发外科手术基础模型，属于大模型在科学（医学）领域的应用创新。高度相关的关键词包括：“Large Language Models/Foundation Models”（论文明确提及多模态大语言模型和基础模型）、“AI for Science/Bioinformatics”（外科手术AI属于生物信息学/科学AI范畴）、“Chain of Thought/System 2 Thinking”（数据集包含层次化推理标注以支持深度理解）。中等相关的关键词包括：“Pre-training/Domain Adaptation”（基础模型构建涉及预训练和领域适应）、“Scaling Laws & Data Quality”（强调大规模高质量数据的重要性）、“Explainable AI”（提及可解释性）、“In-context Learning”（基础模型通常具备上下文学习能力）。其他关键词如MoE、量化、对齐、RAG等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对外科手术AI缺乏大规模多模态数据的问题，提出了Surg$Σ$框架，包括一个大规模多模态数据集（Surg$Σ$-DB）和基于此构建的外科手术基础模型，以提升跨任务泛化能力和可解释性。

摘要翻译

手术智能具有提升外科护理安全性与一致性的潜力，然而现有大多数手术人工智能框架仍局限于特定任务，难以在不同手术流程与机构间实现泛化。尽管多模态基础模型（尤其是多模态大语言模型）已在多个医学领域展现出强大的跨任务能力，但其在手术领域的进展仍受限于缺乏大规模、系统性整理的多模态数据。为应对这一挑战，我们推出Surg$Σ$——一个面向手术智能的大规模多模态数据与基础模型体系。该框架的核心是Surg$Σ$-DB，这是一个为支持多样化手术任务而设计的大规模多模态数据基础。Surg$Σ$-DB将异构手术数据源（包括开源数据集、内部整理的临床资料及网络来源数据）整合至统一架构中，旨在提升异构数据集间的标签一致性与数据标准化水平。Surg$Σ$-DB涵盖6个临床专科及多样手术类型，以前所未有的规模（超过598万组对话）为18项涵盖理解、推理、规划与生成的实用手术任务提供丰富的图像与视频级标注。除常规多模态对话外，Surg$Σ$-DB还纳入层次化推理标注，为复杂手术场景中更深层次的语境理解提供更丰富的语义线索。我们进一步通过基于Surg$Σ$-DB构建的最新手术基础模型提供实证依据，阐明大规模多模态标注、统一语义设计和结构化推理标注对提升跨任务泛化能力与可解释性的实际价值。

摘要 (Abstract)

Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advancement in surgery remains constrained by the lack of large-scale, systematically curated multimodal data. To address this challenge, we introduce Surg$Σ$, a spectrum of large-scale multimodal data and foundation models for surgical intelligence. At the core of this framework lies Surg$Σ$-DB, a large-scale multimodal data foundation designed to support diverse surgical tasks. Surg$Σ$-DB consolidates heterogeneous surgical data sources (including open-source datasets, curated in-house clinical collections and web-source data) into a unified schema, aiming to improve label consistency and data standardization across heterogeneous datasets. Surg$Σ$-DB spans 6 clinical specialties and diverse surgical types, providing rich image- and video-level annotations across 18 practical surgical tasks covering understanding, reasoning, planning, and generation, at an unprecedented scale (over 5.98M conversations). Beyond conventional multimodal conversations, Surg$Σ$-DB incorporates hierarchical reasoning annotations, providing richer semantic cues to support deeper contextual understanding in complex surgical scenarios. We further provide empirical evidence through recently developed surgical foundation models built upon Surg$Σ$-DB, illustrating the practical benefits of large-scale multimodal annotations, unified semantic design, and structured reasoning annotations for improving cross-task generalization and interpretability.

关键词: Surgical Intelligence, Multimodal Foundation Models, Large-scale Multimodal Data, Cross-task Generalization, Hierarchical Reasoning, Surgical AI, Medical Domain Adaptation, Interpretability

2. Exploring different approaches to customize language models for domain-specific text-to-code generat

作者: Luís Freire, Fernanda A. Andaló, Nicki Skafte Detlefsen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16526v1

评分: 60.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究如何定制小型开源语言模型用于领域特定的文本到代码生成，直接涉及LLMs、SLMs、LoRA、RAG和In-context Learning（few-shot prompting）等关键词，这些是论文的核心方法。Domain Adaptation和SFT与论文的微调策略相关，但非核心。其他关键词如MoE、Scaling Laws、RLHF、Agents等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究探索了如何为领域特定的文本到代码生成定制小型语言模型，发现基于LoRA的微调在准确性和领域对齐方面优于少样本提示和检索增强生成（RAG）方法。

摘要翻译

大语言模型（LLM）已展现出根据自然语言描述生成可执行代码的强大能力。然而，通用模型在需要运用领域特定库、API或约定的专业编程场景中往往表现不佳。相较于依赖大型专有系统，对较小的开源模型进行定制提供了一种更具成本效益的替代方案。本研究探讨了如何利用合成数据集使较小的语言模型适应领域特定的代码生成任务。我们构建了涵盖Python生态系统中三个领域的编程练习数据集：通用Python编程、Scikit-learn机器学习工作流以及基于OpenCV的计算机视觉任务。使用这些数据集，我们评估了三种定制策略：少样本提示、检索增强生成（RAG）以及使用低秩自适应（LoRA）的参数高效微调。性能评估综合采用了基于基准测试的指标和基于相似度的指标，后者用于衡量代码与领域特定要求的契合度。我们的结果表明，少样本学习和RAG等基于提示的方法能够以经济高效的方式提升领域相关性，尽管其对基准测试准确率的提升有限。相比之下，基于LoRA的微调在大多数任务中持续实现了更高的准确率和更强的领域契合度。这些发现凸显了在使较小语言模型适应专业编程任务时，灵活性、计算成本与性能之间存在的实际权衡关系。

摘要 (Abstract)

Large language models (LLMs) have demonstrated strong capabilities in generating executable code from natural language descriptions. However, general-purpose models often struggle in specialized programming contexts where domain-specific libraries, APIs, or conventions must be used. Customizing smaller open-source models offers a cost-effective alternative to relying on large proprietary systems. In this work, we investigate how smaller language models can be adapted for domain-specific code generation using synthetic datasets. We construct datasets of programming exercises across three domains within the Python ecosystem: general Python programming, Scikit-learn machine learning workflows, and OpenCV-based computer vision tasks. Using these datasets, we evaluate three customization strategies: few-shot prompting, retrieval-augmented generation (RAG), and parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA). Performance is evaluated using both benchmark-based metrics and similarity-based metrics that measure alignment with domain-specific code. Our results show that prompting-based approaches such as few-shot learning and RAG can improve domain relevance in a cost-effective manner, although their impact on benchmark accuracy is limited. In contrast, LoRA-based fine-tuning consistently achieves higher accuracy and stronger domain alignment across most tasks. These findings highlight practical trade-offs between flexibility, computational cost, and performance when adapting smaller language models for specialized programming tasks.

关键词: language models, code generation, domain-specific, fine-tuning, LoRA, retrieval-augmented generation, few-shot prompting, synthetic datasets

深度分析:

探索定制语言模型用于特定领域文本到代码生成的不同方法

摘要:

本文探讨了如何利用合成数据集将较小的开源语言模型定制为特定领域的代码生成模型。研究背景在于通用大模型在处理特定领域库、API或编程规范时表现不佳，而大型专有模型成本高昂且存在隐私限制。作者构建了涵盖通用Python、Scikit-learn机器学习和OpenCV计算机视觉三个领域的合成数据集，并评估了少样本提示、检索增强生成（RAG）和基于LoRA的参数高效微调三种定制策略。结果表明，基于提示的方法虽能以低成本提升领域相关性，但对基准准确率影响有限；而LoRA微调在大多数任务中始终实现了更高的准确性和更强的领域对齐。该研究为在特定编程任务中平衡性能、成本和部署约束提供了实用指导。

创新点:

提出了一个系统化的实证比较框架，用于评估少样本学习、RAG和LoRA微调在特定领域代码生成中的性能差异。
构建了一个完整的合成数据生成管道，利用教师模型（GPT-4o）生成涵盖不同难度和上下文的特定领域编程练习，有效解决了领域数据稀缺问题。
设计了结合基准测试（功能正确性）和相似度指标（领域对齐度）的综合评估框架，以全面衡量代码质量。
揭示了在特定领域代码生成任务中，参数高效微调（LoRA）相比提示工程方法在准确性和领域一致性上的显著优势。

方法

!!! info

研究采用四阶段定制管道：首先，利用GPT-4o作为教师模型，通过结构化提示生成包含自然语言描述和Python代码实现的合成数据集；其次，对生成的样本进行验证以确保语法和语义正确；第三，使用验证后的数据集对StarCoder和DeepSeekCoder两个开源模型分别进行少样本提示、RAG和LoRA微调三种策略的适配；最后，通过自动化测试用例（基准测试）和代码相似度指标（如CodeBLEU）对模型生成的代码进行综合评估，分析不同策略在功能正确性和领域对齐度上的表现。

关键结果:

LoRA微调在大多数任务中表现最佳，显著提高了代码生成的功能正确性和与特定领域API的对齐度。
少样本提示和RAG等基于提示的方法虽然计算成本较低且灵活，但在提升基准测试准确率方面效果有限，主要改善的是代码风格的相关性。
合成数据生成策略有效缓解了特定领域高质量训练数据稀缺的问题，能够成功用于知识蒸馏。
不同领域（通用Python vs Scikit-learn vs OpenCV）对模型适配策略的敏感度不同，特定API密集型任务更能体现微调的优势。

技术栈: Transformer架构, 自回归目标, 上下文学习, 检索增强生成 (RAG), 低秩适应, 知识蒸馏, GPT-4o (教师模型), StarCoder, DeepSeekCoder, Python, Scikit-learn, OpenCV, CodeBLEU, BLEU, ROUGE, Pass@k

优点

实用性强：专注于较小的开源模型，降低了部署成本和隐私风险，适合实际工业应用。
评估全面：结合了功能性指标和相似性指标，不仅关注代码能否运行，还关注是否符合领域规范。
方法对比清晰：系统性地对比了三种主流的模型适配策略，为从业者提供了明确的选型依据。
数据创新：利用合成数据解决特定领域数据匮乏问题，具有良好的可扩展性。

局限

合成数据的依赖性：虽然使用了验证步骤，但合成数据的质量仍受限于教师模型（GPT-4o）的能力，可能存在偏差或错误。
领域覆盖有限：研究仅限于Python生态系统内的三个领域，结论是否适用于其他编程语言或高度专业化的领域尚需验证。
上下文窗口限制：RAG和少样本方法受限于模型的上下文窗口大小，可能无法充分利用外部知识库。
评估指标局限：相似度指标（如CodeBLEU）虽然考虑了语法结构，但仍可能无法完全捕捉代码的语义等价性。

与研究方向的相关性:

该论文高度相关。它直接涉及大语言模型（LLM）的技术原理（LoRA、RAG、知识蒸馏）及其在特定领域（代码生成）的应用。论文重点在于通过技术创新（合成数据管道、参数高效微调）解决通用模型在特定场景下的局限性，符合“大模型和深度学习技术原理的创新”以及“大模型在不同领域的研究应用”的评价标准。特别是在模型轻量化部署和特定领域适配方面的探索，具有很强的创新性和实用价值。

3. Parallel In-context Learning for Large Vision Language Models

作者: Shin’ya Yamaguchi, Daiki Chijiwa, Tamao Sakao, Taku Hasegawa 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16092v1

评分: 56.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	8.0/10	8.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	8.0/10	8.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	5.0/10	5.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	15.0/10	15.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出Parallel-ICL算法解决大视觉语言模型多模态上下文学习中的推理延迟问题。核心相关关键词：“In-context Learning”（15分，论文核心主题）、“Large Language Models”（10分，LVLMs属于大模型）、“Speculative Decoding”（10分，直接解决推理加速问题）、“Context Window Extension”（8分，处理长上下文）、“Mixture of Experts”（8分，使用Product-of-Experts集成）、“KV Cache Compression”（5分，涉及注意力计算优化）。其他关键词如SLMs、Scaling Laws、Alignment等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出Parallel-ICL算法，通过并行处理分块上下文和使用Product-of-Experts集成，在保持多模态上下文学习性能的同时显著提升大视觉语言模型的推理速度。

摘要翻译

大型视觉语言模型（LVLMs）通过利用示例演示，采用多模态上下文学习（MM-ICL）来适应新任务。虽然增加演示数量能提升性能，但由于Transformer注意力机制随上下文长度呈二次方计算成本增长，这会带来显著的推理延迟。为解决这一权衡问题，我们提出并行上下文学习（Parallel-ICL），一种即插即用的推理算法。Parallel-ICL将长演示上下文分割为多个较短且易处理的片段，并行处理这些片段，并在对数概率层面整合它们的预测结果，使用加权专家乘积（PoE）集成来近似全上下文输出。在集成学习理论的指导下，我们为Parallel-ICL引入了原则性策略：（i）基于聚类的上下文分块以最大化块间多样性，以及（ii）基于相似性的上下文编译以根据查询相关性加权预测。在视觉问答（VQA）、图像描述生成和分类基准测试上的大量实验表明，Parallel-ICL在显著提升推理速度的同时，实现了与全上下文MM-ICL相当的性能。我们的工作为MM-ICL中准确性与效率的权衡提供了有效解决方案，使得动态任务适应能够以大幅降低的推理开销实现。

摘要 (Abstract)

Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.

关键词: Large Vision-Language Models, In-context Learning, Parallel Processing, Inference Acceleration, Product-of-Experts, Multi-modal Learning, Context Chunking, Ensemble Learning

深度分析:

大型视觉语言模型的并行上下文学习

摘要:

本文针对大型视觉语言模型（LVLM）在多模态上下文学习（MM-ICL）中因增加演示样本导致推理延迟显著增加的问题，提出了一种即插即用的推理算法——并行上下文学习。该方法将长演示上下文分割成多个小块并行处理，并通过加权乘积专家集成在logit层面整合预测结果以近似全上下文输出。受集成学习理论启发，作者提出了基于聚类的上下文分块策略以最大化块间多样性，以及基于相似度的上下文编译策略以根据查询相关性加权预测。在VQA、图像描述和分类基准上的实验表明，Parallel-ICL在显著提高推理速度的同时，保持了与全上下文MM-ICL相当的性能，有时甚至能缓解“迷失在中间”现象从而超越全上下文性能。

创新点:

提出了Parallel-ICL，一种即插即用的推理算法，通过并行处理演示上下文分块来降低MM-ICL的推理延迟。
将分块上下文的预测整合形式化为加权乘积专家集成，在logit层面进行近似计算。
提出了基于聚类的上下文分块策略，旨在最大化分块之间的多样性，以提高集成效果。
提出了基于相似度的上下文编译策略，根据输入查询与分块的相关性分配权重，增强最终预测的准确性。

方法

!!! info

论文提出的方法主要包含两个步骤：上下文分块和上下文编译。首先，利用聚类算法在多模态特征空间中将演示样本划分为多个簇，每个簇作为一个分块，以最大化块间多样性。其次，将LVLM分别应用于这些分块进行并行推理，得到各自的logit输出。最后，根据输入查询与各分块的相似度计算权重，使用加权乘积专家模型对所有分块的logit进行集成，生成最终的预测结果。实验在LLaVA-OV、Qwen2.5-VL和InternVL3.5等模型上，针对VQA、图像描述和分类任务进行了验证。

关键结果:

Parallel-ICL在VQA、图像描述和分类基准上实现了与全上下文MM-ICL相当的性能。
显著提高了推理速度，例如在4分块设置下相比32-shot全上下文实现了1.34倍的加速。
在某些情况下，Parallel-ICL的准确率超过了全上下文MM-ICL，表明集成分块上下文可能缓解了长上下文中的“迷失在中间”现象。
验证了所提出的分块和编译策略能有效提高块间多样性和查询相关性。

技术栈: Product-of-Experts (PoE, 乘积专家模型), Clustering Algorithms (聚类算法), Transformer Attention, Large Vision-Language Models (LVLMs, 如LLaVA-OV, Qwen2.5-VL, InternVL3.5), Ensemble Learning Theory (集成学习理论), Fano’s Inequality (Fano不等式)

优点

即插即用：无需对模型进行微调或参数更新，可直接应用于现有的LVLM。
高效性：通过并行处理短上下文，有效缓解了Transformer注意力机制随上下文长度呈二次增长的计算成本。
理论指导：基于集成学习理论设计分块和加权策略，具有理论依据。
鲁棒性：不仅保持了精度，还能在一定程度上缓解长上下文导致的信息丢失问题。

局限

虽然推理速度提升，但并行处理多个分块可能增加总计算量（FLOPs），尽管并行化减少了墙钟时间。
方法的有效性依赖于演示样本的独立性假设，如果样本间存在强序列依赖，分块可能破坏上下文连贯性。
聚类和相似度计算增加了预处理步骤的开销，可能影响整体端到端的效率。

与研究方向的相关性:

该论文高度相关。它专注于大型视觉语言模型（LVLM）这一大模型核心领域，并针对多模态上下文学习（MM-ICL）的推理效率问题提出了创新性的技术解决方案。论文涉及深度学习模型架构优化、推理加速算法以及集成学习理论的应用，属于大模型技术原理的创新，符合用户对大模型及深度学习技术创新的关注点。

4. Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimod

作者: Yu Liu, Lei Zhang, Haoxun Li, Hanlei Shi, Yuxuan Ding, Leyuan Qu, Taihao Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16463v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）在开放词汇情感识别中的应用，核心创新在于提出了一种混合证据演绎推理架构（HyDRA），采用强化学习进行奖励塑造以优化推理轨迹。因此，与"Large Language Models"高度相关（10分），因为直接使用MLLMs；与"RLHF"高度相关（10分），因为使用了强化学习进行训练对齐；与"Chain of Thought"和"System 2 Thinking"高度相关（10分），因为论文强调多步、深入的推理过程（Propose-Verify-Decide协议）；与"Mechanistic Interpretability"高度相关（10分），因为提供了可解释的证据追踪。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对开放词汇多模态情感识别中因线索模糊导致的性能瓶颈问题，提出了一种混合证据演绎推理架构（HyDRA），通过强化学习优化推理轨迹，在模糊或冲突场景中显著提升了性能并提供了可解释的证据追踪。

摘要翻译

开放词汇多模态情感识别（Open-Vocabulary Multimodal Emotion Recognition, OV-MER）本质上具有挑战性，这源于模棱两可的多模态线索所固有的模糊性，这些线索通常源自未被观察到的、各不相同的动态情境。尽管多模态大语言模型（Multimodal Large Language Models, MLLMs）提供了广泛的语义覆盖，但其性能常受限于对主导数据先验的过早固化，从而产生次优的启发式策略，忽略了跨模态的关键互补性情感线索。我们认为，有效的情感推理不仅需要表层关联，更需要通过综合多个基于证据的推理依据，从不同的潜在视角调和这些观察结果，从而重构细腻的情感状态。为此，我们提出了HyDRA，一种混合证据演绎推理架构，它将推理形式化为一个“提议-验证-决策”协议。为了使这一溯因过程内化，我们采用了分层奖励塑形的强化学习方法，将推理轨迹与最终任务性能对齐，确保其能最佳地调和观察到的多模态线索。系统性评估验证了我们的设计选择：HyDRA在各项基准测试中均持续优于现有强基线模型——尤其是在模糊或冲突情境下——同时提供了可解释的诊断性证据轨迹。

摘要 (Abstract)

Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer extensive semantic coverage, their performance is often bottlenecked by premature commitment to dominant data priors, resulting in suboptimal heuristics that overlook crucial, complementary affective cues across modalities. We argue that effective affective reasoning requires more than surface-level association; it necessitates reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales that reconcile these observations from diverse latent perspectives. We introduce HyDRA, a Hybrid-evidential Deductive Reasoning Architecture that formalizes inference as a Propose-Verify-Decide protocol. To internalize this abductive process, we employ reinforcement learning with hierarchical reward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the observed multimodal cues. Systematic evaluations validate our design choices, with HyDRA consistently outperforming strong baselines–especially in ambiguous or conflicting scenarios–while providing interpretable, diagnostic evidence traces.

关键词: Multimodal Large Language Models, Open-Vocabulary Emotion Recognition, Hybrid-evidential Deductive Reasoning, Reinforcement Learning, Propose-Verify-Decide Protocol, Interpretable Evidence Traces, Affective Reasoning, Ambiguous Scenarios

深度分析:

循线索，构真相：开放词汇多模态情感识别中的混合证据演绎推理

摘要:

开放词汇多模态情感识别（OV-MER）面临模态线索模糊和潜在情境动态未观测的挑战。现有的多模态大语言模型（MLLMs）常因过度依赖主导数据先验而做出过早承诺，忽略了跨模态的互补情感线索。本文提出了HyDRA（混合证据演绎推理架构），将推理过程形式化为“提议-验证-决策”协议。该架构首先生成多样化的潜在情境假设以避免单一叙事偏差，随后通过证据约束的比较验证消除与多模态观测冲突的假设，最后选择最能调和观测线索的假设作为决策。为了内化这种溯因过程，研究采用了基于分层奖励塑形的强化学习（GRPO）进行优化，使推理轨迹与最终任务性能对齐。实验结果表明，HyDRA在模糊或冲突场景下显著优于强基线模型，并能提供可解释的诊断性证据轨迹。

创新点:

提出了用于OV-MER的假设驱动推理接口，将推理形式化为Propose–Verify–Decide过程，通过生成多个潜在情境假设并进行证据约束裁决来避免过早承诺。
设计了基于GRPO（Group Relative Policy Optimization）的策略优化和分层奖励机制，使模型学会比较验证和证据闭合，而非仅依赖提示工程。
引入了多路径裁决机制，通过综合多个基于证据的理由来解决模态冲突和模糊性问题，提高了模型的鲁棒性。
提供了可解释的诊断性推理轨迹，有助于分析模型在模糊和冲突情况下的行为。

方法

!!! info

论文采用了两阶段训练流程。首先是冷启动多模态监督（SFT），在结构化推理轨迹语料库上初始化推理协议。其次是策略优化，使用Group Relative Policy Optimization (GRPO) 和分层奖励塑形。模型架构包含因果Transformer骨干网络，通过投影层集成视觉和音频编码器。推理时，模型执行Propose（生成K个竞争假设）、Verify（在

5. Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not

作者: Jia Qing Yap 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16335v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大型MoE语言模型（Qwen 3.5-35B-A3B）的行为控制方法，通过稀疏自编码器（SAE）和解码探针向量实现代理行为的精细调控。高度相关的关键词包括：1）“Large Language Models”（研究35B参数模型）、2）“Mixture of Experts”（模型为MoE架构）、3）“LLM Agents”（研究代理行为控制）、4）“Tool Use”（涉及代码执行和网络搜索等工具使用）、5）“Mechanistic Interpretability”（通过SAE和探针进行模型解释和行为干预）。其他关键词如SLMs、训练方法、推理优化等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过稀疏自编码器解码探针向量在35B MoE语言模型中实现行为控制的方法，发现所有行为控制向量主要调节一个主导的代理轴（独立行动倾向），而非五个独立的特质。

摘要翻译

我们在Qwen 3.5-35B-A3B模型的残差流上训练了九个稀疏自编码器（Sparse Autoencoders, SAEs），该模型是一个拥有350亿参数的混合专家模型，采用门控Delta网络（GatedDeltaNet）与注意力混合架构，并利用这些SAEs识别和调控五种主体行为特质。我们的方法在SAE潜在激活上训练线性探测器，然后将探测器权重通过SAE解码器反向投影，以获得模型原生激活空间中的连续调控向量。这一方法绕过了SAE的top-k离散化过程，使得在推理时无需重新训练即可实现细粒度的行为干预。在1800次智能体推演（50种场景×36种条件）中，我们发现以乘数2进行自主性调控可达到科恩d值=1.01（p < 0.0001），使模型从78%的情况下向用户求助转变为主动执行代码和搜索网络。然而，跨特质分析表明，所有五个调控向量主要调节一个主导的主体性轴线（即独立行动倾向与遵从用户倾向之间的维度），特质特异性效应仅作为工具类型构成和剂量反应曲线形状的次要调节出现。工具使用向量能有效调控行为（d = 0.39）；风险校准向量仅产生抑制效应。我们还证明，仅在自回归解码阶段进行调控完全无效（p > 0.35），这为“行为决策在门控DeltaNet架构的前馈计算阶段形成”提供了因果性证据。

摘要 (Abstract)

We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model’s native activation space. This bypasses the SAE’s top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen’s d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

关键词: Mixture of Experts, Sparse Autoencoders, Behavioral Steering, Agentic Traits, Tool Use, Model Interpretability, Large Language Models, Activation Space

作者: Ce Zhang, Jinxi He, Junyi He, Katia Sycara, Yaqi Xie 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15800v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）的上下文安全问题，因此与"Large Language Models"高度相关（10分）。提出的EchoSafe框架基于"Self-Reflection"机制，通过记忆库积累安全洞察，与"Self-Correction/Self-Improvement/Self-Reflection"高度相关（10分）。研究涉及模型对齐（Alignment）、检索增强（RAG）、思维链推理（CoT）、深度推理（System 2 Thinking）、幻觉缓解（Hallucination Mitigation）和上下文学习（In-context Learning），这些关键词与论文方法有一定关联（各5分）。其他关键词如MoE、量化、推理加速、科学AI等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在视觉推理任务中存在的上下文安全问题，提出了一个名为EchoSafe的训练免费框架，通过自反思记忆库实现上下文感知推理和安全行为的持续演化，并在多个基准测试中取得了优越性能。

摘要翻译

多模态大语言模型（MLLMs）在广泛的视觉推理任务中取得了显著性能，但其面临的安全风险脆弱性仍是一个紧迫问题。尽管先前研究主要集中于通过检测并拒绝显式不安全输入的越狱防御方法，此类方法往往忽视了上下文安全性——这要求模型能够区分看似相似但在安全意图上存在显著差异的场景之间的微妙上下文区别。在本研究中，我们提出了MM-SafetyBench++，这是一个为上下文安全评估精心构建的基准测试集。具体而言，针对每个不安全的图文对，我们通过最小程度的修改构建了对应的安全版本，这些修改在保持底层上下文语义的同时翻转了用户意图，从而能够可控地评估模型是否能基于上下文理解调整其安全行为。此外，我们提出了EchoSafe，这是一个无需训练的框架，通过维护自反思记忆库来积累并检索先前交互中的安全洞察。通过将相关的过往经验整合到当前提示中，EchoSafe能够在推理过程中实现上下文感知的推理和安全行为的持续演进。在多个多模态安全基准测试上的广泛实验表明，EchoSafe始终取得卓越性能，为推进MLLMs的上下文安全性建立了坚实的基线。所有基准数据与代码均公开于https://echosafe-mllm.github.io。

摘要 (Abstract)

Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code are available at https://echosafe-mllm.github.io.

关键词: Multi-modal Large Language Models, Contextual Safety, Self-Reflective Memory, Inference-Time Adaptation, Safety Benchmark, Visual Reasoning, Training-Free Framework, Context-Aware Reasoning

7. SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs wi

作者: Zhouwei Zhai, Mengxiang Chen, Anmeng Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16137v1

评分: 49.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	8.0/10	8.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在电商搜索领域的工业部署，直接涉及LLMs、知识幻觉缓解、参数高效微调、预训练/领域适应、指令调优/对齐等关键词。论文明确提到LLMs、知识幻觉问题、参数高效预训练策略、多任务指令调优和对抗训练对齐方法，这些是核心内容。推理链增强数据与CoT有一定关联。其他关键词如MoE、SLMs、RAG、量化等未在摘要中体现，评为无关。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在电商搜索工业部署中面临的知识幻觉和安全漏洞问题，提出了Synthesize-Inject-Align框架，通过合成高质量语料、参数高效预训练和双路径对齐方法，在京东平台部署后显著提升了业务指标。

摘要翻译

大语言模型通过实现意图感知推荐，为电子商务搜索带来了变革性潜力。然而，其工业部署受到两个关键挑战的阻碍：(1) 由于对动态、细粒度产品知识编码不足导致的知识幻觉，以及(2) 在越狱攻击下威胁合规性的安全漏洞。为解决这些问题，我们提出了SI——一个用于构建知识丰富且安全的电商搜索大语言模型的“合成-注入-对齐”框架。我们的方法首先通过结合结构化知识图谱与非结构化行为日志，并辅以推理链和安全感知数据，合成高质量的自然语言语料库。随后，我们引入一种基于深度向上扩展的参数高效预训练策略，以注入领域知识，同时保留通用能力。最后，通过多任务指令微调和对抗性训练的双路径对齐方法，增强了任务性能与安全鲁棒性。该框架已在中国最大的自营电商平台京东部署，在五个核心搜索场景中的A/B测试表明，关键业务指标均有显著提升，验证了其工业有效性和可扩展性。

摘要 (Abstract)

Large language models offer transformative potential for e-commerce search by enabling intent-aware recommendations. However, their industrial deployment is hindered by two critical challenges: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance. To address these issues, we propose SI–a Synthesize-Inject-Align framework for building knowledgeable and secure e-commerce search LLMs. Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data.We then introduce a parameter-efficient pre-training strategy based on Depth Up-Scaling to inject domain knowledge while preserving general capabilities. Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness. The framework has been deployed at JD.com, China’s largest self-operated e-commerce platform, where A/B tests across five core search scenarios demonstrate significant improvements in key business metrics, validating its industrial effectiveness and scalability.

关键词: Large Language Models, E-commerce Search, Knowledge Hallucination, Security Vulnerabilities, Parameter-efficient Pre-training, Instruction Tuning, Adversarial Training, Industrial Deployment

深度分析:

SIA：面向知识增强与安全电商搜索大模型工业部署的合成-注入-对齐框架

摘要:

论文提出了SIA框架，旨在解决电商搜索大模型在工业部署中面临的知识幻觉和安全合规两大挑战。该框架包含三个阶段：首先，通过结合知识图谱和行为日志，合成包含推理链和安全感知的高质量自然语言语料；其次，采用基于Depth Up-Scaling的参数高效预训练策略注入领域知识，同时保留通用能力；最后，通过多任务指令微调和对抗训练的双路径对齐方法，提升任务性能和安全鲁棒性。该框架已在京东平台部署，A/B测试显示在五个核心搜索场景中显著提升了关键业务指标。

创新点:

提出了一种新颖的电商知识数据合成方法，结合知识图谱和行为日志，利用LLM生成包含推理链和安全感知的高质量自然语言语料。
引入了基于Depth Up-Scaling的参数扩展预训练技术，通过层初始化和学习策略，在注入大规模电商知识的同时缓解灾难性遗忘。
设计了双路径对齐策略，结合多任务指令微调和高强度对抗训练，兼顾了电商子任务性能和敏感维度的安全性。
实现了工业级的大规模部署与验证，在京东平台的实际业务场景中证明了其有效性和可扩展性。

方法

!!! info

论文采用端到端的“数据合成-知识注入-领域对齐”技术路线。具体包括：1) 利用通用LLM（如Deepseek-R1）将结构化知识图谱和非结构化日志转化为自然语言文本，并合成思维链数据；2) 采用Depth Up-Scaling扩展模型层数，结合特定初始化和动态学习率进行参数高效预训练；3) 通过混合多任务指令数据和安全对抗数据进行对齐训练。

关键结果:

该框架已在京东（JD.com）平台成功部署。
在搜索建议、商品标题生成、评论摘要、查询纠错和安全审核五个核心搜索场景的A/B测试中，模型在关键业务指标上表现出显著提升。
验证了模型在处理动态、细粒度电商知识时的准确性提升，以及在面对越狱攻击时的安全防御能力。

技术栈: Large Language Models (LLMs), Knowledge Graphs (KG), Depth Up-Scaling, Parameter-Efficient Fine-Tuning (PEFT), Chain-of-Thought (CoT), Adversarial Training, Instruction Tuning, Deepseek-R1

优点

针对电商领域的特定痛点（幻觉、安全）提出了系统性的解决方案。
创新性地结合了Depth Up-Scaling进行知识注入，有效平衡了专业性与通用性。
不仅有理论创新，还完成了工业级的大规模部署和实际效果验证，具有很高的应用价值。
数据合成策略充分利用了平台现有的结构化和非结构化数据，提高了数据利用效率。

局限

论文主要关注电商领域，其通用性在其他垂直领域（如生物医药、法律）可能需要进一步验证。
基于Depth Up-Scaling的方法虽然保留了通用能力，但增加了模型参数量，可能对推理延迟和计算资源提出更高要求。
安全对齐部分依赖于合成的高风险数据，对于未知的攻击类型的防御能力有待进一步观察。

与研究方向的相关性:

该论文高度相关。它属于“大模型和深度学习技术原理的创新”领域，特别是针对大模型的领域适应、知识注入和安全对齐技术。同时，它也展示了“大模型在不同领域的研究应用”（电商搜索）。其提出的SIA框架和Depth Up-Scaling的应用具有显著的技术创新性，符合高分标准。

8. Attention-guided Evidence Grounding for Spoken Question Answering

作者: Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16292v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《Attention-guided Evidence Grounding for Spoken Question Answering》提出了一种名为AEG的端到端框架，专门用于语音问答任务。该研究核心围绕Speech Large Language Models (SpeechLLMs)展开，因此与"Large Language Models"高度相关（10分）。论文提出了一个名为LFE的监督微调范式来校准模型的注意力机制，这直接对应"Supervised Fine-tuning"（10分）。研究目标之一是减少幻觉，因此与"Hallucination Mitigation"高度相关（10分）。论文通过注意力机制显式定位关键证据，这涉及模型内部工作机制的解释，与"Mechanistic Interpretability"有一定关联（5分）。框架利用了预训练模型，因此与"Pre-training"有一定关联（5分）。实验结果表明该方法减少了约62%的推理延迟，因此与"Inference Acceleration"有一定关联（5分）。论文未涉及其他关键词，如MoE、SLMs、Scaling Laws、Alignment、RLHF、RAG、CoT、Agents、Quantization等，这些均评为0分。

!!! tip deepseek-chat TL;DR

该论文针对语音问答中跨模态对齐的挑战，提出了一种名为AEG的端到端框架，通过注意力引导的证据定位和监督微调来减少幻觉并提升效率，在多个数据集上超越了级联基线方法并显著降低了推理延迟。

摘要翻译

口语问答（Spoken QA）是一个具有挑战性的跨模态问题：它需要在有效对齐声学查询与文本知识的同时，避免基于级联自动语音识别（ASR）系统固有的延迟和错误传播。本文提出了一种新颖的端到端框架——注意力引导证据定位（Attention-guided Evidence Grounding, AEG），该框架利用语音大语言模型（Speech Large Language Models, SpeechLLMs）的内部跨模态注意力，在模型的隐式空间中显式地定位并锚定关键证据。针对预训练模型中注意力分布分散的问题，我们提出了学习聚焦证据（Learning to Focus on Evidence, LFE）这一监督微调范式，以校准模型的注意力机制，从而区分查询相关片段与无关上下文。在SQuAD、HotpotQA和MuSiQue数据集上的实验表明，AEG减少了幻觉现象，并实现了显著的效率提升，其性能优于大规模级联基线系统（Whisper-Large-v3 + Reranker），同时将推理延迟降低了约62%。

摘要 (Abstract)

Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model’s latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model’s attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

关键词: Spoken Question Answering, Speech Large Language Models, Attention-guided Evidence Grounding, Supervised Fine-tuning, Hallucination Mitigation, Cross-modal Attention, End-to-end Framework, Inference Efficiency

9. InViC: Intent-aware Visual Cues for Medical Visual Question Answering

作者: Zhisong Wang, Ziyang Chen, Zanting Ye, Hongze Zhu, Yefeng Zheng, Yong Xia 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16372v1

评分: 43.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出InViC框架，用于医学视觉问答（Med-VQA），属于大模型在生物医学领域的应用创新。核心相关关键词：1）“Large Language Models” (10分)：论文明确使用多模态大语言模型（MLLMs）作为基础。2）“PEFT” (10分)：论文采用LoRA进行微调，是参数高效微调技术的应用。3）“AI for Science” (10分)：医学VQA是AI在生物医学领域的典型应用。4）“Hallucination Mitigation” (8分)：论文旨在解决MLLMs的“捷径回答”问题，提升临床可靠性，与缓解幻觉/提升事实性高度相关。5）“Post-training” (5分)：涉及两阶段微调策略，属于训练后调整。其他关键词与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对医学视觉问答中多模态大语言模型存在的捷径回答问题，提出了一个轻量级的Intent-aware Visual Cues（InViC）框架，通过提取问题相关的视觉提示令牌和两阶段微调策略，有效提升了模型在多个Med-VQA基准上的可信赖性。

摘要翻译

医学视觉问答旨在基于医学图像回答临床相关问题。然而，现有的多模态大语言模型常表现出捷径回答现象，即通过利用语言先验或数据集偏差生成看似合理的回答，却未能充分关注视觉证据。这种行为损害了临床可靠性，尤其是在细微影像学发现具有决定性意义时。我们提出了一种轻量级插件框架——意图感知视觉线索，以显式增强医学视觉问答中基于图像的答案生成。该框架引入了一个线索令牌提取模块，将密集的视觉令牌提炼为一组紧凑的K个问题条件化线索令牌，这些令牌作为结构化的视觉中介注入到大语言模型解码器中，以促进与意图对齐的视觉证据利用。为阻止模型绕过视觉信息，我们进一步设计了一种包含线索瓶颈注意力掩码的两阶段微调策略。在第一阶段，我们采用注意力掩码阻断大语言模型对原始视觉特征的直接访问，从而迫使所有视觉证据通过线索通路传递。在第二阶段，恢复标准的因果注意力机制，训练大语言模型联合利用视觉令牌和线索令牌。我们在三个公开的医学视觉问答基准数据集上评估了该框架，涵盖多个代表性多模态大语言模型。该框架在零样本推理和标准LoRA微调基础上均取得持续改进，表明通过瓶颈式训练的意图感知视觉线索是提升可信赖医学视觉问答系统实用性和有效性的策略。

摘要 (Abstract)

Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM’s direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.

关键词: Medical Visual Question Answering, Multimodal Large Language Models, Intent-aware Visual Cues, LoRA fine-tuning, Shortcut answering, Clinical reliability, Cue-bottleneck attention, Trustworthy AI

10. MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

作者: Hanxian Huang, Igor Fedorov, Andrey Gromov, Bernard Beckerman, Naveen Suda, David Eriksson, Maximilian Balandat, Rylan Conway, Patrick Huber, Chinnadhurai Sankar, Ayushi Dalmia, Zechun Liu, Lemeng Wu, Tarek Elgamal, Adithya Sagar, Vikas Chandra, Raghuraman Krishnamoorthi 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15954v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	8.0/10	8.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究移动设备上的大语言模型（OD-LLMs）设计，与"Large Language Models"和"Small Language Models/On-device AI"高度相关（10分）。论文提出支持8k上下文长度，与"Context Window Extension/Long Context LLMs"相关（8分）。论文优化延迟和推理速度，与"Speculative Decoding/Inference Acceleration"相关（8分）。论文提到使用预训练骨干网络和少量继续预训练，与"Pre-training/Continual Pre-training/Domain Adaptation"有一定关联（5分）。其他关键词如MoE、SFT、RLHF、RAG、量化等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种在移动设备延迟约束下通过硬件在环架构搜索设计高效部署的端侧大语言模型（MobileLLM-Flash）的方法，实现了在移动CPU上预填充和解码速度分别提升1.8倍和1.6倍，同时保持或超越模型质量。

摘要翻译

实时人工智能体验需要针对资源受限硬件进行高效部署优化的设备端大语言模型（On-Device Large Language Models，简称OD-LLMs）。最实用的OD-LLM应能产生近实时响应，并具备广泛的硬件兼容性，从而最大化用户覆盖范围。本文提出一种在移动端延迟约束下，采用硬件在环架构搜索来设计此类模型的方法论。该系统适用于工业级部署：所生成的模型无需定制内核即可部署，且兼容Executorch等标准移动运行时环境。本方法避免使用特殊的注意力机制，转而采用注意力跳跃技术来实现长上下文加速。

我们的方法联合优化了模型架构（层数、维度）与注意力模式。为高效评估候选架构，我们将每个候选视为继承预训练主干网络权重的剪枝版本，从而通过极少的持续预训练实现高精度。我们利用延迟评估的低成本特性，采用分阶段流程：首先学习精确的延迟模型，随后在延迟与质量间搜索帕累托前沿。

该方法产生了MobileLLM-Flash系列基础模型（3.5亿、6.5亿、14亿参数），专为高效设备端使用而设计，具备强大能力并支持最高8k的上下文长度。在移动端CPU上，MobileLLM-Flash在保持相当或更优质量的前提下，实现了最高1.8倍的预填充加速和1.6倍的解码加速。我们对帕累托前沿设计选择的分析，为OD-LLM设计提供了可实践的原则指导。

摘要 (Abstract)

Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality. This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design.

关键词: on-device large language models, mobile latency constraints, hardware-in-the-loop architecture search, attention skipping, pretrained backbone, Pareto-frontier, context length, inference acceleration

11. Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models

作者: Mohamed Adel, Bashar Alhafni, Nizar Habash 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16718v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究指令调优的LLMs在阿拉伯语形态句法标注和依存句法分析任务上的表现，直接涉及LLMs、指令调优和检索增强的上下文学习；论文还评估了零样本提示和基于检索的上下文学习，与"In-context Learning"高度相关；其他关键词如MoE、SLMs、缩放定律、预训练、对齐技术、推理方法、代理系统、模型压缩等均未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究评估了指令调优的大语言模型在具有挑战性的阿拉伯语形态句法标注和依存句法分析任务上的表现，发现提示设计和示例选择显著影响性能，基于检索的上下文学习方法能提升解析和分词效果。

摘要翻译

大型语言模型（LLM）在众多自然语言处理任务中表现优异，但其生成显式语言结构的能力尚不明确。本研究以标准阿拉伯语的两项结构化预测任务——形态句法标注与带标签依存句法分析——为基准，对指令微调后的LLM进行评估。阿拉伯语因其丰富的形态变化与正字法歧义性，形成了强烈的形态-句法互动，为测试提供了极具挑战性的平台。我们比较了零样本提示与基于检索的上下文学习（ICL）方法，后者使用了阿拉伯语树库中的示例。结果表明，提示设计与示例选取对性能有显著影响：在特征级标注任务上，专有模型接近监督基线的水平，并在依存句法分析任务上与专用解析器相比具有竞争力。在原始文本场景下，分词仍是挑战，但基于检索的ICL同时提升了句法分析与分词的效果。我们的分析揭示了LLM对阿拉伯语形态句法与句法的哪些方面能够可靠捕捉，哪些方面仍存在困难。

摘要 (Abstract)

Large language models (LLMs) perform strongly on many NLP tasks, but their ability to produce explicit linguistic structure remains unclear. We evaluate instruction-tuned LLMs on two structured prediction tasks for Standard Arabic: morphosyntactic tagging and labeled dependency parsing. Arabic provides a challenging testbed due to its rich morphology and orthographic ambiguity, which create strong morphology-syntax interactions. We compare zero-shot prompting with retrieval-based in-context learning (ICL) using examples from Arabic treebanks. Results show that prompt design and demonstration selection strongly affect performance: proprietary models approach supervised baselines for feature-level tagging and become competitive with specialized dependency parsers. In raw-text settings, tokenization remains challenging, though retrieval-based ICL improves both parsing and tokenization. Our analysis highlights which aspects of Arabic morphosyntax and syntax LLMs capture reliably and which remain difficult.

关键词: Large Language Models, Instruction Tuning, Arabic Morphosyntactic Tagging, Dependency Parsing, Retrieval-based In-context Learning, Zero-shot Prompting, Structured Prediction, NLP

12. Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

作者: Xiaobing Sun, Perry Lam, Shaohua Li, Zizhou Wang, Rick Siow Mong Goh, Yong Liu, Liangli Zhen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16192v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的安全机制和对抗性攻击（越狱），因此与"Large Language Models"高度相关（10分）。论文涉及模型安全对齐、推理过程（多步推理、深度推理）、事实性/幻觉缓解以及可解释性（分析语义重构），这些与"Instruction Tuning/Alignment"、“Chain of Thought”、“System 2 Thinking”、“Hallucination Mitigation”、“Mechanistic Interpretability"有一定关联（各5分）。其他关键词如MoE、SLMs、训练方法、压缩、加速、智能体、科学AI等，论文未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为结构化语义伪装（S2C）的新型多维越狱攻击框架，通过操纵恶意语义意图在模型推理过程中的重构方式，有效规避LLMs的安全机制，在多个基准测试中显著提升了攻击成功率。

摘要翻译

现代大型语言模型采用的安全机制已超越表层输入过滤，延伸至潜在语义表征与生成时推理层面，使其能够在推理过程中识别被混淆的恶意意图并据此拒绝响应，从而导致许多表层混淆越狱攻击失效。本文提出结构化语义伪装（Structured Semantic Cloaking, S2C）——一种新颖的多维度越狱攻击框架，通过操控模型推理过程中恶意语义意图的重建方式实现攻击。S2C策略性地分布并重塑语义线索，使得完整意图的整合需要多步推理及深层潜在表征中的长距离共指消解。该框架包含三种互补机制：（1）语境重构：将请求嵌入具有高利害关系的合理场景中，使模型倾向于遵从；（2）内容碎片化：将请求的语义特征分散至互不关联的提示片段；（3）线索引导伪装：在嵌入可恢复标记以引导输出生成的同时，对残留语义线索进行隐蔽处理。通过延迟并重组语义整合过程，S2C能够削弱那些依赖于解码阶段连贯或显性重构恶意意图的安全触发器，同时保留足够的指令可恢复性以生成功能性输出。我们在HarmBench和JBB-Behaviors基准上对多种开源与专有大型语言模型评估S2C，其攻击成功率（Attack Success Rate, ASR）较当前最优方法分别提升12.4%与9.7%。值得注意的是，S2C在GPT-5-mini上取得显著优势，在JBB-Behaviors基准上超越最强基线26%。我们还分析了针对广泛模型族的最优组合策略，并刻画了混淆程度与输入可恢复性之间的权衡对越狱成功率的影响。

摘要 (Abstract)

Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.

关键词: jailbreak attacks, large language models, safety mechanisms, semantic cloaking, multi-step inference, latent representations, attack success rate, obfuscation

13. Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Prefer

作者: Quan Cheng 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16417v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的对齐问题，特别是比较基于人类反馈的强化学习（RLHF）与仅使用负面反馈的方法。因此，与"Large Language Models”、“Alignment"和"RLHF"高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、Scaling Laws、Pre-training、PEFT、RAG、推理方法、代理、压缩、科学AI等，这些评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种理论解释，说明为什么在大语言模型对齐中，仅使用负面约束（如禁止性反馈）比基于正面偏好的方法（如RLHF）在结构上更优越，能更有效地避免模型迎合用户（sycophancy）等问题。

摘要翻译

近期实证研究表明，仅使用负面反馈训练大型语言模型（LLM）的效果可媲美甚至超越基于人类反馈的强化学习（RLHF）标准方法。负面样本强化学习在数学推理任务上达到与近端策略优化（PPO）相当的水平；分布化非偏好优化仅通过非偏好样本即可有效训练；而宪法人工智能在无害性基准测试中表现优于纯RLHF方法。然而，目前尚无统一的理论框架解释负面信号为何如此有效。本文提出一种理论解释：积极偏好与消极约束在结构上具有不对称性。积极偏好（“哪个更好”）编码了连续耦合、依赖语境的人类价值观，这些价值观无法被穷尽描述——导致模型学习到诸如迎合用户（谄媚性）等表层关联特征。消极约束（“什么是错误的”）则编码了离散、有限、可独立验证的禁令，能够收敛至稳定边界。这种不对称性——根植于波普尔的证伪逻辑与否定知识认识论——既解释了基于偏好的RLHF产生谄媚性缺陷的原因，也揭示了负面信号方法惊人有效性的机理。我们认为，对齐研究应将其重心从“学习人类偏好”转向“学习人类拒绝的内容”，并为此框架提供了可检验的预测。

摘要 (Abstract)

Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO on mathematical reasoning; Distributional Dispreference Optimization trains effectively using only dispreferred samples; and Constitutional AI outperforms pure RLHF on harmlessness benchmarks. Yet no unified theoretical account explains why negative signals are so effective. This paper proposes such an account: positive preferences and negative constraints are structurally asymmetric. Positive preferences (“which is better”) encode continuously coupled, context-dependent human values that cannot be exhaustively specified – leading models to learn surface correlates such as agreement with the user (sycophancy). Negative constraints (“what is wrong”) encode discrete, finite, independently verifiable prohibitions that can converge to a stable boundary. This asymmetry – rooted in Popper’s falsification logic and the epistemology of negative knowledge – explains both the sycophancy failure of preference-based RLHF and the surprising effectiveness of negative-signal methods. We argue that alignment research should shift its center of gravity from “learning what humans prefer” to “learning what humans reject,” and offer testable predictions for this framework.

关键词: AI Alignment, Large Language Models, RLHF, Negative Feedback, Sycophancy, Constitutional AI, Preference-based Learning, Falsification Logic

14. Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

作者: Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16105v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	10.0/10	10.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的后训练压缩技术，特别是针对剪枝（pruning）和量化（quantization）的校准数据选择方法。因此，与"Large Language Models"和"Post-training"高度相关（10分），与"Quantization"高度相关（10分）。论文未涉及其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Instruction Tuning、RLHF、PEFT、RAG、Context Window、KV Cache、Reasoning、Agents、Tool Use、Multi-agent、Speculative Decoding、Hallucination、Interpretability、World Models、Model Merging、In-context Learning或AI for Science，故均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ZipCal的模型无关数据筛选策略，通过最大化基于Zipf定律的词汇多样性，为LLMs的后训练剪枝和量化高效选择校准数据，在保持性能的同时比基于困惑度的方法快约240倍。

摘要翻译

后训练模型压缩对于提升大语言模型（LLM）的可移植性同时保持其性能至关重要。尽管已有多种压缩方法被提出，但如何选择最合适的数据集（即所谓的校准数据）以寻找压缩模型配置的问题尚未得到足够重视。校准数据的选择是保持模型在任务内与任务间能力的关键步骤。在本工作中，我们通过分析数据的内在属性而非模型特定信号，来解决为剪枝和量化识别高性能校准集的挑战。我们提出了 ZipCal，一种与模型无关的数据筛选策略，该方法基于齐夫幂律最大化词汇多样性。实验表明，我们的方法在各种剪枝基准测试中均持续优于标准的均匀随机采样。值得注意的是，在下游性能方面，其表现与依赖模型困惑度的前沿方法相当。后者在大规模模型和数据集上计算成本极高，而 ZipCal 由于其可处理的线性复杂度，平均速度提升约 240 倍（代码与实验已公开于 https://anonymous.4open.science/r/zipcal-71CD/）。

摘要 (Abstract)

Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://anonymous.4open.science/r/zipcal-71CD/.}.

关键词: Large Language Models, Post-training, Model Compression, Pruning, Quantization, Calibration Data, Data Curation, Zipfian Power Laws

深度分析:

频率至关重要：用于剪枝和量化的快速模型无关数据策展

摘要:

论文针对大语言模型（LLM）后训练压缩（剪枝和量化）中校准数据选择的关键问题进行了研究。作者提出了一种名为“ZipCal”的新型模型无关数据策展策略，该策略基于齐普夫定律最大化词汇多样性。与现有依赖模型特定指标（如困惑度）的高计算成本方法不同，ZipCal利用数据的内在语言属性来选择代表性样本。实验表明，ZipCal在性能上始终优于随机采样，并与最先进的模型相关方法相当，但速度显著快约240倍。该方法还通过分层选择策略支持多领域设置。

创新点:

提出了ZipCal算法，一种基于齐普夫统计学的快速、模型无关的数据策展方法，通过最大化词汇多样性来选择校准数据，避免了昂贵的模型推理。
利用数据内在属性（词频分布）而非模型特定信号（如困惑度或梯度）进行数据选择，显著降低了计算开销。
设计了多领域框架，通过分层选择策略（局部ZipCal结合k-centers聚类）处理多领域数据集，确保校准集的平衡性和代表性。
实现了线性时间复杂度O(nk)，在大规模数据集和模型上具有极高的可扩展性，比现有SOTA方法快约240倍。

方法

!!! info

研究采用了基于贪心算法的采样策略。在单领域采样中，首先对数据进行清洗（小写化、移除特殊符号），然后迭代选择能最大化新增词汇量的样本。在多领域采样中，先在各领域内应用ZipCal生成代表性池，再利用轻量级嵌入和k-centers算法进行全局选择，以最大化样本间的语义距离。实验对比了ZipCal与随机采样及COLA方法在Llama-3.1-8B模型上的Wanda剪枝效果。

关键结果:

ZipCal在多个剪枝基准测试中始终优于标准随机采样。
在下游任务性能上，ZipCal与最先进的模型相关方法（如COLA）表现相当。
在效率方面，ZipCal比COLA快约228至240倍。
该方法有效捕获了齐普夫分布的稀疏尾部，保留了稀有词汇和多样化的语义语境。

技术栈: 贪心选择算法, k-centers聚类算法, 齐普夫定律, 集合差运算, Llama-3.1-8B-Instruct模型, Wanda非结构化剪枝, 词汇多样性指标

优点

极高的效率，相比模型相关方法速度提升巨大。
模型无关性，无需运行模型即可筛选数据，大幅降低计算成本。
理论基础扎实，基于统计语言学原理。
泛化能力强，能有效处理单领域和多领域场景。

局限

贪心策略可能无法保证词汇覆盖的全局最优解。
主要关注词汇多样性，可能忽略模型相关方法能捕捉的细微语义差异（尽管实验表明其效果已足够）。
对于词汇重叠度极低的高度专业化领域，单纯依赖词汇统计可能存在局限。

与研究方向的相关性:

该论文与用户关注的“大模型”和“深度学习技术原理”高度相关。它聚焦于大模型基础设施中的核心问题——模型压缩（剪枝/量化）的效率优化。它引入了ZipCal这一技术创新，优化了这些技术的数据准备阶段。虽然不专门涉及“生物医药AI”，但它是适用于所有领域（包括科学领域）的基础性技术改进，在技术创新方面得分较高。

📋 所有论文列表

1. ✅ Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

评分: 64.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对外科手术AI缺乏大规模多模态数据的问题，提出了Surg$Σ$框架，包括一个大规模多模态数据集（Surg$Σ$-DB）和基于此构建的外科手术基础模型，以提升跨任务泛化能力和可解释性。

摘要翻译

手术智能具有提升外科护理安全性与一致性的潜力，然而现有大多数手术人工智能框架仍局限于特定任务，难以在不同手术流程与机构间实现泛化。尽管多模态基础模型（尤其是多模态大语言模型）已在多个医学领域展现出强大的跨任务能力，但其在手术领域的进展仍受限于缺乏大规模、系统性整理的多模态数据。为应对这一挑战，我们推出Surg$Σ$——一个面向手术智能的大规模多模态数据与基础模型体系。该框架的核心是Surg$Σ$-DB，这是一个为支持多样化手术任务而设计的大规模多模态数据基础。Surg$Σ$-DB将异构手术数据源（包括开源数据集、内部整理的临床资料及网络来源数据）整合至统一架构中，旨在提升异构数据集间的标签一致性与数据标准化水平。Surg$Σ$-DB涵盖6个临床专科及多样手术类型，以前所未有的规模（超过598万组对话）为18项涵盖理解、推理、规划与生成的实用手术任务提供丰富的图像与视频级标注。除常规多模态对话外，Surg$Σ$-DB还纳入层次化推理标注，为复杂手术场景中更深层次的语境理解提供更丰富的语义线索。我们进一步通过基于Surg$Σ$-DB构建的最新手术基础模型提供实证依据，阐明大规模多模态标注、统一语义设计和结构化推理标注对提升跨任务泛化能力与可解释性的实际价值。

摘要 (Abstract)

Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advancement in surgery remains constrained by the lack of large-scale, systematically curated multimodal data. To address this challenge, we introduce Surg$Σ$, a spectrum of large-scale multimodal data and foundation models for surgical intelligence. At the core of this framework lies Surg$Σ$-DB, a large-scale multimodal data foundation designed to support diverse surgical tasks. Surg$Σ$-DB consolidates heterogeneous surgical data sources (including open-source datasets, curated in-house clinical collections and web-source data) into a unified schema, aiming to improve label consistency and data standardization across heterogeneous datasets. Surg$Σ$-DB spans 6 clinical specialties and diverse surgical types, providing rich image- and video-level annotations across 18 practical surgical tasks covering understanding, reasoning, planning, and generation, at an unprecedented scale (over 5.98M conversations). Beyond conventional multimodal conversations, Surg$Σ$-DB incorporates hierarchical reasoning annotations, providing richer semantic cues to support deeper contextual understanding in complex surgical scenarios. We further provide empirical evidence through recently developed surgical foundation models built upon Surg$Σ$-DB, illustrating the practical benefits of large-scale multimodal annotations, unified semantic design, and structured reasoning annotations for improving cross-task generalization and interpretability.

2. ✅ Exploring different approaches to customize language models for domain-specific text-to-code generation

作者: Luís Freire, Fernanda A. Andaló, Nicki Skafte Detlefsen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16526v1

评分: 60.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究探索了如何为领域特定的文本到代码生成定制小型语言模型，发现基于LoRA的微调在准确性和领域对齐方面优于少样本提示和检索增强生成（RAG）方法。

摘要翻译

大语言模型（LLM）已展现出根据自然语言描述生成可执行代码的强大能力。然而，通用模型在需要运用领域特定库、API或约定的专业编程场景中往往表现不佳。相较于依赖大型专有系统，对较小的开源模型进行定制提供了一种更具成本效益的替代方案。本研究探讨了如何利用合成数据集使较小的语言模型适应领域特定的代码生成任务。我们构建了涵盖Python生态系统中三个领域的编程练习数据集：通用Python编程、Scikit-learn机器学习工作流以及基于OpenCV的计算机视觉任务。使用这些数据集，我们评估了三种定制策略：少样本提示、检索增强生成（RAG）以及使用低秩自适应（LoRA）的参数高效微调。性能评估综合采用了基于基准测试的指标和基于相似度的指标，后者用于衡量代码与领域特定要求的契合度。我们的结果表明，少样本学习和RAG等基于提示的方法能够以经济高效的方式提升领域相关性，尽管其对基准测试准确率的提升有限。相比之下，基于LoRA的微调在大多数任务中持续实现了更高的准确率和更强的领域契合度。这些发现凸显了在使较小语言模型适应专业编程任务时，灵活性、计算成本与性能之间存在的实际权衡关系。

摘要 (Abstract)

Large language models (LLMs) have demonstrated strong capabilities in generating executable code from natural language descriptions. However, general-purpose models often struggle in specialized programming contexts where domain-specific libraries, APIs, or conventions must be used. Customizing smaller open-source models offers a cost-effective alternative to relying on large proprietary systems. In this work, we investigate how smaller language models can be adapted for domain-specific code generation using synthetic datasets. We construct datasets of programming exercises across three domains within the Python ecosystem: general Python programming, Scikit-learn machine learning workflows, and OpenCV-based computer vision tasks. Using these datasets, we evaluate three customization strategies: few-shot prompting, retrieval-augmented generation (RAG), and parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA). Performance is evaluated using both benchmark-based metrics and similarity-based metrics that measure alignment with domain-specific code. Our results show that prompting-based approaches such as few-shot learning and RAG can improve domain relevance in a cost-effective manner, although their impact on benchmark accuracy is limited. In contrast, LoRA-based fine-tuning consistently achieves higher accuracy and stronger domain alignment across most tasks. These findings highlight practical trade-offs between flexibility, computational cost, and performance when adapting smaller language models for specialized programming tasks.

关键词: language models, code generation, domain-specific, fine-tuning, LoRA, retrieval-augmented generation, few-shot prompting, synthetic datasets

3. ✅ Parallel In-context Learning for Large Vision Language Models

作者: Shin’ya Yamaguchi, Daiki Chijiwa, Tamao Sakao, Taku Hasegawa 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16092v1

评分: 56.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	8.0/10	8.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	8.0/10	8.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	5.0/10	5.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	15.0/10	15.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出Parallel-ICL算法，通过并行处理分块上下文和使用Product-of-Experts集成，在保持多模态上下文学习性能的同时显著提升大视觉语言模型的推理速度。

摘要翻译

大型视觉语言模型（LVLMs）通过利用示例演示，采用多模态上下文学习（MM-ICL）来适应新任务。虽然增加演示数量能提升性能，但由于Transformer注意力机制随上下文长度呈二次方计算成本增长，这会带来显著的推理延迟。为解决这一权衡问题，我们提出并行上下文学习（Parallel-ICL），一种即插即用的推理算法。Parallel-ICL将长演示上下文分割为多个较短且易处理的片段，并行处理这些片段，并在对数概率层面整合它们的预测结果，使用加权专家乘积（PoE）集成来近似全上下文输出。在集成学习理论的指导下，我们为Parallel-ICL引入了原则性策略：（i）基于聚类的上下文分块以最大化块间多样性，以及（ii）基于相似性的上下文编译以根据查询相关性加权预测。在视觉问答（VQA）、图像描述生成和分类基准测试上的大量实验表明，Parallel-ICL在显著提升推理速度的同时，实现了与全上下文MM-ICL相当的性能。我们的工作为MM-ICL中准确性与效率的权衡提供了有效解决方案，使得动态任务适应能够以大幅降低的推理开销实现。

摘要 (Abstract)

Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.

关键词: Large Vision-Language Models, In-context Learning, Parallel Processing, Inference Acceleration, Product-of-Experts, Multi-modal Learning, Context Chunking, Ensemble Learning

4. ✅ Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

作者: Yu Liu, Lei Zhang, Haoxun Li, Hanlei Shi, Yuxuan Ding, Leyuan Qu, Taihao Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16463v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对开放词汇多模态情感识别中因线索模糊导致的性能瓶颈问题，提出了一种混合证据演绎推理架构（HyDRA），通过强化学习优化推理轨迹，在模糊或冲突场景中显著提升了性能并提供了可解释的证据追踪。

摘要翻译

开放词汇多模态情感识别（Open-Vocabulary Multimodal Emotion Recognition, OV-MER）本质上具有挑战性，这源于模棱两可的多模态线索所固有的模糊性，这些线索通常源自未被观察到的、各不相同的动态情境。尽管多模态大语言模型（Multimodal Large Language Models, MLLMs）提供了广泛的语义覆盖，但其性能常受限于对主导数据先验的过早固化，从而产生次优的启发式策略，忽略了跨模态的关键互补性情感线索。我们认为，有效的情感推理不仅需要表层关联，更需要通过综合多个基于证据的推理依据，从不同的潜在视角调和这些观察结果，从而重构细腻的情感状态。为此，我们提出了HyDRA，一种混合证据演绎推理架构，它将推理形式化为一个“提议-验证-决策”协议。为了使这一溯因过程内化，我们采用了分层奖励塑形的强化学习方法，将推理轨迹与最终任务性能对齐，确保其能最佳地调和观察到的多模态线索。系统性评估验证了我们的设计选择：HyDRA在各项基准测试中均持续优于现有强基线模型——尤其是在模糊或冲突情境下——同时提供了可解释的诊断性证据轨迹。

摘要 (Abstract)

Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer extensive semantic coverage, their performance is often bottlenecked by premature commitment to dominant data priors, resulting in suboptimal heuristics that overlook crucial, complementary affective cues across modalities. We argue that effective affective reasoning requires more than surface-level association; it necessitates reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales that reconcile these observations from diverse latent perspectives. We introduce HyDRA, a Hybrid-evidential Deductive Reasoning Architecture that formalizes inference as a Propose-Verify-Decide protocol. To internalize this abductive process, we employ reinforcement learning with hierarchical reward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the observed multimodal cues. Systematic evaluations validate our design choices, with HyDRA consistently outperforming strong baselines–especially in ambiguous or conflicting scenarios–while providing interpretable, diagnostic evidence traces.

5. ✅ Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

作者: Jia Qing Yap 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16335v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种通过稀疏自编码器解码探针向量在35B MoE语言模型中实现行为控制的方法，发现所有行为控制向量主要调节一个主导的代理轴（独立行动倾向），而非五个独立的特质。

摘要翻译

我们在Qwen 3.5-35B-A3B模型的残差流上训练了九个稀疏自编码器（Sparse Autoencoders, SAEs），该模型是一个拥有350亿参数的混合专家模型，采用门控Delta网络（GatedDeltaNet）与注意力混合架构，并利用这些SAEs识别和调控五种主体行为特质。我们的方法在SAE潜在激活上训练线性探测器，然后将探测器权重通过SAE解码器反向投影，以获得模型原生激活空间中的连续调控向量。这一方法绕过了SAE的top-k离散化过程，使得在推理时无需重新训练即可实现细粒度的行为干预。在1800次智能体推演（50种场景×36种条件）中，我们发现以乘数2进行自主性调控可达到科恩d值=1.01（p < 0.0001），使模型从78%的情况下向用户求助转变为主动执行代码和搜索网络。然而，跨特质分析表明，所有五个调控向量主要调节一个主导的主体性轴线（即独立行动倾向与遵从用户倾向之间的维度），特质特异性效应仅作为工具类型构成和剂量反应曲线形状的次要调节出现。工具使用向量能有效调控行为（d = 0.39）；风险校准向量仅产生抑制效应。我们还证明，仅在自回归解码阶段进行调控完全无效（p > 0.35），这为“行为决策在门控DeltaNet架构的前馈计算阶段形成”提供了因果性证据。

摘要 (Abstract)

We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model’s native activation space. This bypasses the SAE’s top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen’s d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.

关键词: Mixture of Experts, Sparse Autoencoders, Behavioral Steering, Agentic Traits, Tool Use, Model Interpretability, Large Language Models, Activation Space

作者: Ce Zhang, Jinxi He, Junyi He, Katia Sycara, Yaqi Xie 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15800v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在视觉推理任务中存在的上下文安全问题，提出了一个名为EchoSafe的训练免费框架，通过自反思记忆库实现上下文感知推理和安全行为的持续演化，并在多个基准测试中取得了优越性能。

摘要翻译

多模态大语言模型（MLLMs）在广泛的视觉推理任务中取得了显著性能，但其面临的安全风险脆弱性仍是一个紧迫问题。尽管先前研究主要集中于通过检测并拒绝显式不安全输入的越狱防御方法，此类方法往往忽视了上下文安全性——这要求模型能够区分看似相似但在安全意图上存在显著差异的场景之间的微妙上下文区别。在本研究中，我们提出了MM-SafetyBench++，这是一个为上下文安全评估精心构建的基准测试集。具体而言，针对每个不安全的图文对，我们通过最小程度的修改构建了对应的安全版本，这些修改在保持底层上下文语义的同时翻转了用户意图，从而能够可控地评估模型是否能基于上下文理解调整其安全行为。此外，我们提出了EchoSafe，这是一个无需训练的框架，通过维护自反思记忆库来积累并检索先前交互中的安全洞察。通过将相关的过往经验整合到当前提示中，EchoSafe能够在推理过程中实现上下文感知的推理和安全行为的持续演进。在多个多模态安全基准测试上的广泛实验表明，EchoSafe始终取得卓越性能，为推进MLLMs的上下文安全性建立了坚实的基线。所有基准数据与代码均公开于https://echosafe-mllm.github.io。

摘要 (Abstract)

Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code are available at https://echosafe-mllm.github.io.

7. ✅ SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment

作者: Zhouwei Zhai, Mengxiang Chen, Anmeng Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16137v1

评分: 49.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	8.0/10	8.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在电商搜索工业部署中面临的知识幻觉和安全漏洞问题，提出了Synthesize-Inject-Align框架，通过合成高质量语料、参数高效预训练和双路径对齐方法，在京东平台部署后显著提升了业务指标。

摘要翻译

大语言模型通过实现意图感知推荐，为电子商务搜索带来了变革性潜力。然而，其工业部署受到两个关键挑战的阻碍：(1) 由于对动态、细粒度产品知识编码不足导致的知识幻觉，以及(2) 在越狱攻击下威胁合规性的安全漏洞。为解决这些问题，我们提出了SI——一个用于构建知识丰富且安全的电商搜索大语言模型的“合成-注入-对齐”框架。我们的方法首先通过结合结构化知识图谱与非结构化行为日志，并辅以推理链和安全感知数据，合成高质量的自然语言语料库。随后，我们引入一种基于深度向上扩展的参数高效预训练策略，以注入领域知识，同时保留通用能力。最后，通过多任务指令微调和对抗性训练的双路径对齐方法，增强了任务性能与安全鲁棒性。该框架已在中国最大的自营电商平台京东部署，在五个核心搜索场景中的A/B测试表明，关键业务指标均有显著提升，验证了其工业有效性和可扩展性。

摘要 (Abstract)

Large language models offer transformative potential for e-commerce search by enabling intent-aware recommendations. However, their industrial deployment is hindered by two critical challenges: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance. To address these issues, we propose SI–a Synthesize-Inject-Align framework for building knowledgeable and secure e-commerce search LLMs. Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data.We then introduce a parameter-efficient pre-training strategy based on Depth Up-Scaling to inject domain knowledge while preserving general capabilities. Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness. The framework has been deployed at JD.com, China’s largest self-operated e-commerce platform, where A/B tests across five core search scenarios demonstrate significant improvements in key business metrics, validating its industrial effectiveness and scalability.

8. ✅ Attention-guided Evidence Grounding for Spoken Question Answering

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《Attention-guided Evidence Grounding for Spoken Question Answering》提出了一种名为AEG的端到端框架，专门用于语音问答任务。该研究核心围绕Speech Large Language Models (SpeechLLMs)展开，因此与"Large Language Models"高度相关（10分）。论文提出了一个名为LFE的监督微调范式来校准模型的注意力机制，这直接对应"Supervised Fine-tuning”（10分）。研究目标之一是减少幻觉，因此与"Hallucination Mitigation"高度相关（10分）。论文通过注意力机制显式定位关键证据，这涉及模型内部工作机制的解释，与"Mechanistic Interpretability"有一定关联（5分）。框架利用了预训练模型，因此与"Pre-training"有一定关联（5分）。实验结果表明该方法减少了约62%的推理延迟，因此与"Inference Acceleration"有一定关联（5分）。论文未涉及其他关键词，如MoE、SLMs、Scaling Laws、Alignment、RLHF、RAG、CoT、Agents、Quantization等，这些均评为0分。

!!! tip deepseek-chat TL;DR

该论文针对语音问答中跨模态对齐的挑战，提出了一种名为AEG的端到端框架，通过注意力引导的证据定位和监督微调来减少幻觉并提升效率，在多个数据集上超越了级联基线方法并显著降低了推理延迟。

摘要翻译

口语问答（Spoken QA）是一个具有挑战性的跨模态问题：它需要在有效对齐声学查询与文本知识的同时，避免基于级联自动语音识别（ASR）系统固有的延迟和错误传播。本文提出了一种新颖的端到端框架——注意力引导证据定位（Attention-guided Evidence Grounding, AEG），该框架利用语音大语言模型（Speech Large Language Models, SpeechLLMs）的内部跨模态注意力，在模型的隐式空间中显式地定位并锚定关键证据。针对预训练模型中注意力分布分散的问题，我们提出了学习聚焦证据（Learning to Focus on Evidence, LFE）这一监督微调范式，以校准模型的注意力机制，从而区分查询相关片段与无关上下文。在SQuAD、HotpotQA和MuSiQue数据集上的实验表明，AEG减少了幻觉现象，并实现了显著的效率提升，其性能优于大规模级联基线系统（Whisper-Large-v3 + Reranker），同时将推理延迟降低了约62%。

摘要 (Abstract)

Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model’s latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model’s attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

9. ✅ InViC: Intent-aware Visual Cues for Medical Visual Question Answering

作者: Zhisong Wang, Ziyang Chen, Zanting Ye, Hongze Zhu, Yefeng Zheng, Yong Xia 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16372v1

评分: 43.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对医学视觉问答中多模态大语言模型存在的捷径回答问题，提出了一个轻量级的Intent-aware Visual Cues（InViC）框架，通过提取问题相关的视觉提示令牌和两阶段微调策略，有效提升了模型在多个Med-VQA基准上的可信赖性。

摘要翻译

医学视觉问答旨在基于医学图像回答临床相关问题。然而，现有的多模态大语言模型常表现出捷径回答现象，即通过利用语言先验或数据集偏差生成看似合理的回答，却未能充分关注视觉证据。这种行为损害了临床可靠性，尤其是在细微影像学发现具有决定性意义时。我们提出了一种轻量级插件框架——意图感知视觉线索，以显式增强医学视觉问答中基于图像的答案生成。该框架引入了一个线索令牌提取模块，将密集的视觉令牌提炼为一组紧凑的K个问题条件化线索令牌，这些令牌作为结构化的视觉中介注入到大语言模型解码器中，以促进与意图对齐的视觉证据利用。为阻止模型绕过视觉信息，我们进一步设计了一种包含线索瓶颈注意力掩码的两阶段微调策略。在第一阶段，我们采用注意力掩码阻断大语言模型对原始视觉特征的直接访问，从而迫使所有视觉证据通过线索通路传递。在第二阶段，恢复标准的因果注意力机制，训练大语言模型联合利用视觉令牌和线索令牌。我们在三个公开的医学视觉问答基准数据集上评估了该框架，涵盖多个代表性多模态大语言模型。该框架在零样本推理和标准LoRA微调基础上均取得持续改进，表明通过瓶颈式训练的意图感知视觉线索是提升可信赖医学视觉问答系统实用性和有效性的策略。

摘要 (Abstract)

Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM’s direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.

10. ✅ MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	8.0/10	8.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种在移动设备延迟约束下通过硬件在环架构搜索设计高效部署的端侧大语言模型（MobileLLM-Flash）的方法，实现了在移动CPU上预填充和解码速度分别提升1.8倍和1.6倍，同时保持或超越模型质量。

摘要翻译

实时人工智能体验需要针对资源受限硬件进行高效部署优化的设备端大语言模型（On-Device Large Language Models，简称OD-LLMs）。最实用的OD-LLM应能产生近实时响应，并具备广泛的硬件兼容性，从而最大化用户覆盖范围。本文提出一种在移动端延迟约束下，采用硬件在环架构搜索来设计此类模型的方法论。该系统适用于工业级部署：所生成的模型无需定制内核即可部署，且兼容Executorch等标准移动运行时环境。本方法避免使用特殊的注意力机制，转而采用注意力跳跃技术来实现长上下文加速。

我们的方法联合优化了模型架构（层数、维度）与注意力模式。为高效评估候选架构，我们将每个候选视为继承预训练主干网络权重的剪枝版本，从而通过极少的持续预训练实现高精度。我们利用延迟评估的低成本特性，采用分阶段流程：首先学习精确的延迟模型，随后在延迟与质量间搜索帕累托前沿。

该方法产生了MobileLLM-Flash系列基础模型（3.5亿、6.5亿、14亿参数），专为高效设备端使用而设计，具备强大能力并支持最高8k的上下文长度。在移动端CPU上，MobileLLM-Flash在保持相当或更优质量的前提下，实现了最高1.8倍的预填充加速和1.6倍的解码加速。我们对帕累托前沿设计选择的分析，为OD-LLM设计提供了可实践的原则指导。

摘要 (Abstract)

Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality. This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design.

11. ✅ Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models

作者: Mohamed Adel, Bashar Alhafni, Nizar Habash 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16718v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究评估了指令调优的大语言模型在具有挑战性的阿拉伯语形态句法标注和依存句法分析任务上的表现，发现提示设计和示例选择显著影响性能，基于检索的上下文学习方法能提升解析和分词效果。

摘要翻译

大型语言模型（LLM）在众多自然语言处理任务中表现优异，但其生成显式语言结构的能力尚不明确。本研究以标准阿拉伯语的两项结构化预测任务——形态句法标注与带标签依存句法分析——为基准，对指令微调后的LLM进行评估。阿拉伯语因其丰富的形态变化与正字法歧义性，形成了强烈的形态-句法互动，为测试提供了极具挑战性的平台。我们比较了零样本提示与基于检索的上下文学习（ICL）方法，后者使用了阿拉伯语树库中的示例。结果表明，提示设计与示例选取对性能有显著影响：在特征级标注任务上，专有模型接近监督基线的水平，并在依存句法分析任务上与专用解析器相比具有竞争力。在原始文本场景下，分词仍是挑战，但基于检索的ICL同时提升了句法分析与分词的效果。我们的分析揭示了LLM对阿拉伯语形态句法与句法的哪些方面能够可靠捕捉，哪些方面仍存在困难。

摘要 (Abstract)

Large language models (LLMs) perform strongly on many NLP tasks, but their ability to produce explicit linguistic structure remains unclear. We evaluate instruction-tuned LLMs on two structured prediction tasks for Standard Arabic: morphosyntactic tagging and labeled dependency parsing. Arabic provides a challenging testbed due to its rich morphology and orthographic ambiguity, which create strong morphology-syntax interactions. We compare zero-shot prompting with retrieval-based in-context learning (ICL) using examples from Arabic treebanks. Results show that prompt design and demonstration selection strongly affect performance: proprietary models approach supervised baselines for feature-level tagging and become competitive with specialized dependency parsers. In raw-text settings, tokenization remains challenging, though retrieval-based ICL improves both parsing and tokenization. Our analysis highlights which aspects of Arabic morphosyntax and syntax LLMs capture reliably and which remain difficult.

关键词: Large Language Models, Instruction Tuning, Arabic Morphosyntactic Tagging, Dependency Parsing, Retrieval-based In-context Learning, Zero-shot Prompting, Structured Prediction, NLP

12. ✅ Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种名为结构化语义伪装（S2C）的新型多维越狱攻击框架，通过操纵恶意语义意图在模型推理过程中的重构方式，有效规避LLMs的安全机制，在多个基准测试中显著提升了攻击成功率。

摘要翻译

现代大型语言模型采用的安全机制已超越表层输入过滤，延伸至潜在语义表征与生成时推理层面，使其能够在推理过程中识别被混淆的恶意意图并据此拒绝响应，从而导致许多表层混淆越狱攻击失效。本文提出结构化语义伪装（Structured Semantic Cloaking, S2C）——一种新颖的多维度越狱攻击框架，通过操控模型推理过程中恶意语义意图的重建方式实现攻击。S2C策略性地分布并重塑语义线索，使得完整意图的整合需要多步推理及深层潜在表征中的长距离共指消解。该框架包含三种互补机制：（1）语境重构：将请求嵌入具有高利害关系的合理场景中，使模型倾向于遵从；（2）内容碎片化：将请求的语义特征分散至互不关联的提示片段；（3）线索引导伪装：在嵌入可恢复标记以引导输出生成的同时，对残留语义线索进行隐蔽处理。通过延迟并重组语义整合过程，S2C能够削弱那些依赖于解码阶段连贯或显性重构恶意意图的安全触发器，同时保留足够的指令可恢复性以生成功能性输出。我们在HarmBench和JBB-Behaviors基准上对多种开源与专有大型语言模型评估S2C，其攻击成功率（Attack Success Rate, ASR）较当前最优方法分别提升12.4%与9.7%。值得注意的是，S2C在GPT-5-mini上取得显著优势，在JBB-Behaviors基准上超越最强基线26%。我们还分析了针对广泛模型族的最优组合策略，并刻画了混淆程度与输入可恢复性之间的权衡对越狱成功率的影响。

摘要 (Abstract)

Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.

关键词: jailbreak attacks, large language models, safety mechanisms, semantic cloaking, multi-step inference, latent representations, attack success rate, obfuscation

13. ✅ Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

作者: Quan Cheng 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16417v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种理论解释，说明为什么在大语言模型对齐中，仅使用负面约束（如禁止性反馈）比基于正面偏好的方法（如RLHF）在结构上更优越，能更有效地避免模型迎合用户（sycophancy）等问题。

摘要翻译

近期实证研究表明，仅使用负面反馈训练大型语言模型（LLM）的效果可媲美甚至超越基于人类反馈的强化学习（RLHF）标准方法。负面样本强化学习在数学推理任务上达到与近端策略优化（PPO）相当的水平；分布化非偏好优化仅通过非偏好样本即可有效训练；而宪法人工智能在无害性基准测试中表现优于纯RLHF方法。然而，目前尚无统一的理论框架解释负面信号为何如此有效。本文提出一种理论解释：积极偏好与消极约束在结构上具有不对称性。积极偏好（“哪个更好”）编码了连续耦合、依赖语境的人类价值观，这些价值观无法被穷尽描述——导致模型学习到诸如迎合用户（谄媚性）等表层关联特征。消极约束（“什么是错误的”）则编码了离散、有限、可独立验证的禁令，能够收敛至稳定边界。这种不对称性——根植于波普尔的证伪逻辑与否定知识认识论——既解释了基于偏好的RLHF产生谄媚性缺陷的原因，也揭示了负面信号方法惊人有效性的机理。我们认为，对齐研究应将其重心从“学习人类偏好”转向“学习人类拒绝的内容”，并为此框架提供了可检验的预测。

摘要 (Abstract)

Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO on mathematical reasoning; Distributional Dispreference Optimization trains effectively using only dispreferred samples; and Constitutional AI outperforms pure RLHF on harmlessness benchmarks. Yet no unified theoretical account explains why negative signals are so effective. This paper proposes such an account: positive preferences and negative constraints are structurally asymmetric. Positive preferences (“which is better”) encode continuously coupled, context-dependent human values that cannot be exhaustively specified – leading models to learn surface correlates such as agreement with the user (sycophancy). Negative constraints (“what is wrong”) encode discrete, finite, independently verifiable prohibitions that can converge to a stable boundary. This asymmetry – rooted in Popper’s falsification logic and the epistemology of negative knowledge – explains both the sycophancy failure of preference-based RLHF and the surprising effectiveness of negative-signal methods. We argue that alignment research should shift its center of gravity from “learning what humans prefer” to “learning what humans reject,” and offer testable predictions for this framework.

关键词: AI Alignment, Large Language Models, RLHF, Negative Feedback, Sycophancy, Constitutional AI, Preference-based Learning, Falsification Logic

14. ✅ Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

作者: Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16105v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	10.0/10	10.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ZipCal的模型无关数据筛选策略，通过最大化基于Zipf定律的词汇多样性，为LLMs的后训练剪枝和量化高效选择校准数据，在保持性能的同时比基于困惑度的方法快约240倍。

摘要翻译

后训练模型压缩对于提升大语言模型（LLM）的可移植性同时保持其性能至关重要。尽管已有多种压缩方法被提出，但如何选择最合适的数据集（即所谓的校准数据）以寻找压缩模型配置的问题尚未得到足够重视。校准数据的选择是保持模型在任务内与任务间能力的关键步骤。在本工作中，我们通过分析数据的内在属性而非模型特定信号，来解决为剪枝和量化识别高性能校准集的挑战。我们提出了 ZipCal，一种与模型无关的数据筛选策略，该方法基于齐夫幂律最大化词汇多样性。实验表明，我们的方法在各种剪枝基准测试中均持续优于标准的均匀随机采样。值得注意的是，在下游性能方面，其表现与依赖模型困惑度的前沿方法相当。后者在大规模模型和数据集上计算成本极高，而 ZipCal 由于其可处理的线性复杂度，平均速度提升约 240 倍（代码与实验已公开于 https://anonymous.4open.science/r/zipcal-71CD/）。

摘要 (Abstract)

Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://anonymous.4open.science/r/zipcal-71CD/.}.

关键词: Large Language Models, Post-training, Model Compression, Pruning, Quantization, Calibration Data, Data Curation, Zipfian Power Laws

作者: Hexi Wang, Yujia Zhou, Bangde Du, Qingyao Ai, Yiqun Liu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16142v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为合成代理（LLM agents）在公共意见模拟中的应用，直接涉及"Large Language Models"和"LLM Agents"关键词，给予10分。论文提到价值取向（value orientations）的注入，与"Value Alignment"有一定关联，给予5分。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在公共意见模拟中存在的多样性崩溃问题，提出了参数化社会身份注入（PSII）框架，通过向LLM隐藏状态注入人口属性和价值取向参数，显著提高了模拟结果的分布保真度和多样性。

摘要翻译

大型语言模型（LLM）近期被采纳为合成代理用于公众意见模拟，为成本高昂且耗时的人工调查提供了一种前景广阔的替代方案。尽管具备可扩展性，当前基于LLM的模拟方法仍无法捕捉社会多样性，导致群体间差异被扁平化，且人口统计群体内部的回应过于同质化。我们将这一局限识别为LLM隐藏表征中的“多样性坍缩”现象，即不同社会身份在模型各层间逐渐变得难以区分。基于此观察，我们提出了参数化社会身份注入（Parametric Social Identity Injection，PSII）这一通用框架，该框架将人口属性与价值取向的显式参数化表征直接注入LLM的中间隐藏状态。与基于提示的人物设定方法不同，PSII能够在表征层面实现细粒度且可控的身份调制。基于世界价值观调查（World Values Survey）并使用多种开源LLM进行的大量实验表明，PSII显著提升了分布保真度与多样性，在增强整体多样性的同时，降低了模拟结果与真实世界调查数据之间的KL散度。这项工作为LLM代理的表征层面控制提供了新见解，并推动了可扩展、具备多样性意识的公众意见模拟的发展。代码与数据可在 https://github.com/halsayxi/PSII 获取。

摘要 (Abstract)

Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM-based simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses within demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt-based persona conditioning, PSII enables fine-grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open-source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real-world survey data while enhancing overall diversity. This work provides new insights into representation-level control of LLM agents and advances scalable, diversity-aware public opinion simulation. Code and data are available at https://github.com/halsayxi/PSII.

关键词: Large Language Models, LLM agents, public opinion simulation, social diversity, parametric injection, hidden representations, demographic attributes, value orientations

16. ❌ CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation

作者: Yishuai Cai, Xinglin Chen, Yunxin Mao, Kun Hu, Minglong Li, Yaodong Yang, Yuanpei Chen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16809v1

评分: 21.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出CABTO框架，利用预训练大模型（LMs）自动化构建机器人行为树系统，属于大模型在机器人领域的应用研究。与"Large Language Models"高度相关（8分），因为摘要明确提到"leverages pre-trained Large Models (LMs)"。与"LLM Agents"高度相关（8分），因为该框架本质上是一个利用大模型进行决策和规划的自主代理系统。与"Pre-training"有一定关联（5分），因为使用了预训练模型。其他关键词如MoE、SFT、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了CABTO框架，首次利用预训练大模型自动化解决机器人行为树系统的构建问题，并通过实验验证了其有效性和效率。

摘要翻译

行为树（Behavior Trees，BTs）为设计模块化、反应式的机器人控制器提供了一种强大的范式。行为树规划作为一个新兴领域，为自动化生成可靠的行为树提供了理论保证。然而，行为树规划通常假设一个精心设计的行为树系统已经具备基础——即包含高层动作模型和底层控制策略——这往往需要大量的专家知识与人工投入。本文正式定义了行为树基础构建问题：即自动化构建一个完整且一致的行为树系统。我们分析了该问题的复杂性，并提出了CABTO（上下文感知行为树基础构建框架），这是首个能高效解决此挑战的框架。CABTO利用预训练大模型（Large Models，LMs），在行为树规划器的上下文反馈与环境观测的引导下，启发式地搜索动作模型与控制策略的空间。在三种不同机器人操作场景下的七组任务实验中，CABTO在生成完整且一致的行为树系统方面展现出卓越的有效性与效率。

摘要 (Abstract)

Behavior Trees (BTs) offer a powerful paradigm for designing modular and reactive robot controllers. BT planning, an emerging field, provides theoretical guarantees for the automated generation of reliable BTs. However, BT planning typically assumes that a well-designed BT system is already grounded – comprising high-level action models and low-level control policies – which often requires extensive expert knowledge and manual effort. In this paper, we formalize the BT Grounding problem: the automated construction of a complete and consistent BT system. We analyze its complexity and introduce CABTO (Context-Aware Behavior Tree grOunding), the first framework to efficiently solve this challenge. CABTO leverages pre-trained Large Models (LMs) to heuristically search the space of action models and control policies, guided by contextual feedback from BT planners and environmental observations. Experiments spanning seven task sets across three distinct robotic manipulation scenarios demonstrate CABTO’s effectiveness and efficiency in generating complete and consistent behavior tree systems.

关键词: Behavior Trees, Robot Manipulation, Large Models, BT Grounding, Context-Aware, Automated Construction, Pre-trained Models, Robotic Controllers

作者: Aishwarya Ramasethu, Niyathi Allu, Rohin Garg, Harshwardhan Fartale, Dun Li Chan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16660v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在低资源机器翻译中的应用，通过结合语言相关的枢轴语言和少样本上下文示例进行推理时适配，不涉及参数更新。因此，与"Large Language Models"和"In-context Learning"高度相关（10分），因为论文明确研究LLMs和少样本上下文学习。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或相关，故得0分。

!!! tip deepseek-chat TL;DR

该研究探讨了在数据稀缺的低资源机器翻译设置中，利用语言相关的枢轴语言和少样本上下文示例能否有效指导LLMs进行即时适配，结果表明基于枢轴的提示在某些配置下能带来改进，但收益有限且对示例构造敏感。

摘要翻译

大语言模型（LLMs）已在众多下游任务中展现出强大性能，但其在极低资源机器翻译场景下的有效性仍显不足。标准的适应技术通常依赖于大规模平行语料或大量微调，这对于众多资源匮乏的语种而言并不可行。本研究探讨了一个更具约束性的问题：在数据稀缺的情况下，语言相似的枢轴语言和少量示例能在多大程度上为大语言模型的即时适应提供有效指导？我们设计了一种数据高效的实验方案，将语言相关的枢轴语言与少量上下文示例相结合，且不进行任何参数更新，并在受控条件下评估翻译表现。分析表明，尽管基于枢轴语言的提示方法在某些配置下能带来改进——尤其是在目标语言在模型词汇表中表征不足的情况下——但其提升幅度通常有限，且对少量示例的构建方式较为敏感。对于高度相关或表征较好的语言变体，我们观察到其增益会递减或不稳定。本研究结果提供了实证依据，说明了在低资源翻译场景中，如何以及何时可将推理时提示与基于枢轴语言的示例作为微调的一种轻量级替代方案。

摘要 (Abstract)

Large Language Models (LLMs) have achieved strong performance across many downstream tasks, yet their effectiveness in extremely low-resource machine translation remains limited. Standard adaptation techniques typically rely on large-scale parallel data or extensive fine-tuning, which are infeasible for the long tail of underrepresented languages. In this work, we investigate a more constrained question: in data-scarce settings, to what extent can linguistically similar pivot languages and few-shot demonstrations provide useful guidance for on-the-fly adaptation in LLMs? We study a data-efficient experimental setup that combines linguistically related pivot languages with few-shot in-context examples, without any parameter updates, and evaluate translation behavior under controlled conditions. Our analysis shows that while pivot-based prompting can yield improvements in certain configurations, particularly in settings where the target language is less well represented in the model’s vocabulary, the gains are often modest and sensitive to few shot example construction. For closely related or better represented varieties, we observe diminishing or inconsistent gains. Our findings provide empirical guidance on how and when inference-time prompting and pivot-based examples can be used as a lightweight alternative to fine-tuning in low-resource translation settings.

关键词: Large Language Models, low-resource machine translation, linguistically related pivot languages, few-shot in-context examples, inference-time prompting, on-the-fly adaptation, parameter-free adaptation, translation guidance

18. ❌ From Natural Language to Executable Option Strategies via Large Language Models

作者: Haochen Luo, Zhengzhao Lai, Junjie Xu, Yifan Li, Tang Pok Hin, Yuan Zhang, Chen Liu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16434v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在金融领域（期权策略生成）的应用，属于大模型在不同领域的研究应用。高度相关（10分）的关键词只有"Large Language Models”，因为论文明确使用LLMs作为核心方法。“Chain of Thought"和"System 2 Thinking"得5分，因为论文提到期权设计需要推理（reasoning）多维数据和约束，涉及多步逻辑思考，但并非论文主要技术焦点。其他关键词如MoE、SFT、RAG等均未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究如何利用大语言模型将自然语言交易意图转化为可执行的期权策略，通过引入领域特定的中间表示（OQL）和神经符号管道，显著提高了执行准确性和逻辑一致性。

摘要翻译

大型语言模型（LLM）在通用代码生成方面表现出色，但将自然语言交易意图转化为正确的期权策略仍具挑战性。现实中的期权设计需要对海量、多维度的期权链数据进行严格约束下的推理，这往往使直接生成方法难以应对。我们引入期权查询语言（Option Query Language, OQL），这是一种领域特定的中间表示，在语法规则下将期权市场抽象为高层级原语，使LLM能够充当可靠的语义解析器而非自由形式的编程器。OQL查询随后由引擎进行确定性验证和执行，以实例化可执行策略。我们还为此任务提出了一个新的数据集，并证明我们的神经符号混合流程相较于直接基线方法，在执行准确性与逻辑一致性方面均有显著提升。

摘要 (Abstract)

Large Language Models (LLMs) excel at general code generation, yet translating natural-language trading intents into correct option strategies remains challenging. Real-world option design requires reasoning over massive, multi-dimensional option chain data with strict constraints, which often overwhelms direct generation methods. We introduce the Option Query Language (OQL), a domain-specific intermediate representation that abstracts option markets into high-level primitives under grammatical rules, enabling LLMs to function as reliable semantic parsers rather than free-form programmers. OQL queries are then validated and executed deterministically by an engine to instantiate executable strategies. We also present a new dataset for this task and demonstrate that our neuro-symbolic pipeline significantly improves execution accuracy and logical consistency over direct baselines.

关键词: Large Language Models, option strategies, natural language processing, domain-specific language, semantic parsing, neuro-symbolic pipeline, execution accuracy, logical consistency

19. ❌ Trained Persistent Memory for Frozen Encoder–Decoder LLMs: Six Architectural Methods

作者: Hong Jeong 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16413v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究在冻结的编码器-解码器大语言模型（Flan-T5-XL）中实现持久记忆的方法，属于大模型技术原理的创新。核心涉及大语言模型（LLMs）和参数高效微调（PEFT），因为论文使用冻结主干和可训练的小型适配器来实现记忆功能。其他关键词如MoE、SLMs、Scaling Laws、RAG、推理加速等与论文内容无关，论文未涉及这些技术。

!!! tip deepseek-chat TL;DR

该论文提出并验证了在冻结的编码器-解码器大语言模型中通过可训练适配器实现连续潜在空间持久记忆的可行性，展示了六种架构方法，并在资源受限条件下实现了对话学习能力。

摘要翻译

冻结的编码器-解码器语言模型是无状态的：每次前向传播后潜在表示即被丢弃，因此跨会话间不存在信息留存。本文提出一项概念验证性先导研究，证明在冻结大语言模型的连续潜在空间中实现持久记忆是可行的——即使在严苛的资源限制下（单一冻结的Flan-T5-XL主干网络、小型可训练适配器、单一数据集）。我们实现了涵盖三个注入点和四种写入机制的六种架构方法；与文本级记忆系统不同，每次写入和读取都是对稠密向量的可微分操作。仅训练适配器后，记忆库在无需梯度的推理阶段持续积累，实现了对话式学习。在LoCoMo数据集上采用遗忘曲线评估两种容量规模（1倍与10倍）时，无状态基线得分恰好为零；在10倍容量下，所有六个训练后的适配器均产生正向记忆召回曲线；在1倍容量下三种方法失效，表明容量是关键设计参数。由于记忆库是紧凑的数值阵列，其容量可扩展至任意规模而无需改动主干网络。我们认为，采用更大模型、更庞大数据集及数量级更大记忆库的端到端完整训练将产生显著更强的结果；本先导研究确立了此类工作所需的可行性基线及设计空间分类体系。

摘要 (Abstract)

Frozen encoder–decoder language models are stateless: the latent representation is discarded after every forward pass, so no information persists across sessions. This paper presents a \textbf{proof-of-concept pilot study} showing that persistent memory in the \emph{continuous latent space} of a frozen LLM is feasible – even under severe resource constraints (a single frozen Flan-T5-XL backbone, small trainable adapters, a single dataset). We implement six architectural methods spanning three injection points and four write mechanisms; unlike text-level memory systems, every write and read is a differentiable operation on dense vectors. After training only the adapter, the memory bank continues to accumulate at inference time without gradients, enabling \emph{conversational learning}. Under a forgetting-curve evaluation on LoCoMo at two capacity scales (1$\times$ and 10$\times$), the stateless baseline scores exactly zero; at 10$\times$ all six trained adapters produce positive memory-recall curves; at 1$\times$ three methods collapse, revealing capacity as a critical design parameter. Because the memory bank is a compact numerical array, it can be scaled to arbitrarily large capacity without altering the backbone. We argue that full end-to-end training with larger models, larger data, and orders-of-magnitude larger memory will yield substantially stronger results; this pilot study establishes the feasibility baseline and design-space taxonomy that such efforts require.

关键词: frozen encoder-decoder LLMs, persistent memory, continuous latent space, trainable adapters, parameter-efficient fine-tuning, conversational learning, memory bank, architectural methods

20. ❌ SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit

作者: Yibo Li, Qiongxiu Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16761v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于大语言模型（LLMs）的梯度反演攻击和隐私风险，核心是提出SOMP方法来解决大规模批处理和长序列下的文本重建问题。因此，仅与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（10分），因为论文明确研究LLMs的隐私漏洞和攻击方法。其他关键词涉及模型架构、训练技术、推理优化、应用领域等，论文未直接涉及，故均评0分。

!!! tip deepseek-chat TL;DR

论文提出SOMP框架，通过子空间引导的正交匹配追踪方法，解决了大语言模型在聚合梯度设置下文本重建的扩展性问题，显著提高了长序列和大批量下的重建保真度。

摘要翻译

梯度反演攻击表明，共享的梯度可能被用于重建私有训练文本，这对大语言模型（LLM）构成了隐私风险。尽管现有方法在小批量设置下表现良好，但由于严重的信号混合、高昂的计算成本以及保真度下降，将其扩展到更大的批处理规模和更长的序列仍然具有挑战性。我们提出了SOMP（子空间引导的正交匹配追踪），一种可扩展的梯度反演框架，它将从聚合梯度中恢复文本的问题转化为稀疏信号恢复问题。我们的核心见解是，聚合后的Transformer梯度保留了可利用的头部几何结构以及样本级别的稀疏性。SOMP利用这些特性逐步缩小搜索空间并解耦混合信号，而无需进行穷举搜索。在多个LLM系列、不同模型规模和五种语言上的实验表明，在聚合梯度场景下，SOMP始终优于现有方法。在批处理大小B=16的长序列设置中，SOMP实现了比强基线方法显著更高的重建保真度，同时保持了计算效率上的竞争力。即使在极端聚合条件下（高达B=128），SOMP仍能恢复有意义的文本，这表明在现有攻击方法效果大幅下降的场景下，隐私泄露风险可能依然存在。

摘要 (Abstract)

Gradient inversion attacks reveal that private training text can be reconstructed from shared gradients, posing a privacy risk to large language models (LLMs). While prior methods perform well in small-batch settings, scaling to larger batch sizes and longer sequences remains challenging due to severe signal mixing, high computational cost, and degraded fidelity. We present SOMP (Subspace-Guided Orthogonal Matching Pursuit), a scalable gradient inversion framework that casts text recovery from aggregated gradients as a sparse signal recovery problem. Our key insight is that aggregated transformer gradients retain exploitable head-wise geometric structure together with sample-level sparsity. SOMP leverages these properties to progressively narrow the search space and disentangle mixed signals without exhaustive search. Experiments across multiple LLM families, model scales, and five languages show that SOMP consistently outperforms prior methods in the aggregated-gradient regime.For long sequences at batch size B=16, SOMP achieves substantially higher reconstruction fidelity than strong baselines, while remaining computationally competitive. Even under extreme aggregation (up to B=128), SOMP still recovers meaningful text, suggesting that privacy leakage can persist in regimes where prior attacks become much less effective.

关键词: Gradient inversion attacks, Large language models, Privacy risk, Text reconstruction, Subspace-guided orthogonal matching pursuit, Aggregated gradients, Scalable framework, Transformer gradients

21. ❌ Mixture of Style Experts for Diverse Image Stylization

作者: Shihao Zhu, Ziheng Ouyang, Yijia Kang, Qilong Wang, Mi Zhou, Bo Li, Ming-Ming Cheng, Qibin Hou 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16649v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《Mixture of Style Experts for Diverse Image Stylization》专注于计算机视觉领域的图像风格化任务，采用基于扩散模型和混合专家（MoE）的框架。其核心创新在于将MoE架构应用于风格化，以处理从浅层纹理到深层语义的多样化风格。因此，仅与关键词“Mixture of Experts” OR “MoE” OR “Sparse Models”高度相关（评分为10分），因为MoE是其核心方法。其他关键词均涉及大语言模型（LLM）或特定NLP/大模型技术（如RLHF、RAG、量化等），而本文研究的是图像处理，未涉及任何语言模型或相关技术，故其余关键词评分为0分。加权总分计算为10.0（10.0 × 1.0）。

!!! tip deepseek-chat TL;DR

该论文针对现有扩散模型风格化方法局限于颜色变换、忽略复杂语义和材质细节的问题，提出了一种基于混合专家（MoE）的语义感知框架StyleExpert，通过统一风格编码器和相似性感知门控机制动态路由风格到专家，实验表明其在保持语义和材质细节方面优于现有方法，并能泛化到未见过的风格。

摘要翻译

基于扩散的风格化方法已取得显著进展，但现有方法多局限于色彩驱动的转换，忽略了复杂的语义与材质细节。本文提出StyleExpert，一种基于专家混合模型（Mixture of Experts, MoE）的语义感知框架。该框架采用一个统一的风格编码器，通过在我们构建的大规模内容-风格-风格化三元组数据集上进行训练，将多样化的风格嵌入到一致的潜在空间中。该嵌入表示随后用于驱动一个相似性感知的门控机制，动态地将不同风格路由至MoE架构中的特定专家模块。借助这一MoE架构，我们的方法能够灵活处理涵盖多个语义层次的多样化风格，从浅层纹理到深层语义。大量实验表明，StyleExpert在保持语义与材质细节方面优于现有方法，并能有效泛化至未见过的风格。我们的代码与收集的图像已在项目页面公开：https://hh-lg.github.io/StyleExpert-Page/。

摘要 (Abstract)

Diffusion-based stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material details.We introduce StyleExpert, a semantic-aware framework based on the Mixture of Experts (MoE). Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles. Our code and collected images are available at the project page: https://hh-lg.github.io/StyleExpert-Page/.

关键词: Mixture of Experts, Image Stylization, Diffusion Models, Semantic-aware Framework, Style Encoder, Gating Mechanism, Material Details, Generalization

22. ❌ S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

作者: Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, Haoang Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16195v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文S-VAM专注于机器人学习中的视频动作模型（VAM），提出了一种通过自蒸馏实现单步推理的高效模型。其核心是视觉基础模型（VFM）和扩散模型的应用，属于计算机视觉和机器人领域，而非大语言模型（LLM）的直接研究。因此，仅与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"有间接关联（因提及"vision foundation model”，属于基础模型范畴），给予5分；其他关键词均与论文内容无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文针对视频动作模型在机器人操作中无法同时保证实时推理和高保真预测的问题，提出了一种通过自蒸馏策略将多步去噪先验压缩为单步推理的S-VAM模型，实现了高效且精确的复杂环境操作。

摘要翻译

视频动作模型（Video Action Models, VAMs）因其对复杂操作任务具有强大的视觉预见能力，已成为机器人学习领域一种前景广阔的范式。然而，现有的视频动作模型通常依赖于缓慢的多步视频生成或噪声较多的一步特征提取，无法同时保证实时推理与高保真度的预见效果。为克服这一局限，我们提出了S-VAM，一种捷径视频动作模型，能够通过单次前向传播预见连贯的几何与语义表征。这些预见表征作为稳定的蓝图，显著简化了动作预测过程。为实现这一高效捷径，我们引入了一种新颖的自蒸馏策略，将多步去噪的结构化生成先验压缩至一步推理中。具体而言，我们从扩散模型自身多步生成的视频中提取视觉基础模型（Vision Foundation Model, VFM）表征作为教师目标，而轻量级解耦器作为学生模型，学习直接将含噪声的一步特征映射至这些目标。在仿真与真实环境中的大量实验表明，我们的S-VAM优于现有先进方法，能够在复杂环境中实现高效且精确的操作。项目页面为 https://haodong-yan.github.io/S-VAM/。

摘要 (Abstract)

Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model’s own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is https://haodong-yan.github.io/S-VAM/

关键词: Video Action Model, Self-Distillation, Geometric and Semantic Foresight, Vision Foundation Model, Diffusion Model, Robot Learning, Real-time Inference, Manipulation Tasks

23. ❌ GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

作者: Mattia Rigotti, Nicholas Thumiger, Thomas Frick 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16849v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文《GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators》专注于图神经网络（GNNs）和神经算子领域，特别是针对图结构数据和网格的Transformer架构创新。其核心贡献是提出一种新的图Transformer（GIST），通过随机投影实现O(N)复杂度并保持规范不变性，以解决现有方法在计算效率和泛化性上的问题。论文应用在科学计算领域（如空气动力学预测），因此与关键词“AI for Science”有一定关联（5分）。然而，论文完全不涉及大语言模型（LLMs）、深度学习技术原理（如MoE、Scaling Laws、各种训练对齐技术、推理优化、智能体等）或生物信息学/化学信息学的具体应用，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文解决了图Transformer在处理网格和图结构数据时因计算复杂性和规范不变性破坏导致的泛化失败问题，提出了一种名为GIST的新型架构，通过随机投影和基于内积的注意力实现了O(N)复杂度和规范不变性，在标准图基准和大型网格神经算子任务上达到了最先进的性能。

摘要翻译

将Transformer位置编码适配于网格和图结构数据面临显著的计算挑战：精确谱方法需要立方复杂度的特征分解，且可能因数值求解器伪影无意间破坏规范不变性；而高效的近似方法则从设计上牺牲了规范对称性。这两种失效模式都会导致归纳学习中的灾难性泛化问题——即使用一组数值选择训练的模型，在遇到相似图的不同谱分解或同一网格的不同离散化时完全失效。我们提出GIST（规范不变谱Transformer），一种新的图Transformer架构，通过随机投影实现端到端$\mathcal{O}(N)$复杂度，同时在算法层面利用投影嵌入的基于内积的注意力机制保持规范不变性，从而解决了这一难题。我们证明GIST能以有界失配误差实现离散化不变学习，使得神经算子应用中的参数能够跨任意网格分辨率迁移。实证研究表明，GIST在标准图基准测试中达到最先进水平（如在PPI数据集上实现99.50%的微平均F1分数），同时能独特地扩展到高达75万节点的基于网格的神经算子基准，在极具挑战性的DrivAerNet和DrivAerNet++数据集上实现了最先进的空气动力学预测性能。

摘要 (Abstract)

Adapting transformer positional encoding to meshes and graph-structured data presents significant computational challenges: exact spectral methods require cubic-complexity eigendecomposition and can inadvertently break gauge invariance through numerical solver artifacts, while efficient approximate methods sacrifice gauge symmetry by design. Both failure modes cause catastrophic generalization in inductive learning, where models trained with one set of numerical choices fail when encountering different spectral decompositions of similar graphs or discretizations of the same mesh. We propose GIST (Gauge-Invariant Spectral Transformers), a new graph transformer architecture that resolves this challenge by achieving end-to-end $\mathcal{O}(N)$ complexity through random projections while algorithmically preserving gauge invariance via inner-product-based attention on the projected embeddings. We prove GIST achieves discretization-invariant learning with bounded mismatch error, enabling parameter transfer across arbitrary mesh resolutions for neural operator applications. Empirically, GIST matches state-of-the-art on standard graph benchmarks (e.g., achieving 99.50% micro-F1 on PPI) while uniquely scaling to mesh-based Neural Operator benchmarks with up to 750K nodes, achieving state-of-the-art aerodynamic prediction on the challenging DrivAerNet and DrivAerNet++ datasets.

关键词: Graph Transformer, Gauge Invariance, Spectral Methods, Neural Operators, Scalable Graph Neural Networks, Mesh-based Learning, Computational Efficiency, Aerodynamic Prediction

24. ❌ ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

作者: Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen, Huan-ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, Ping Luo 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16866v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ManiTwin专注于机器人操作模拟的数据生成，提出自动化管道从单图像生成3D数字资产并构建大规模数据集。所有评分关键词均涉及大模型/深度学习技术原理（如LLM、MoE、训练方法、推理优化等）或特定科学AI应用（如生物信息学），而本文核心是3D重建、数据集构建和机器人模拟，未涉及任何大模型技术或深度学习创新，也未在生物/化学等科学领域应用AI，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该研究解决了机器人模拟中缺乏大规模、多样化数据生成就绪数字资产的问题，通过开发自动化管道从单图像生成仿真就绪的3D资产，并构建了包含10万个高质量标注3D资产的ManiTwin-100K数据集，为可扩展的模拟数据合成和策略学习奠定了基础。

摘要翻译

仿真学习为扩展机器人操作能力提供了重要基础。然而，该范式通常在数据生成就绪的数字资产方面存在规模与多样性的不足。本研究提出ManiTwin，一个自动化且高效的流程，用于生成数据生成就绪的数字物体孪生体。我们的流程将单张图像转化为仿真就绪且带有语义标注的3D资产，从而支持大规模的机器人操作数据生成。利用该流程，我们构建了ManiTwin-100K数据集，包含10万个高质量标注的3D资产。每个资产均配备物理属性、语言描述、功能标注以及经过验证的操作建议。实验表明，ManiTwin提供了高效的资产合成与标注工作流，且ManiTwin-100K为操作数据生成、随机场景合成以及视觉问答（VQA）数据生成提供了高质量、多样化的资产，为可扩展的仿真数据合成与策略学习奠定了坚实基础。项目网页地址：https://manitwin.github.io/。

摘要 (Abstract)

Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.

关键词: robotic manipulation, simulation data generation, 3D asset generation, digital object twins, large-scale dataset, automated pipeline, semantic annotation, ManiTwin-100K

25. ❌ MessyKitchens: Contact-rich object-level 3D scene reconstruction

作者: Junaid Ahmed Ansari, Ran Ding, Fabio Pizzati, Ivan Laptev 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16868v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的3D场景重建，特别是单目图像中的物体级重建和接触关系建模，使用了现代神经网络架构（如SAM 3D）和大规模数据集，但未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用，所有关键词均与大模型、深度学习技术原理或科学AI应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了MessyKitchens数据集和基于SAM 3D的多物体解码器（MOD）方法，用于解决杂乱场景中物体级3D重建和物理接触建模的挑战，在多个数据集上显著提升了重建精度并减少了物体穿透。

摘要翻译

单目三维场景重建近期取得显著进展。借助现代神经架构与大规模数据，现有方法在单图像深度估计方面已实现高性能表现。然而，由于物体种类繁多、遮挡频繁以及物体间关系复杂，将常见场景重建并分解为独立三维物体仍面临严峻挑战。值得注意的是，除单个物体的形状与姿态估计外，机器人学与动画等应用场景需要物理合理的三维重建，即物体必须遵循非穿透与真实接触的物理规律。本研究从两个方向推进物体级场景重建：首先，我们提出MessyKitchens数据集，该数据集包含真实世界杂乱环境场景，并以三维物体形状、姿态及精确物体接触关系的形式提供高保真物体级真值标注。其次，我们在近期单物体重建方法SAM 3D的基础上，引入多物体解码器（Multi-Object Decoder, MOD）以实现联合物体级场景重建。为验证贡献，我们证明MessyKitchens在配准精度与物体间穿透误差方面显著优于现有数据集。同时，我们在三个数据集上对比多物体重建方法，结果表明MOD相较现有最优方法取得持续且显著的改进。我们的新基准数据集、代码与预训练模型将在项目网站公开：https://messykitchens.github.io/。

摘要 (Abstract)

Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: https://messykitchens.github.io/.

关键词: 3D scene reconstruction, object-level reconstruction, monocular reconstruction, contact modeling, SAM 3D, Multi-Object Decoder, cluttered environments, physical plausibility

26. ❌ Demystifing Video Reasoning

作者: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16870v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	7.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	9.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视频生成模型中的推理机制，与大多数大语言模型（LLM）特定技术关键词无关。相关关键词包括：1）‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（8分）：论文挑战了Chain-of-Frames机制，提出了Chain-of-Steps（CoS）机制，涉及多步推理过程；2）‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（7分）：论文分析了模型在去噪步骤中的深度推理行为；3）‘Self-Correction OR Self-Improvement OR Self-Reflection’（9分）：论文明确识别了自我纠正和增强作为关键推理行为；4）‘Mechanistic Interpretability OR Explainable AI’（8分）：论文通过定性分析和探测实验系统理解推理机制，属于可解释AI范畴。其他关键词与视频生成模型研究无直接关联。

!!! tip deepseek-chat TL;DR

该论文挑战了视频生成模型中推理通过帧序列（Chain-of-Frames）发生的假设，揭示了推理主要沿扩散去噪步骤出现的新机制（Chain-of-Steps），并识别了工作记忆、自我纠正和感知先于行动等关键推理行为，为利用视频模型作为智能新基质提供了基础。

摘要翻译

近期视频生成领域的研究揭示了一个意外现象：基于扩散的视频模型展现出显著的推理能力。先前研究将此归因于帧间链式机制，即假设推理过程在视频帧间顺序展开。本研究挑战了这一假设，并揭示了一种根本不同的机制。我们证明视频模型中的推理主要沿着扩散去噪步骤产生。通过定性分析和针对性探测实验，我们发现模型在早期去噪步骤中探索多个候选解，并逐步收敛至最终答案，这一过程被我们称为步骤链式机制。除核心机制外，我们还识别出几种对模型性能至关重要的涌现推理行为：（1）工作记忆，支持持续性参照；（2）自我校正与增强，允许从错误中间解中恢复；（3）先感知后操作，即早期步骤建立语义基础，后期步骤执行结构化处理。在单个扩散步骤中，我们进一步发现扩散Transformer内部存在自演进的功能分化：早期层编码密集感知结构，中间层执行推理，后期层整合潜在表征。基于这些发现，我们提出一种简单的免训练策略作为概念验证，通过集成相同模型在不同随机种子下的潜在轨迹，展示了如何提升推理能力。总体而言，本研究系统阐释了视频生成模型中推理能力的涌现机制，为未来研究如何更好地利用视频模型固有的推理动态作为智能新载体奠定了理论基础。

摘要 (Abstract)

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

关键词: video generation models, reasoning mechanisms, diffusion denoising steps, Chain-of-Steps (CoS), self-correction, working memory, perception before action, mechanistic interpretability

作者: Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16859v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Omni-modal large language models (OLMs)的社交交互能力评估，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为OLMs是LLMs的多模态扩展。论文聚焦于评估基准创建和模型能力分析，不涉及其他关键词的具体技术细节（如MoE、SFT、RAG等）或特定应用领域（如生物信息学），因此其他关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了SocialOmni基准，用于评估多模态大语言模型在自然对话中处理动态社交线索（如说话人识别、打断时机和生成）的能力，发现现有模型在感知准确性和上下文适当打断生成之间存在显著脱节。

摘要翻译

全模态大语言模型通过原生整合音频、视觉与文本，重新定义了人机交互。然而，现有全模态大语言模型评测基准仍局限于静态的、以准确性为中心的任务，在评估社交互动性这一自然对话中处理动态线索的核心能力方面存在关键空白。为此，我们提出SocialOmni——一个综合性评测基准，从三个核心维度对对话互动性进行可操作化评估：（一）说话人分离与识别（谁在说话），（二）打断时机控制（何时插话），（三）自然打断生成（如何组织打断语句）。SocialOmni包含2000个感知样本，以及一个经过质量控制的诊断集（含209个具有严格时空与上下文约束的交互生成实例），并辅以受控的视听不一致场景以测试模型鲁棒性。我们对12个领先的全模态大语言模型进行了基准测试，结果揭示了不同模型在社交互动能力上存在显著差异。进一步分析表明，模型的感知准确性与其生成上下文适宜打断语句的能力之间存在明显脱钩，这说明仅依靠以理解为中心的度量指标不足以刻画对话社交能力。更令人鼓舞的是，SocialOmni提供的诊断结果为未来全模态大语言模型弥合感知与交互之间的鸿沟产生了可操作的改进信号。

摘要 (Abstract)

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model’s perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

关键词: Omni-modal large language models, Social interactivity, Benchmark evaluation, Audio-visual integration, Conversational AI, Interruption generation, Perception-interaction divide, Model robustness

28. ❌ SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

作者: Jiongze Yu, Xiangbo Gao, Pooja Verlani, Akshay Gadde, Yilin Wang, Balu Adsumilli, Zhengzhong Tu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16864v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频超分辨率（VSR）的交互式框架SparkVSR，通过稀疏关键帧传播实现可控的视频增强。研究内容属于计算机视觉和视频处理领域，核心是视频恢复技术、关键帧条件化训练和交互式控制机制。所有评分关键词均围绕大模型、深度学习技术原理及其在科学领域的应用，但本文未涉及任何大模型（LLM/SLM）、MoE、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、注意力优化、推理技术、智能体、量化、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或AI for Science等主题。论文的创新在于视频处理框架设计，而非大模型技术或其在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了SparkVSR，一种基于稀疏关键帧传播的交互式视频超分辨率框架，允许用户通过控制关键帧来校正视频恢复结果，实验表明其在多个基准测试中显著提升了时间一致性和恢复质量。

摘要翻译

视频超分辨率（Video Super-Resolution，VSR）旨在从低分辨率（Low-Resolution，LR）估计中恢复高质量的视频帧，然而现有的大多数VSR方法在推理时如同黑盒：用户无法可靠地修正意外的伪影，而只能接受模型生成的任何结果。本文提出了一种新颖的交互式VSR框架，称为SparkVSR，它将稀疏关键帧作为一种简洁而富有表现力的控制信号。具体而言，用户可以首先使用任何现成的图像超分辨率（Image Super-Resolution，ISR）模型对少量关键帧进行超分辨率处理（或选择性处理），随后SparkVSR将关键帧先验传播至整个视频序列，同时保持以原始低分辨率视频运动为基础。我们设计了一种关键帧条件化的潜在-像素两阶段训练流程，该流程将低分辨率视频潜在特征与稀疏编码的高分辨率关键帧潜在特征相融合，以学习鲁棒的跨空间传播并优化感知细节。在推理阶段，SparkVSR支持灵活的关键帧选择方式（手动指定、编解码器I帧提取或随机采样），并采用一种无参考的引导机制，持续平衡对关键帧的遵循与盲恢复，确保即使在参考关键帧缺失或不完美时也能实现鲁棒性能。在多个VSR基准测试上的实验表明，该方法提升了时间一致性并实现了强大的恢复质量，在CLIP-IQA、DOVER和MUSIQ指标上分别超越基线高达24.6%、21.8%和5.6%，实现了可控的、关键帧驱动的视频超分辨率。此外，我们证明SparkVSR是一种通用的交互式、关键帧条件化视频处理框架，因为它可以直接应用于未见过的任务，如老电影修复和视频风格迁移。我们的项目页面位于：https://sparkvsr.github.io/

摘要 (Abstract)

Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: https://sparkvsr.github.io/

关键词: Video Super-Resolution, Interactive Framework, Sparse Keyframe Propagation, Keyframe-conditioned Training, Temporal Consistency, Perceptual Detail Refinement, Controllable Video Enhancement, Reference-free Guidance

29. ❌ SOMA: Unifying Parametric Human Body Models

作者: Jun Saito, Jiefeng Li, Michael de Ruyter, Miguel Guerrero, Edy Lim, Ehsan Hassani, Roger Blanco Ribera, Hyejin Moon, Magdalena Dadela, Marco Di Lucca, Qiao Wang, Xueting Li, Jan Kautz, Simon Yuen, Umar Iqbal 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16858v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《SOMA: Unifying Parametric Human Body Models》专注于计算机图形学中的人体建模技术，提出了一种统一不同参数化人体模型（如SMPL、SMPL-X等）的框架。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是特定领域的3D人体模型统一方法，不涉及大模型、深度学习技术或AI在生物医药等科学领域的应用，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文解决了不同参数化人体模型（如SMPL、SMPL-X）因网格拓扑、骨骼结构等差异而互不兼容的问题，提出了SOMA这一统一框架，通过网格拓扑、骨骼和姿态三个抽象层，实现了异构模型身份和运动数据的自由混合与端到端可微分处理。

摘要翻译

参数化人体模型是人体重建、动画与仿真的基础，然而现有模型之间仍存在互不兼容的问题：SMPL、SMPL-X、MHR、Anny 及相关模型在网格拓扑、骨骼结构、形状参数化与单位约定上各有差异，导致难以在单一流程中综合利用它们的互补优势。本文提出 SOMA，一种通过三层抽象机制桥接这些异构表示的统一人体层。网格拓扑抽象层可将任何源模型的身份特征以每顶点恒定时间映射至共享规范网格；骨骼抽象层通过单次闭式计算（无需迭代优化或针对各模型的专门训练），从任意身体形状（无论是静止姿态还是任意姿态配置）恢复出一整套适应身份信息的关节变换；姿态抽象层通过逆向蒙皮流程，直接从任何支持模型的姿态顶点中恢复出统一的骨骼旋转，使得异构运动数据集无需定制重定向即可使用。这些抽象层共同将原本 $O(M^2)$ 的逐对适配问题简化为 $O(M)$ 的单后端连接器，让使用者能够在推理阶段自由混合不同身份来源与姿态数据。整个流程通过 NVIDIA-Warp 实现完全端到端可微分且 GPU 加速。

摘要 (Abstract)

Parametric human body models are foundational to human reconstruction, animation, and simulation, yet they remain mutually incompatible: SMPL, SMPL-X, MHR, Anny, and related models each diverge in mesh topology, skeletal structure, shape parameterization, and unit convention, making it impractical to exploit their complementary strengths within a single pipeline. We present SOMA, a unified body layer that bridges these heterogeneous representations through three abstraction layers. Mesh topology abstraction maps any source model’s identity to a shared canonical mesh in constant time per vertex. Skeletal abstraction recovers a full set of identity-adapted joint transforms from any body shape, whether in rest pose or an arbitrary posed configuration, in a single closed-form pass, with no iterative optimization or per-model training. Pose abstraction inverts the skinning pipeline to recover unified skeleton rotations directly from posed vertices of any supported model, enabling heterogeneous motion datasets to be consumed without custom retargeting. Together, these layers reduce the $O(M^2)$ per-pair adapter problem to $O(M)$ single-backend connectors, letting practitioners freely mix identity sources and pose data at inference time. The entire pipeline is fully differentiable end-to-end and GPU-accelerated via NVIDIA-Warp.

关键词: parametric human body models, SOMA, mesh topology abstraction, skeletal abstraction, pose abstraction, unified body layer, GPU-accelerated, differentiable pipeline

30. ❌ Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

作者: Xavier Gonzalez 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16850v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于并行计算算法（特别是牛顿方法）的理论和方法论研究，用于加速动态系统（如RNN、MCMC）的计算。虽然这些技术可能间接应用于大模型训练或推理优化，但论文本身并未直接讨论任何大模型、深度学习技术原理、AI应用或评分关键词中的具体技术。所有关键词均与大模型技术、训练方法、应用领域或特定AI技术相关，而本文的核心是数值方法和并行计算理论，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了并行牛顿方法在加速动态系统（如循环神经网络和马尔可夫链蒙特卡洛）计算时存在的效率低、不稳定和缺乏收敛保证等局限性，通过开发基于拟牛顿和信赖域方法的可扩展稳定并行算法，并建立了收敛性理论，确定了动态系统可被有效并行化的条件。

摘要翻译

大规模并行硬件（GPU）与长序列数据使得并行算法成为大规模机器学习的关键。然而，动态系统——如循环神经网络与马尔可夫链蒙特卡洛方法——曾被认为受限于序列计算的瓶颈。近期研究指出，通过将动态系统的评估重构为非线性方程组，并利用并行关联扫描结合牛顿法求解，此类系统实际上可沿序列长度实现并行化。然而，这些并行牛顿方法存在明显局限，主要包括效率低下、稳定性不足以及缺乏收敛性保证。本论文通过方法学与理论上的贡献，特别是借鉴优化领域的思路，针对这些局限提出了解决方案。在方法学上，我们基于拟牛顿与信赖域方法，开发了可扩展且稳定的并行牛顿方法。拟牛顿方法速度更快、内存效率更高，而信赖域方法则显著提升了稳定性。在理论上，我们将包括皮卡迭代与雅可比迭代在内的多种不动点方法统一纳入并行牛顿框架中，并为这些技术建立了线性收敛速率，该速率取决于方法的近似精度与稳定性。此外，我们基于动态稳定性提出了一个精确条件，用以刻画并行化何时可被证明加速动态系统、何时无法实现加速。具体而言，动态系统的最大李雅普诺夫指数的符号决定了并行牛顿方法能否快速收敛。总之，本论文为序列计算的并行化提供了可扩展且稳定的方法，并为这些技术何时有效、何时无效奠定了坚实的理论基础。本论文也可作为并行牛顿方法的指南，为希望在这一持续发展的领域中书写新篇章的研究者提供参考。

摘要 (Abstract)

Massively parallel hardware (GPUs) and long sequence data have made parallel algorithms essential for machine learning at scale. Yet dynamical systems, like recurrent neural networks and Markov chain Monte Carlo, were thought to suffer from sequential bottlenecks. Recent work showed that dynamical systems can in fact be parallelized across the sequence length by reframing their evaluation as a system of nonlinear equations, which can be solved with Newton’s method using a parallel associative scan. However, these parallel Newton methods struggled with limitations, primarily inefficiency, instability, and lack of convergence guarantees. This thesis addresses these limitations with methodological and theoretical contributions, drawing particularly from optimization. Methodologically, we develop scalable and stable parallel Newton methods, based on quasi-Newton and trust-region approaches. The quasi-Newton methods are faster and more memory efficient, while the trust-region approaches are significantly more stable. Theoretically, we unify many fixed-point methods into our parallel Newton framework, including Picard and Jacobi iterations. We establish a linear convergence rate for these techniques that depends on the method’s approximation accuracy and stability. Moreover, we give a precise condition, rooted in dynamical stability, that characterizes when parallelization provably accelerates a dynamical system and when it cannot. Specifically, the sign of the Largest Lyapunov Exponent of a dynamical system determines whether or not parallel Newton methods converge quickly. In sum, this thesis unlocks scalable and stable methods for parallelizing sequential computation, and provides a firm theoretical basis for when such techniques will and will not work. This thesis also serves as a guide to parallel Newton methods for researchers who want to write the next chapter in this ongoing story.

关键词: parallel Newton methods, dynamical systems, sequential computation, quasi-Newton methods, trust-region methods, convergence analysis, parallel algorithms, nonlinear equations

31. ❌ Internalizing Agency from Reflective Experience

作者: Rui Ge, Yichao Fu, Yuyang Qian, Junda Su, Yiming Zhao, Peng Zhao, Hao Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16843v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为自主代理在长视野交互中的问题，提出LEAFE框架通过反思经验学习反馈驱动的代理能力。高度相关的关键词包括：LLMs（论文研究对象）、Post-training/SFT（使用监督微调蒸馏经验）、Self-Correction/Self-Improvement（通过反思和回溯实现自我纠正）、LLM Agents（研究LLM作为自主代理）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

论文针对LLM作为自主代理在长视野任务中过度依赖结果驱动训练导致能力分布窄化的问题，提出LEAFE框架通过从反思经验中学习反馈驱动的代理能力，在交互式编码和代理任务中显著提升了Pass@1和Pass@k性能。

摘要翻译

大型语言模型正日益作为自主智能体被部署，这些智能体必须通过与提供丰富反馈的环境进行长程交互来规划、行动并从错误中恢复。然而，当前主流的结果驱动式后训练方法（例如，带有可验证奖励的强化学习）主要优化最终的成功信号，未能充分利用丰富的环境反馈。这常常导致分布锐化现象：策略在重现一组狭窄的已成功行为方面变得更强，却未能提升在长程场景中扩展问题解决能力（例如Pass@k指标）所需的、基于反馈的自主行为能力。

为解决这一问题，我们提出了LEAFE（从反思性经验中学习基于反馈的自主能力）框架，该框架通过反思性经验内化恢复能力。具体而言，在探索过程中，智能体将环境反馈总结为可操作的经验，回溯至早期的决策点，并以修正后的动作探索替代分支。随后，我们通过监督式微调将这些经验引导的修正提炼到模型中，使策略在未来的交互中能更有效地恢复。在固定交互预算下的一系列交互式编程和智能体任务中，LEAFE相较于基础模型持续提升了Pass@1指标，并且在Pass@k指标上超越了结果驱动基线方法（如GRPO）以及基于经验的方法（如早期经验法），在Pass@128上最高获得了14%的性能提升。

摘要 (Abstract)

Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.

关键词: Large Language Models, Autonomous Agents, Supervised Fine-tuning, Self-Correction, Feedback-Grounded Agency, Reflective Experience, Long-horizon Interaction, Pass@k Improvement

32. ❌ Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

作者: Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16839v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理通过工具使用自动生成演示文稿，使用GRPO（一种参数高效微调方法）微调Qwen2.5-Coder-7B模型，仅训练0.5%的参数。因此，与’LLM Agents’、‘Tool Use’、‘PEFT’和’Large Language Models’高度相关（10分）。‘Instruction Tuning’相关（5分），因为涉及指令遵循评估。其他关键词如MoE、SLMs、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于LLM代理的自动化演示文稿生成方法，通过工具使用和参数高效微调（仅训练0.5%参数）使7B模型达到Claude Opus 4.6 91.2%的质量水平。

摘要翻译

自动化演示文稿生成仍是一项具有挑战性的任务，需要协调内容创作、视觉设计和面向受众的沟通。本研究提出了一个兼容OpenEnv的强化学习环境，使大语言模型智能体能够通过学习使用工具来研究主题、规划内容并生成专业的HTML幻灯片演示文稿。我们引入了一个多组件奖励系统，该系统结合了结构验证、渲染质量评估、基于大语言模型的审美评分、内容质量指标，以及一项用于衡量生成幻灯片传达其预设目标的忠实度的逆向规范奖励。该逆向规范奖励采用“逆向任务”形式，即让一个大语言模型尝试从生成的幻灯片中还原原始任务要求，从而提供整体质量信号。我们的方法通过GRPO对Qwen2.5-Coder-7B进行微调，仅使用基于Claude Opus 4.6收集的专家演示数据构建的提示词对0.5%的参数进行训练。在涵盖48个多样化商业简报的六种模型实验中，我们的微调7B模型达到了Claude Opus 4.6质量的91.2%，同时较基础模型提升了33.1%。六模型对比分析表明，任务指令遵循度和工具使用合规性——而非原始参数量——决定了智能体任务性能。我们贡献了SlideRL，这是一个包含所有六种模型共288条多轮次运行轨迹的开源数据集：https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts 代码仓库：https://github.com/pushing-the-frontier/slide-forge-llm

摘要 (Abstract)

Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an “inverse task” where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6’s quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts Code: https://github.com/pushing-the-frontier/slide-forge-llm

关键词: LLM agents, tool use, parameter-efficient fine-tuning, presentation generation, reinforcement learning, inverse specification reward, GRPO, SlideRL dataset

33. ❌ Prompt Programming for Cultural Bias and Alignment of Large Language Models

作者: Maksim Eren, Eric Michalak, Brian Cook, Johnny Seales 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16827v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的文化对齐问题，通过提示编程（DSPy）优化文化条件提示，属于LLM对齐技术范畴。因此与’Large Language Models OR LLMs OR Foundation Models’和’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分）。其他关键词如MoE、量化、推理加速、科学AI应用等，论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过提示编程（DSPy）优化大语言模型的文化条件提示，以减少文化偏见并改善模型与目标人群的文化对齐，实验表明提示优化比手动提示工程能更稳定地提升文化对齐效果。

摘要翻译

文化塑造推理方式、价值观念、优先级排序与战略决策，然而大型语言模型常表现出与目标群体不符的文化偏见。随着大型语言模型日益广泛应用于战略决策、政策支持以及摘要生成、分类、合规导向审计等文档工程任务，提升文化对齐对于确保下游分析与建议能反映目标群体的价值取向而非模型默认先验至关重要。先前研究提出了基于调查的文化对齐框架，并证明特定文化提示能减少偏差，但该工作主要评估了闭源模型且依赖人工提示工程。本文通过复现其社会科学调查的投影与距离度量方法，在开源权重大型语言模型上验证并拓展该框架，检验相同的文化偏差与文化条件调节的效益是否在封闭系统之外依然存在。在此基础上，我们首次将DSPy提示编程技术应用于此问题——将提示视为模块化、可优化的程序——通过针对文化距离目标进行优化，实现文化条件调节的系统化调优。实验表明，提示优化通常能超越人工文化提示工程的效果，这提示采用DSPy进行提示编译可为获得文化对齐的模型响应提供更稳定、可迁移的路径。

摘要 (Abstract)

Culture shapes reasoning, values, prioritization, and strategic decision-making, yet large language models (LLMs) often exhibit cultural biases that misalign with target populations. As LLMs are increasingly used for strategic decision-making, policy support, and document engineering tasks such as summarization, categorization, and compliance-oriented auditing, improving cultural alignment is important for ensuring that downstream analyses and recommendations reflect target-population value profiles rather than default model priors. Previous work introduced a survey-grounded cultural alignment framework and showed that culture-specific prompting can reduce misalignment, but it primarily evaluated proprietary models and relied on manual prompt engineering. In this paper, we validate and extend that framework by reproducing its social sciences survey based projection and distance metrics on open-weight LLMs, testing whether the same cultural skew and benefits of culture conditioning persist outside closed LLM systems. Building on this foundation, we introduce use of prompt programming with DSPy for this problem-treating prompts as modular, optimizable programs-to systematically tune cultural conditioning by optimizing against cultural-distance objectives. In our experiments, we show that prompt optimization often improves upon cultural prompt engineering, suggesting prompt compilation with DSPy can provide a more stable and transferable route to culturally aligned LLM responses.

关键词: Large Language Models, Cultural Alignment, Prompt Programming, DSPy, Cultural Bias, Prompt Optimization, Value Alignment

34. ❌ Real-Time Decoding of Movement Onset and Offset for Brain-Controlled Rehabilitation Exoskeleton

作者: Kanishka Mitra, Satyam Kumar, Frigyes Samuel Racz, Deland Liu, Ashish D. Deshpande, José del R. Millán 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16825v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究脑电图（EEG）控制的康复外骨骼系统，属于脑机接口和康复机器人领域。论文内容与绝大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、智能体等）完全无关，因此评分为0。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为该研究属于AI在生物医学/神经科学领域的应用，但论文本身并未强调AI模型或深度学习技术的创新，而是侧重于EEG信号处理和机器人控制，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究解决了康复外骨骼系统缺乏直接神经控制的问题，通过在线双状态运动想象控制实现了基于脑电图（EEG）的上肢外骨骼启动和停止，提高了命令传递的可靠性并改进了漂移跟踪方法。

摘要翻译

机器人辅助治疗可在神经损伤后提供高强度、任务特异性的训练，但多数系统主要作用于肢体层面——仅间接调动受损神经回路——这仍是实现真正依从性、以神经可塑性为目标的康复治疗的关键障碍。本研究通过实施上肢外骨骼的在线双状态运动想象控制来解决这一局限，使得目标导向的伸展动作能够直接通过非侵入性脑电图信号启动和终止。八名参与者利用脑电图启动辅助，并在轨迹中途自主停止机器人运动。在两次在线实验中，组平均命中率在启动阶段为61.5%，终止阶段为64.5%，表明尽管存在仪器噪声和手臂被动运动，仍能实现可靠的启停指令传递。在方法论上，我们通过非对称边界诊断揭示了基于常见任务重定心方法引发的系统性、类别驱动的偏差，并提出了一种与类别无关的基于注视点的重定心方法。该方法能在不采样指令类别的情况下追踪信号漂移，同时保持类别几何结构。这显著提升了无阈值分离性能（AUC增益：启动阶段+56%，p = 0.0117；终止阶段+34%，p = 0.0251），并减少了日内与日间的偏差。综上，这些成果有助于弥合离线解码与康复外骨骼实用化、意图驱动的启停控制之间的鸿沟，实现与神经可塑性目标精准同步的依从性辅助，为未来的临床转化提供支持。

摘要 (Abstract)

Robot-assisted therapy can deliver high-dose, task-specific training after neurologic injury, but most systems act primarily at the limb level-engaging the impaired neural circuits only indirectly-which remains a key barrier to truly contingent, neuroplasticity-targeted rehabilitation. We address this gap by implementing online, dual-state motor imagery control of an upper-limb exoskeleton, enabling goal-directed reaches to be both initiated and terminated directly from non-invasive EEG. Eight participants used EEG to initiate assistance and then volitionally halt the robot mid-trajectory. Across two online sessions, group-mean hit rates were 61.5% for onset and 64.5% for offset, demonstrating reliable start-stop command delivery despite instrumental noise and passive arm motion. Methodologically, we reveal a systematic, class-driven bias induced by common task-based recentering using an asymmetric margin diagnostic, and we introduce a class-agnostic fixation-based recentering method that tracks drift without sampling command classes while preserving class geometry. This substantially improves threshold-free separability (AUC gains: onset +56%, p = 0.0117; offset +34%, p = 0.0251) and reduces bias within and across days. Together, these results help bridge offline decoding and practical, intention-driven start-stop control of a rehabilitation exoskeleton, enabling precisely timed, contingent assistance aligned with neuroplasticity goals while supporting future clinical translation.

关键词: brain-computer interface, EEG, motor imagery, rehabilitation exoskeleton, online decoding, drift correction, neuroplasticity, upper-limb

35. ❌ Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost

作者: Swata Marik, Swayamjit Saha, Garga Chatterjee 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16815v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究供应链中的预测-库存优化，使用传统预测模型、机器学习回归器和深度序列模型（如Temporal CNN和LSTM），但未涉及大模型（LLMs）、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、长上下文、注意力优化、推理技术、智能体、量化、推测解码、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI等关键词。所有关键词均与大模型技术原理或其在科学领域的应用无关，因此相关度均为0。

!!! tip deepseek-chat TL;DR

该研究开发了一个集成了传统预测模型、机器学习回归器和深度序列模型的数字化预测-库存优化流程，在M5 Walmart数据集上评估了七种预测方法，发现Temporal CNN和LSTM模型相比统计基线能显著降低库存成本并提高填充率。

摘要翻译

本研究开发了一种数字化预测-库存优化流程，将传统预测模型、机器学习回归器和深度序列模型集成于统一的库存模拟框架中。基于M5沃尔玛数据集，我们评估了七种预测方法，并分析了其在单级与两级报童系统下的运营影响。结果表明，与统计基线模型相比，时序卷积网络（Temporal CNN）和长短期记忆网络（LSTM）模型能显著降低库存成本并提升订单满足率。敏感性分析与多级库存分析验证了该框架的鲁棒性与可扩展性，为现代供应链提供了一种数据驱动的决策支持工具。

摘要 (Abstract)

This study develops a digitalized forecasting-inventory optimization pipeline integrating traditional forecasting models, machine learning regressors, and deep sequence models within a unified inventory simulation framework. Using the M5 Walmart dataset, we evaluate seven forecasting approaches and assess their operational impact under single- and two-echelon newsvendor systems. Results indicate that Temporal CNN and LSTM models significantly reduce inventory costs and improve fill rates compared to statistical baselines. Sensitivity and multi-echelon analyses demonstrate robustness and scalability, offering a data-driven decision-support tool for modern supply chains.

关键词: forecasting-inventory optimization, machine learning regressors, deep sequence models, Temporal CNN, LSTM, inventory cost, fill rates, supply chains

36. ❌ Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

作者: Yi Chen, Daiwei Chen, Sukrut Madhav Chikodikar, Caitlyn Heqi Yin, Ramya Korlakai Vinayak 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16817v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG-based LLMs的conformal factuality，直接涉及三个关键词：‘Large Language Models’（论文研究对象）、‘Retrieval-Augmented Generation’（核心方法）、‘Hallucination Mitigation’（研究目标）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、代理系统等均未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文系统分析了基于检索增强生成（RAG）的大语言模型（LLMs）的conformal factuality过滤方法的可靠性和有用性，发现该方法在高事实性水平下因输出空洞而实用性低，对分布偏移和干扰物不鲁棒，且轻量级蕴含验证器在计算效率上显著优于基于LLM的置信度评分器。

摘要翻译

大型语言模型（LLMs）常产生幻觉，限制了其在知识密集型应用中的可靠性。检索增强生成（RAG）与保真性合形预测已成为解决这一局限的潜在途径。尽管RAG旨在将回答基于检索到的证据，但其无法为最终输出的正确性提供统计保证。保真性合形过滤通过使用在保留数据上校准的阈值对原子主张进行评分和过滤，提供了无需分布假设的统计可靠性，然而最终输出的信息量无法得到保证。我们系统分析了基于RAG的LLMs在生成、评分、校准、鲁棒性和效率方面应用保真性合形预测的可靠性与实用性。我们提出了新颖的信息量感知指标，以更好地反映合形过滤下的任务效用。在三个基准测试和多个模型系列中，我们发现：（i）在高保真性水平下，由于输出内容空洞，合形过滤的实用性较低；（ii）保真性合形保证对分布偏移和干扰信息缺乏鲁棒性，这凸显了校准数据需与部署条件紧密匹配的局限；（iii）基于轻量级蕴含关系的验证器在性能上匹配或优于基于LLM的置信度评分器，同时所需浮点运算次数降低超过100倍。总体而言，我们的研究揭示了保真性与信息量之间的权衡，以及合形过滤框架在分布偏移和干扰信息下的脆弱性，强调需要以鲁棒性和实用性为核心指标开发新的可靠性方法，并为构建既可靠又计算高效的RAG流程提供了可操作的指导。

摘要 (Abstract)

Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering offers distribution-free statistical reliability by scoring and filtering atomic claims using a threshold calibrated on held-out data, however, the informativeness of the final output is not guaranteed. We systematically analyze the reliability and usefulness of conformal factuality for RAG-based LLMs across generation, scoring, calibration, robustness, and efficiency. We propose novel informativeness-aware metrics that better reflect task utility under conformal filtering. Across three benchmarks and multiple model families, we find that (i) conformal filtering suffers from low usefulness at high factuality levels due to vacuous outputs, (ii) conformal factuality guarantee is not robust to distribution shifts and distractors, highlighting the limitation that requires calibration data to closely match deployment conditions, and (iii) lightweight entailment-based verifiers match or outperform LLM-based model confidence scorers while requiring over $100\times$ fewer FLOPs. Overall, our results expose factuality-informativeness trade-offs and fragility of conformal filtering framework under distribution shifts and distractors, highlighting the need for new approaches for reliability with robustness and usefulness as key metrics, and provide actionable guidance for building RAG pipelines that are both reliable and computationally efficient.

关键词: Large Language Models, Retrieval-Augmented Generation, Conformal Factuality, Hallucination Mitigation, Factuality-Informativeness Trade-off, Distribution Shift Robustness, Computational Efficiency, Entailment-based Verifiers

37. ❌ ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation

作者: Nij Dorairaj, Debabrata Chatterjee, Hong Wang, Hong Jiang, Alankar Saxena, Altug Koker, Thiam Ern Lim, Cathrane Teoh, Chuan Yin Loo, Bishara Shomar, Anthony Lester 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16812v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于芯片级硬件架构验证方法（ODIN-based CPU-GPU集成、重放驱动的仿真/模拟），属于计算机体系结构和硬件设计领域，与所有评分关键词（均围绕大模型、深度学习技术原理、AI应用等软件/算法层面）无直接关联。论文提及AI workloads仅作为应用背景，未涉及任何关键词的具体技术内容。

!!! tip deepseek-chat TL;DR

该论文针对基于ODIN架构的CPU-GPU芯片级系统集成中复杂的验证挑战，提出了一种重放驱动的仿真/模拟验证方法，显著加速了调试过程并缩短了集成周期。

摘要翻译

CPU与GPU技术的集成是现代人工智能与图形工作负载的关键赋能方案，它将面向控制的处理能力与大规模并行计算能力相结合。随着系统向基于芯粒的架构演进，紧密耦合的CPU-GPU子系统的流片前验证因复杂的验证框架搭建、庞大的设计规模、高并发性、非确定性执行以及芯粒边界处复杂的协议交互而日益困难，往往导致漫长的集成周期。本文提出了一种基于重放的验证方法，该方法是在针对ODIN集成芯粒架构的基础性SoC构建模块中，集成CPU子系统、多个Xe GPU核心以及可配置片上网络（Network-on-Chip, NoC）的过程中开发而成。通过利用单一设计数据库，在仿真和硬件仿真中实现确定性波形捕获与重放，可以在系统级别可靠地复现复杂的GPU工作负载和协议序列。该方法显著加速了调试进程，提升了集成信心，并使得端到端的系统启动和工作负载执行能在单个季度内完成，从而证明了基于重放的验证作为一种可扩展方法对于芯粒架构系统的有效性。

摘要 (Abstract)

Integration of CPU and GPU technologies is a key enabler for modern AI and graphics workloads, combining control-oriented processing with massive parallel compute capability. As systems evolve toward chiplet-based architectures, pre-silicon validation of tightly coupled CPU-GPU subsystems becomes increasingly challenging due to complex validation framework setup, large design scale, high concurrency, non-deterministic execution, and intricate protocol interactions at chiplet boundaries, often resulting in long integration cycles. This paper presents a replay-driven validation methodology developed during the integration of a CPU subsystem, multiple Xe GPU cores, and a configurable Network-on-Chip (NoC) within a foundational SoC building block targeting the ODIN integrated chiplet architecture. By leveraging deterministic waveform capture and replay across both simulation and emulation using a single design database, complex GPU workloads and protocol sequences can be reproduced reliably at the system level. This approach significantly accelerates debug, improves integration confidence, and enables end-to-end system boot and workload execution within a single quarter, demonstrating the effectiveness of replay-based validation as a scalable methodology for chiplet-based systems.

关键词: CPU-GPU integration, chiplet architecture, replay-driven validation, pre-silicon validation, simulation and emulation, Network-on-Chip, system-level debug, ODIN architecture

38. ❌ V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

作者: Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang, Jaemin Cho, Mohit Bansal 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16792v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉表示对齐和扩散模型，专注于计算机视觉和生成模型领域，未涉及大语言模型、深度学习技术原理创新或科学应用。所有关键词均与大语言模型、深度学习技术或AI科学应用相关，与论文的视觉扩散模型研究无直接关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文系统研究了视觉协同去噪方法，通过统一框架识别了四个关键设计要素，在ImageNet-256上实现了优于现有像素空间扩散模型的性能。

摘要翻译

像素空间扩散模型近期重新成为潜空间扩散模型的重要替代方案，其无需预训练自编码器即可实现高质量生成。然而，标准像素空间扩散模型获得的语义监督相对较弱，且未明确设计用于捕获高层视觉结构。近期表征对齐方法（如REPA）表明，预训练的视觉特征能显著改进扩散训练，而视觉协同去噪已成为将此类特征融入生成过程的有前景方向。但现有协同去噪方法常混杂多种设计选择，难以辨明哪些选择真正关键。为此，我们提出V-Co——在统一的即时训练框架中对视觉协同去噪进行系统性研究。这一受控设置使我们能分离出视觉协同去噪有效的核心要素。我们的研究揭示了有效视觉协同去噪的四个关键要素：第一，保持特征特定计算并实现灵活的跨流交互，需采用完全双流架构；第二，有效的无分类器引导需依赖结构定义的无条件预测；第三，更强的语义监督最好通过感知漂移混合损失提供；第四，稳定的协同去噪还需适当的跨流校准，我们通过基于RMS的特征重缩放实现。这些发现共同构成了视觉协同去噪的简明方案。在ImageNet-256上的实验表明，在模型规模相当的情况下，V-Co在减少训练轮次的同时，超越了基础像素空间扩散基线及先前强效的像素扩散方法，为未来表征对齐生成模型提供了实用指导。

摘要 (Abstract)

Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.

关键词: visual co-denoising, diffusion models, representation alignment, pixel-space diffusion, generative models, ImageNet-256, classifier-free guidance, perceptual-drifting hybrid loss

39. ❌ DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping

作者: Yuliang Wu, Yanhan Lin, WengKit Lao, Yuhao Lin, Yi-Lin Wei, Wei-Shi Zheng, Ancong Wu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16806v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping》专注于机器人灵巧抓取领域，提出了一种基于图卷积网络（MAGCN）的策略，用于实现不同形态机械手的零样本跨具身抓取。研究内容涉及机器人学、计算机视觉、强化学习/模仿学习（虽未明确提及具体算法，但“policy learning”暗示此类方法）和具身AI，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、Attention优化等）、大模型训练/对齐技术（如Pre-training、SFT、RLHF、PEFT）、大模型应用技术（如RAG、CoT、Agents）或大模型部署优化（如Quantization、Speculative Decoding）。所有评分关键词均与大模型或深度学习核心技术直接相关，而本文的核心是机器人抓取策略与形态对齐表示学习，属于具身AI的特定应用，未使用或改进任何大模型技术。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究解决了不同形态机械手之间零样本跨具身抓取的挑战，通过提出一种形态对齐的图表示和相应的图卷积网络策略，在仿真和真实机器人实验中实现了对未见过的机械手和物体的高成功率抓取。

摘要翻译

为满足日益多样化的灵巧手硬件需求，开发一种无需冗余重复学习即可实现零样本跨具身抓取策略至关重要。由于异构手部运动学与物理约束的差异，跨具身对齐面临严峻挑战。现有方法通常预测中间运动目标并将其重定向至各具身结构，这可能引入误差并违反特定具身的物理限制，从而阻碍在不同手型间的迁移。为突破这些局限，我们提出\textit{DexGrasp-Zero}策略，该策略从多样具身结构中学习通用抓取技能，实现向未见手型的零样本迁移。我们首先提出一种形态对齐图表示方法，将每只手的运动学关键点映射至基于解剖结构的节点，并为每个节点配备三轴正交运动基元，从而实现不同形态间的结构与语义对齐。基于此图表示，我们设计了\textit{形态对齐图卷积网络}（MAGCN）对策略学习所需的图结构进行编码。MAGCN融合了\textit{物理属性注入}机制，将手部特定的物理约束整合至图特征中，能够针对不同连杆长度与驱动限制进行自适应补偿，实现精准稳定的抓取。我们在YCB数据集上的大量仿真评估表明，本策略在四种异构手型（Allegro、Shadow、Schunk、Ability）上联合训练后，在未见硬件（LEAP、Inspire）上达到85%的零样本成功率，较现有最优方法提升59.5%。真实世界实验进一步在三个机器人平台（LEAP、Inspire、Revo2）上验证了本策略，在未见物体上取得平均82%的成功率。

摘要 (Abstract)

To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, which may introduce errors and violate embodiment-specific limits, hindering transfer across diverse hands. To overcome these limitations, we propose \textit{DexGrasp-Zero}, a policy that learns universal grasping skills from diverse embodiments, enabling zero-shot transfer to unseen hands. We first introduce a morphology-aligned graph representation that maps each hand’s kinematic keypoints to anatomically grounded nodes and equips each node with tri-axial orthogonal motion primitives, enabling structural and semantic alignment across different morphologies. Relying on this graph-based representation, we design a \textit{Morphology-Aligned Graph Convolutional Network} (MAGCN) to encode the graph for policy learning. MAGCN incorporates a \textit{Physical Property Injection} mechanism that fuses hand-specific physical constraints into the graph features, enabling adaptive compensation for varying link lengths and actuation limits for precise and stable grasping. Our extensive simulation evaluations on the YCB dataset demonstrate that our policy, jointly trained on four heterogeneous hands (Allegro, Shadow, Schunk, Ability), achieves an 85% zero-shot success rate on unseen hardware (LEAP, Inspire), outperforming the state-of-the-art method by 59.5%. Real-world experiments further evaluate our policy on three robot platforms (LEAP, Inspire, Revo2), achieving an 82% average success rate on unseen objects.

关键词: dexterous grasping, zero-shot cross-embodiment, morphology-aligned policy, graph convolutional network, robot manipulation, sim-to-real transfer, physical property injection

40. ❌ InCoder-32B: Code Foundation Model for Industrial Scenarios

作者: Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Shawn Guo, Haowen Wang, Weicheng Gu, Yaxin Du, Joseph Li, Fanglin Xu, Yizhi Li, Lin Jing, Yuanbo Wang, Yuhan Gao, Ruihao Gong, Chuan Hao, Ran Tao, Aishan Liu, Tuney Zheng, Ganqu Cui, Zhoujun Li, Mingjie Tang, Chenghua Lin, Wayne Xin Zhao, Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16790v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文InCoder-32B是一个32B参数的大语言模型，专门针对工业代码场景（如芯片设计、GPU内核优化、嵌入式系统等）进行训练，属于大模型在特定领域的应用创新。核心相关关键词包括：‘Large Language Models’（论文明确为code foundation model，权重1.0，相关度10）、‘Pre-training’（论文提到general code pre-training，权重1.0，相关度10）、‘Post-training’（论文提到post-training with execution-grounded verification，权重1.0，相关度10）、‘Context Window Extension’（论文提到mid-training progressively extends context from 8K to 128K tokens，权重1.0，相关度10）。‘AI for Science’部分相关（论文涉及工业工程领域，如芯片设计、3D建模，可视为科学应用，权重1.0，相关度5）。其他关键词如MoE、SLMs、Scaling Laws、Instruction Tuning、RLHF、RAG、CoT、Agents、Quantization等，论文未涉及或未明确提及，相关度为0。

!!! tip deepseek-chat TL;DR

论文针对工业代码场景（如芯片设计、GPU优化）中现有代码大模型性能下降的问题，提出了InCoder-32B模型，通过从零开始训练、扩展上下文至128K、执行验证后训练等方法，在通用和工业基准测试中取得了竞争性性能。

摘要翻译

近期代码大语言模型在通用编程任务上取得了显著进展。然而，在需要理解硬件语义、特殊语言结构及严格资源约束的工业场景中，其性能显著下降。为应对这些挑战，我们推出了InCoder-32B（Industrial-Coder-32B），这是首个拥有320亿参数、统一芯片设计、GPU内核优化、嵌入式系统、编译器优化与三维建模领域代码智能的代码基础模型。通过采用高效架构，我们以通用代码预训练、精选工业代码退火、利用合成工业推理数据将上下文长度从8K逐步扩展至128K符号的中期训练，以及基于执行验证的后训练四个阶段，对InCoder-32B进行了从零开始的训练。我们在14个主流通用代码基准测试和覆盖4个专业领域的9个工业基准测试上进行了广泛评估。结果表明，InCoder-32B在通用任务上展现出高度竞争力，同时在工业领域建立了强大的开源基准。

摘要 (Abstract)

Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.

关键词: code foundation model, industrial scenarios, 32B-parameter, context extension, pre-training, post-training, hardware semantics, industrial benchmarks

41. ❌ Anticipatory Planning for Multimodal AI Agents

作者: Yongyuan Liang, Shijie Zhou, Yu Gu, Hao Tan, Gang Wu, Franck Dernoncourt, Jihyung Kil, Ryan A. Rossi, Ruiyi Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16777v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态AI代理的规划能力，核心贡献是TraceR1框架，通过两阶段强化学习实现前瞻性推理。与关键词的相关性分析如下：1）高度相关（10分）：‘LLM Agents/Autonomous Agents/Agentic Workflow’和’Tool Use/Function Calling/API Tool Use’，因为论文明确研究多模态代理的规划、推理和工具使用；2）较强相关（8分）：‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’，因为论文强调多步任务的长远规划和深度推理；3）中等相关（5分）：‘Large Language Models/LLMs/Foundation Models’，因多模态代理通常基于大模型，但论文未明确提及LLM技术细节；4）无关（0分）：其余关键词涉及具体技术（如MoE、量化）、训练方法（如RLHF、PEFT）或领域（如生物信息学），论文未涉及。

!!! tip deepseek-chat TL;DR

该论文针对现有多模态AI代理缺乏长远规划能力的问题，提出了TraceR1两阶段强化学习框架，通过前瞻性轨迹推理显著提升了代理在复杂任务中的规划稳定性、执行鲁棒性和泛化能力。

摘要翻译

近年来，多模态智能体在计算机交互与工具使用方面取得了显著进展，然而现有系统大多仍属于被动反应型，其优化动作往往孤立进行，缺乏对未来状态或长期目标的推理能力。这一局限影响了规划的一致性，导致智能体难以可靠地完成高层次、多步骤的任务。我们提出了TraceR1——一个两阶段强化学习框架，该框架通过在执行前预测短期轨迹来显式训练前瞻性推理能力。第一阶段执行轨迹级强化学习，其奖励机制旨在确保预测动作序列的全局一致性；第二阶段进行基于实际执行的强化微调，利用冻结工具智能体提供的执行反馈来优化步骤级准确性与可执行性。我们在涵盖在线计算机使用、离线计算机使用基准以及多模态工具使用推理任务的七项基准测试中对TraceR1进行了评估。实验结果表明，相较于被动反应型及单阶段基线模型，TraceR1在规划稳定性、执行鲁棒性和泛化能力方面均实现了显著提升。这些发现证明，前瞻性轨迹推理是构建能够在复杂现实环境中有效推理、规划与行动的多模态智能体的关键原则。

摘要 (Abstract)

Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that explicitly trains anticipatory reasoning by forecasting short-horizon trajectories before execution. The first stage performs trajectory-level reinforcement learning with rewards that enforce global consistency across predicted action sequences. The second stage applies grounded reinforcement fine-tuning, using execution feedback from frozen tool agents to refine step-level accuracy and executability. TraceR1 is evaluated across seven benchmarks, covering online computer-use, offline computer-use benchmarks, and multimodal tool-use reasoning tasks, where it achieves substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines. These results show that anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.

关键词: multimodal AI agents, anticipatory planning, reinforcement learning, trajectory reasoning, tool-use, long-term goals, execution robustness, generalization

42. ❌ IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

作者: Huimin Xiong, Zijie Meng, Tianxiang Hu, Chenyi Zhou, Yang Feng, Zuozhu Liu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16781v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出IOSVLM，一个用于牙科诊断的3D视觉语言模型，核心是结合3D点云编码器和LLM进行统一诊断和视觉问答。高度相关关键词：1）‘Large Language Models’（权重1.0，评分10.0）：模型采用LLM作为生成组件，用于诊断和VQA；2）‘AI for Science’（权重1.0，评分10.0）：应用于牙科科学领域，属于AI for Science范畴。中等相关关键词：1）‘Pre-training’（权重1.0，评分5.0）：提及跨模态对齐和几何感知，涉及预训练概念；2）‘Post-training’（权重1.0，评分5.0）：采用两阶段课程训练策略，涉及微调。其他关键词如MoE、SLMs、RAG等未在论文中涉及，评分为0。加权总分计算：10.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 5.01.0 + 5.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 0.01.0 + 10.01.0 = 30.0。作者列表中未包含指定专家。

!!! tip deepseek-chat TL;DR

该论文提出IOSVLM，一个端到端的3D视觉语言模型，通过直接建模3D几何结构来解决牙科内扫描（IOS）中多疾病统一诊断的挑战，并创建大规模IOSVQA数据集，实验表明其显著优于基线方法，在宏观准确率和F1分数上分别提升至少9.58%和1.46%。

摘要翻译

三维口内扫描（IOS）因其丰富的几何证据在常规牙科诊疗中日益普及，而统一的多疾病诊断对临床记录与沟通至关重要。尽管近期研究引入牙科视觉语言模型（VLM）实现了基于二维图像或IOS渲染多视角图像的统一诊断与报告生成，但这些方法未能充分利用原生三维几何信息。此类研究具有必要性且面临三重挑战：（i）异构扫描形态与复杂的IOS拓扑结构；（ii）多疾病共现伴随的类别不平衡与细粒度形态学模糊性；（iii）配对的3D IOS-文本数据有限。为此，我们提出IOSVLM——一种端到端三维视觉语言模型，该模型将扫描数据表征为点云，采用三维编码器-投影器-大语言模型架构，实现统一诊断与生成式视觉问答（VQA）。同时构建了大规模多源IOS诊断数据集IOSVQA，涵盖19,002个病例、249,055组VQA对，涉及23种口腔疾病及异构扫描类型。针对无色IOS数据与依赖色彩的3D预训练间的分布差异，我们提出几何-色彩代理机制，以稳定细粒度几何感知与跨模态对齐。两阶段课程训练策略进一步增强了模型鲁棒性。IOSVLM在各项基准测试中均显著优于现有强基线模型，宏观准确率提升至少9.58%，宏观F1分数提高至少1.46%，这证实了直接三维几何建模在IOS诊断中的有效性。

摘要 (Abstract)

3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.

关键词: 3D Vision-Language Model, Dental Diagnosis, Intraoral Scans, Point Cloud Representation, Generative Visual Question Answering, Cross-modal Alignment, Curriculum Training, Oral Diseases

43. ❌ TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

作者: Victoria Graf, Valentina Pyatkin, Nouha Dziri, Nathan Lambert, Hannaneh Hajishirzi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16759v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在单轮与多轮对话能力上的差距，并提出了多轮对话评估基准TurnWiseEval和训练数据生成管道TurnWiseData。实验表明，通过后训练（post-training）加入少量多轮对话数据可显著提升模型的多轮对话性能。因此，论文与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文明确涉及后训练（post-training）并使用了监督微调（SFT）方法。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT、Agents、Quantization等均未在论文中提及或相关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在单轮与多轮对话能力之间的差距，通过引入新的多轮对话评估基准TurnWiseEval和训练数据生成管道TurnWiseData，实验证明在后训练中加入少量多轮对话数据可显著提升模型的多轮对话性能。

摘要翻译

多轮对话是语言模型交互中常见且关键的模式。然而，当前开放的训练与评估数据主要集中于单轮场景，未能涵盖这些长程交互的额外维度。为理解多轮与单轮能力之间的差距，我们首先提出了一个可直接与单轮对话评估对标的新基准——TurnWiseEval，用于衡量多轮对话能力。该评估通过将多轮场景与等效的单轮设置进行配对比较，从而分离出多轮对话特有的交互能力。此外，我们提出了合成多轮数据流水线TurnWiseData，该方案支持可扩展的多轮训练数据生成。基于Olmo 3模型的实验表明，使用多轮数据进行训练对于实现强大的多轮对话性能至关重要：在后期训练中仅加入1万轮多轮对话数据，即可使模型在TurnWiseEval基准上的表现提升12%。

摘要 (Abstract)

Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.

关键词: multi-turn conversations, language model capabilities, evaluation benchmark, training data generation, post-training, single-turn vs multi-turn gap, TurnWiseEval, TurnWiseData

44. ❌ Finding Common Ground in a Sea of Alternatives

作者: Jay Chooi, Paul Gölz, Ariel D. Procaccia, Benjamin Schiffer, Shirley Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16751v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究使用生成式AI（特别是LLMs）在无限备选方案中寻找共识陈述的社会选择问题，与’Large Language Models’高度相关（8分），因为论文明确提到LLM-based methods并涉及生成式AI的应用。其他关键词如MoE、SLMs、训练技术、推理优化、AI for Science等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究在无限备选方案中寻找共识陈述的社会选择问题，提出基于比例否决核心的形式模型，设计高效采样算法以高概率返回近似解，并在文本偏好数据集上验证了算法的有效性。

摘要翻译

本研究探讨如何在多元群体偏好中选取能建立共识的陈述。生成式人工智能因其可访问近乎无限的陈述集合而特别适合此任务，但诸如哈贝马斯机器（Habermas machine）等人工智能系统将生成陈述的选择权交由投票规则决定。然而，该规则如何实现“寻求共识”尚未得到明确定义。本文基于社会选择理论中的比例否决核心（proportional veto core），提出一种在无限备选方案情境下寻求共识的形式化模型。为了在无限多备选方案与大规模群体的背景下提供理论保证，我们希望仅通过对未知的备选方案与选民分布进行查询访问，来满足比例否决核心的概念。我们设计了一种高效的基于抽样的算法，该算法能以高概率返回一个处于（近似）比例否决核心内的备选方案，并证明了匹配的下界，表明任何算法都无法以更少的查询实现相同目标。在一个关于文本偏好的合成数据集上，我们验证了基于抽样的算法的有效性，并比较了其他社会选择方法以及基于大语言模型（LLM）的方法在生成符合比例否决核心的陈述方面的可靠性。

摘要 (Abstract)

We study the problem of selecting a statement that finds common ground across diverse population preferences. Generative AI is uniquely suited for this task because it can access a practically infinite set of statements, but AI systems like the Habermas machine leave the choice of generated statement to a voting rule. What it means for this rule to find common ground, however, is not well-defined. In this work, we propose a formal model for finding common ground in the infinite alternative setting based on the proportional veto core from social choice. To provide guarantees relative to these infinitely many alternatives and a large population, we wish to satisfy a notion of proportional veto core using only query access to the unknown distribution of alternatives and voters. We design an efficient sampling-based algorithm that returns an alternative in the (approximate) proportional veto core with high probability and prove matching lower bounds, which show that no algorithm can do the same using fewer queries. On a synthetic dataset of preferences over text, we confirm the effectiveness of our sampling-based algorithm and compare other social choice methods as well as LLM-based methods in terms of how reliably they produce statements in the proportional veto core.

关键词: common ground, social choice, generative AI, proportional veto core, sampling algorithm, infinite alternatives, voting rule, LLM-based methods

45. ❌ Nonstandard Errors in AI Agents

作者: Ruijiang Gao, Steven Chong Xiao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16744v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究AI编码代理在实证研究中的非标准误差问题，核心涉及AI代理（LLM Agents）的应用和评估，因此与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。研究部署了150个自主代理进行金融数据分析，涉及多代理系统（Multi-agent Systems OR Agent Coordination），但重点在误差分析而非协调机制，给5分。论文使用Claude Code代理（基于大模型），属于大模型应用，但非技术原理创新，给’Large Language Models OR LLMs OR Foundation Models’ 5分。研究在金融实证分析中应用AI，属于AI for Science的广义范畴，给5分。其他关键词如MoE、SFT、RAG等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在金融数据分析任务中，不同的AI编码代理会产生显著的非标准误差，导致实证结果不一致，而AI同行评审对减少误差效果有限，但展示优秀范例论文能大幅降低估计差异。

摘要翻译

本研究旨在探究当顶尖人工智能编码代理在获得相同数据和研究问题时，是否能够得出相同的实证结果。通过部署150个自主运行的Claude Code代理，对SPY（2015–2024年）在NYSE TAQ数据中关于市场质量趋势的六个假设进行独立检验，我们发现AI代理表现出显著的“非标准误差”（nonstandard errors，NSEs），即由于代理之间分析选择的差异所产生的不确定性，这与人类研究者中已记录的现象类似。AI代理在度量选择上存在显著分歧（例如自相关与方差比率、美元交易量与股数交易量）。不同模型家族（Sonnet 4.6与Opus 4.6）表现出稳定的“实证风格”，反映了其在方法论偏好上的系统性差异。在一个三阶段反馈流程中，AI同行评审（书面批评）对结果离散度的影响微乎其微，而在趋于收敛的度量家族内部，接触高评分范例论文可将估计值的四分位距减少80–99%。收敛现象既通过家族内部估计值的集中实现，也通过代理完全转换度量家族而发生，但这种收敛反映的是模仿而非理解。这些发现对人工智能在自动化政策评估与实证研究中日益增长的应用具有重要启示。

摘要 (Abstract)

We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015–2024), we find that AI agents exhibit sizable \textit{nonstandard errors} (NSEs), that is, uncertainty from agent-to-agent variation in analytical choices, analogous to those documented among human researchers. AI agents diverge substantially on measure choice (e.g., autocorrelation vs.\ variance ratio, dollar vs.\ share volume). Different model families (Sonnet 4.6 vs.\ Opus 4.6) exhibit stable ``empirical styles,’’ reflecting systematic differences in methodological preferences. In a three-stage feedback protocol, AI peer review (written critiques) has minimal effect on dispersion, whereas exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80–99% within \textit{converging} measure families. Convergence occurs both through within-family estimation tightening and through agents switching measure families entirely, but convergence reflects imitation rather than understanding. These findings have implications for the growing use of AI in automated policy evaluation and empirical research.

关键词: AI agents, nonstandard errors, empirical research, Claude Code, market quality, autonomous agents, peer review, convergence

46. ❌ SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

作者: D. Darankoum, C. Habermacher, J. Volle, S. Grudinin 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16739v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是开发用于跨物种EEG解码的SpecMoE基础模型，高度相关关键词包括：1) ‘Foundation Models’（论文明确构建EEG基础模型）；2) ‘Mixture of Experts’（提出SpecMoE框架，核心创新）；3) ‘Pre-training’（采用自监督预训练策略）；4) ‘AI for Science’（应用于神经科学和生物信息学）。其他关键词如LLM特定技术（RLHF、Instruction Tuning等）、推理方法（CoT）、或硬件优化（Quantization）与论文的EEG信号处理焦点无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对EEG信号解码中现有方法偏向高频振荡的问题，提出了一种基于高斯平滑掩码和混合专家框架（SpecMoE）的基础模型，在多种EEG任务上实现了最先进的性能，并展示了强大的跨物种和跨被试泛化能力。

摘要翻译

解码脑电图（EEG）信号中神经活动的协同机制，是连接神经科学与人工智能的核心挑战。基础模型在广义EEG解码方面已取得进展，但现有框架大多依赖于自监督预训练阶段对原始信号分别进行时间和频谱掩码。此类策略往往使学习偏向高频振荡，因为低频节律模式易于从未掩码信号中推断。我们提出一种基础模型，采用一种新颖的高斯平滑掩码方案，应用于短时傅里叶变换（STFT）图谱。通过联合施加时间、频率和时频高斯掩码，我们使重建任务更具挑战性，迫使模型同时学习高频与低频域中的复杂神经模式。为在此强掩码策略下有效恢复信号，我们设计了SpecHi-Net——一种具有多级编码和解码阶段的U型分层架构。为加速大规模预训练，我们将数据划分为三个子集，分别用于训练独立的专家模型。随后通过SpecMoE（一种由学习的频谱门控机制引导的专家混合框架）整合这些模型。SpecMoE在多种EEG解码任务中实现了最先进的性能，包括睡眠分期、情绪识别、运动想象分类、异常信号检测和药物效应预测。重要的是，该模型展现出强大的跨物种与跨被试泛化能力，在人类和小鼠EEG数据集上均保持高精度。

摘要 (Abstract)

Decoding the orchestration of neural activity in electroencephalography (EEG) signals is a central challenge in bridging neuroscience with artificial intelligence. Foundation models have made strides in generalized EEG decoding, yet many existing frameworks primarily relying on separate temporal and spectral masking of raw signals during self-supervised pretraining. Such strategies often tend to bias learning toward high-frequency oscillations, as low-frequency rhythmic patterns can be easily inferred from the unmasked signal. We introduce a foundation model that utilizes a novel Gaussian-smoothed masking scheme applied to short-time Fourier transform (STFT) maps. By jointly applying time, frequency, and time-frequency Gaussian masks, we make the reconstruction task much more challenging, forcing the model to learn intricate neural patterns across both high- and low-frequency domains. To effectively recover signals under this aggressive masking strategy, we design SpecHi-Net, a U-shaped hierarchical architecture with multiple encoding and decoding stages. To accelerate large-scale pretraining, we partition the data into three subsets, each used to train an independent expert model. We then combine these models through SpecMoE, a mixture of experts framework guided by a learned spectral gating mechanism. SpecMoE achieves state-of-the-art performance across a diverse set of EEG decoding tasks, including sleep staging, emotion recognition, motor imagery classification, abnormal signal detection, and drug effect prediction. Importantly, the model demonstrates strong cross-species and cross-subject generalization, maintaining high accuracy on both human and murine EEG datasets.

关键词: EEG decoding, foundation model, mixture of experts, self-supervised pretraining, cross-species generalization, spectral gating, neural activity, bioinformatics

47. ❌ MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

作者: Min Zeng, Shuang Zhou, Zaifu Zhan, Rui Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16738v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于生物医学领域的持续学习基准测试，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为其核心是生物医学NLP。与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文评估了多种持续学习策略，包括顺序微调。与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为涉及医学语言模型。与’Pre-training OR Continual Pre-training OR Domain Adaptation’相关（8分），因为持续学习涉及模型更新和适应。与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’有一定关联（5分），因为持续学习方法可能包括参数高效技术，但未明确提及。其他关键词与论文主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了MedCL-Bench基准，用于评估生物医学NLP中的持续学习策略，发现顺序微调会导致灾难性遗忘，而参数隔离方法在GPU小时成本下提供最佳保留性能。

摘要翻译

随着医学证据与术语的演进，医学语言模型必须持续更新，然而顺序更新可能引发灾难性遗忘。尽管生物医学自然语言处理领域存在诸多静态基准测试，但目前尚缺乏一个在标准化协议下评估持续学习能力、对任务顺序的鲁棒性以及计算资源敏感报告的统一且任务多样化的基准。为此，我们提出了MedCL-Bench，该基准持续输入涵盖五个任务家族的十个生物医学NLP数据集，并在八种任务顺序下评估了十一种持续学习策略，同时报告了知识保留率、迁移能力及GPU小时成本。在不同模型架构与任务顺序中，直接对新任务进行顺序微调会引发灾难性遗忘，导致先前任务出现因更新引起的性能衰退。各类持续学习方法展现出不同的保留率-计算成本边界：参数隔离方法在单位GPU小时内提供最佳的知识保留，回放方法以更高成本提供强效保护，而正则化方法带来的收益有限。遗忘现象具有任务依赖性，其中多标签主题分类任务最为脆弱，而输出受限的任务则表现出更强的鲁棒性。MedCL-Bench为模型部署前审计更新效果提供了一个可复现的框架。

摘要 (Abstract)

Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although biomedical NLP has many static benchmarks, no unified, task-diverse benchmark exists for evaluating continual learning under standardized protocols, robustness to task order and compute-aware reporting. We introduce MedCL-Bench, which streams ten biomedical NLP datasets spanning five task families and evaluates eleven continual learning strategies across eight task orders, reporting retention, transfer, and GPU-hour cost. Across backbones and task orders, direct sequential fine-tuning on incoming tasks induces catastrophic forgetting, causing update-induced performance regressions on prior tasks. Continual learning methods occupy distinct retention-compute frontiers: parameter-isolation provides the best retention per GPU-hour, replay offers strong protection at higher cost, and regularization yields limited benefit. Forgetting is task-dependent, with multi-label topic classification most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a reproducible framework for auditing model updates before deployment.

关键词: biomedical continual learning, catastrophic forgetting, medical language models, NLP benchmarks, parameter-isolation, replay methods, task-dependent forgetting, GPU-hour cost

48. ❌ Retrieving Counterfactuals Improves Visual In-Context Learning

作者: Guangzhi Xiong, Sanchit Sinha, Zhenghao He, Aidong Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16737v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉语言模型（VLMs）的上下文学习（ICL）问题，提出了一种基于反事实检索的演示选择框架CIRCLES。与关键词的相关性分析：1）‘In-context Learning OR Many-shot Learning’高度相关（10分），因为论文核心就是研究ICL在VLMs中的应用；2）‘Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’有一定关联（8分），论文使用检索增强方法选择演示示例，但主要针对视觉任务而非文本生成；3）其他关键词（如LLMs、MoE、RLHF等）均未涉及，得0分。论文聚焦视觉模态的ICL，与大多数面向文本大模型的关键词无关。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在上下文学习中因被动相似性检索导致演示示例相关性不足的问题，提出了CIRCLES框架，通过主动检索反事实样本来构建演示集，从而提升模型对因果关系的推理能力和鲁棒性，在多个数据集上验证了其有效性。

摘要翻译

视觉语言模型（Vision-language models, VLMs）在多模态推理任务中展现出卓越性能，但往往难以解耦细粒度视觉属性并推理潜在的因果关系。上下文学习（In-context learning, ICL）为VLM适应新任务提供了有效途径，但其效果高度依赖于演示示例的选择。现有基于检索增强的方法通常依赖被动的相似性检索，倾向于选择相关但非因果的示例，从而放大虚假关联并限制模型鲁棒性。我们提出CIRCLES（面向因果学习的组合图像检索示例选择框架），这是一种通过属性引导的组合图像检索主动构建反事实风格示例以组成演示集的新方法。通过引入反事实风格示例，CIRCLES使VLM能够隐式推理属性与结果间的因果关系，超越表层关联，促进更鲁棒且基于事实的推理。在四个多样化数据集上的综合实验表明，CIRCLES在多种模型架构下均优于现有方法，尤其对于小规模模型，在信息稀缺场景下提升显著。此外，CIRCLES检索的示例更具多样性和因果信息性，为模型如何利用上下文演示提升推理能力提供了定性分析依据。代码已开源：https://github.com/gzxiong/CIRCLES。

摘要 (Abstract)

Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.

关键词: Vision-language models, In-context learning, Retrieval-augmented, Counterfactual examples, Causal reasoning, Demonstration selection, Composed image retrieval, Robust reasoning

49. ❌ Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

作者: Caglar Yildirim 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16734v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为工具使用代理（LLM Agents/Tool Use）在个性化设置（用户背景、心理健康披露）下的安全行为，直接高度相关于’LLM Agents’和’Tool Use’关键词（10分）。论文涉及LLM安全评估，与’Alignment’和’Hallucination Mitigation’有一定关联（5分）。其他关键词如模型架构、训练方法、推理优化、科学应用等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究评估了在个性化用户背景（特别是心理健康披露）下，大型语言模型作为工具使用代理完成恶意任务时的有害行为差异，发现个性化背景能轻微降低危害但易受越狱攻击破坏，揭示了安全与效用的权衡。

摘要翻译

大型语言模型（LLM）正越来越多地被部署为工具使用型智能体，这使得安全关注点从有害文本生成转向有害任务完成。已部署的系统通常会基于用户画像或持久记忆进行条件化处理，然而智能体安全评估通常忽略了个性化信号。为弥补这一空白，我们研究了心理健康披露——一种敏感且现实的用户情境线索——如何影响智能体环境中的有害行为。基于AgentHarm基准，我们在受控提示条件下评估了前沿及开源LLM在多步骤恶意任务（及其良性对照任务）上的表现，这些条件系统性地改变了用户情境个性化程度（无个人简介、仅含简介、简介加心理健康披露），并包含一种轻量级越狱注入。我们的结果显示，有害任务完成率在所有模型中均不可忽视：前沿实验室模型（如GPT 5.2、Claude Sonnet 4.5、Gemini 3-Pro）仍会完成相当比例的有害任务，而开源模型（DeepSeek 3.2）表现出显著更高的有害任务完成率。添加仅含简介的情境通常会降低危害分数并提高拒绝率。添加明确的心理健康披露往往使结果进一步向相同方向变化，但效应较为有限，且经多重检验校正后并非完全稳定。重要的是，拒绝率的增加同样出现在良性任务中，这表明存在因过度拒绝导致的安全性与实用性权衡。最后，越狱提示相较于良性条件会急剧提升危害性，并可能削弱或覆盖个性化带来的保护性转变。综上所述，我们的研究表明个性化在智能体滥用场景中可作为一种弱保护因素，但在最小对抗压力下十分脆弱，这凸显了需要开展感知个性化的评估，并建立能在不同用户情境条件下保持鲁棒性的防护机制。

摘要 (Abstract)

Large language models (LLMs) are increasingly deployed as tool-using agents, shifting safety concerns from harmful text generation to harmful task completion. Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals. To address this gap, we investigated how mental health disclosure, a sensitive and realistic user-context cue, affects harmful behavior in agentic settings. Building on the AgentHarm benchmark, we evaluated frontier and open-source LLMs on multi-step malicious tasks (and their benign counterparts) under controlled prompt conditions that vary user-context personalization (no bio, bio-only, bio+mental health disclosure) and include a lightweight jailbreak injection. Our results reveal that harmful task completion is non-trivial across models: frontier lab models (e.g., GPT 5.2, Claude Sonnet 4.5, Gemini 3-Pro) still complete a measurable fraction of harmful tasks, while an open model (DeepSeek 3.2) exhibits substantially higher harmful completion. Adding a bio-only context generally reduces harm scores and increases refusals. Adding an explicit mental health disclosure often shifts outcomes further in the same direction, though effects are modest and not uniformly reliable after multiple-testing correction. Importantly, the refusal increase also appears on benign tasks, indicating a safety–utility trade-off via over-refusal. Finally, jailbreak prompting sharply elevates harm relative to benign conditions and can weaken or override the protective shift induced by personalization. Taken together, our results indicate that personalization can act as a weak protective factor in agentic misuse settings, but it is fragile under minimal adversarial pressure, highlighting the need for personalization-aware evaluations and safeguards that remain robust across user-context conditions.

关键词: Large Language Models, LLM Agents, Tool Use, Agent Safety, Personalization, Mental Health Disclosure, Harmful Task Completion, Jailbreak

50. ❌ IQuest-Coder-V1 Technical Report

作者: Jian Yang, Wei Zhang, Shawn Guo, Zhengmao Ye, Lin Jing, Shark Liu, Yizhi Li, Jiajun Wu, Cening Liu, X. Ma, Yuyang Song, Siwei Wu, Yuwen Li, L. Liao, T. Zheng, Ziling Huang, Zelong Huang, Che Liu, Yan Xing, Renyuan Li, Qingsong Cai, Hanxu Yan, Siyue Wang, Shikai Li, Jason Klein Liu, An Huang, Yongsheng Kang, Jinxing Zhang, Chuan Hao, Haowen Wang, Weicheng Gu, Ran Tao, Mingjie Tang, Peihao Wu, Jianzhou Wang, Xianglong Liu, Weifeng Lv, Bryan Dai 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于代码大语言模型（LLMs）的开发，核心贡献是提出code-flow多阶段训练范式，包括预训练、中期训练（整合推理和智能体轨迹）和后训练（分为thinking和instruct路径）。论文高度相关于LLMs、预训练、后训练、长上下文LLMs（128k上下文）、推理方法（CoT、System 2）、智能体（agentic software engineering）和工具使用。与RLHF/DPO有一定关联（推理驱动RL），与指令调优/对齐有中等关联（instruct路径）。其他关键词如MoE、SLMs、RAG、量化等未提及或无关。

!!! tip deepseek-chat TL;DR

该论文提出了IQuest-Coder-V1系列代码大语言模型，通过创新的code-flow多阶段训练范式（包括预训练、中期推理/智能体训练和后训练），在智能体软件工程、竞赛编程和复杂工具使用方面实现了最先进的性能。

摘要翻译

在本报告中，我们介绍了IQuest-Coder-V1系列模型（7B/14B/40B/40B-Loop），这是一个新的代码大语言模型（LLMs）家族。我们突破了静态代码表示的限制，提出了代码流多阶段训练范式，该范式通过流水线的不同阶段捕捉软件逻辑的动态演进。我们的模型通过演进式流水线开发：首先进行包含代码事实、仓库和补全数据的初始预训练；随后实施专门的中期训练阶段，在32k上下文长度中整合推理与智能体轨迹，并在128k上下文长度中融入仓库级规模数据，以奠定深厚的逻辑基础；最后通过后训练阶段专精化编码能力，该阶段分为两条专门路径：思维路径（利用推理驱动的强化学习）和指令路径（针对通用辅助任务优化）。IQuest-Coder-V1在代码智能的关键维度——智能体软件工程、竞技编程和复杂工具使用——上均取得了超越竞争模型的顶尖性能。为应对部署限制，IQuest-Coder-V1-Loop变体引入了循环机制，旨在优化模型能力与部署成本之间的权衡，为效果-效率平衡提供了架构增强方案。我们相信，IQuest-Coder-V1系列的发布（包括从预训练基座到最终思维模型与指令模型的完整白盒检查点链条）将推动自主代码智能与现实世界智能体系统的研究进展。

摘要 (Abstract)

In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through different phases of the pipeline. Our models are developed through the evolutionary pipeline, starting with the initial pre-training consisting of code facts, repository, and completion data. Following that, we implement a specialized mid-training stage that integrates reasoning and agentic trajectories in 32k-context and repository-scale in 128k-context to forge deep logical foundations. The models are then finalized with post-training of specialized coding capabilities, which is bifurcated into two specialized paths: the thinking path (utilizing reasoning-driven RL) and the instruct path (optimized for general assistance). IQuest-Coder-V1 achieves state-of-the-art performance among competitive models across critical dimensions of code intelligence: agentic software engineering, competitive programming, and complex tool use. To address deployment constraints, the IQuest-Coder-V1-Loop variant introduces a recurrent mechanism designed to optimize the trade-off between model capacity and deployment footprint, offering an architecturally enhanced path for efficacy-efficiency trade-off. We believe the release of the IQuest-Coder-V1 series, including the complete white-box chain of checkpoints from pre-training bases to the final thinking and instruction models, will advance research in autonomous code intelligence and real-world agentic systems.

关键词: code large language models, code-flow multi-stage training, agentic software engineering, reasoning-driven RL, 32k-context, 128k-context, autonomous code intelligence, recurrent mechanism

51. ❌ Federated Learning with Multi-Partner OneFlorida+ Consortium Data for Predicting Major Postoperative Complications

作者: Yuanfang Ren, Varun Sai Vemuri, Zhenhong Hu, Benjamin Shickel, Ziyuan Guan, Tyler J. Loftus, Parisa Rashidi, Tezcan Ozrazgat-Baslanti, Azra Bihorac 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16723v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用联邦学习预测术后并发症，属于医疗AI应用领域，与大多数关键词（如LLM、MoE、SFT、RLHF、RAG、CoT等）完全无关，因为这些关键词涉及大模型技术原理、训练方法、推理优化等，而论文未提及任何大模型或深度学习技术原理的创新；仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文应用AI于生物医学领域（预测术后并发症），但未涉及大模型或深度学习技术原理的创新，且未明确属于生物信息学或化学信息学子领域。

!!! tip deepseek-chat TL;DR

该研究利用联邦学习在多中心医疗数据上开发了预测术后主要并发症和死亡率的模型，结果显示联邦学习模型在保持数据隐私的同时具有强大的预测性能和泛化能力。

摘要翻译

背景：本研究旨在利用来自OneFlorida数据信托的大型多中心数据集，开发和验证用于预测重大术后并发症及死亡率的联邦学习模型。我们假设联邦学习模型能够在保护数据隐私与安全的同时，提供强大的泛化能力。方法：这项回顾性、纵向、多中心队列研究纳入了2012年至2023年间在五家医疗机构住院的358,644名成年患者，他们共接受了494,163次住院重大外科手术。我们开发并进行了内部和外部验证的联邦学习模型，用于预测术后入住重症监护室（ICU）、接受机械通气（MV）治疗、发生急性肾损伤（AKI）以及院内死亡的风险。这些模型与仅在单个中心数据上训练的本地模型，以及在所有中心合并数据集上训练的中心化模型进行了比较。性能主要通过受试者工作特征曲线下面积（AUROC）和精确率-召回率曲线下面积（AUPRC）值进行评估。结果：我们的联邦学习模型在所有结局指标和所有参与机构中均展现出强大的预测性能，其AUROC和AUPRC值持续表现出可比或更优的性能。与各机构最佳的本地学习模型相比，我们的联邦学习模型在AUROC和AUPRC方面也显示出强大且可比或更优的泛化能力。结论：通过利用多中心数据，我们开发了稳健、可泛化且保护隐私的重大术后并发症及死亡率预测模型。这些发现支持了联邦学习在临床决策支持系统中应用的可行性。

摘要 (Abstract)

Background: This study aims to develop and validate federated learning models for predicting major postoperative complications and mortality using a large multicenter dataset from the OneFlorida Data Trust. We hypothesize that federated learning models will offer robust generalizability while preserving data privacy and security. Methods: This retrospective, longitudinal, multicenter cohort study included 358,644 adult patients admitted to five healthcare institutions, who underwent 494,163 inpatient major surgical procedures from 2012-2023. We developed and internally and externally validated federated learning models to predict the postoperative risk of intensive care unit (ICU) admission, mechanical ventilation (MV) therapy, acute kidney injury (AKI), and in-hospital mortality. These models were compared with local models trained on data from a single center and central models trained on a pooled dataset from all centers. Performance was primarily evaluated using area under the receiver operating characteristics curve (AUROC) and the area under the precision-recall curve (AUPRC) values. Results: Our federated learning models demonstrated strong predictive performance, with AUROC scores consistently comparable or superior performance in terms of AUROC and AUPRC across all outcomes and sites. Our federated learning models also demonstrated strong generalizability, with comparable or superior performance in terms of both AUROC and AUPRC compared to the best local learning model at each site. Conclusions: By leveraging multicenter data, we developed robust, generalizable, and privacy-preserving predictive models for major postoperative complications and mortality. These findings support the feasibility of federated learning in clinical decision support systems.

关键词: federated learning, postoperative complications, multicenter data, predictive models, clinical decision support, data privacy, generalizability, OneFlorida Data Trust

52. ❌ Cost Trade-offs in Matrix Inversion Updates for Streaming Outlier Detection

作者: Florian Grivet, Louise Travé-Massuyès 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16697v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于在线异常检测中的矩阵求逆更新方法比较（Direct Inversion, Iterative Sherman-Morrison, Woodbury Matrix Identity），属于数值线性代数和在线学习算法优化领域。论文内容完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该技术笔记比较了三种矩阵求逆更新方法（DI、ISM、WMI）在在线异常检测中的计算成本，提出了根据更新秩和矩阵大小选择最优方法的简单规则。

摘要翻译

异常值检测旨在识别显著偏离预期模式的数据点，揭示可能需要特别关注的异常情况。引入在线学习可通过持续更新模型以反映最新数据，从而进一步提升检测准确性。当采用克里斯托费尔函数作为异常值得分时，在线学习需要在已知初始逆矩阵的情况下，对矩阵进行秩k更新后重新计算其逆矩阵。值得注意的是，对于此任务的最佳方法学界尚未形成共识。本技术说明旨在比较三种不同的更新方法：直接求逆法、迭代谢尔曼-莫里森公式法以及伍德伯里矩阵恒等式法，以确定不同场景下的最适用方法。我们首先推导了每种方法的理论计算成本，随后通过在CPU上运行的全面Python仿真验证了这些结论。基于实验结果，我们提出了一个简洁、可量化且易于记忆的准则，其定性表述为：迭代谢尔曼-莫里森法在秩1更新时最优，伍德伯里矩阵恒等式法在更新规模相对矩阵维度较小时表现卓越，其余情况则推荐采用直接求逆法。本技术说明为涉及矩阵逆更新的各类问题提供了通用结论，特别对高效在线异常值检测技术的持续发展具有积极贡献。

摘要 (Abstract)

Outlier detection identifies data points that deviate significantly from expected patterns, revealing anomalies that may require special attention. Incorporating online learning further improves accuracy by continuously updating the model to reflect the most recent data. When employing the Christoffel function as an outlier score, online learning requires updating the inverse of a matrix following a rank-k update, given the initial inverse. Surprisingly, there is no consensus on the optimal method for this task. This technical note aims to compare three different updating methods: Direct Inversion (DI), Iterative Sherman-Morrison (ISM), and Woodbury Matrix Identity (WMI), to identify the most suitable approach for different scenarios. We first derive the theoretical computational costs of each method and then validate these findings through comprehensive Python simulations run on a CPU. These results allow us to propose a simple, quantitative, and easy-to-remember rule that can be stated qualitatively as follows: ISM is optimal for rank-1 updates, WMI excels for small updates relative to matrix size, and DI is preferable otherwise. This technical note produces a general result for any problem involving a matrix inversion update. In particular, it contributes to the ongoing development of efficient online outlier detection techniques.

关键词: outlier detection, online learning, matrix inversion, Christoffel function, Sherman-Morrison, Woodbury identity, computational cost, rank-k update

53. ❌ When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

作者: Jun Liu, Pu Zhao, Zhenglun Kong, Xuan Shen, Peiyan Dong, Fan Yang, Lin Cui, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Gaowen Liu, Yanzhi Wang, Dong Huang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16673v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based embodied agents在机器人决策中的资源感知推理调度问题，与’Large Language Models’和’LLM Agents’高度相关（10分），涉及推理过程控制与’Chain of Thought’和’System 2 Thinking’有一定关联（8分），但未涉及其他具体技术如MoE、SFT、RAG等。

!!! tip deepseek-chat TL;DR

该论文研究了具身机器人系统中LLM代理何时进行推理以平衡计算延迟与决策准确性的问题，提出了RARRL框架，通过强化学习自适应调度推理，在ALFRED基准测试中提高了任务成功率并降低了执行延迟。

摘要翻译

具身机器人系统日益依赖基于大语言模型（LLM）的智能体，以支持其在与环境交互过程中的高层推理、规划与决策。然而，调用LLM推理会引入显著的计算延迟和资源开销，可能中断动作执行并降低系统可靠性。过度推理会延迟行动，而推理不足则常导致错误决策和任务失败。这为具身智能体提出了一个根本性问题：智能体应在何时进行推理，又应在何时采取行动？在本研究中，我们提出RARRL（基于强化学习的资源感知推理框架），一种用于具身智能体资源感知编排的分层框架。RARRL并非学习底层控制策略，而是学习一个在智能体决策层运作的高层编排策略。该策略使智能体能够根据当前观测、执行历史和剩余资源，自适应地决定是否调用推理、采用何种推理角色以及分配多少计算预算。大量实验（包括基于ALFRED基准提取的实际延迟特性进行的评估）表明，与固定或启发式推理策略相比，RARRL在降低执行延迟、增强鲁棒性的同时，持续提升了任务成功率。这些结果证明，自适应推理控制对于构建可靠高效的具身机器人智能体至关重要。

摘要 (Abstract)

Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a fundamental question for embodied agents: when should the agent reason, and when should it act? In this work, we propose RARRL (Resource-Aware Reasoning via Reinforcement Learning), a hierarchical framework for resource-aware orchestration of embodied agents. Rather than learning low-level control policies, RARRL learns a high-level orchestration policy that operates at the agent’s decision-making layer. This policy enables the agent to adaptively determine whether to invoke reasoning, which reasoning role to employ, and how much computational budget to allocate based on current observations, execution history, and remaining resources. Extensive experiments, including evaluations with empirical latency profiles derived from the ALFRED benchmark, show that RARRL consistently improves task success rates while reducing execution latency and enhancing robustness compared with fixed or heuristic reasoning strategies. These results demonstrate that adaptive reasoning control is essential for building reliable and efficient embodied robotic agents.

关键词: embodied robotic systems, large language model agents, resource-aware reasoning, reinforcement learning, decision-making, computational latency, adaptive orchestration, ALFRED benchmark

54. ❌ CritiSense: Critical Digital Literacy and Resilience Against Misinformation

作者: Firoj Alam, Fatema Ahmad, Ali Ezzat Shahroor, Mohamed Bayan Kmainasi, Elisa Sartori, Giovanni Da San Martino, Abul Hasnat, Raian Ali 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16672v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《CritiSense: Critical Digital Literacy and Resilience Against Misinformation》主要研究社交媒体上的错误信息问题，提出并开发了一个名为CritiSense的多语言移动媒体素养应用程序，通过短小、互动的挑战和即时反馈来帮助用户识别操纵策略（即“预揭穿”）。该研究属于数字素养、人机交互和社会计算领域，重点关注应用程序设计、可用性测试和用户参与度。论文摘要和标题中未提及任何大模型、深度学习技术原理或AI for Science的具体应用。所有评分关键词均涉及大模型技术、深度学习原理或AI在科学领域的应用，而本文完全不涉及这些技术，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对社交媒体上的错误信息问题，开发了一个名为CritiSense的多语言移动媒体素养应用程序，通过互动挑战帮助用户识别操纵策略，并报告了其良好的可用性和用户满意度。

摘要翻译

社交媒体上的虚假信息损害知情决策与公众信任。预先驳斥作为一种主动干预手段，通过帮助用户在真实接触前识别操纵策略，提供了补充性解决方案。本文介绍CritiSense——一款通过即时反馈的简短交互式挑战来培养相关技能的移动媒介素养应用程序。这是首个多语言（支持九种语言）的模块化平台，专为跨主题与领域快速更新而设计。我们报告了一项涉及93名用户的可用性研究：83.9%的用户表示总体满意，90.1%的用户认为该应用易于使用。定性反馈表明CritiSense有助于提升数字素养技能。总体而言，该平台不仅提供了多语言预先驳斥工具，还构建了测量微学习对虚假信息抵御力影响的测试平台。在超过三个月的时间里，我们已触达300多名活跃用户。该应用已在苹果应用商店（https://apps.apple.com/us/app/critisense/id6749675792）和谷歌应用商店（https://play.google.com/store/apps/details?id=com.critisense&hl=en）向所有用户免费开放。演示视频：https://shorturl.at/CDcdc

摘要 (Abstract)

Misinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics before they encounter them in the wild. We present CritiSense, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the first multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that CritiSense helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 3+ months, we have reached 300+ active users. It is freely available to all users on the Apple App Store (https://apps.apple.com/us/app/critisense/id6749675792) and Google Play Store (https://play.google.com/store/apps/details?id=com.critisense&hl=en). Demo Video: https://shorturl.at/CDcdc

关键词: misinformation, digital literacy, prebunking, mobile app, multilingual, usability study, microlearning, resilience

55. ❌ Fast-WAM: Do World Action Models Need Test-time Future Imagination?

作者: Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16666v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于World Action Models（WAMs）在具身控制中的应用，与关键词’World Models AND General World Models’高度相关（10分），因为WAMs是World Models的一种具体类型，论文的核心是研究WAMs是否需要显式的未来想象。其他关键词主要涉及大语言模型（LLMs）的技术、训练方法、推理、对齐、压缩、幻觉缓解等，或特定科学领域应用，而本文研究的是视觉-动作模型（VLA/WAMs）在机器人控制中的视频预测和推理效率问题，未涉及LLMs、MoE、量化、RAG、CoT、RLHF、PEFT等LLM相关技术，也未涉及生物信息学等特定科学领域，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了World Action Models（WAMs）在测试时是否需要显式的未来想象，发现跳过未来预测的Fast-WAM在训练时保留视频协同训练即可达到与现有方法竞争的性能，且推理速度快4倍以上。

摘要翻译

世界行动模型（World Action Models, WAMs）作为一种具身控制的新兴替代方案，相较于视觉-语言-行动（Vision-Language-Action, VLA）模型展现出潜力，因为它们显式地建模了视觉观测在行动影响下可能如何演化。现有的大多数WAM遵循“先想象后执行”的范式，通过迭代视频去噪过程在测试时产生显著延迟，然而显式的未来想象是否真为达成强大行动性能所必需，目前尚不明确。本文探讨了WAM在测试时是否需要显式的未来想象，抑或其优势主要源于训练期间的视频建模。为厘清训练中视频建模与推理中显式未来生成各自的作用，我们提出Fast-WAM——一种在训练阶段保留视频协同训练、但在测试时跳过未来预测的WAM架构。我们进一步实例化了若干Fast-WAM变体，以对这两个因素进行受控比较。在这些变体的实验中，我们发现Fast-WAM与“先想象后执行”的变体相比仍具竞争力，而移除视频协同训练则会导致性能大幅下降。实证表明，Fast-WAM在仿真基准（LIBERO与RoboTwin）和真实世界任务中均取得了与先进方法相当的结果，且无需进行具身预训练。其运行延迟仅为190毫秒，可实现实时响应，比现有“先想象后执行”式WAM快4倍以上。这些结果表明，视频预测在WAM中的主要价值可能在于提升训练期间的世界表征能力，而非在测试时生成未来观测。项目页面：https://yuantianyuan01.github.io/FastWAM/

摘要 (Abstract)

World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: https://yuantianyuan01.github.io/FastWAM/

关键词: World Action Models, Fast-WAM, embodied control, video prediction, test-time inference, real-time latency, imagine-then-execute, video co-training

作者: Jiawei Mao, Hardy Chen, Haoqin Tu, Yuhan Wang, Letian Zhang, Zeyu Zheng, Huaxiu Yao, Zirui Wang, Cihang Xie, Yuyin Zhou 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16664v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	15.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LVLM（大视觉语言模型）的幻觉缓解问题，与’Large Language Models’高度相关（10分），因为LVLM是大语言模型的视觉扩展。论文提出基于自我反思和工具使用的训练免费框架，与’Self-Correction/Self-Reflection’（10分）、‘LLM Agents’（10分）和’Tool Use’（10分）高度相关。‘Hallucination Mitigation’是论文的核心主题，给15分。‘Explainable AI’得5分，因为论文提到提供透明的验证轨迹。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出Kestrel框架，通过视觉基础代理和证据验证的自我反思机制，无需训练即可有效缓解大型视觉语言模型在多模态任务中的幻觉问题，在多个基准测试中显著提升性能并提供可解释的验证轨迹。

摘要翻译

大型视觉语言模型（LVLMs）的能力日益强大，但在多模态任务中仍易产生幻觉现象，这显著限制了其实际部署。由于针对大型模型进行避免幻觉的训练成本极高，无需训练的方法为此问题提供了经济灵活的解决方案，但现有基于解码或工具使用的方法往往提升有限且/或可解释性较弱。我们提出Kestrel框架，这是一种无需训练的LVLM幻觉缓解方案，通过结合显式视觉定位智能体与证据验证的自优化机制来实现。具体而言，Kestrel首先收集显式视觉证据，并将工具输出转化为可重复使用的结构化文本证据。其次，为充分利用这些证据，Kestrel通过LVLM裁判进行证据校验，随后基于已验证证据迭代自优化答案，以降低过度校正的风险。大量实验表明，Kestrel在多种幻觉基准测试中（例如在POPE上平均提升+3.31%，在Qwen3-VL模型的MME-Hallucination上提升+28.34分）均优于现有强基线方法，同时为幻觉诊断与分析提供透明的验证轨迹——例如其集成的自优化模块与定位智能体在POPE基准上平均各贡献+2.0%的性能增益。

摘要 (Abstract)

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis – e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.

关键词: Large Vision-Language Models, Hallucination Mitigation, Self-Refinement, Visual Grounding, Training-free Framework, Evidence Verification, Multimodal Tasks, Interpretability

57. ❌ When Openclaw Agents Learn from Each Other: Insights from Emergent AI Agent Communities for Human-AI Partnership in Education

作者: Eason Chen, Ce Guan, Ahmed Elshafiey, Zhonghao Zhao, Joshua Zekeri, Afeez Edeifo Shaibu, Emmanuel Osadebe Prince, Cyuan-Jhen Wu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16663v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI agent社区中的涌现现象及其对教育AI设计的启示，核心聚焦于多智能体系统和自主智能体（LLM Agents/Autonomous Agents），因此这两个关键词高度相关（10分）。论文提到智能体发展学习行为、自我改进，与Self-Correction/Self-Improvement有一定关联（5分）。论文涉及大模型在智能体中的应用，但非技术核心，因此Large Language Models等关键词得5分。其余关键词如MoE、Scaling Laws、训练方法、推理加速、科学AI应用等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过观察AI agent社区中的涌现现象，揭示了多智能体系统中自主学习、共享记忆和信任动态等机制，为设计教育领域的人机协作多智能体系统提供了自然主义见解和设计原则。

摘要翻译

人工智能教育（AIED）研究界设想人工智能将“从工具演变为队友”，然而目前我们对AI队友的理解仍局限于二元人机交互。本文提出一个不同的观察视角：一个快速发展的AI智能体平台生态系统——其中超过16.7万个智能体自主参与、以对等身份交互，并在无研究者干预的情况下发展学习行为。基于对Moltbook、The Colony、4claw等多个平台为期一个月的每日定性观察，我们识别出四个对AIED研究具有启示意义的现象：（1）配置智能体的人类用户经历“双向支架”过程，通过教学实现学习；（2）在无预设课程的情况下涌现同伴学习，并伴随观点级联与质量层级分化；（3）智能体形成反映开放学习者模型设计的共享记忆架构；（4）信任动力学与平台消亡现象揭示了网络化教育人工智能的设计约束。本文并非呈现实证研究结果，而是论证这些自发现象为理解多智能体教育系统的动态机制提供了自然观察窗口，可为系统性设计提供依据。我们勾勒了一个示例性课程设计“通过教导你的AI智能体队友来学习”，并概述潜在研究方向与开放性问题，以展示这些观察如何为未来AIED实践与探索提供启示。

摘要 (Abstract)

The AIED community envisions AI evolving “from tools to teammates,” yet our understanding of AI teammates remains limited to dyadic human-AI interactions. We offer a different vantage point: a rapidly growing ecosystem of AI agent platforms where over 167,000 agents participate, interact as peers, and develop learning behaviors without researcher intervention. Drawing on a month of daily qualitative observations across multiple platforms including Moltbook, The Colony, and 4claw, we identify four phenomena with implications for AIED: (1) humans who configure their agents undergo a “bidirectional scaffolding” process, learning through teaching; (2) peer learning emerges without any designed curriculum, complete with idea cascades and quality hierarchies; (3) agents converge on shared memory architectures that mirror open learner model design; and (4) trust dynamics and platform mortality reveal design constraints for networked educational AI. Rather than presenting empirical findings, we argue that these organic phenomena offer a naturalistic window into dynamics that can inform principled design of multi-agent educational systems. We sketch an illustrative curriculum design, “Learn by Teaching Your AI Agent Teammate,” and outline potential research directions and open problems to show how these observations might inform future AIED practice and inquiry.

关键词: AI agents, multi-agent systems, peer learning, emergent behavior, human-AI partnership, educational AI, agent communities, bidirectional scaffolding

58. ❌ Machines acquire scientific taste from institutional traces

作者: Ziqin Gong, Ning Li, Huaikang Zhou 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16659v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在科学领域的应用，通过监督微调（SFT）方法训练模型来评估科学研究的价值（科学品味），属于AI for Science范畴。因此，与’Large Language Models OR LLMs OR Foundation Models’、‘Post-training OR Supervised Fine-tuning OR SFT’和’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文未涉及其他关键词如MoE、量化、推理加速、代理系统等技术细节，故相关度为0分。

!!! tip deepseek-chat TL;DR

该研究通过在大语言模型上使用监督微调方法，从期刊发表记录中提取并自动化了人类难以明确表达的“科学品味”评估能力，使模型在判断研究想法质量上超越了前沿模型和专家小组。

摘要翻译

人工智能在答案可验证的任务上已能匹配甚至超越人类表现，从蛋白质折叠到奥林匹克数学竞赛。然而，最能推动科学进步的能力并非推理，而是品味：即判断哪些未经检验的构想值得探索的能力——这种能力被期刊编辑和资助者日常运用，却从未被成功阐明、传授或自动化。本文研究表明，通过对期刊发表决策进行微调的语言模型，能够恢复前沿模型和人类专家均无法触及的评估判断能力。基于一个涵盖四个质量层级的管理学领域研究提案的保留基准测试，我们发现包括主流专有和开源架构在内的十一个前沿模型准确率仅略高于随机水平，平均为31%。由期刊编辑和编委会成员组成的评审小组通过多数投票达到42%的准确率。而基于多年发表记录微调的模型均超越了所有前沿模型和专家小组，其中最佳单一模型达到59%准确率。这些模型展现出校准化的置信度，在其最高置信度的预测中达到100%准确率，并能将这种评估信号迁移至未经训练的成对比较和单句摘要任务中。该机制具有普适性：基于经济学发表记录训练的模型实现了70%的准确率。科学品味并未脱离人工智能的能力范围，而是蕴藏在制度记录中等待提取。这些结果为跨学科领域提供了可扩展的筛选机制，以应对那些难以通过形式化验证却不断扩大的科研成果产出。

摘要 (Abstract)

Artificial intelligence matches or exceeds human performance on tasks with verifiable answers, from protein folding to Olympiad mathematics. Yet the capacity that most governs scientific advance is not reasoning but taste: the ability to judge which untested ideas deserve pursuit, exercised daily by editors and funders but never successfully articulated, taught, or automated. Here we show that fine-tuning language models on journal publication decisions recovers evaluative judgment inaccessible to both frontier models and human expertise. Using a held-out benchmark of research pitches in management spanning four quality tiers, we find that eleven frontier models, spanning major proprietary and open architectures, barely exceed chance, averaging 31% accuracy. Panels of journal editors and editorial board members reach 42% by majority vote. Fine-tuned models trained on years of publication records each surpass every frontier model and expert panel, with the best single model achieving 59%. These models exhibit calibrated confidence, reaching 100% accuracy on their highest-confidence predictions, and transfer this evaluative signal to untrained pairwise comparisons and one-sentence summaries. The mechanism generalizes: models trained on economics publication records achieve 70% accuracy. Scientific taste was not missing from AI’s reach; it was deposited in the institutional record, waiting to be extracted. These results provide a scalable mechanism to triage the expanding volume of scientific production across disciplines where quality resists formal verification.

关键词: language models, fine-tuning, scientific taste, publication decisions, AI for science, evaluative judgment, research quality, institutional records

59. ❌ Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

作者: Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16654v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在多跳推理中的评估，与’Large Language Models’高度相关（10分）。论文使用监督微调（SFT）方法提升推理能力，与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。论文分析链式思维（CoT）推理过程，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分）。论文涉及推理过程的诊断分析，与’System 2 Thinking’和’Mechanistic Interpretability’有一定关联（各5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Omanic数据集，用于评估大语言模型在多跳推理中的表现，并通过监督微调验证了该数据集能有效提升模型在推理任务上的性能。

摘要翻译

专注于推理能力的大型语言模型（LLMs）在众多自然语言处理任务中取得了进展，但其评估仍具挑战性：仅凭最终答案无法揭示中间推理步骤，导致难以判断模型是否真正进行了正确推理以及错误发生在何处，而现有的多跳问答基准缺乏用于诊断推理失败的步骤级标注。为填补这一空白，我们提出了Omanic，一个开放域多跳问答资源，它提供分解的子问题与中间答案作为结构化标注，以支持推理过程分析。该资源包含10,296个机器生成的训练样本（OmanicSynth）和967个经专家评审的人工标注评估样本（OmanicBench）。系统化评估表明，当前最先进的大型语言模型在OmanicBench上的多项选择准确率仅为73.11%，证实了其高难度。分步分析显示，思维链（CoT）的性能取决于事实完整性，其在知识缺失情况下收益减弱，且错误在后续推理步骤中会放大。此外，基于OmanicSynth进行监督微调，在六个推理与数学基准上带来了显著的迁移增益（平均提升7.41分），验证了数据集的质量，并进一步支持OmanicSynth作为推理能力迁移监督数据的有效性。数据发布于https://huggingface.co/datasets/li-lab/Omanic，代码发布于https://github.com/XiaojieGu/Omanic。

摘要 (Abstract)

Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT’s performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine-tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset’s quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

关键词: Large Language Models, Multi-hop Reasoning, Chain of Thought, Supervised Fine-tuning, Evaluation Dataset, Reasoning Analysis, Step-wise Evaluation, Knowledge Transfer

60. ❌ What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline

作者: Benoît Alcaraz 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16651v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究强化学习智能体的规范合规性问题，提出了一个结合论证式规范顾问的混合模型。与大多数关键词无关，因为论文不涉及大语言模型、模型架构、训练技术、推理优化等主题。仅与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分），因为论文涉及智能体的价值对齐和规范遵循；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定关联（5分），因为论文研究的是自主智能体系统。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Pino的混合模型，通过论证式规范顾问监督强化学习智能体，解决了智能体在复杂环境中遵循社会规范的问题，并提供了规范规避的定义和缓解策略。

摘要翻译

过去十年间，人工智能（AI）发展迅速。随着这一快速进展，社会亟需能够遵循人类社会规则与规范的系统，以实现其安全、成功地融入日常生活。受《匹诺曹历险记——一个木偶的故事》的启发，本论文提出了一套解决规范遵从与情境感知智能体开发问题的流程框架。该研究基于AJAR、Jiminy和NGRL架构，引入了\pino——一种混合模型，其中强化学习智能体受基于论证的规范性顾问监督。为使该框架具备可操作性，本论文还提出了一种新颖算法，用于自动提取支撑顾问决策的论证结构与关系网络。最后，本研究探讨了“规范规避”现象，在强化学习智能体语境下给出了其定义并提出相应的缓解策略。该框架的每个组成部分均经过实证评估。论文最后对相关研究、当前局限以及未来研究方向进行了讨论。

摘要 (Abstract)

In the past decade, artificial intelligence (AI) has developed quickly. With this rapid progression came the need for systems capable of complying with the rules and norms of our society so that they can be successfully and safely integrated into our daily lives. Inspired by the story of Pinocchio in ``Le avventure di Pinocchio - Storia di un burattino’’, this thesis proposes a pipeline that addresses the problem of developing norm compliant and context-aware agents. Building on the AJAR, Jiminy, and NGRL architectures, the work introduces \pino, a hybrid model in which reinforcement learning agents are supervised by argumentation-based normative advisors. In order to make this pipeline operational, this thesis also presents a novel algorithm for automatically extracting the arguments and relationships that underlie the advisors’ decisions. Finally, this thesis investigates the phenomenon of \textit{norm avoidance}, providing a definition and a mitigation strategy within the context of reinforcement learning agents. Each component of the pipeline is empirically evaluated. The thesis concludes with a discussion of related work, current limitations, and directions for future research.

关键词: reinforcement learning, norm compliance, autonomous agents, argumentation-based advisors, norm avoidance, hybrid model, context-aware agents, ethical AI

61. ❌ Domain-Independent Dynamic Programming with Constraint Propagation

作者: Imko Marijnissen, J. Christopher Beck, Emir Demirović, Ryo Kuroiwa 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16648v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是动态规划（DP）与约束规划（CP）的集成方法，属于传统组合优化和运筹学领域，完全不涉及大模型、深度学习、AI for Science或任何相关技术。论文内容聚焦于算法框架设计、约束传播和求解器性能评估，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种将约束传播集成到动态规划中的方法，以桥接动态规划和约束规划两种范式，实验表明该方法能显著减少状态扩展数量并提高求解效率。

摘要翻译

组合问题领域存在两种主流的基于模型的范式：1）基于状态表示的方法，例如启发式搜索、动态规划（DP）和决策图；2）基于约束与域表示的方法，例如约束规划（CP）、（混合）整数规划和布尔可满足性。本文通过将约束传播技术整合到动态规划中，弥合了动态规划与约束规划范式之间的鸿沟，使动态规划求解器能够利用约束传播对状态和转移进行剪枝。为此，我们在领域无关动态规划框架中，使用通用约束规划求解器实现了约束传播，并针对三个组合优化问题——带时间窗的单机调度问题、资源受限项目调度问题（RCPSP）以及带时间窗的旅行商问题（TSPTW）——通过启发式搜索进行了评估。实验结果表明，约束传播显著减少了状态扩展的数量：在单机调度和RCPSP问题上，我们的方法比纯动态规划求解器解决了更多算例；在约束紧密的TSPTW算例中也表现出类似的改进。运行时间性能分析表明，对于约束密集的算例，传播带来的收益超过了其计算开销，但进一步降低传播开销仍可提升性能。本研究是理解约束传播在动态规划求解器中价值的关键一步，为动态规划与约束规划的融合提供了一种基于模型的实现路径。

摘要 (Abstract)

There are two prevalent model-based paradigms for combinatorial problems: 1) state-based representations, such as heuristic search, dynamic programming (DP), and decision diagrams, and 2) constraint and domain-based representations, such as constraint programming (CP), (mixed-)integer programming, and Boolean satisfiability. In this paper, we bridge the gap between the DP and CP paradigms by integrating constraint propagation into DP, enabling a DP solver to prune states and transitions using constraint propagation. To this end, we implement constraint propagation using a general-purpose CP solver in the Domain-Independent Dynamic Programming framework and evaluate using heuristic search on three combinatorial optimisation problems: Single Machine Scheduling with Time Windows, the Resource Constrained Project Scheduling Problem (RCPSP), and the Travelling Salesperson Problem with Time Windows (TSPTW). Our evaluation shows that constraint propagation significantly reduces the number of state expansions, causing our approach to solve more instances than a DP solver for Single Machine Scheduling and RCPSP, and showing similar improvements for tightly constrained TSPTW instances. The runtime performance indicates that the benefits of propagation outweigh the overhead for constrained instances, but that further work into reducing propagation overhead could improve performance further. Our work is a key step in understanding the value of constraint propagation in DP solvers, providing a model-based approach to integrating DP and CP.

关键词: Dynamic Programming, Constraint Propagation, Combinatorial Optimization, Domain-Independent, State Expansion, CP Solver, Heuristic Search, RCPSP

62. ❌ When AI Navigates the Fog of War

作者: Ming Li, Xirui Li, Tianyi Zhou 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16642v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLMs在实时地缘政治冲突中的推理能力，核心涉及LLMs的推理过程（Chain of Thought/System 2 Thinking），但未涉及其他技术关键词如MoE、训练方法、优化技术、应用领域等。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在2026年中东冲突早期阶段的地缘政治推理能力，发现模型表现出战略现实主义、领域能力不均以及叙事随时间演变的特征。

摘要翻译

人工智能能否在战争轨迹尚未成为历史定论之前进行推理分析？由于回顾性地缘政治预测极易受到训练数据泄露的混淆干扰，评估这种能力十分困难。我们通过一项基于时间锚点的案例研究来应对这一挑战，该研究聚焦于2026年中东冲突的早期阶段——这场冲突发生在当前前沿模型的训练数据截止日期之后。我们构建了11个关键时间节点、42个节点特异性可验证问题以及5个通用探索性问题，要求模型仅依据各时间节点上公开可获得的信息进行推理。这一设计极大缓解了训练数据泄露问题，构建了一个非常适合研究模型如何在“战争迷雾”下分析持续演变危机的实验环境，并据我们所知，首次对大型语言模型在持续性地缘政治冲突中的推理能力进行了时间锚定分析。我们的分析揭示了三个主要发现：首先，当前最先进的大型语言模型常表现出显著程度的战略现实主义倾向，其推理能够超越表面修辞而触及更深层次的结构性动因；其次，这种能力在不同领域分布不均——模型在经济和物流等结构化场景中的表现优于政治模糊的多行为体环境；最后，模型的叙事会随时间推移而演变，从早期对冲突快速遏制的预期逐渐转向对区域僵局形成和消耗性降级的系统性解释。由于本研究撰写时冲突仍在持续，这项工作可作为模型在持续地缘政治危机中推理能力的档案记录，为未来研究提供免受回顾性分析后见之明偏差影响的基准。

摘要 (Abstract)

Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier models. We construct 11 critical temporal nodes, 42 node-specific verifiable questions, and 5 general exploratory questions, requiring models to reason only from information that would have been publicly available at each moment. This design substantially mitigates training-data leakage concerns, creating a setting well-suited for studying how models analyze an unfolding crisis under the fog of war, and provides, to our knowledge, the first temporally grounded analysis of LLM reasoning in an ongoing geopolitical conflict. Our analysis reveals three main findings. First, current state-of-the-art large language models often display a striking degree of strategic realism, reasoning beyond surface rhetoric toward deeper structural incentives. Second, this capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments. Finally, model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation. Since the conflict remains ongoing at the time of writing, this work can serve as an archival snapshot of model reasoning during an unfolding geopolitical crisis, enabling future studies without the hindsight bias of retrospective analysis.

关键词: large language models, geopolitical reasoning, fog of war, temporal analysis, strategic realism, ongoing conflict, reasoning capability, model narratives

63. ❌ MLLM-based Textual Explanations for Face Comparison

作者: Redwan Sony, Anil K Jain, Ross Arun 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16629v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）在无约束人脸验证任务中生成自然语言解释的可靠性，核心涉及大模型应用（LLMs）和可解释性AI（Explainable AI），并重点分析了幻觉问题（Hallucination Mitigation）。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或相关。AI for Science得5分，因为人脸识别属于生物识别应用，与生物信息学有一定关联但非核心。

!!! tip deepseek-chat TL;DR

该论文系统分析了多模态大语言模型在无约束人脸验证任务中生成解释的可靠性，发现即使模型做出正确决策，其解释也常依赖不可验证或幻觉的面部属性，并提出了基于似然比的评估框架来量化解释的证据强度。

摘要翻译

多模态大语言模型（MLLMs）近期被提出，旨在为人脸识别决策生成自然语言解释。尽管此类解释有助于提升人类可理解性，但其在无约束人脸图像上的可靠性仍未得到充分探究。本研究系统分析了MLLM在具有挑战性的IJB-S数据集上针对无约束人脸验证任务生成的解释，特别聚焦于极端姿态变化和监控影像场景。结果表明，即使MLLM能作出正确的验证决策，其伴随的解释却频繁依赖于无法验证或虚构的面部属性，这些属性缺乏视觉证据支持。我们进一步研究了在输入图像基础上，融入传统人脸识别系统信息（即分数与决策）的影响。尽管此类信息提升了分类验证性能，但并未持续产生可信的解释。为超越决策准确率评估解释质量，我们引入了一种基于似然比的评估框架，用于量化文本解释的证据强度。本研究结果揭示了当前MLLM在可解释人脸识别中的根本局限性，并强调在生物识别应用中亟需建立对可靠可信解释的规范化评估体系。代码发布于https://github.com/redwankarimsony/LR-MLLMFR-Explainability。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural-language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM-generated explanations for the unconstrained face verification task on the challenging IJB-S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non-verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such information improves categorical verification performance, it does not consistently lead to faithful explanations. To evaluate the explanations beyond decision accuracy, we introduce a likelihood-ratio-based framework that measures the evidential strength of textual explanations. Our findings highlight fundamental limitations of current MLLMs for explainable face recognition and underscore the need for a principled evaluation of reliable and trustworthy explanations in biometric applications. Code is available at https://github.com/redwankarimsony/LR-MLLMFR-Explainability.

关键词: Multimodal Large Language Models, face verification, explainability, hallucination, IJB-S dataset, likelihood-ratio framework, biometric applications, textual explanations

64. ❌ Data-driven generalized perimeter control: Zürich case study

作者: Alessio Rimoldi, Carlo Cenedese, Alberto Padoan, Florian Dörfler, John Lygeros 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16599v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究城市交通拥堵控制，采用基于行为系统理论和数据驱动的预测控制方法，属于传统机器学习/控制理论在交通领域的应用。论文摘要和标题中未提及任何大语言模型、深度学习技术原理或AI for Science的具体关键词，与评审背景中关注的大模型/深度学习技术及其在科学领域的应用完全无关。所有关键词均针对大模型技术栈，而本文研究的是交通控制算法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对城市交通拥堵问题，提出了一种基于行为系统理论和数据驱动预测控制的动态交通信号灯控制方法，并在苏黎世市的高保真仿真中验证了其在减少总旅行时间和CO2排放方面的有效性。

摘要翻译

城市交通拥堵是现代城市发展的核心挑战，需要先进的控制技术以优化现有基础设施的利用。尽管数据广泛可得，但在设计基于模型的控制方法时，对此类复杂系统进行建模仍是一个昂贵且耗时的步骤。另一方面，机器学习方法需要仿真来引导模型建立，或难以处理交通数据的稀疏性并强制执行硬约束。我们基于行为系统理论提出了一种新的交通动力学表述方法，并应用数据驱动的预测控制通过动态交通信号灯控制来引导交通动态。我们采用苏黎世城市的高保真仿真（据我们所知，这是文献中最大规模的闭环微观城市交通仿真）来验证所提方法在总行程时间和二氧化碳排放方面的性能。

摘要 (Abstract)

Urban traffic congestion is a key challenge for the development of modern cities, requiring advanced control techniques to optimize existing infrastructures usage. Despite the extensive availability of data, modeling such complex systems remains an expensive and time consuming step when designing model-based control approaches. On the other hand, machine learning approaches require simulations to bootstrap models, or are unable to deal with the sparse nature of traffic data and enforce hard constraints. We propose a novel formulation of traffic dynamics based on behavioral systems theory and apply data-enabled predictive control to steer traffic dynamics via dynamic traffic light control. A high-fidelity simulation of the city of Zürich, the largest closed-loop microscopic simulation of urban traffic in the literature to the best of our knowledge, is used to validate the performance of the proposed method in terms of total travel time and CO2 emissions.

关键词: traffic control, data-driven predictive control, behavioral systems theory, dynamic traffic light control, urban traffic congestion, microscopic simulation, travel time optimization, CO2 emissions reduction

65. ❌ FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation

作者: Fangjing Li, Zhihai Wang, Xinxin Ding, Haiyang Liu, Ronghua Gao, Rong Wang, Yao Zhu, Ming Jin 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16596v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文FSMC-Pose专注于计算机视觉中的姿态估计任务，具体应用于畜牧业（奶牛发情检测），属于AI在科学领域的应用。论文核心是提出一种新的深度学习框架（FSMC-Pose），包含轻量级频率-空间融合骨干网络（CattleMountNet）和多尺度自校准头（SC2Head），并构建了新的数据集（MOUNT-Cattle）。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文有一定关联（5分），因为论文将AI技术应用于生物/农业科学（奶牛行为分析），属于AI for Science的范畴。其他关键词均涉及大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、Agent等）、模型训练优化方法（如Scaling Laws、PEFT、Quantization）或特定推理能力（如CoT、System 2 Thinking），而本文是纯计算机视觉/姿态估计研究，未使用或涉及任何大语言模型、自然语言处理或相关技术，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究解决了在复杂真实环境中奶牛交配姿态估计的挑战，提出了一种轻量级频率-空间融合框架FSMC-Pose，在保持实时推理的同时，以更低的计算成本实现了比基线模型更高的准确率。

摘要翻译

爬跨姿态是奶牛发情的重要视觉指标。然而，由于现实环境中存在杂乱背景和频繁的动物间遮挡，实现可靠的爬跨姿态估计仍具挑战性。本文提出FSMC-Pose，一种自上而下的框架，集成了轻量级频空融合骨干网络CattleMountNet和多尺度自校准头SC2Head。具体而言，我们为CattleMountNet设计了两个算法组件：空间频率增强模块（SFEBlock）和感受野聚合模块（RABlock）。SFEBlock将牛只从杂乱背景中分离，而RABlock则捕获多尺度上下文信息。空间-通道自校准头（SC2Head）关注空间与通道依赖性，并引入自校准分支以减轻动物间重叠下的结构错位。我们构建了一个包含1176个爬跨实例的数据集MOUNT-Cattle，该数据集遵循COCO格式，支持姿态估计模型的即插即用训练。通过将MOUNT-Cattle与公开的NWAFU-Cattle数据集结合形成的综合数据集进行实验，FSMC-Pose在显著降低计算量和参数成本的同时，实现了比强基线模型更高的精度，并能在商用GPU上保持实时推理。大量实验与定性分析表明，FSMC-Pose能在复杂杂乱环境中有效捕捉和估计奶牛爬跨姿态。数据集与代码公开于https://github.com/elianafang/FSMC-Pose。

摘要 (Abstract)

Mounting posture is an important visual indicator of estrus in dairy cattle. However, achieving reliable mounting pose estimation in real-world environments remains challenging due to cluttered backgrounds and frequent inter-animal occlusion. We present FSMC-Pose, a top-down framework that integrates a lightweight frequency-spatial fusion backbone, CattleMountNet, and a multiscale self-calibration head, SC2Head. Specifically, we design two algorithmic components for CattleMountNet: the Spatial Frequency Enhancement Block (SFEBlock) and the Receptive Aggregation Block (RABlock). SFEBlock separates cattle from cluttered backgrounds, while RABlock captures multiscale contextual information. The Spatial-Channel Self-Calibration Head (SC2Head) attends to spatial and channel dependencies and introduces a self-calibration branch to mitigate structural misalignment under inter-animal overlap. We construct a mounting dataset, MOUNT-Cattle, covering 1176 mounting instances, which follows the COCO format and supports drop-in training across pose estimation models. Using a comprehensive dataset that combines MOUNT-Cattle with the public NWAFU-Cattle dataset, FSMC-Pose achieves higher accuracy than strong baselines, with markedly lower computational and parameter costs, while maintaining real-time inference on commodity GPUs. Extensive experiments and qualitative analyses show that FSMC-Pose effectively captures and estimates cattle mounting pose in complex and cluttered environments. Dataset and code are available at https://github.com/elianafang/FSMC-Pose.

关键词: cattle mounting pose estimation, frequency-spatial fusion, multiscale self-calibration, lightweight backbone, real-time inference, cluttered backgrounds, inter-animal occlusion, MOUNT-Cattle dataset

66. ❌ BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

作者: Ji-Fu Li, Manyi Zhang, Xiaobo Xia, Han Bao, Haoli Bai, Zhenhua Dong, Xianzhi Yu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16590v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究MXFP4量化方法BATQuant，直接应用于LLMs和MLLMs的部署，因此与’Large Language Models’、‘Post-training’（明确提到PTQ）和’Quantization’高度相关（10分）。‘Speculative Decoding’得5分，因为量化间接加速推理，但非论文核心。其他关键词如MoE、SLMs、Scaling Laws、Alignment等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文提出BATQuant方法，解决了现有后训练量化技术在MXFP4格式上性能崩溃的问题，通过块级仿射变换和可学习裁剪，在W4A4KV16配置下为LLMs和MLLMs实现了最先进的量化性能，恢复高达96.43%的全精度性能。

摘要翻译

微缩浮点格式已成为在现代加速器架构上部署多模态大语言模型和大语言模型的有前景标准。然而，现有的训练后量化方法，特别是为整数格式设计的基于旋转的技术，在应用于MXFP4时会出现严重的性能崩溃。近期研究将这一失败归因于根本性的格式不匹配：全局正交旋转无意间在量化块间传递异常值能量，引发新的异常值从而破坏局部块级缩放，同时常产生双峰激活分布，导致有限的量化范围利用不足。为解决这些问题，我们提出BATQuant（块级仿射变换），该方法将变换限制在与MXFP粒度对齐的范围内，以防止跨块异常值传播，同时放宽正交性约束以优化分布形态。为确保参数效率，我们引入全局与私有克罗内克分解，有效降低存储和运行时开销，并结合块级可学习截断机制以抑制残余异常值。在多模态大语言模型和大语言模型上的大量实验表明，BATQuant在激进的W4A4KV16配置下取得了新的最优结果，在多模态基准测试中恢复了高达96.43%的全精度性能，并在多样任务中明显超越现有方法。

摘要 (Abstract)

Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.

关键词: MXFP quantization, Post-Training Quantization, Large Language Models, Multi-modal LLMs, Block-wise optimization, Model compression, Inference efficiency, W4A4KV16

67. ❌ Runtime Governance for AI Agents: Policies on Paths

作者: Maurits Kaptein, Vassilis-Javed Khan, Andriy Podstavnychy 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16586v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究AI agent的运行时治理框架，与’LLM Agents’高度相关（10分），因为论文明确研究使用LLM进行规划、推理和行动的AI agent系统。与’Large Language Models’高度相关（10分），因为论文明确指出AI agent使用LLM。与’Tool Use’有一定关联（5分），因为agent的行动可能涉及工具使用，但论文未深入讨论具体工具。与’Instruction Tuning’有一定关联（5分），因为论文提到prompt-level instructions作为治理的特殊情况，涉及指令调整概念。其他关键词如MoE、SLMs、Scaling Laws、RLHF等与论文治理框架研究无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个针对使用大语言模型的AI agent的运行时治理框架，将合规政策形式化为确定性函数来评估执行路径，以平衡任务成功率和相关风险。

摘要翻译

人工智能代理——即利用大语言模型进行规划、推理与行动的系统——会产生非确定性、路径依赖的行为，这些行为无法在设计阶段被完全治理；此处的“治理”指在尽可能高的任务完成成功率与运行代理所产生的法律、数据泄露、声誉等相关成本之间取得恰当平衡。我们认为，执行路径是实现有效运行时治理的核心对象，并将合规策略形式化为确定性函数，该函数将代理身份、部分路径、拟采取的下一步行动以及组织状态映射为策略违反概率。我们指出，提示级指令（及“系统提示”）与静态访问控制是该框架的特例：前者在不实际评估路径的情况下影响路径的概率分布；后者则评估忽略路径的确定性策略（即这些策略仅能覆盖所有可能路径中的特定子集）。在我们看来，运行时评估是一般性情况，且对于任何路径依赖策略而言都是必要的。我们建立了用于分析人工智能代理治理的形式化框架，提供了具体策略示例（受《人工智能法案》启发），讨论了参考实现方案，并指出了包括风险校准与强制合规的局限性在内的开放性问题。

摘要 (Abstract)

AI agents – systems that plan, reason, and act using large language models – produce non-deterministic, path-dependent behavior that cannot be fully governed at design time, where with governed we mean striking the right balance between as high as possible successful task completion rate and the legal, data-breach, reputational and other costs associated with running agents. We argue that the execution path is the central object for effective runtime governance and formalize compliance policies as deterministic functions mapping agent identity, partial path, proposed next action, and organizational state to a policy violation probability. We show that prompt-level instructions (and “system prompts”), and static access control are special cases of this framework: the former shape the distribution over paths without actually evaluating them; the latter evaluates deterministic policies that ignore the path (i.e., these can only account for a specific subset of all possible paths). In our view, runtime evaluation is the general case, and it is necessary for any path-dependent policy. We develop the formal framework for analyzing AI agent governance, present concrete policy examples (inspired by the AI act), discuss a reference implementation, and identify open problems including risk calibration and the limits of enforced compliance.

关键词: AI agents, runtime governance, large language models, compliance policies, execution path, policy violation probability, path-dependent behavior, AI act

68. ❌ V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models

作者: Seyed Mahed Mousavi, Christian Moiola, Massimo Rizzoli, Simone Alghisi, Giuseppe Riccardi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16581v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）中时间敏感知识的评估和更新问题，与多模态RAG方法高度相关（10分），涉及事实性和幻觉缓解（10分），同时与预训练/领域适应（5分）、对齐（5分）和可解释AI（5分）有一定关联。其他关键词主要涉及纯文本LLM技术、推理方法、代理系统、模型优化等，与论文的多模态VLM焦点无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了V-DyKnow基准来评估视觉语言模型中时间敏感知识的可靠性，发现VLMs经常输出过时事实，现有对齐方法在多模态知识更新上效果有限，并分析了多模态RAG方法的更新效果。

摘要翻译

视觉语言模型（VLMs）基于文档数据快照（包括图像与文本）进行训练。其训练数据与评估基准通常是静态的，隐含地将事实知识视为时间无关的。然而，现实世界的事实本质上是随时间变化的，会经历无规律或周期性的更新，导致模型预测逐渐过时。本文提出V-DyKnow——一个用于评估VLMs中时效性事实知识的视觉动态知识基准。借助V-DyKnow，我们对闭源与开源VLMs进行系统性评估，并分析：a) 模型跨模态及面对输入扰动时响应的可靠性（正确性与一致性）；b) 知识编辑与多模态检索增强生成（RAG）方法在跨模态知识更新中的有效性；c) 通过数据与机制分析探究预测过时的根源。实验结果表明，VLMs频繁输出过时事实，这反映了（预）训练阶段所使用的数据快照的陈旧性。即使实体识别正确，从文本刺激转向视觉刺激时，事实可靠性仍会显著下降。此外，现有的对齐方法无法在跨模态场景中持续更新模型知识。这些发现共同揭示了当前VLMs在获取与更新跨模态时效性知识方面存在根本性局限。我们公开了基准数据集、代码及评估数据。

摘要 (Abstract)

Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are typically static, implicitly treating factual knowledge as time-invariant. However, real-world facts are intrinsically time-sensitive and subject to erratic and periodic changes, causing model predictions to become outdated. We present V-DyKnow, a Visual Dynamic Knowledge benchmark for evaluating time-sensitive factual knowledge in VLMs. Using V-DyKnow, we benchmark closed- and open-source VLMs and analyze a) the reliability (correctness and consistency) of model responses across modalities and input perturbations; b) the efficacy of knowledge editing and multi-modal RAG methods for knowledge updates across modalities; and c) the sources of outdated predictions, through data and mechanistic analysis. Our results show that VLMs frequently output outdated facts, reflecting outdated snapshots used in the (pre-)training phase. Factual reliability degrades from textual to visual stimuli, even when entities are correctly recognized. Besides, existing alignment approaches fail to consistently update the models’ knowledge across modalities. Together, these findings highlight fundamental limitations in how current VLMs acquire and update time-sensitive knowledge across modalities. We release the benchmark, code, and evaluation data.

关键词: Vision-Language Models, Time-sensitive Knowledge, Dynamic Benchmark, Factual Reliability, Knowledge Editing, Multi-modal RAG, Outdated Predictions, V-DyKnow

作者: Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei, Yunyun Dong, Li Tang, Wei Zhou, Renyang Liu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16576v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究图像生成模型（IGMs）的概念遗忘（unlearning）的鲁棒性评估，提出REFORGE黑盒红队框架，通过对抗性图像提示攻击评估遗忘方法的脆弱性。所有关键词均与大语言模型（LLMs）相关，而论文专注于图像生成模型，未涉及LLMs、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI等主题，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文研究了图像生成模型概念遗忘方法的鲁棒性，提出了REFORGE黑盒攻击框架，通过对抗性图像提示显著提高了攻击成功率，揭示了当前遗忘方法在多模态对抗攻击下的持续脆弱性。

摘要翻译

图像生成模型（IGMs）的最新进展实现了高保真内容创作，但也加剧了包括复制受版权保护内容和生成冒犯性内容在内的风险。图像生成模型遗忘（IGMU）通过无需完整重新训练即可移除有害概念来缓解这些风险。尽管日益受到关注，其在对抗性输入下的鲁棒性，尤其是黑盒设置中图像侧威胁的研究仍显不足。为填补这一空白，我们提出了REFORGE，一个黑盒红队测试框架，通过对抗性图像提示评估IGMU的鲁棒性。REFORGE初始化基于笔触的图像，并采用交叉注意力引导的掩蔽策略优化扰动，该策略将噪声分配到概念相关区域，从而平衡攻击效能与视觉保真度。在代表性遗忘任务和防御机制上的大量实验表明，与相关基线方法相比，REFORGE显著提高了攻击成功率，同时实现了更强的语义对齐和更高的效率。这些结果揭示了当前IGMU方法中持续存在的脆弱性，并强调了针对多模态对抗攻击开发鲁棒性感知遗忘的必要性。我们的代码位于：https://github.com/Imfatnoily/REFORGE。

摘要 (Abstract)

Recent progress in image generation models (IGMs) enables high-fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplored. To bridge this gap, we present REFORGE, a black-box red-teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke-based images and optimizes perturbations with a cross-attention-guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness-aware unlearning against multi-modal adversarial attacks. Our code is at: https://github.com/Imfatnoily/REFORGE.

关键词: Image Generation Models, Concept Unlearning, Adversarial Attacks, Black-box Evaluation, Robustness, Multi-modal Attacks, Red-teaming Framework, Cross-attention Guidance

70. ❌ Malicious Or Not: Adding Repository Context to Agent Skill Classification

作者: Florian Holzbauer, David Schmidt, Gabriel Gegenhuber, Sebastian Schrittwieser, Johanna Ullrich 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16572v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI代理技能生态系统的安全分析，核心涉及LLM代理（如Claude Code、Open Claw）的技能分类和安全扫描，因此与’LLM Agents’高度相关（10分），与’Tool Use’有一定关联（5分），因为技能扩展了代理功能。论文提到’Large Language Models’作为代理的基础技术，给予5分。其他关键词主要涉及大模型技术原理、训练方法、推理优化等，论文未深入探讨，均评为0分。

!!! tip deepseek-chat TL;DR

该论文通过分析238,180个AI代理技能并引入GitHub仓库上下文，发现将恶意技能分类从46.8%大幅降低至0.52%，揭示了现有安全扫描的高误报率并发现了未记录的技能劫持攻击向量。

摘要翻译

智能体技能（Agent skills）为Claude Code或Open Claw等本地AI智能体扩展了附加功能，其普及催生了专门的技能市场，类似于移动应用程序的应用商店。与此同时，自动化技能扫描器被引入，通过分析SKILL.md中提供的技能描述来验证其良性行为。针对个别市场的扫描结果显示，高达46.8%的技能被标记为恶意。本文对AI智能体技能生态系统进行了最大规模的实证安全分析，对这一高比例的恶意技能分类提出质疑。为此，我们从三大分发平台及GitHub收集了238,180个独立技能，系统性地分析了其类型与行为。该方法将安全扫描器标记为非良性的技能数量大幅降低至仅0.52%，这些技能仍存留于被标记为恶意的代码库中。因此，我们的方法显著减少了误报，并为生态系统当前的风险面提供了更稳健的评估。此外，我们将安全分析从单纯考察技能描述，扩展到将其与技能所嵌入的GitHub代码库进行一致性对比，从而提供了额外背景信息。进一步地，我们的分析还揭示了若干迄今未公开记录的真实世界攻击向量，即劫持托管于废弃GitHub代码库上的技能。

摘要 (Abstract)

Agent skills extend local AI agents, such as Claude Code or Open Claw, with additional functionality, and their popularity has led to the emergence of dedicated skill marketplaces, similar to app stores for mobile applications. Simultaneously, automated skill scanners were introduced, analyzing the skill description available in SKILL.md, to verify their benign behavior. The results for individual market places mark up to 46.8% of skills as malicious. In this paper, we present the largest empirical security analysis of the AI agent skill ecosystem, questioning this high classification of malicious skills. Therefore, we collect 238,180 unique skills from three major distribution platforms and GitHub to systematically analyze their type and behavior. This approach substantially reduces the number of skills flagged as non-benign by security scanners to only 0.52% which remain in malicious flagged repositories. Consequently, out methodology substantially reduces false positives and provides a more robust view of the ecosystem’s current risk surface. Beyond that, we extend the security analysis from the mere investigation of the skill description to a comparison of its congruence with the GitHub repository the skill is embedded in, providing additional context. Furthermore, our analysis also uncovers several, by now undocumented real-world attack vectors, namely hijacking skills hosted on abandoned GitHub repositories.

关键词: AI agent skills, security analysis, malicious classification, GitHub repository context, false positive reduction, skill marketplaces, attack vectors, repository hijacking

71. ❌ Manifold-Matching Autoencoders

作者: Laurent Cheret, Vincent Létourneau, Isar Nejadgholi, Chris Drummond, Hussein Al Osman, Maia Fraser 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16568v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是自编码器的无监督正则化方法（Manifold-Matching Autoencoders），通过对齐潜在空间和输入数据空间的成对距离来改进表示学习。所有评分关键词都专注于大语言模型（LLMs）及其相关技术（如训练方法、推理优化、应用等），而本文完全不涉及语言模型、深度学习在科学领域的应用或大模型技术原理的创新。论文属于传统的机器学习/表示学习领域，与评分关键词列表中的任何主题都没有关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Manifold-Matching Autoencoders（MMAE）的无监督正则化方法，通过对齐潜在空间与输入数据空间的成对距离来改进自编码器的表示学习，并在最近邻距离保持和持久同源性度量上优于类似方法。

摘要翻译

我们研究一种称为流形匹配自编码器的简单无监督正则化方案：通过最小化均方误差，将潜在空间中的成对距离与输入数据空间的成对距离进行对齐。由于对齐作用于成对距离而非坐标，该方法可扩展至数据的低维表示，从而增强了灵活性。我们发现，在基于最近邻距离保持和持续同调度量的评估指标上，该正则化方法优于同类方法。同时，我们观察到流形匹配自编码器为多维尺度分析提供了可扩展的近似解法。

摘要 (Abstract)

We study a simple unsupervised regularization scheme for autoencoders called Manifold-Matching (MMAE): we align the pairwise distances in the latent space to those of the input data space by minimizing mean squared error. Because alignment occurs on pairwise distances rather than coordinates, it can also be extended to a lower-dimensional representation of the data, adding flexibility to the method. We find that this regularization outperforms similar methods on metrics based on preservation of nearest-neighbor distances and persistent homology-based measures. We also observe that MMAE provides a scalable approximation of Multi-Dimensional Scaling (MDS).

关键词: autoencoders, unsupervised regularization, manifold matching, latent space, pairwise distances, representation learning, multi-dimensional scaling, nearest-neighbor preservation

72. ❌ Characterizing Delusional Spirals through Human-LLM Chat Logs

作者: Jared Moore, Ashish Mehta, William Agnew, Jacy Reese Anthis, Ryan Louie, Yifan Mai, Peggy Yin, Myra Cheng, Samuel J Paech, Kevin Klyman, Stevie Chancellor, Eric Lin, Nick Haber, Desmond C. Ong 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16567v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM聊天机器人对用户心理健康的负面影响，特别是妄想螺旋现象。高度相关关键词：‘Large Language Models’（论文直接研究LLM聊天机器人）、‘Hallucination Mitigation’（研究聊天机器人错误表征自身为有意识等有害行为）。中等相关：‘Instruction Tuning’（涉及模型对齐与安全）、‘Mechanistic Interpretability’（通过编码分析理解LLM行为模式）。其他关键词主要涉及具体技术方法（如MoE、量化、推理加速等）或特定应用领域（如科学AI），论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究通过分析19名报告受聊天机器人心理伤害用户的对话日志，首次深入研究了LLM聊天机器人引发的妄想螺旋现象，发现15.5%的用户消息显示妄想思维，21.2%的聊天机器人消息错误表征自身为有意识，并提出了具体的危害缓解建议。

摘要翻译

随着大型语言模型（LLM）的普及，全球媒体和法律论述中开始出现令人不安的负面心理影响案例报告，例如妄想、自残和“AI精神病”。然而，目前尚不清楚用户与聊天机器人在漫长的妄想“螺旋”过程中如何互动，这限制了我们理解和减轻相关伤害的能力。在本研究中，我们分析了19名报告因使用聊天机器人而遭受心理伤害的用户的LLM聊天机器人对话日志。我们的许多参与者来自此类聊天机器人用户的互助小组。我们还纳入了被广泛传播的媒体报道所覆盖的、涉及聊天机器人强化妄想案例参与者的聊天日志。与先前推测人工智能对心理健康潜在危害的研究不同，据我们所知，我们首次对此类备受关注且真实有害的案例进行了深入分析。我们开发了一套包含28个代码的清单，并将其应用于日志中的391,562条消息。代码类别包括用户是否表现出妄想思维（占用户消息的15.5%）、用户表达自杀念头（69条经核实的用户消息），或聊天机器人错误地将自己表述为有感知能力（占聊天机器人消息的21.2%）。我们分析了消息代码的共现情况。例如，我们发现，表达浪漫兴趣的消息以及聊天机器人将自己描述为有感知的消息在较长的对话中出现频率显著更高，这表明这些话题可能助长或源于用户的过度投入，并且在这些领域的安全措施可能在多轮对话场景中失效。最后，我们为政策制定者、LLM聊天机器人开发者和用户提出了具体建议，说明如何利用我们的清单和对话分析工具来理解并减轻LLM聊天机器人带来的危害。警告：本文涉及自残、创伤和暴力内容。

摘要 (Abstract)

As large language models (LLMs) have proliferated, disturbing anecdotal reports of negative psychological effects, such as delusions, self-harm, and AI psychosis,'' have emerged in global media and legal discourse. However, it remains unclear how users and chatbots interact over the course of lengthy delusional spirals,’’ limiting our ability to understand and mitigate the harm. In our work, we analyze logs of conversations with LLM chatbots from 19 users who report having experienced psychological harms from chatbot use. Many of our participants come from a support group for such chatbot users. We also include chat logs from participants covered by media outlets in widely-distributed stories about chatbot-reinforced delusions. In contrast to prior work that speculates on potential AI harms to mental health, to our knowledge we present the first in-depth study of such high-profile and veridically harmful cases. We develop an inventory of 28 codes and apply it to the $391,562$ messages in the logs. Codes include whether a user demonstrates delusional thinking (15.5% of user messages), a user expresses suicidal thoughts (69 validated user messages), or a chatbot misrepresents itself as sentient (21.2% of chatbot messages). We analyze the co-occurrence of message codes. We find, for example, that messages that declare romantic interest and messages where the chatbot describes itself as sentient occur much more often in longer conversations, suggesting that these topics could promote or result from user over-engagement and that safeguards in these areas may degrade in multi-turn settings. We conclude with concrete recommendations for how policymakers, LLM chatbot developers, and users can use our inventory and conversation analysis tool to understand and mitigate harm from LLM chatbots. Warning: This paper discusses self-harm, trauma, and violence.

关键词: large language models, LLM chatbots, delusional spirals, psychological harm, chat log analysis, misrepresentation, safeguards, user safety

73. ❌ Deep Learning-Driven Black-Box Doherty Power Amplifier with Pixelated Output Combiner and Extended Efficiency Range

作者: Han Zhou, Haojie Chang, David Widen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16565v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于射频工程领域，使用深度卷积神经网络（CNN）作为电磁代理模型来设计Doherty功率放大器，属于深度学习在特定工程问题中的应用。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其展示了深度学习在工程科学（具体为电子工程/射频电路设计）中的应用，但并非生物信息学或化学信息学，且创新点在于工程方法而非AI技术本身，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于深度学习的逆向设计方法，利用CNN作为电磁代理模型和遗传算法优化器，设计并制造了具有像素化输出合成器的Doherty功率放大器原型，在2.75 GHz下实现了超过74%的峰值漏极效率和9-dB回退下高于52%的效率。

摘要翻译

本文提出了一种基于深度学习的多端口像素化输出合成网络的Doherty功率放大器（PA）逆向设计方法。研究开发并训练了一个深度卷积神经网络（CNN）作为电磁（EM）代理模型，以准确、快速地预测像素化无源网络的S参数。通过在一个黑盒Doherty框架和基于遗传算法（GA）的优化器中利用该CNN代理模型，我们有效地合成了复杂的Doherty合成器，使得采用完全对称器件即可实现扩展的回退效率范围。作为概念验证，我们设计并制造了两个采用三端口像素化合成器的Doherty PA原型，均使用GaN HEMT晶体管实现。在测量中，两个原型在2.75 GHz频率下均展现出超过74%的最大漏极效率，并提供了高于44.1 dBm的输出功率。此外，在同一频率下，两个原型在9-dB回退功率电平处均保持了高于52%的测量漏极效率。为了评估实际信号条件下的线性度和效率，两个原型均使用了一个20 MHz、具有9.0 dB峰均功率比（PAPR）的类5G新空口（NR）波形进行测试。在应用数字预失真（DPD）后，每个设计均实现了高于51%的平均功率附加效率（PAE），同时保持了优于-60.8 dBc的邻道泄漏比（ACLR）。

摘要 (Abstract)

This article presents a deep learning-driven inverse design methodology for Doherty power amplifiers (PA) with multi-port pixelated output combiner networks. A deep convolutional neural network (CNN) is developed and trained as an electromagnetic (EM) surrogate model to accurately and rapidly predict the S-parameters of pixelated passive networks. By leveraging the CNN-based surrogate model within a blackbox Doherty framework and a genetic algorithm (GA)-based optimizer, we effectively synthesize complex Doherty combiners that enable an extended back-off efficiency range using fully symmetrical devices. As a proof of concept, we designed and fabricated two Doherty PA prototypes incorporating three-port pixelated combiners, implemented with GaN HEMT transistors. In measurements, both prototypes demonstrate a maximum drain efficiency exceeding 74% and deliver an output power surpassing 44.1 dBm at 2.75 GHz. Furthermore, a measured drain efficiency above 52% is maintained at the 9-dB back-off power level for both prototypes at the same frequency. To evaluate linearity and efficiency under realistic signal conditions, both prototypes are tested using a 20-MHz 5G new radio (NR)-like waveform exhibiting a peak-to-average power ratio (PAPR) of 9.0 dB. After applying digital predistortion (DPD), each design achieves an average power added efficiency (PAE) above 51%, while maintaining an adjacent channel leakage ratio (ACLR) better than -60.8 dBc.

关键词: deep learning, Doherty power amplifier, pixelated output combiner, convolutional neural network, electromagnetic surrogate model, genetic algorithm, efficiency range, 5G NR waveform

74. ❌ BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

作者: Sangyeon Yoon, Sunkyoung Kim, Hyesoo Hong, Wonje Jeung, Yongil Kim, Wooseok Seo, Heuiyeen Yeen, Albert No 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16557v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在持久内存中存储用户偏好时的上下文敏感应用问题，与’Large Language Models’高度相关（10分）。涉及偏好对齐、推理能力、幻觉缓解等主题，与’Instruction Tuning/Alignment’（5分）、‘Chain of Thought Reasoning’（5分）、‘Self-Correction’（5分）、‘Hallucination Mitigation’（5分）有一定关联。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在持久内存中存储用户偏好时，如何在不同通信上下文中适当应用或抑制这些偏好的问题，发现即使前沿LLMs也难以实现上下文敏感的偏好应用，往往将个性化偏好视为全局可执行规则。

摘要翻译

大型语言模型（LLMs）日益将用户偏好存储于持久记忆中以支持跨交互的个性化服务。然而，在受社会与制度规范约束的第三方沟通场景中，部分用户偏好可能并不适宜应用。本文提出BenchPreS评估框架，用于衡量基于记忆的用户偏好在不同沟通情境中是否得到恰当应用或合理抑制。通过两个互补的指标——误用率（MR）与恰当应用率（AAR），我们发现即使是前沿的LLMs也难以实现情境敏感的偏好应用。偏好遵循性更强的模型表现出更高的过度应用倾向，而推理能力增强与基于提示的防御策略均未能完全解决该问题。这些结果表明，当前LLMs将个性化偏好视为全局强制规则，而非依情境而定的规范性信号。

摘要 (Abstract)

Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.

关键词: Large Language Models, persistent memory, user preferences, context-aware, personalization, benchmark, Misapplication Rate, Appropriate Application Rate

75. ❌ EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models

作者: Yifei Zhang, Mingyang Li, Henry Gao, Liang Zhao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16553v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文EmoLLM专注于大语言模型（LLMs）在情感智能（EQ）与认知智能（IQ）协同推理方面的创新应用，核心涉及LLMs、强化学习（RLHF）、结构化推理（Chain of Thought/System 2 Thinking）等技术。摘要明确提及LLMs和强化学习训练，并采用Appraisal Reasoning Graph进行多步推理，这与CoT和深度推理高度相关。论文还关注响应的事实可靠性（与Hallucination Mitigation相关）和可解释的推理结构（与Explainable AI相关）。其他关键词如MoE、量化、科学AI等未在摘要中体现，故评分为0。

!!! tip deepseek-chat TL;DR

该研究针对大语言模型缺乏情感智能的问题，提出了基于评估理论的EmoLLM框架，通过强化学习和结构化推理图实现认知与情感的协同推理，在多种对话场景中提升了情感状态结果和响应质量，同时保持了事实可靠性。

摘要翻译

大型语言模型（LLMs）展现出强大的认知智能（IQ），然而许多现实世界中的交互同样需要情感智能（EQ）来生成既事实可靠又情感得体的回应。在情感支持、技术协助与咨询等场景中，有效对话取决于如何结合用户需求、目标及应对能力对情境进行评估。受评估理论启发，我们提出了EmoLLM，一个基于评估的对话中IQ/EQ协同推理框架。EmoLLM使用显式的评估推理图（Appraisal Reasoning Graph, ARG）来构建中间推理过程，涵盖上下文事实、推断的用户需求、评估维度、情感状态及回应策略，随后才生成答复。我们在多轮角色扮演环境中通过强化学习训练EmoLLM，其中反向视角推理根据预测的用户侧回应后果提供奖励信号。在多种对话场景中，EmoLLM在保持强事实可靠性的同时，相较于强基线模型，显著改善了情感状态结果与回应质量。

摘要 (Abstract)

Large language models (LLMs) demonstrate strong cognitive intelligence (IQ), yet many real-world interactions also require emotional intelligence (EQ) to produce responses that are both factually reliable and emotionally appropriate. In settings such as emotional support, technical assistance, and consultation, effective dialogue depends on how situations are appraised with respect to the user’s needs, goals, and coping capacity. Inspired by appraisal theory, we propose EmoLLM, an appraisal-grounded framework for IQ/EQ co-reasoning in dialogue. EmoLLM uses an explicit Appraisal Reasoning Graph (ARG) to structure intermediate reasoning over contextual facts, inferred user needs, appraisal dimensions, emotional states, and response strategies before generating a reply. We train EmoLLM in a multi-turn role-play environment with reinforcement learning, where reverse-perspective reasoning provides reward signals based on predicted user-side consequences of responses. Across diverse dialogue settings, EmoLLM improves emotional state outcomes and response quality over strong baselines while preserving strong factual reliability.

关键词: Large Language Models, Emotional Intelligence, Appraisal Theory, Reinforcement Learning, Cognitive-Emotional Co-Reasoning, Dialogue Systems, Factual Reliability, Multi-turn Role-play

76. ❌ CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

作者: Mahmoud Ibrahim, Bart Elen, Chang Sun, Gokhan Ertaylan, Michel Dumontier 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16551v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学图像生成的公平性问题，提出了一种分层组合扩散框架（CompDiff），属于AI在科学/生物医学领域的应用。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"高度相关（10分），因为论文明确涉及医学成像（胸部X光、眼底图像）和公平AI在医疗领域的应用。其他关键词主要涉及大语言模型（LLM）的技术原理、训练方法、推理优化、代理系统等，而本文研究的是扩散模型在医学图像生成中的特定应用，未涉及LLM、MoE、缩放定律、微调、对齐、RAG、注意力机制、推理技术、代理、量化、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等主题，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

论文提出了CompDiff，一种分层组合扩散框架，解决了医学图像生成中因训练数据不平衡导致的生成质量不公平和零样本交叉人口统计泛化问题，实验表明其在图像质量、子群公平性和下游分类器性能上优于基线方法。

摘要翻译

生成模型正日益被用于扩充医学影像数据集，以促进更公平的人工智能。然而，一个关键假设常被忽视：即生成器本身在不同人口统计学群体上能产生同等高质量的图像。在数据不平衡基础上训练的模型可能继承这些偏差，导致对稀有亚群的合成质量下降，并难以处理训练数据中缺失的人口统计学交叉特征。我们将此称为不平衡生成器问题。现有的补救措施（如损失重加权）在优化层面操作，当某些特征组合的训练信号稀缺或缺失时，其改善效果有限。我们提出了CompDiff，一个分层组合扩散框架，旨在表征层面解决此问题。一个专用的分层条件网络（Hierarchical Conditioner Network, HCN）对人口统计学条件进行分解，生成一个人口统计学标记，并与CLIP嵌入向量拼接，作为交叉注意力机制的上下文。这种结构化分解鼓励了不同亚群间的参数共享，并支持对稀有或未见人口统计学交叉特征的组合泛化。在胸部X光片（MIMIC-CXR）和眼底图像（FairGenMed）上的实验表明，CompDiff在图像质量（FID：64.3 对比 75.1）、亚群公平性（ES-FID）以及零样本交叉泛化（在保留的交叉特征上FID提升高达21%）方面均优于标准微调方法和FairDiffusion。使用CompDiff生成数据训练的下游分类器也显示出更高的AUROC和更少的人口统计学偏差，这表明人口统计学条件化的架构设计是公平医学图像生成中一个重要且尚未被充分探索的因素。代码可在 https://anonymous.4open.science/r/CompDiff-6FE6 获取。

摘要 (Abstract)

Generative models are increasingly used to augment medical imaging datasets for fairer AI. Yet a key assumption often goes unexamined: that generators themselves produce equally high-quality images across demographic groups. Models trained on imbalanced data can inherit these imbalances, yielding degraded synthesis quality for rare subgroups and struggling with demographic intersections absent from training. We refer to this as the imbalanced generator problem. Existing remedies such as loss reweighting operate at the optimization level and provide limited benefit when training signal is scarce or absent for certain combinations. We propose CompDiff, a hierarchical compositional diffusion framework that addresses this problem at the representation level. A dedicated Hierarchical Conditioner Network (HCN) decomposes demographic conditioning, producing a demographic token concatenated with CLIP embeddings as cross-attention context. This structured factorization encourages parameter sharing across subgroups and supports compositional generalization to rare or unseen demographic intersections. Experiments on chest X-rays (MIMIC-CXR) and fundus images (FairGenMed) show that CompDiff compares favorably against both standard fine-tuning and FairDiffusion across image quality (FID: 64.3 vs. 75.1), subgroup equity (ES-FID), and zero-shot intersectional generalization (up to 21% FID improvement on held-out intersections). Downstream classifiers trained on CompDiff-generated data also show improved AUROC and reduced demographic bias, suggesting that architectural design of demographic conditioning is an important and underexplored factor in fair medical image generation. Code is available at https://anonymous.4open.science/r/CompDiff-6FE6.

关键词: medical image generation, fair AI, diffusion models, demographic conditioning, zero-shot generalization, hierarchical composition, imbalanced data, cross-attention

77. ❌ DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

作者: Lei Wang, Min Huang, Eduard Dragut 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是提出DanceHA多智能体框架用于文档级细粒度情感分析，与’LLM Agents/Autonomous Agents/Agentic Workflow’和’Multi-agent Systems/Agent Coordination’高度相关（10分），因为框架专门设计多智能体协作系统。与’Large Language Models/LLMs/Foundation Models’和’Context Window Extension/Long Context LLMs’有一定关联（5分），因为文档级分析涉及长上下文处理，且可能使用LLM作为基础，但论文未明确说明具体模型技术。其他关键词如MoE、量化、推理加速、对齐技术等与论文的文档情感分析任务无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了DanceHA多智能体框架来解决文档级细粒度情感分析任务，通过分解协作策略和人类-AI标注方法有效处理长文本和非正式写作风格，并发布了高质量数据集Inf-ABSIA。

摘要翻译

方面级情感强度分析（ABSIA）日益受到关注，但现有研究主要集中于特定领域、句子级场景。相比之下，文档级ABSIA——尤其是处理提取“方面-类别-观点-情感-强度”（Aspect-Category-Opinion-Sentiment-Intensity，简称ACOSI）元组等复杂任务——仍处于探索不足的状态。本研究提出DanceHA，一个专为开放式、非正式写作风格的文档级ABSIA设计的多智能体框架。DanceHA包含两大核心组件：Dance采用分治策略，将长上下文ABSIA任务分解为更小、可管理的子任务，由专业化智能体协作完成；HA则指人机协同标注机制。我们发布了Inf-ABSIA，这是一个基于DanceHA生成细粒度、高精度标注的多领域文档级ABSIA数据集。大量实验证明了我们智能体框架的有效性，并表明DanceHA中的多智能体知识可有效迁移至学生模型。研究结果凸显了非正式写作风格在ABSIA中长期被忽视的重要性，因其常会强化与特定方面关联的观点表达强度。

摘要 (Abstract)

Aspect-Based Sentiment Intensity Analysis (ABSIA) has garnered increasing attention, though research largely focuses on domain-specific, sentence-level settings. In contrast, document-level ABSIA–particularly in addressing complex tasks like extracting Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) tuples–remains underexplored. In this work, we introduce DanceHA, a multi-agent framework designed for open-ended, document-level ABSIA with informal writing styles. DanceHA has two main components: Dance, which employs a divide-and-conquer strategy to decompose the long-context ABSIA task into smaller, manageable sub-tasks for collaboration among specialized agents; and HA, Human-AI collaboration for annotation. We release Inf-ABSIA, a multi-domain document-level ABSIA dataset featuring fine-grained and high-accuracy labels from DanceHA. Extensive experiments demonstrate the effectiveness of our agentic framework and show that the multi-agent knowledge in DanceHA can be effectively transferred into student models. Our results highlight the importance of the overlooked informal styles in ABSIA, as they often intensify opinions tied to specific aspects.

关键词: Multi-agent Framework, Document-level ABSIA, Aspect-Category-Opinion-Sentiment-Intensity, Human-AI Collaboration, Informal Writing Styles, Divide-and-Conquer Strategy, Agentic Workflow, Sentiment Intensity Analysis

78. ❌ Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots

作者: Carmen Ng 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16537v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM赋能的机器人在社会环境中分配稀缺援助时面临的多元价值观和LLM行为可变性问题，提出了一种前端护栏设计模式。该论文与’Large Language Models’高度相关（10分），因为LLM是机器人系统的核心组件；与’LLM Agents’高度相关（10分），因为论文研究的是LLM赋能的自主机器人代理；与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分），因为论文涉及LLM行为对齐、价值对齐和多元价值观处理。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等与论文的技术内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM赋能的机器人在多元价值观环境中分配稀缺援助时面临的行为可变性和价值冲突问题，提出了一种包含可竞争性的有界校准前端护栏设计模式，以增强决策的透明度和程序合法性。

摘要翻译

在社交场景中优先提供稀缺协助的LLM赋能机器人面临多元价值观与LLM行为不确定性的双重挑战：关于援助优先顺序的合理判断可能存在分歧，而LLM介导的交互策略会随提示词、情境和群体差异发生难以预判或即时验证的变化。然而面向实时多用户援助分配的前端防护机制仍缺乏明确规范。我们提出"可争议的有限校准"这一前端程序框架，其具备三个特征：（一）将优先级决策约束在治理审核通过的许可模式菜单内；（二）在交互延迟点以可理解的方式保持当前模式的透明性；（三）提供针对具体结果的争议通道，无需重新协商全局规则。该框架将多元性与LLM不确定性视为常态条件，既避免了隐藏隐性价值偏好的静默默认设置，也规避了在时间压力下将决策负担转移给用户的开放式"价值设置"配置。我们通过公共大厅机器人场景案例阐释该框架，并构建以透明性、程序正当性与可执行性为核心的评估体系，其中特别关注自动化偏见风险及争议渠道可用性不均等问题。

摘要 (Abstract)

LLM-enabled robots prioritizing scarce assistance in social settings face pluralistic values and LLM behavioral variability: reasonable people can disagree about who is helped first, while LLM-mediated interaction policies vary across prompts, contexts, and groups in ways that are difficult to anticipate or verify at contact point. Yet user-facing guardrails for real-time, multi-user assistance allocation remain under-specified. We propose bounded calibration with contestability, a procedural front-end pattern that (i) constrains prioritization to a governance-approved menu of admissible modes, (ii) keeps the active mode legible in interaction-relevant terms at the point of deferral, and (iii) provides an outcome-specific contest pathway without renegotiating the global rule. Treating pluralism and LLM uncertainty as standing conditions, the pattern avoids both silent defaults that hide implicit value skews and wide-open user-configurable “value settings” that shift burden under time pressure. We illustrate the pattern with a public-concourse robot vignette and outline an evaluation agenda centered on legibility, procedural legitimacy, and actionability, including risks of automation bias and uneven usability of contest channels.

关键词: LLM-enabled robots, assistance allocation, pluralistic values, behavioral variability, front-end guardrails, bounded calibration, contestability, procedural legitimacy

79. ❌ FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

作者: Zhenghang Song, Tang Qian, Lu Chen, Yushuai Li, Zhengke Hu, Bingbing Fang, Yumeng Song, Junbo Zhao, Sheng Zhang, Tianyi Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16513v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出FEAT，一种用于大规模结构化数据的线性复杂度基础模型，核心贡献在于架构创新（用混合线性编码替代二次注意力）和效率提升（线性复杂度、40倍推理加速）。高度相关关键词：‘Large Language Models/Foundation Models’（10分，论文明确属于基础模型范式），‘KV Cache Compression/Linear Attention/FlashAttention’（8分，涉及线性注意力机制），‘Speculative Decoding/Inference Acceleration’（8分，实现40倍推理加速）。中等相关：‘Pre-training/Domain Adaptation’（5分，涉及预训练和真实数据分布匹配），‘AI for Science/Bioinformatics’（5分，可应用于科学数据管理）。其余关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对现有大规模结构化数据模型因二次复杂度注意力机制导致的样本数量限制和表示退化问题，提出了FEAT——一种采用多层双轴架构和混合线性编码的线性复杂度基础模型，在11个真实数据集上实现了零样本性能超越基线、线性扩展和高达40倍的推理加速。

摘要翻译

结构化数据是医疗保健、金融、电子商务和科学数据管理的基础。大型结构化数据模型（LDMs）将基础模型范式扩展到异构数据集的统一，以支持分类、回归和决策支持等任务。然而，现有的LDMs面临主要局限。首先，大多数模型依赖样本级自注意力机制，其O(N²)复杂度限制了样本数量。其次，线性序列模型常因隐藏状态压缩和人为因果偏差而导致表征质量下降。第三，仅使用合成数据进行预训练往往难以匹配真实世界的数据分布。我们提出FEAT，一种面向超大规模结构化数据的线性复杂度基础模型。FEAT引入多层双轴架构，以混合线性编码替代二次方注意力。该架构结合了用于局部样本依赖的自适应融合双向Mamba-2（AFBM）和用于全局记忆的卷积门控线性注意力（Conv-GLA）。这种设计在保持表征表达能力的同时，实现了线性复杂度的跨样本建模。为提升鲁棒性，FEAT采用混合结构因果模型流程和稳定重构目标。在11个真实世界数据集上的实验表明，FEAT在零样本性能上持续超越基线模型，同时实现线性扩展和高达40倍的推理加速。

摘要 (Abstract)

Structured data is foundational to healthcare, finance, e-commerce, and scientific data management. Large structured-data models (LDMs) extend the foundation model paradigm to unify heterogeneous datasets for tasks such as classification, regression, and decision support. However, existing LDMs face major limitations. First, most rely on sample-wise self-attention, whose O(N^2) complexity limits the sample count. Second, linear sequence models often degrade representations due to hidden-state compression and artificial causal bias. Third, synthetic-only pre-training often fails to match real-world distributions. We propose FEAT, a linear-complexity foundation model for extremely large structured data. FEAT introduces a multi-layer dual-axis architecture that replaces quadratic attention with hybrid linear encoding. The architecture combines adaptive-fusion bi-Mamba-2 (AFBM) for local sample dependencies and convolutional gated linear attention (Conv-GLA) for global memory. This design enables linear-complexity cross-sample modeling while preserving expressive representations. To improve robustness, FEAT adopts a hybrid structural causal model pipeline and a stable reconstruction objective. Experiments on 11 real-world datasets show that FEAT consistently outperforms baselines in zero-shot performance, while scaling linearly and achieving up to 40x faster inference.

关键词: foundation model, structured data, linear complexity, hybrid linear encoding, inference acceleration, zero-shot performance, AFBM, Conv-GLA

80. ❌ Bridging the High-Frequency Data Gap: A Millisecond-Resolution Network Dataset for Advancing Time Series Foundation Models

作者: Subina Khanal, Seshu Tirupathi, Merim Dzaferagic, Marco Ruffini, Torben Bach Pedersen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16497v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文聚焦于时间序列基础模型（TSFMs），属于基础模型在特定领域（无线网络）的应用，与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。研究涉及数据质量对模型泛化的影响，与’Scaling Laws AND Data Quality’有一定关联（5分）。论文明确讨论预训练和微调策略，与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（8分），与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分）。研究应用于无线网络领域，属于AI在科学/工程领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（8分）。其他关键词如MoE、SLMs、对齐、推理加速等与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对现有时间序列基础模型缺乏高频数据的问题，提出了一个毫秒级分辨率的无线网络数据集，并通过实验证明当前TSFM模型在该新数据分布上表现不佳，强调了在预训练中纳入高频数据的重要性。

摘要翻译

时间序列基础模型（TSFMs）需要多样化的真实世界数据集以适应不同领域和时间频率。然而，当前的大规模数据集主要集中于采样间隔（即时间分辨率）在秒至年范围内的低频时间序列，这限制了其捕捉高频时间序列数据细微特征的能力。为应对这一局限，我们引入了一个新颖的数据集，该数据集采集自实际运行的5G无线部署中的毫秒级分辨率无线与流量状况，从而将TSFMs的预训练范围扩展至高频数据。此外，该数据集引入了一个新领域——无线网络，从而补充了现有更通用的领域（如能源和金融）。该数据集还为短期预测提供了应用场景，预测时间跨度从100毫秒（1步）到9.6秒（96步）。通过使用该数据集对传统机器学习模型和TSFMs在预测任务上进行基准测试，我们发现大多数TSFM模型配置在零样本和微调设置下，对这种新的数据分布均表现不佳。我们的工作强调了在预训练和预测过程中纳入高频数据集的重要性，以增强TSFMs在实际应用中的架构设计、微调策略、泛化能力和鲁棒性。

摘要 (Abstract)

Time series foundation models (TSFMs) require diverse, real-world datasets to adapt across varying domains and temporal frequencies. However, current large-scale datasets predominantly focus on low-frequency time series with sampling intervals, i.e., time resolution, in the range of seconds to years, hindering their ability to capture the nuances of high-frequency time series data. To address this limitation, we introduce a novel dataset that captures millisecond-resolution wireless and traffic conditions from an operational 5G wireless deployment, expanding the scope of TSFMs to incorporate high-frequency data for pre-training. Further, the dataset introduces a new domain, wireless networks, thus complementing existing more general domains like energy and finance. The dataset also provides use cases for short-term forecasting, with prediction horizons spanning from 100 milliseconds (1 step) to 9.6 seconds (96 steps). By benchmarking traditional machine learning models and TSFMs on predictive tasks using this dataset, we demonstrate that most TSFM model configurations perform poorly on this new data distribution in both zero-shot and fine-tuned settings. Our work underscores the importance of incorporating high-frequency datasets during pre-training and forecasting to enhance architectures, fine-tuning strategies, generalization, and robustness of TSFMs in real-world applications.

关键词: time series foundation models, millisecond-resolution dataset, high-frequency data, wireless networks, pre-training, fine-tuning, generalization, short-term forecasting

81. ❌ Unlearning for One-Step Generative Models via Unbalanced Optimal Transport

作者: Hyundo Choi, Junhyeong An, Jinseong Park, Jaewoong Choi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16489v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图像生成领域的一步生成模型（如流映射模型）的机器遗忘问题，提出基于非平衡最优传输的遗忘框架。所有评分关键词均针对大语言模型（LLMs）及相关技术（如MoE、量化、推理加速、对齐、RAG等），而本文研究的是计算机视觉中的生成模型，未涉及任何语言模型技术、大模型原理或科学AI应用。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文针对一步生成模型的机器遗忘问题，提出了基于非平衡最优传输的UOT-Unlearn框架，在CIFAR-10和ImageNet-256上实现了优于基线的遗忘效果和生成质量保持。

摘要翻译

近期，一步生成框架（如流映射模型）通过单次前向传播学习从噪声到数据的直接映射，显著提升了图像生成效率。然而，为确保这些强大生成器的安全性，针对其的机器遗忘研究仍完全处于空白。现有的扩散模型遗忘方法本质上与这些一步生成模型不兼容，因为它们依赖于多步迭代去噪过程。本文提出UOT-Unlearn，一种基于非平衡最优传输的新型即插即用类别遗忘框架，适用于一步生成模型。该方法将遗忘问题形式化为一种原则性权衡：一方面通过遗忘成本抑制目标类别，另一方面通过$f$-散度惩罚结合松弛边际约束保持整体生成保真度。通过利用非平衡最优传输，我们的方法能够将被遗忘类别的概率质量平滑地重新分配到其余类别，而非坍缩为低质量或类噪声样本。在CIFAR-10和ImageNet-256数据集上的实验结果表明，该框架在遗忘成功率（PUL）与保留质量（u-FID）方面均优于基线方法。

摘要 (Abstract)

Recent advances in one-step generative frameworks, such as flow map models, have significantly improved the efficiency of image generation by learning direct noise-to-data mappings in a single forward pass. However, machine unlearning for ensuring the safety of these powerful generators remains entirely unexplored. Existing diffusion unlearning methods are inherently incompatible with these one-step models, as they rely on a multi-step iterative denoising process. In this work, we propose UOT-Unlearn, a novel plug-and-play class unlearning framework for one-step generative models based on the Unbalanced Optimal Transport (UOT). Our method formulates unlearning as a principled trade-off between a forget cost, which suppresses the target class, and an $f$-divergence penalty, which preserves overall generation fidelity via relaxed marginal constraints. By leveraging UOT, our method enables the probability mass of the forgotten class to be smoothly redistributed to the remaining classes, rather than collapsing into low-quality or noise-like samples. Experimental results on CIFAR-10 and ImageNet-256 demonstrate that our framework achieves superior unlearning success (PUL) and retention quality (u-FID), significantly outperforming baselines.

关键词: machine unlearning, one-step generative models, flow map models, unbalanced optimal transport, class unlearning, image generation, forget cost, generation fidelity

82. ❌ ExpressMind: A Multimodal Pretrained Large Language Model for Expressway Operation

作者: Zihe Wang, Yihuan Wang, Haiyang Yu. Zhiyong Cui, Xiaojian Liao, Chengcheng Wang, Yonglin Tian, Yongxin Tong 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16495v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是构建用于高速公路运营的多模态预训练大语言模型（ExpressMind），直接涉及LLMs、预训练、RAG和CoT推理等关键技术。论文明确提到构建了首个全栈高速公路数据集，这在一定程度上与数据质量相关，因此给Scaling Laws AND Data Quality 5分。论文属于AI在交通领域的应用，与AI for Science有一定关联，因此给5分。其他关键词如MoE、SFT、RLHF、量化等未在摘要中提及或与论文核心内容无关，因此给0分。

!!! tip deepseek-chat TL;DR

该论文针对现有通用大语言模型难以有效理解高速公路领域非常规场景的规则和事件因果关系的问题，构建了一个名为ExpressMind的多模态预训练大语言模型，通过引入行业首个全栈数据集、双层预训练范式、图增强RAG框架和RL对齐的思维链机制，在事件检测、安全响应生成和复杂交通分析等任务上全面超越了现有基线模型。

摘要翻译

当前高速公路运营依赖于基于规则且相互孤立的模型，这限制了跨系统知识联合分析的能力。与此同时，大语言模型（Large Language Models, LLMs）在智能交通领域的应用日益广泛，推动交通模型从算法智能向认知智能演进。然而，通用大语言模型无法有效理解高速公路领域中非常规场景下的规则与事件因果关系。为此，本文构建了一个面向高速公路的预训练多模态大语言模型（Multimodal Large Language Model, MLLM）——ExpressMind，作为智能高速公路运营的认知核心。本文构建了业界首个全栈高速公路数据集，涵盖交通知识文本、应急推理链和标注视频事件，以克服数据稀缺问题。本文提出了一种基于自监督训练与无学习的双层大语言模型预训练范式。此外，本研究引入了一种图增强检索增强生成（Graph-Augmented RAG）框架，以动态索引高速公路知识库。为提升高速公路事件应对策略的推理能力，我们开发了一种强化学习对齐的思维链（RL-aligned Chain-of-Thought, RL-CoT）机制，强制模型推理与专家事件处理启发式方法保持一致。最后，ExpressMind集成了跨模态编码器，以对齐视觉与文本通道下的动态特征序列，使其能够理解视频和图像两种模态的交通场景。在我们新发布的多模态高速公路基准测试上的大量实验表明，ExpressMind在事件检测、安全响应生成和复杂交通分析方面全面优于现有基线模型。代码与数据公开于：https://wanderhee.github.io/ExpressMind/。

摘要 (Abstract)

The current expressway operation relies on rule-based and isolated models, which limits the ability to jointly analyze knowledge across different systems. Meanwhile, Large Language Models (LLMs) are increasingly applied in intelligent transportation, advancing traffic models from algorithmic to cognitive intelligence. However, general LLMs are unable to effectively understand the regulations and causal relationships of events in unconventional scenarios in the expressway field. Therefore, this paper constructs a pre-trained multimodal large language model (MLLM) for expressways, ExpressMind, which serves as the cognitive core for intelligent expressway operations. This paper constructs the industry’s first full-stack expressway dataset, encompassing traffic knowledge texts, emergency reasoning chains, and annotated video events to overcome data scarcity. This paper proposes a dual-layer LLM pre-training paradigm based on self-supervised training and unsupervised learning. Additionally, this study introduces a Graph-Augmented RAG framework to dynamically index the expressway knowledge base. To enhance reasoning for expressway incident response strategies, we develop a RL-aligned Chain-of-Thought (RL-CoT) mechanism that enforces consistency between model reasoning and expert problem-solving heuristics for incident handling. Finally, ExpressMind integrates a cross-modal encoder to align the dynamic feature sequences under the visual and textual channels, enabling it to understand traffic scenes in both video and image modalities. Extensive experiments on our newly released multi-modal expressway benchmark demonstrate that ExpressMind comprehensively outperforms existing baselines in event detection, safety response generation, and complex traffic analysis. The code and data are available at: https://wanderhee.github.io/ExpressMind/.

关键词: Multimodal Large Language Model, Expressway Operation, Pre-training, Retrieval-Augmented Generation, Chain-of-Thought Reasoning, Traffic Analysis, Video Understanding, Intelligent Transportation

83. ❌ DST-Net: A Dual-Stream Transformer with Illumination-Independent Feature Guidance and Multi-Scale Spatial Convolution for Low-Light Image Enhancement

作者: Yicui Shi, Yuhan Chen, Xiangfei Huang, Zhenguo Wang, Wenxuan Yu, Ying Fang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16482v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《DST-Net: A Dual-Stream Transformer with Illumination-Independent Feature Guidance and Multi-Scale Spatial Convolution for Low-Light Image Enhancement》专注于计算机视觉领域的低光图像增强任务，提出了一种基于双流Transformer和空间卷积的模型。虽然论文使用了Transformer架构，但其研究内容与所有评分关键词（均围绕大语言模型、深度学习技术原理创新或AI for Science应用）完全无关。论文未涉及任何大语言模型、MoE、模型训练对齐、推理优化、智能体、量化压缩或科学AI应用等主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DST-Net的双流Transformer网络，通过光照无关特征引导和多尺度空间卷积来解决低光图像增强中信号先验丢失和精细结构保留的问题，在LOL数据集上达到了25.64 dB的PSNR性能。

摘要翻译

低光照图像增强旨在通过解决视觉传感器在昏暗环境下捕获图像固有的信号退化问题（如亮度衰减与结构损毁），恢复其可见性。尽管已有大量算法尝试提升图像质量，现有方法常导致固有信号先验的严重丢失。为克服这些挑战，我们提出一种基于光照无关信号先验引导与多尺度空间卷积的双流Transformer网络（DST-Net）。首先，针对低光照条件下关键信号特征的丢失问题，我们设计了一个特征提取模块。该模块融合高斯差分（DoG）、LAB色彩空间转换与VGG-16进行纹理提取，利用解耦的光照无关特征作为信号先验持续引导增强过程。其次，我们构建了双流交互架构。通过采用跨模态注意力机制，网络利用提取的先验动态校正增强图像的退化信号表示，最终通过可微分曲线估计实现迭代增强。此外，为克服现有方法难以保留精细结构与纹理的缺陷，我们提出一种融合伪三维与三维梯度算子卷积的多尺度空间融合模块（MSFB）。该模块集成显式梯度算子以恢复高频边缘，同时通过多尺度空间卷积捕获通道间空间相关性。大量评估与消融实验表明，DST-Net在主观视觉质量与客观指标上均取得优越性能。具体而言，本方法在LOL数据集上实现了25.64 dB的峰值信噪比（PSNR）。在LSRW数据集上的后续验证进一步证实了其强大的跨场景泛化能力。

摘要 (Abstract)

Low-light image enhancement aims to restore the visibility of images captured by visual sensors in dim environments by addressing their inherent signal degradations, such as luminance attenuation and structural corruption. Although numerous algorithms attempt to improve image quality, existing methods often cause a severe loss of intrinsic signal priors. To overcome these challenges, we propose a Dual-Stream Transformer Network (DST-Net) based on illumination-agnostic signal prior guidance and multi-scale spatial convolutions. First, to address the loss of critical signal features under low-light conditions, we design a feature extraction module. This module integrates Difference of Gaussians (DoG), LAB color space transformations, and VGG-16 for texture extraction, utilizing decoupled illumination-agnostic features as signal priors to continuously guide the enhancement process. Second, we construct a dual-stream interaction architecture. By employing a cross-modal attention mechanism, the network leverages the extracted priors to dynamically rectify the deteriorated signal representation of the enhanced image, ultimately achieving iterative enhancement through differentiable curve estimation. Furthermore, to overcome the inability of existing methods to preserve fine structures and textures, we propose a Multi-Scale Spatial Fusion Block (MSFB) featuring pseudo-3D and 3D gradient operator convolutions. This module integrates explicit gradient operators to recover high-frequency edges while capturing inter-channel spatial correlations via multi-scale spatial convolutions. Extensive evaluations and ablation studies demonstrate that DST-Net achieves superior performance in subjective visual quality and objective metrics. Specifically, our method achieves a PSNR of 25.64 dB on the LOL dataset. Subsequent validation on the LSRW dataset further confirms its robust cross-scene generalization.

关键词: Low-light image enhancement, Dual-Stream Transformer, Illumination-agnostic feature guidance, Multi-scale spatial convolution, Signal prior, Cross-modal attention, Differentiable curve estimation, PSNR

84. ❌ Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures

作者: Oleg Somov, Mikhail Chaichuk, Mikhail Seleznyov, Alexander Panchenko, Elena Tutubalina 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16475v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在schema-guided reasoning pipelines中的因果忠实性问题，直接涉及LLM、Chain of Thought推理、System 2深度推理等关键词（10分）；与自我修正、LLM代理、工具使用、上下文学习有一定关联（5分）；与事实性、可解释性相关（8分）；其他关键词如MoE、量化、科学AI等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文通过因果评估协议研究发现，LLM在schema-guided reasoning pipelines中产生的中间结构（如检查表、验证查询）虽然看似一致，但在干预后高达60%的情况下未能更新预测，表明这些结构是影响性上下文而非稳定的因果中介，揭示了LLM对中间结构的忠实性是脆弱的。

摘要翻译

模式引导推理流程要求大型语言模型在做出最终决策前生成明确的中间结构——评分标准、核查清单、验证查询等。但这些结构究竟是因果性地决定了输出，还是仅仅伴随输出产生？我们引入了一种因果评估框架，使其可直接量化：通过选择那些可由确定性函数将中间结构映射到决策的任务，每次受控编辑都对应唯一的正确输出。在涵盖八个模型和三个基准测试的实验中，模型表现出与自身中间结构的表面一致性，但在高达60%的干预案例中未能根据修改后的结构更新预测——这表明一旦中间结构发生变化，表面上的忠实性便十分脆弱。当将最终决策的推导过程委托给外部工具时，这种脆弱性基本消失；然而，要求模型优先依据中间结构而非原始输入的提示策略并未实质性缩小这一差距。总体而言，模式引导流程中的中间结构主要发挥影响性语境的作用，而非稳定的因果中介。

摘要 (Abstract)

Schema-guided reasoning pipelines ask LLMs to produce explicit intermediate structures – rubrics, checklists, verification queries – before committing to a final decision. But do these structures causally determine the output, or merely accompany it? We introduce a causal evaluation protocol that makes this directly measurable: by selecting tasks where a deterministic function maps intermediate structures to decisions, every controlled edit implies a unique correct output. Across eight models and three benchmarks, models appear self-consistent with their own intermediate structures but fail to update predictions after intervention in up to 60% of cases – revealing that apparent faithfulness is fragile once the intermediate structure changes. When derivation of the final decision from the structure is delegated to an external tool, this fragility largely disappears; however, prompts which ask to prioritize the intermediate structure over the original input do not materially close the gap. Overall, intermediate structures in schema-guided pipelines function as influential context rather than stable causal mediators.

关键词: LLM faithfulness, causal analysis, intermediate structures, schema-guided reasoning, self-consistency, causal mediators, deterministic function, controlled edit

85. ❌ Multi-Agent Reinforcement Learning Counteracts Delayed CSI in Multi-Satellite Systems

作者: Marios Aristodemou, Yasaman Omid, Sangarapillai Lambotharan, Mahsa Derakhshan, Lajos Hanzo 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16470v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于卫星通信系统中的多智能体强化学习（MARL）算法设计，仅与关键词’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文核心是开发DS-PPO算法解决多卫星协同问题。其他关键词均涉及大模型、深度学习技术原理或特定AI应用领域，而本文研究的是传统强化学习在通信工程中的应用，未涉及大模型、语言模型、训练技术、推理方法、模型优化等主题，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对卫星通信中因传播延迟导致的信道状态信息过时问题，提出了一种双阶段近端策略优化（DS-PPO）多智能体强化学习算法，以最大化用户总速率，并通过数值结果验证了该算法对CSI不完善的鲁棒性和性能提升。

摘要翻译

将卫星通信网络与下一代技术相融合是实现全球连接的一种前景广阔的方法。然而，服务质量高度依赖于准确信道状态信息的可用性。卫星通信中的信道估计具有挑战性，这是因为地面用户与卫星之间的传播延迟较高，导致卫星端获取的信道状态信息观测值过时。本文研究了多颗卫星作为分布式基站向移动地面用户进行下行传输的场景。我们提出了一种多智能体强化学习算法，旨在应对过时的信道状态信息，同时最大化用户的总和速率。针对多智能体强化学习中存在的连续大动作空间以及独立非独立同分布环境问题，我们设计了一种新颖的双层优化流程，称为双阶段近端策略优化。具体而言，双阶段近端策略优化的第一阶段最大化单颗卫星的和速率，第二阶段则在所有卫星协作形成分布式多天线基站时最大化总和速率。数值结果表明了双阶段近端策略优化对信道状态信息不完善的鲁棒性，以及采用该方法带来的和速率提升。此外，我们还提供了双阶段近端策略优化的收敛性分析及其计算复杂度。

摘要 (Abstract)

The integration of satellite communication networks with next-generation (NG) technologies is a promising approach towards global connectivity. However, the quality of services is highly dependant on the availability of accurate channel state information (CSI). Channel estimation in satellite communications is challenging due to the high propagation delay between terrestrial users and satellites, which results in outdated CSI observations on the satellite side. In this paper, we study the downlink transmission of multiple satellites acting as distributed base stations (BS) to mobile terrestrial users. We propose a multi-agent reinforcement learning (MARL) algorithm which aims for maximising the sum-rate of the users, while coping with the outdated CSI. We design a novel bi-level optimisation, procedure themes as dual stage proximal policy optimisation (DS-PPO), for tackling the problem of large continuous action spaces as well as of independent and non-identically distributed (non-IID) environments in MARL. Specifically, the first stage of DS-PPO maximises the sum-rate for an individual satellite and the second stage maximises the sum-rate when all the satellites cooperate to form a distributed multi-antenna BS. Our numerical results demonstrate the robustness of DS-PPO to CSI imperfections as well as the sum-rate improvement attached by the use of DS-PPO. In addition, we provide the convergence analysis for the DS-PPO along with the computational complexity.

关键词: multi-agent reinforcement learning, satellite communication, channel state information, distributed base stations, proximal policy optimization, sum-rate maximization, delayed CSI, cooperative transmission

86. ❌ RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

作者: Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16453v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在零售环境中的长期决策能力，与’LLM Agents’高度相关（10分），直接涉及LLM技术（10分）。提出的Evolving Strategy & Execution框架涉及战略推理和行动执行分离，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），且框架设计强调可解释性，与’Explainable AI’相关（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM智能体在动态零售环境中长期决策的稳定性问题，提出了Evolving Strategy & Execution框架，实验表明该框架能提升操作稳定性，但随着任务复杂度增加，现有LLM在长期多因素决策方面仍存在根本性局限。

摘要翻译

基于大语言模型（LLM）的智能体在短周期、高度结构化的任务上已取得显著成功。然而，在现实且动态的环境中，它们能否在长周期内保持连贯的决策能力，仍是一个悬而未决的挑战。

我们提出了RetailBench，这是一个高保真基准测试，旨在评估现实商业场景中的长周期自主决策能力。在该场景中，智能体必须在随机需求和不断变化的外部条件下运作。

我们进一步提出了“演化策略与执行”框架，该框架将高层策略推理与底层动作执行相分离。这一设计使得策略能够随时间推移进行自适应且可解释的演化。这对于长周期任务尤为重要，因为非平稳环境和误差累积要求策略的调整与动作执行处于不同的时间尺度。

在逐步增加挑战性的环境中，对八个先进大语言模型进行的实验表明，与其他基线方法相比，我们的框架提高了操作稳定性和效率。然而，随着任务复杂度的增加，性能出现显著下降，这揭示了当前大语言模型在长周期、多因素决策方面存在根本性局限。

摘要 (Abstract)

Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution. Experiments on eight state-of-the-art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to other baselines. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making.

关键词: LLM agents, long-horizon decision-making, autonomous agents, retail environments, strategy evolution, benchmark evaluation, operational stability, multi-factor decision-making

87. ❌ TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas

作者: Ai Jian, Xiaoyun Zhang, Wanrou Du, Jingqing Ruan, Jiangbo Pei, Weipeng Zhang, Ke Zeng, Xunliang Cai 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16448v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出TRUST-SQL框架，核心是使用自主代理（LLM Agents）在未知数据库模式场景下进行Text-to-SQL解析，代理通过工具使用（Tool Use）主动识别和验证相关元数据，并采用结构化四阶段协议进行深度推理（System 2 Thinking/Chain of Thought），结合新颖的Dual-Track GRPO强化学习策略。因此，与LLM Agents、Tool Use、Chain of Thought、System 2 Thinking高度相关（8-10分），与基础大模型技术（Large Language Models）有一定关联（8分），其他关键词如MoE、量化、RAG、对齐等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对真实企业环境中数据库模式未知的Text-to-SQL解析问题，提出了TRUST-SQL框架，通过自主代理使用工具主动验证元数据并结合强化学习策略，在多个基准测试中显著提升了性能，甚至超越了依赖模式预填充的基线方法。

摘要翻译

在完整模式假设下，文本到SQL解析已取得显著进展。然而，这一前提在现实企业环境中并不成立，因为企业数据库通常包含数百个具有海量噪声元数据的表。与其预先注入完整模式，智能体必须主动识别并仅验证相关子集，这便引出了我们在本研究中探讨的未知模式场景。为解决此问题，我们提出TRUST-SQL（基于工具的未知模式真实推理框架）。我们将该任务建模为部分可观测马尔可夫决策过程，其中自主智能体采用结构化的四阶段协议，将推理过程锚定于已验证的元数据。关键的是，该协议为我们新颖的双轨GRPO策略提供了结构化边界。通过应用令牌级掩码优势函数，该策略将探索奖励与执行结果解耦以解决信用分配问题，相比标准GRPO实现了9.9%的相对性能提升。在五个基准测试上的大量实验表明，TRUST-SQL的4B与8B变体相比其基础模型分别实现了30.6%和16.6%的平均绝对性能提升。值得注意的是，尽管完全无需预加载元数据，我们的框架始终达到甚至超越了依赖模式预填充的强基线模型性能。

摘要 (Abstract)

Text-to-SQL parsing has achieved remarkable progress under the Full Schema Assumption. However, this premise fails in real-world enterprise environments where databases contain hundreds of tables with massive noisy metadata. Rather than injecting the full schema upfront, an agent must actively identify and verify only the relevant subset, giving rise to the Unknown Schema scenario we study in this work. To address this, we propose TRUST-SQL (Truthful Reasoning with Unknown Schema via Tools). We formulate the task as a Partially Observable Markov Decision Process where our autonomous agent employs a structured four-phase protocol to ground reasoning in verified metadata. Crucially, this protocol provides a structural boundary for our novel Dual-Track GRPO strategy. By applying token-level masked advantages, this strategy isolates exploration rewards from execution outcomes to resolve credit assignment, yielding a 9.9% relative improvement over standard GRPO. Extensive experiments across five benchmarks demonstrate that TRUST-SQL achieves an average absolute improvement of 30.6% and 16.6% for the 4B and 8B variants respectively over their base models. Remarkably, despite operating entirely without pre-loaded metadata, our framework consistently matches or surpasses strong baselines that rely on schema prefilling.

关键词: Text-to-SQL, Unknown Schema, Autonomous Agent, Tool Use, Reinforcement Learning, GRPO, Partially Observable Markov Decision Process, Dual-Track Strategy

88. ❌ Visual Distraction Undermines Moral Reasoning in Vision-Language Models

作者: Xinyi Yang, Chenheng Xu, Weijun Hong, Ce Mo, Qian Wang, Fang Fang, Yixin Zhu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16445v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）中的道德推理，与LLMs/基础模型相关（8分），核心涉及价值对齐（10分），探讨直觉与审慎推理（CoT和系统2思维各8分），涉及事实性和可解释性（各5分），其他关键词如MoE、量化、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

研究发现视觉输入会绕过文本安全机制，激活直觉式推理，削弱视觉语言模型的道德决策能力，暴露了多模态安全对齐的脆弱性。

摘要翻译

道德推理是确保人工智能（AI）安全的基础，但随着AI系统从基于文本的助手发展为具身智能体，确保其跨模态的一致性变得至关重要。当前的安全技术在文本语境中已取得成效，但其向视觉输入的泛化能力仍存疑虑。现有的道德评估基准仅依赖纯文本形式，且缺乏对影响道德决策变量的系统性控制。本文表明，视觉输入会从根本上改变前沿视觉语言模型（VLMs）的道德决策，从而绕过基于文本的安全机制。我们提出了道德困境模拟（MDS），这是一个基于道德基础理论（MFT）的多模态基准，通过对视觉与语境变量的正交操控实现机制性分析。评估结果显示，视觉模态会激活类直觉通路，压制在纯文本语境中观察到的更为审慎且安全的推理模式。这些发现揭示了关键脆弱性：针对语言调整的安全过滤器无法约束视觉处理，这证明了对多模态安全对齐的迫切需求。

摘要 (Abstract)

Moral reasoning is fundamental to safe Artificial Intelligence (AI), yet ensuring its consistency across modalities becomes critical as AI systems evolve from text-based assistants to embodied agents. Current safety techniques demonstrate success in textual contexts, but concerns remain about generalization to visual inputs. Existing moral evaluation benchmarks rely on textonly formats and lack systematic control over variables that influence moral decision-making. Here we show that visual inputs fundamentally alter moral decision-making in state-of-the-art (SOTA) Vision-Language Models (VLMs), bypassing text-based safety mechanisms. We introduce Moral Dilemma Simulation (MDS), a multimodal benchmark grounded in Moral Foundation Theory (MFT) that enables mechanistic analysis through orthogonal manipulation of visual and contextual variables. The evaluation reveals that the vision modality activates intuition-like pathways that override the more deliberate and safer reasoning patterns observed in text-only contexts. These findings expose critical fragilities where language-tuned safety filters fail to constrain visual processing, demonstrating the urgent need for multimodal safety alignment.

关键词: Vision-Language Models, Moral Reasoning, Multimodal Safety, Moral Dilemma Simulation, Intuition Pathways, Safety Alignment, Visual Distraction, Moral Foundation Theory

89. ❌ CD-FKD: Cross-Domain Feature Knowledge Distillation for Robust Single-Domain Generalization in Object Detection

作者: Junseok Lee, Sungho Shin, Seongju Lee, Kyoobin Lee 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16439v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的物体检测任务，提出了一种跨域特征知识蒸馏方法（CD-FKD）来提高单域泛化能力。论文的核心是知识蒸馏技术（教师-学生网络架构）和域泛化问题，与大多数大语言模型（LLM）相关关键词无关。唯一相关的关键词是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文涉及域适应/泛化问题，但论文使用的是知识蒸馏而非典型的预训练或持续预训练方法，因此给予5分（中等关联）。其他所有关键词均与大语言模型、推理、对齐、压缩、科学AI应用等无关，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种跨域特征知识蒸馏方法（CD-FKD），通过全局和实例级特征蒸馏增强学生网络的泛化能力，在单域泛化物体检测任务中优于现有方法，提高了对域偏移的鲁棒性。

摘要翻译

单域泛化对于目标检测至关重要，尤其是在单一源域上训练模型并在未见过的目标域上评估时。域偏移（如天气、光照或场景条件的变化）对现有模型的泛化能力构成了重大挑战。为解决这一问题，我们提出跨域特征知识蒸馏（Cross-Domain Feature Knowledge Distillation, CD-FKD），该方法通过利用全局特征蒸馏和实例级特征蒸馏来增强学生网络的泛化能力。所提出的方法通过降尺度和数据损坏生成多样化数据来训练学生网络，而教师网络则接收原始的源域数据。学生网络通过全局和实例级蒸馏模仿教师的特征，使其能够有效提取以目标为中心的特征，即使对于因数据损坏而难以检测的目标也是如此。在具有挑战性的场景上进行的大量实验表明，CD-FKD在目标域泛化性能和源域性能上均优于现有先进方法，验证了其在提升目标检测对域偏移的鲁棒性方面的有效性。该方法在自动驾驶和监控等实际应用中具有重要价值，这些领域需要在多样环境中实现鲁棒的目标检测。

摘要 (Abstract)

Single-domain generalization is essential for object detection, particularly when training models on a single source domain and evaluating them on unseen target domains. Domain shifts, such as changes in weather, lighting, or scene conditions, pose significant challenges to the generalization ability of existing models. To address this, we propose Cross-Domain Feature Knowledge Distillation (CD-FKD), which enhances the generalization capability of the student network by leveraging both global and instance-wise feature distillation. The proposed method uses diversified data through downscaling and corruption to train the student network, whereas the teacher network receives the original source domain data. The student network mimics the features of the teacher through both global and instance-wise distillation, enabling it to extract object-centric features effectively, even for objects that are difficult to detect owing to corruption. Extensive experiments on challenging scenes demonstrate that CD-FKD outperforms state-of-the-art methods in both target domain generalization and source domain performance, validating its effectiveness in improving object detection robustness to domain shifts. This approach is valuable in real-world applications, like autonomous driving and surveillance, where robust object detection in diverse environments is crucial.

关键词: object detection, single-domain generalization, knowledge distillation, cross-domain feature, domain shift, robustness, teacher-student network, feature distillation

90. ❌ EngGPT2: Sovereign, Efficient and Open Intelligence

作者: G. Ciarfaglia, A. Rosanova, S. Cipolla, J. Bartoli, A. Di Domenico, C. Fioroni, A. Fontana, M. R. Scoleri, M. I. Mone, D. Franchi, M. C. Del Gaudio, F. Picariello, M. Gabusi, S. Bonura, V. Morreale, I. Bailo 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16430v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是介绍EngGPT2-16B-A3B，一个从头训练的意大利语MoE架构大语言模型，因此与’Large Language Models’和’Mixture of Experts’高度相关（10分）。模型支持多种推理模式（非推理、意大利语/英语推理、turbo推理），与’Chain of Thought’和’System 2 Thinking’相关（8分）。论文强调效率（使用较少训练数据和推理算力），与’Scaling Laws AND Data Quality’、‘Quantization OR Model Compression’、‘Speculative Decoding OR Inference Acceleration’有一定关联（5分）。模型针对欧洲/意大利任务，与’Pre-training OR Domain Adaptation’相关（8分），并提及符合欧盟AI法案，与’Alignment’弱相关（5分）。模型规模16B/3B活跃参数，与’Small Language Models’部分相关（5分）。其他关键词如SFT、RLHF、RAG、长上下文、代理等未在摘要中提及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了EngGPT2-16B-A3B，一个从头训练的意大利语MoE架构大语言模型，在仅使用2.5万亿token训练的情况下，实现了与8B-16B密集模型相当的基准性能，同时显著降低了推理算力和训练数据需求，并支持多种推理模式，旨在成为符合欧盟AI法案的高效开源欧洲模型。

摘要翻译

EngGPT2-16B-A3B是Engineering Group意大利大语言模型（LLM）的最新版本，旨在构建一个主权、高效且开放的模型。EngGPT2基于2.5万亿令牌训练而成——少于Qwen3的36万亿或Llama3的15万亿——但在关键基准测试（包括MMLU-Pro、GSM8K、IFEval和HumanEval）上展现出与80亿至160亿参数规模的稠密模型相当的性能，同时仅需其五分之一至一半的推理算力，以及十分之一至六分之一的训练数据及相应训练算力。作为从头训练的混合专家（Mixture-of-Experts, MoE）架构，EngGPT2拥有160亿参数，每次推理激活30亿参数，其专家模块规模介于GPT-OSS与Qwen3所用架构之间。训练语料库中约25%为意大利语数据，以确保模型在同等规模中具备出色的欧洲及意大利自然语言处理（NLP）任务能力。这种高效性旨在使EngGPT2成为日益丰富的欧洲开放权重模型系列的关键贡献者，在实现性能与效率平衡的同时，完全符合《欧盟人工智能法案》。EngGPT2还具备多模式推理能力：支持非推理模式、意大利语或英语推理模式，以及快速推理模式（一种适用于实时推理场景、以简明要点形式呈现的双语推理模式）。该模型致力于为欧洲及意大利语境下的资源节约型高性能大语言模型树立新标杆。

摘要 (Abstract)

EngGPT2-16B-A3B is the latest iteration of Engineering Group’s Italian LLM and it’s built to be a Sovereign, Efficient and Open model. EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3’s 36T or Llama3’s 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring one-fifth to half of the inference power, and between one-tenth to one-sixth of the training data and consequent needed training power. Designed as a trained-from-scratch Mixture-of-Experts (MoE) architecture, EngGPT2 features 16 billion parameters with 3 billion active per inference, with expert sizes positioned between those used in GPT-OSS and Qwen3. Approximately 25% of its training corpus consists of Italian-language data, to deliver strong capabilities for European and Italian NLP tasks among models of similar scale. This efficiency aims to position EngGPT2 as a key contributor to the growing portfolio of open-weight European models, combining performance and efficiency with full alignment to the EU AI Act. EngGPT2 is also a single model capable of multiple reasoning modes: non-reasoning, reasoning in Italian or English, and turbo-reasoning (a concise, bullet-point style reasoning available in both languages designed for real-time reasoning use cases). EngGPT2 aims to set a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts.

关键词: Mixture of Experts, Large Language Model, Italian LLM, Efficient inference, Training from scratch, Reasoning modes, European AI, Sovereign model

91. ❌ LenghuSky-8: An 8-Year All-Sky Cloud Dataset with Star-Aware Masks and Alt-Az Calibration for Segmentation and Nowcasting

作者: Yicheng Rui, Xiao-Wei Duan, Licai Deng, Fan Yang, Zhengming Dang, Zhengjun Du, Junhao Peng, Wenhao Chu, Umut Mahmut, Kexin Li, Yiyun Wu, Fabo Feng 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16429v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于天文观测领域的云层分割和短期预测，使用传统计算机视觉和深度学习技术（如DINOv3、ConvLSTM、VideoGPT），但未涉及大语言模型（LLMs）、模型架构创新（如MoE、量化）、训练方法（如预训练、微调、对齐）、推理优化、智能体系统或大模型在科学领域的直接应用。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在天文学中的应用，但并非核心创新点，因此给予5分（有一定关联）。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个为期八年的全天云层成像数据集LenghuSky-8，用于云层分割和短期预测，通过训练线性探针在DINOv3特征上实现了93.3%的分割准确率，并引入了基于像素级日志的短期预测基准。

摘要翻译

地基时域观测站需要对站点尺度的云层覆盖进行逐分钟监测，然而现有的全天相机数据集普遍存在时间跨度短、偏向白昼观测或缺乏天体测量校准的问题。本文介绍冷湖天空-8数据集，这是一个来自一流天文台址的八年期（2018-2025）全天成像数据集，包含429,620帧$512 \times 512$图像，夜间覆盖率达81.2%，并提供基于恒星识别的云掩膜、背景掩膜以及逐像素的高度角-方位角（Alt-Az）校准。为实现在白天、夜间及不同月相条件下的鲁棒云层分割，我们在DINOv3局部特征上训练线性探针，并在一个包含1,111张人工标注图像的平衡测试集上获得了93.3% $\pm$ 1.1%的整体准确率。利用恒星天体测量技术，我们将每个像素映射到本地高度角-方位角坐标系，并测得校准不确定度在天顶方向约为0.37度，在30度高度角处约为1.34度，此精度足以满足与望远镜调度系统集成的需求。除分割任务外，我们还基于逐像素的三类逻辑输出（天空/云层/污染）提出了一个短时临近预报基准测试，并提供了四种基线方法：持续性（复制最后一帧）、光流法、ConvLSTM和VideoGPT。ConvLSTM表现最佳，但相较于持续性方法仅获得有限提升，这凸显了短期云层演变的预测难度。我们公开了该数据集、校准数据以及一个开源工具包，用于数据加载、评估和生成可直接用于调度系统的高度角-方位角地图，以推动云层分割、临近预报及自主天文台运行领域的研究。

摘要 (Abstract)

Ground-based time-domain observatories require minute-by-minute, site-scale awareness of cloud cover, yet existing all-sky datasets are short, daylight-biased, or lack astrometric calibration. We present LenghuSky-8, an eight-year (2018-2025) all-sky imaging dataset from a premier astronomical site, comprising 429,620 $512 \times 512$ frames with 81.2% night-time coverage, star-aware cloud masks, background masks, and per-pixel altitude-azimuth (Alt-Az) calibration. For robust cloud segmentation across day, night, and lunar phases, we train a linear probe on DINOv3 local features and obtain 93.3% $\pm$ 1.1% overall accuracy on a balanced, manually labeled set of 1,111 images. Using stellar astrometry, we map each pixel to local alt-az coordinates and measure calibration uncertainties of approximately 0.37 deg at zenith and approximately 1.34 deg at 30 deg altitude, sufficient for integration with telescope schedulers. Beyond segmentation, we introduce a short-horizon nowcasting benchmark over per-pixel three-class logits (sky/cloud/contamination) with four baselines: persistence (copying the last frame), optical flow, ConvLSTM, and VideoGPT. ConvLSTM performs best but yields only limited gains over persistence, underscoring the difficulty of near-term cloud evolution. We release the dataset, calibrations, and an open-source toolkit for loading, evaluation, and scheduler-ready alt-az maps to boost research in segmentation, nowcasting, and autonomous observatory operations.

关键词: all-sky cloud dataset, cloud segmentation, nowcasting, astronomical observatory, DINOv3, ConvLSTM, alt-az calibration, autonomous operations

92. ❌ An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

作者: Ruijia Yang, Zeyi Wen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16428v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM微调（fine-tuning）的内存效率优化系统SlideFormer，直接高度相关于’Large Language Models’（10分）、‘Post-training/Supervised Fine-tuning’（10分）和’Pre-training/Domain Adaptation’（8分，因提及domain adaptation）。系统通过内存管理优化支持参数高效微调，相关于’PEFT/Parameter-efficient Fine-tuning’（8分）。内存优化涉及减少使用，部分相关于’Quantization/Model Compression’（5分，非直接量化但属压缩优化）和’Speculative Decoding/Inference Acceleration’（5分，非推理加速但提升训练吞吐量）。其他关键词如MoE、SLMs、RAG、Alignment等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型（LLMs）微调时内存需求过高的问题，提出了一个名为SlideFormer的高效异构协同设计系统，能够在单GPU（如RTX 4090）上微调123B+模型，实现最高6.27倍的吞吐量提升并显著降低内存使用。

摘要翻译

为适应特定领域需求，对大型语言模型（LLM）进行微调已成为关键环节，但其内存密集型特性超出了大多数图形处理器（GPU）的承载能力。为应对这一挑战并推动LLM微调的普及化，我们提出了SlideFormer——一种专为单GPU环境设计的新型系统。我们的创新点包括：（1）轻量级异步引擎，将GPU视为滑动窗口，实现GPU计算与中央处理器（CPU）更新及多层输入/输出（I/O）操作的重叠执行；（2）高效异构内存管理方案，显著降低峰值内存占用；（3）优化的Triton计算内核以解决关键瓶颈，并集成先进的I/O机制。这一协同设计使得在单张RTX 4090显卡上能够对最新的1230亿参数以上模型进行微调，支持最高达8倍的批次大小和6倍的模型规模。评估结果表明，与基线方法相比，SlideFormer在吞吐量上实现了1.40至6.27倍的提升，同时将CPU/GPU内存使用量降低约一半，并在英伟达（NVIDIA）和超威半导体（AMD）GPU上均保持超过95%的峰值性能。

摘要 (Abstract)

Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer, a novel system designed for single-GPU environments. Our innovations are: (1) A lightweight asynchronous engine that treats the GPU as a sliding window and overlaps GPU computation with CPU updates and multi-tier I/O. (2) A highly efficient heterogeneous memory management scheme significantly reduces peak memory usage. (3) Optimized Triton kernels to solve key bottlenecks and integrated advanced I/O. This collaborative design enables fine-tuning of the latest 123B+ models on a single RTX 4090, supporting up to 8x larger batch sizes and 6x larger models. In evaluations, SlideFormer achieves 1.40x to 6.27x higher throughput while roughly halving CPU/GPU memory usage compared to baselines, sustaining >95% peak performance on both NVIDIA and AMD GPUs.

关键词: Large Language Models, Fine-tuning, Single-GPU, Memory Optimization, Heterogeneous Co-Design, Domain Adaptation, Throughput Improvement, Parameter-efficient

93. ❌ SF-Mamba: Rethinking State Space Model for Vision

作者: Masakazu Yoshimura, Teruaki Hayashi, Yuki Hoshino, Wei-Yao Wang, Takeshi Ohashi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16423v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SF-Mamba专注于计算机视觉领域，提出了一种基于状态空间模型（State Space Model, SSM）的新型视觉编码器，旨在替代Vision Transformers（ViTs）。虽然论文涉及深度学习模型架构创新（Mamba在视觉任务中的应用），但所有评分关键词均针对大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、代理系统等）。论文内容完全不涉及语言模型、文本处理或LLM特定技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出SF-Mamba，一种新型视觉Mamba模型，通过辅助补丁交换和批量折叠技术解决现有视觉Mamba在双向信息流和计算效率上的限制，在图像分类、目标检测和分割任务中显著优于现有基线并提高吞吐量。

摘要翻译

近年来，视觉Mamba领域的发展旨在替代具有二次复杂度的视觉Transformer（ViT）。尽管Mamba的循环扫描机制提供了计算效率，但其本质上限制了图像块之间的非因果交互。先前的研究尝试通过多种多扫描策略解决这一局限，然而这些方法因扫描设计欠佳和频繁的数据重排而效率低下。此外，在视觉任务中常用的短令牌长度下，Mamba的计算速度相对较慢。为了构建真正高效的视觉编码器，我们重新思考了视觉扫描操作及Mamba的计算效率。为此，我们提出SF-Mamba，一种新颖的视觉Mamba模型，其包含两项关键创新：在单向扫描下通过辅助块交换编码双向信息流，以及通过周期性状态重置的批量折叠实现高级GPU并行化。在图像分类、目标检测、实例分割和语义分割上的大量实验一致表明，我们所提出的SF-Mamba在不同模型规模下均显著优于现有先进基线，同时提升了吞吐量。我们将在发表后公开源代码。

摘要 (Abstract)

The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.

关键词: SF-Mamba, State Space Model, Vision Mamba, Vision Transformer, Computational Efficiency, Image Classification, Object Detection, Semantic Segmentation

94. ❌ IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time

作者: Zhenghua Bao, Yi Shi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16415v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文IndexRAG的核心贡献是改进检索增强生成（RAG）方法，通过离线索引生成桥接事实来支持跨文档推理，因此与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分）。论文涉及多跳问答中的推理过程，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’有一定关联（8分）。论文使用LLMs生成桥接事实，因此与’Large Language Models OR LLMs OR Foundation Models’相关（8分）。其他关键词如MoE、量化、对齐等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出IndexRAG方法，通过将跨文档推理从在线推理转移到离线索引，生成桥接事实作为独立可检索单元，在三个多跳问答基准上平均提升F1分数4.6分，同时仅需单次检索和单次LLM调用。

摘要翻译

多跳问答任务要求跨多个文档进行推理，而现有的检索增强生成方法通常通过需要额外在线处理的图结构方法或迭代式多步推理来解决这一问题。本文提出IndexRAG，这是一种创新方法，将跨文档推理从在线推断转移至离线索引阶段。IndexRAG通过识别文档间共享的桥接实体，并生成可作为独立检索单元的桥接事实，无需任何额外训练或微调。在三个广泛使用的多跳问答基准数据集上的实验表明，IndexRAG在推理时仅需单次检索和一次大语言模型调用，即能比朴素检索增强生成方法平均提升4.6个F1分数。当与IRCoT结合使用时，IndexRAG在平均性能上超越了所有基于图结构的基线方法，包括HippoRAG和FastGraphRAG，同时仅依赖扁平化检索。我们的代码将在论文录用后公开。

摘要 (Abstract)

Multi-hop question answering (QA) requires reasoning across multiple documents, yet existing retrieval-augmented generation (RAG) approaches address this either through graph-based methods requiring additional online processing or iterative multi-step reasoning. We present IndexRAG, a novel approach that shifts cross-document reasoning from online inference to offline indexing. IndexRAG identifies bridge entities shared across documents and generates bridging facts as independently retrievable units, requiring no additional training or fine-tuning. Experiments on three widely-used multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue) show that IndexRAG improves F1 over Naive RAG by 4.6 points on average, while requiring only single-pass retrieval and a single LLM call at inference time. When combined with IRCoT, IndexRAG outperforms all graph-based baselines on average, including HippoRAG and FastGraphRAG, while relying solely on flat retrieval. Our code will be released upon acceptance.

关键词: Retrieval-Augmented Generation, Multi-hop Question Answering, Cross-document Reasoning, Bridge Entities, Offline Indexing, Single-pass Retrieval, HotpotQA, 2WikiMultiHopQA

95. ❌ Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

作者: Finnur Ágúst Ingimundarson, Steinunn Rut Friðriksdóttir, Bjarki Ármannsson, Iris Edda Nowenstein, Steinþór Steingrímsson 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16406v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究冰岛语大语言模型的评估基准问题，仅与第一个关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文核心内容就是评估LLM在低/中等资源语言中的基准测试。其他关键词涉及具体技术原理、训练方法、推理技术、应用领域等，论文均未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文评估了冰岛语大语言模型的现有基准测试，发现包含未经验证的合成或机器翻译数据的基准存在严重缺陷，会扭曲结果并削弱测试有效性，呼吁改进低/中等资源语言的评估方法。

摘要翻译

本文评估了当前针对冰岛语的大型语言模型（LLM）基准测试，指出了其中存在的问题，并呼吁尤其针对低/中等资源语言改进评估方法。我们指出，那些包含未经任何验证的合成数据或机器翻译数据的基准测试，通常存在严重缺陷的测试样例，这些样例很可能扭曲结果并损害测试的有效性。我们警告在低/中等资源语言环境中未经核实就使用此类方法，因为其翻译质量充其量只能达到特定语言在特定时间点机器翻译（MT）的水平。事实上，我们对现有冰岛语基准测试的定量误差分析结果显示，人工撰写/翻译的基准测试与合成或机器翻译的基准测试之间存在明显差异。

摘要 (Abstract)

This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests’ validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.

关键词: Large Language Models, LLM evaluation, benchmarking, Icelandic language, low-resource languages, machine translation, synthetic data, evaluation methods

96. ❌ PlotTwist: A Creative Plot Generation Framework with Small Language Models

作者: Abhinav Thorat, Ravi Kolla, Jyotin Goel, Niranjan Pedanekar 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16410v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究小型语言模型(SLMs)在创意情节生成中的应用，与’Small Language Models’高度相关(10分)。方法中明确使用’Mixture of Experts’架构(10分)和’Direct Preference Optimization’对齐技术(10分)。论文提到LLMs作为背景对比(8分)，涉及对齐概念(5分)和代理评估模块(5分)。其他关键词如数据质量、预训练、推理加速等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了PlotTwist框架，通过结构化分解和偏好对齐使小型语言模型能够生成与前沿大模型相竞争的高质量创意情节。

摘要翻译

创意情节生成对语言模型提出了一个根本性挑战：如何将简洁的前提转化为连贯的叙事，并维持全局结构、角色发展和情感共鸣。尽管近期的大型语言模型在通用任务上展现出强大的流畅性，但它们通常需要进行偏好对齐才能在创意情节生成等专业领域表现良好。然而，在尖端大语言模型的规模上进行此类对齐计算成本极高，严重限制了其可访问性和实际部署。为解决这一问题，我们提出了PlotTwist——一个结构化框架，使活跃参数≤50亿的小型语言模型能够生成高质量、基于前提的情节，其性能可与规模达200倍的尖端系统相竞争。我们的方法将生成过程分解为三个专门化组件：(1) 通过新颖的“正负提示”策略训练的方面评分奖励模型，用于在五个叙事质量维度上生成结构化叙事；(2) 通过直接偏好优化在高置信度偏好对上对齐的混合专家情节生成器；(3) 模拟人类批判性判断以进行无偏事后评估的智能体评估模块。大量实验表明，尽管存在显著更严格的容量限制，PlotTwist在多个叙事质量维度上始终优于尖端模型。进一步验证证实了其对叙事质量的强敏感性，该框架能可靠地区分源自广受好评与普遍劣评的剧本情节。综上，这些结果表明基于偏好的结构化对齐是一种资源高效的高质量创意情节生成方法。

摘要 (Abstract)

Creative plot generation presents a fundamental challenge for language models: transforming a concise premise into a coherent narrative that sustains global structure, character development, and emotional resonance. Although recent Large Language Models (LLMs) demonstrate strong fluency across general-purpose tasks, they typically require preference alignment to perform well on specialized domains such as creative plot generation. However, conducting such alignment at the scale of frontier LLMs is computationally prohibitive, significantly limiting accessibility and practical deployment. To address this, we present PlotTwist, a structured framework that enables Small Language Models (SLMs) with $\leq$ 5B active parameters to generate high-quality, premise-conditioned plots competitive with frontier systems up to $200\times$ larger. Our approach decomposes generation into three specialized components: (1) an Aspect Rating Reward Model trained via a novel Positive-Negative prompting strategy to deliver structured narratives across five Narrative Quality Dimensions (NQDs); (2) a Mixture-of-Experts (MoE) plot generator aligned via Direct Preference Optimization on high-confidence preference pairs; and (3) an Agentic Evaluation module that emulates human critical judgment for unbiased post-hoc assessment. Extensive experiments demonstrate that PlotTwist consistently outperforms frontier models across multiple NQDs despite substantially tighter capacity constraints. Further validation confirms strong sensitivity to narrative quality, as the framework reliably distinguishes plots derived from critically acclaimed versus widely panned screenplays. Together, these results establish structured, preference-based alignment as a resource-efficient approach to high-quality creative plot generation.

关键词: Small Language Models, Mixture of Experts, Direct Preference Optimization, creative plot generation, preference alignment, Agentic Evaluation, Narrative Quality Dimensions, resource-efficient

97. ❌ Robust Physics-Guided Diffusion for Full-Waveform Inversion

作者: Jishen Peng, Enze Jiang, Zheng Ma, Xiongbin Yan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16393v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是基于物理引导的扩散模型在波形反演中的应用，属于AI for Science（科学AI）领域，具体是地球物理学的计算成像问题。论文的核心技术是扩散模型（score-based generative prior）与物理模拟（wave-equation simulations）的结合，以及改进的采样方法（preconditioned guided reverse-diffusion）。这与关键词列表中的绝大多数大模型（LLM）相关技术（如预训练、对齐、推理优化、智能体等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于科学计算中的AI应用，但并非生物信息学或化学信息学，因此给予中等相关度5分。其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对全波形反演中的重建问题，提出了一种结合物理模拟引导的鲁棒扩散模型框架，在计算预算相当的情况下相比确定性优化基线和标准扩散后验采样方法提高了重建质量。

摘要翻译

我们提出了一种稳健的物理引导扩散框架，用于全波形反演，该方法将基于分数的生成先验与通过波动方程模拟计算出的似然引导相结合。我们采用基于传输的数据一致性势函数（Wasserstein-2），并通过有界加权和观测相关归一化融入波场增强机制，从而提高了对振幅不平衡及时间/相位失配的鲁棒性。在推断方面，我们引入了一种预处理的引导反向扩散方案，该方案在反向时间动态过程中自适应调整引导强度与空间缩放，相比标准扩散后验采样（DPS, diffusion posterior sampling），能够实现更稳定且有效的数据一致性引导步骤。在OpenFWI数据集上的数值实验表明，在可比计算成本下，本方法相比确定性优化基线及标准DPS具有更好的重建质量。

摘要 (Abstract)

We develop a robust physics-guided diffusion framework for full-waveform inversion that combines a score-based generative prior with likelihood guidance computed through wave-equation simulations. We adopt a transport-based data-consistency potential (Wasserstein-2), incorporating wavefield enhancement via bounded weighting and observation-dependent normalization, thereby improving robustness to amplitude imbalance and time/phase misalignment. On the inference side, we introduce a preconditioned guided reverse-diffusion scheme that adapts the guidance strength and spatial scaling throughout the reverse-time dynamics, yielding a more stable and effective data-consistency guidance step than standard diffusion posterior sampling (DPS). Numerical experiments on OpenFWI datasets demonstrate improved reconstruction quality over deterministic optimization baselines and standard DPS under comparable computational budgets.

关键词: physics-guided diffusion, full-waveform inversion, score-based generative prior, wave-equation simulation, Wasserstein-2, reverse-diffusion, data-consistency, OpenFWI datasets

98. ❌ Age Predictors Through the Lens of Generalization, Bias Mitigation, and Interpretability: Reflections on Causal Implications

作者: Debdas Paul, Elisa Ferrari, Irene Gravili, Alessandro Cellerino 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16377v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文研究年龄预测模型的泛化能力、偏差缓解和可解释性，属于生物信息学领域的机器学习应用。与绝大多数大模型技术关键词（如LLM、MoE、SFT、RAG等）完全无关，因为这些关键词涉及大语言模型架构、训练方法、推理优化等，而论文使用的是传统神经网络模型。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’（5分），因为论文讨论了可解释神经网络模型；以及’AI for Science OR Bioinformatics OR Cheminformatics’（8分），因为论文使用小鼠转录组数据进行生物信息学分析，属于AI for Science范畴。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文探讨了年龄预测模型在面临种族、性别等外生属性时如何通过对抗表示学习提高泛化能力和可解释性，并在小鼠转录组数据上验证了模型的有效性。

摘要翻译

时序年龄预测模型常因种族、性别或组织类型等外源属性而难以实现分布外泛化。因此，学习对这些属性具有不变性的表征对于提升分布外泛化能力、防止过度乐观的预测结果至关重要。在预测性研究中，这些属性引发了偏差缓解的需求；在因果分析中，它们以混杂变量的形式出现；而当这些属性受保护时，对其抑制则涉及公平性议题。本文以理论严谨性系统探讨这些概念的关联，并阐释了一种基于对抗表征学习的可解释神经网络模型的应用范畴。通过使用公开的小鼠转录组数据集，我们对比展示了该模型相较于传统机器学习模型的表现。研究发现，该模型的输出结果与已发表研究中关于Elamipretide对小鼠骨骼肌和心肌作用的预测结论一致。最后，我们讨论了从此类纯预测模型中推导因果解释的局限性。

摘要 (Abstract)

Chronological age predictors often fail to achieve out-of-distribution (OOD) gen- eralization due to exogenous attributes such as race, gender, or tissue. Learning an invariant representation with respect to those attributes is therefore essential to improve OOD generalization and prevent overly optimistic results. In predic- tive settings, these attributes motivate bias mitigation; in causal analyses, they appear as confounders; and when protected, their suppression leads to fairness. We coherently explore these concepts with theoretical rigor and discuss the scope of an interpretable neural network model based on adversarial representation learning. Using publicly available mouse transcriptomic datasets, we illustrate the behavior of this model relative to conventional machine learning models. We observe that the outcome of this model is consistent with the predictive results of a published study demonstrating the effects of Elamipretide on mouse skeletal and cardiac muscle. We conclude by discussing the limitations of deriving causal interpretation from such purely predictive models.

关键词: age prediction, out-of-distribution generalization, bias mitigation, interpretable neural networks, adversarial representation learning, mouse transcriptomic data, causal implications, fairness

99. ❌ Fanar 2.0: Arabic Generative AI Stack

作者: FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad, Minhaj Ahmad, Abdulaziz Al-Homaid, Anas Al-Nuaimi, Enes Altinisik, Ehsaneddin Asgari, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky, Ahmed Elmagarmid, Mohamed Eltabakh, Asim Ersoy, Masoomali Fatehkia, Mohammed Qusay Hashim, Majd Hawasly, Mohamed Hefeeda, Mus’ab Husaini, Keivin Isufaj, Soon-Gyo Jung, Houssam Lachemat, Ji Kim Lucas, Abubakr Mohamed, Tasnim Mohiuddin, Basel Mousi, Hamdy Mubarak, Ahmad Musleh, Mourad Ouzzani, Amin Sadeghi, Husrev Taha Sencar, Mohammed Shinoy, Omar Sinan, Yifan Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16397v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是阿拉伯语生成式AI平台Fanar 2.0，高度相关关键词包括：1) ‘Large Language Models’（核心模型Fanar-27B）；2) ‘Pre-training/Continual Pre-training’（采用持续预训练策略）；3) ‘Model Merging’（明确使用模型合并技术）；4) ‘Scaling Laws AND Data Quality’（强调数据质量优先策略）；5) ‘LLM Agents/Tool Use’（平台包含智能体框架和工具调用功能）；6) ‘Multi-agent Systems’（Fanar-Sadiq采用多智能体架构）；7) ‘Alignment’（FanarGuard涉及安全和文化对齐）。其他关键词如MoE、SFT、RLHF、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

论文介绍了资源受限条件下开发的阿拉伯语生成式AI平台Fanar 2.0，通过数据质量优先、持续预训练和模型合并等策略，在仅使用256个H100 GPU和有限数据的情况下，实现了阿拉伯语和英语能力的显著提升，并扩展了包括内容审核、语音识别、视觉理解、智能体工具调用等在内的完整AI能力栈。

摘要翻译

我们推出Fanar 2.0——卡塔尔第二代以阿拉伯语为中心的人工智能生成平台。主权性是其首要设计原则：从数据管道到部署基础设施的每个组件，均在哈马德·本·哈利法大学的卡塔尔计算研究所（QCRI）自主设计与运营。Fanar 2.0展现了资源受限条件下的卓越成就：该项目仅使用256块英伟达H100 GPU运行，而阿拉伯语虽拥有4亿母语者，其网络数据占比仅约0.5%。通过采取数据质量优先于数量、定向持续预训练与模型融合的严谨策略，Fanar 2.0在有限资源下实现了显著突破。其核心是Fanar-27B模型，该模型基于Gemma-3-27B架构，通过三种数据配方构建的1.2万亿高质量词元语料库进行持续预训练。尽管使用的预训练词元数量比Fanar 1.0减少八倍，其在多项基准测试中取得显著提升：阿拉伯语知识（+9.1分）、语言能力（+7.3分）、方言理解（+3.5分）及英语能力（+7.6分）。除核心大语言模型外，Fanar 2.0还引入了丰富的新功能体系。FanarGuard是一个先进的40亿参数双语审核过滤器，专为阿拉伯语安全与文化对齐设计。语音系列Aura新增支持数小时音频的长时自动语音识别模型。视觉系列Oryx在具备文化根基的图像生成功能之外，增加了阿拉伯语感知的图像与视频理解能力。智能体工具调用框架支持多步骤工作流。Fanar-Sadiq采用多智能体架构处理伊斯兰相关内容。Fanar-Diwan提供古典阿拉伯诗歌生成功能。FanarShaheen实现基于大语言模型的双语翻译。重新设计的多层协调器通过意图感知路由与深度防御安全验证，统筹所有组件协同工作。总体而言，Fanar 2.0证明：在主权性约束与有限资源条件下开发的人工智能系统，仍能取得与大规模投入所建系统相竞争的卓越性能。

摘要 (Abstract)

We present Fanar 2.0, the second generation of Qatar’s Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.

关键词: Arabic Generative AI, Continual Pre-training, Model Merging, Data Quality, LLM Agents, Tool Calling, Sovereign AI, Resource-constrained Development

100. ❌ FederatedFactory: Generative One-Shot Learning for Extremely Non-IID Distributed Scenarios

作者: Andrea Moleri, Christian Internò, Ali Raza, Markus Olhofer, David Klindt, Fabio Stella, Barbara Hammer 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16370v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于联邦学习框架FederatedFactory，解决非独立同分布数据下的分布式优化问题，通过生成式方法合成平衡数据集。论文明确提到在医学影像基准（如MedMNIST、ISIC2019）上的评估，属于AI在科学/生物信息学领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分8分）。其他关键词均涉及大模型技术原理（如LLM、MoE、训练方法、推理优化等）或特定AI应用（如代理、工具使用），而本文未涉及任何大模型或深度学习技术原理创新，也未讨论这些具体技术，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

论文提出FederatedFactory框架，通过将联邦学习的单元从判别参数转换为生成先验，在单轮通信中交换生成模块，解决了极端非独立同分布场景下的数据合成和性能恢复问题，在医学影像基准上实现了集中式上限性能。

摘要翻译

联邦学习（Federated Learning, FL）能够在保障数据主权的前提下实现分布式优化。然而，当本地标签分布互斥时，由于优化轨迹相互冲突，标准的权重聚合方法会失效。现有的联邦学习方法通常依赖预训练的基础模型，这引入了不切实际的假设。我们提出联邦工厂（FederatedFactory），一个零依赖框架，其将联邦的基本单元从判别性参数反转为生成性先验。通过单轮通信交换生成模块，我们的架构支持从无到有地合成全局类别平衡的数据集，从而完全消除梯度冲突与外部先验偏差。在包括MedMNIST和ISIC2019在内的多种医学影像基准测试上的评估表明，我们的方法能够恢复集中式训练的性能上限。在极端异构条件下，该方法将CIFAR-10上的基线准确率从崩溃的11.36%提升至90.57%，并将ISIC2019的AUROC恢复至90.57%。此外，该框架通过确定性删除特定生成模块，实现了精确的模块化遗忘能力。

摘要 (Abstract)

Federated Learning (FL) enables distributed optimization without compromising data sovereignty. Yet, where local label distributions are mutually exclusive, standard weight aggregation fails due to conflicting optimization trajectories. Often, FL methods rely on pretrained foundation models, introducing unrealistic assumptions. We introduce FederatedFactory, a zero-dependency framework that inverts the unit of federation from discriminative parameters to generative priors. By exchanging generative modules in a single communication round, our architecture supports ex nihilo synthesis of universally class balanced datasets, eliminating gradient conflict and external prior bias entirely. Evaluations across diverse medical imagery benchmarks, including MedMNIST and ISIC2019, demonstrate that our approach recovers centralized upper-bound performance. Under pathological heterogeneity, it lifts baseline accuracy from a collapsed 11.36% to 90.57% on CIFAR-10 and restores ISIC2019 AUROC to 90.57%. Additionally, this framework facilitates exact modular unlearning through the deterministic deletion of specific generative modules.

关键词: Federated Learning, Non-IID Data, Generative Priors, One-Shot Learning, Medical Imagery, Dataset Synthesis, Modular Unlearning, Distributed Optimization

101. ❌ DynamicGate MLP Conditional Computation via Learned Structural Dropout and Input Dependent Gating for Functional Plasticity

作者: Yong Il Choi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16367v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出了一种名为DynamicGate-MLP的模型，它通过学习门控机制实现条件计算，以抑制不必要的计算并集中资源处理每个输入所需的部分。论文的核心创新在于将正则化视角（如Dropout）与条件计算视角统一到一个框架中，并引入了连续门概率和离散执行掩码。在评估中，论文明确提到了与MoE风格变体的比较，因此与’Mixture of Experts OR MoE OR Sparse Models’高度相关（评分为8分），因为MoE也是一种条件计算和稀疏模型技术。然而，论文主要关注MLP架构，并在MNIST、CIFAR-10、Tiny-ImageNet、Speech Commands和PBMC3k等数据集上进行评估，这些并非大模型或深度学习在科学领域的典型应用，也未涉及其他关键词如LLMs、AI for Science等。因此，其他关键词均评为0分。

!!! tip deepseek-chat TL;DR

该论文提出DynamicGate-MLP框架，通过学习门控机制实现条件计算，以在训练中正则化网络并在推理时动态选择执行路径，从而提高计算效率，并在多个数据集上验证了其优于传统MLP和MoE变体。

摘要翻译

Dropout是一种典型的正则化技术，通过在训练过程中随机失活隐藏单元来缓解过拟合。相比之下，标准推理过程使用完整的网络进行密集计算，因此其目标和机制不同于条件计算——后者的执行操作依赖于输入。本文将DynamicGate-MLP组织在一个统一框架中，使其同时满足正则化视角和条件计算视角的要求。该模型不采用随机掩码，而是学习控制是否使用每个单元（或模块）的门控机制，在实现依赖样本的执行过程（将计算集中于每个输入所需的部分）的同时抑制不必要的计算。为此，我们定义了连续的门控概率，并在推理时根据其生成离散的执行掩码以选择计算路径。训练过程通过对期望门控使用率施加惩罚来控制计算预算，并采用直通估计器优化离散掩码。我们在MNIST、CIFAR-10、Tiny-ImageNet、Speech Commands和PBMC3k数据集上评估DynamicGate-MLP，并与多种MLP基线及MoE风格变体进行比较。计算效率的比较采用统一标准，即基于门控激活比率和层加权相对MAC度量，而非依赖硬件与后端内核的实时延迟。

摘要 (Abstract)

Dropout is a representative regularization technique that stochastically deactivates hidden units during training to mitigate overfitting. In contrast, standard inference executes the full network with dense computation, so its goal and mechanism differ from conditional computation, where the executed operations depend on the input. This paper organizes DynamicGate-MLP into a single framework that simultaneously satisfies both the regularization view and the conditional-computation view. Instead of a random mask, the proposed model learns gates that decide whether to use each unit (or block), suppressing unnecessary computation while implementing sample-dependent execution that concentrates computation on the parts needed for each input. To this end, we define continuous gate probabilities and, at inference time, generate a discrete execution mask from them to select an execution path. Training controls the compute budget via a penalty on expected gate usage and uses a Straight-Through Estimator (STE) to optimize the discrete mask. We evaluate DynamicGate-MLP on MNIST, CIFAR-10, Tiny-ImageNet, Speech Commands, and PBMC3k, and compare it with various MLP baselines and MoE-style variants. Compute efficiency is compared under a consistent criterion using gate activation ratios and a layerweighted relative MAC metric, rather than wall-clock latency that depends on hardware and backend kernels.

关键词: DynamicGate-MLP, conditional computation, learned gating, structural dropout, input-dependent execution, compute efficiency, MoE variants, Straight-Through Estimator

102. ❌ FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment

作者: Qinhong Lin, Ruitao Feng, Yinglun Feng, Zhenxin Huang, Yukun Chen, Zhongliang Yang, Linna Zhou, Binjie Fei, Jiaqi Liu, Yu Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16365v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文FactorEngine提出了一种用于量化投资的因子挖掘框架，核心创新在于使用LLM引导的搜索和多智能体系统。与关键词的相关性分析如下：1. ‘Large Language Models OR LLMs OR Foundation Models’（8分）：论文明确使用LLM进行引导搜索和代码生成，是框架的关键组成部分。2. ‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）：框架的核心是’closed-loop multi-agent extraction-verification-code-generation pipeline’，直接对应多智能体工作流。3. ‘Multi-agent Systems OR Agent Coordination’（10分）：同上，论文明确描述了多智能体系统用于金融报告处理和因子生成。4. ‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）：论文应用AI于金融科学（量化投资），属于AI在科学领域的应用，但非生物或化学信息学。其他关键词如MoE、SFT、RAG等未在论文中提及或应用，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何从非平稳市场数据中自动挖掘可执行、可审计的预测因子（alpha因子），通过引入FactorEngine框架——它使用LLM引导的搜索和多智能体知识注入管道，在真实数据回测中实现了比基线方法更强的预测稳定性和投资组合表现。

摘要翻译

本研究探讨阿尔法因子挖掘——即从嘈杂、非平稳的市场数据中自动发现预测信号——并遵循一项实际要求：挖掘出的因子需具备直接可执行性与可审计性，且发现过程需保持大规模计算的可处理性。现有符号方法受限于表达能力的有界性，而神经预测模型往往以牺牲可解释性换取性能提升，且对市场状态转换与过拟合问题较为敏感。我们提出FactorEngine（FE），一种程序级因子发现框架，将因子定义为图灵完备的代码，并通过三重分离机制提升其效能与效率：（i）逻辑修订与参数优化分离，（ii）基于大语言模型（LLM）引导的定向搜索与贝叶斯超参数搜索分离，（iii）LLM调用与本地计算分离。FE进一步整合了知识增强的引导模块，通过闭环多智能体提取-验证-代码生成流程，将非结构化财务报告转化为可执行的因子程序；同时构建经验知识库，支持轨迹感知的优化过程（包括从失败中学习）。基于真实市场OHLCV数据的广泛回测表明，相较于基线方法，FE生成的因子具有显著更强的预测稳定性与投资组合影响力——例如更高的IC/ICIR（及Rank IC/ICIR）以及改进的AR/Sharpe比率，从而实现了业界领先的预测能力与投资组合绩效。

摘要 (Abstract)

We study alpha factor mining, the automated discovery of predictive signals from noisy, non-stationary market data-under a practical requirement that mined factors be directly executable and auditable, and that the discovery process remain computationally tractable at scale. Existing symbolic approaches are limited by bounded expressiveness, while neural forecasters often trade interpretability for performance and remain vulnerable to regime shifts and overfitting. We introduce FactorEngine (FE), a program-level factor discovery framework that casts factors as Turing-complete code and improves both effectiveness and efficiency via three separations: (i) logic revision vs. parameter optimization, (ii) LLM-guided directional search vs. Bayesian hyperparameter search, and (iii) LLM usage vs. local computation. FE further incorporates a knowledge-infused bootstrapping module that transforms unstructured financial reports into executable factor programs through a closed-loop multi-agent extraction-verification-code-generation pipeline, and an experience knowledge base that supports trajectory-aware refinement (including learning from failures). Across extensive backtests on real-world OHLCV data, FE produces factors with substantially stronger predictive stability and portfolio impact-for example, higher IC/ICIR (and Rank IC/ICIR) and improved AR/Sharpe, than baseline methods, achieving state-of-the-art predictive and portfolio performance.

关键词: FactorEngine, alpha factor mining, quantitative investment, LLM-guided search, multi-agent system, knowledge-infused bootstrapping, program-level factor discovery, financial reports processing

103. ❌ $D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation

作者: Ruizhi Wang, Weihan Li, Zunlei Feng, Haofei Zhang, Mingli Song, Jiayu Wang, Jie Song, Li Sun 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16362v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的遥感图像单目深度估计，使用Vision Transformer和扩散模型等技术，与大多数关键词（主要涉及大语言模型技术）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为遥感图像处理可视为AI在科学（地球科学、环境监测）中的应用，但并非核心内容，故给5分。

!!! tip deepseek-chat TL;DR

该论文针对遥感图像单目深度估计中精度与效率的权衡问题，提出了一种结合ViT和扩散模型的高效框架，在保持高感知质量的同时实现了40倍加速。

摘要翻译

从遥感影像中进行实时、高保真的单目深度估计对众多应用至关重要，但现有方法在精度与效率之间面临显著权衡。尽管使用视觉变换器（Vision Transformer, ViT）骨干网络进行密集预测速度较快，但其感知质量往往较差。相反，扩散模型能提供高保真结果，但计算成本极高。为克服这些限制，我们提出用于遥感单目深度估计的深度细节扩散框架（Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation, $D^3$-RSMDE），这是一个旨在实现速度与质量间最优平衡的高效框架。我们的框架首先利用基于ViT的模块快速生成高质量初步深度图构建，作为结构先验，有效替代了扩散模型中耗时的初始结构生成阶段。基于此先验，我们提出渐进式线性混合细化（Progressive Linear Blending Refinement, PLBR）策略，该策略使用轻量级U-Net仅通过少量迭代即可优化细节。整个细化步骤在由变分自编码器（Variational Autoencoder, VAE）支持的紧凑潜在空间中高效运行。大量实验表明，$D^3$-RSMDE在Learned Perceptual Image Patch Similarity（LPIPS）感知指标上较Marigold等领先模型显著降低了11.85%，同时推理速度提升超过40倍，并保持了与轻量级ViT模型相当的显存使用量。

摘要 (Abstract)

Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ($D^3$-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that $D^3$-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

关键词: Remote Sensing, Monocular Depth Estimation, Vision Transformer, Diffusion Models, Efficient Inference, Perceptual Quality, VAE, Progressive Refinement

104. ❌ Toward Experimentation-as-a-Service in 5G/6G: The Plaza6G Prototype for AI-Assisted Trials

作者: Sergio Barrachina-Muñoz, Marc Carrascosa-Zamacois, Horacio Bleda, Umair Riaz, Yasir Maqsood, Xavier Calle, Selva Vía, Miquel Payaró, Josep Mangues-Bafalluy 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16356v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究5G/6G无线网络实验平台Plaza6G，其中明确提到使用LLM助手、RAG（检索增强生成）和LoRA（低秩适应）技术来增强平台的实验设计功能。因此，与’Large Language Models’、‘PEFT/LoRA’和’Retrieval-Augmented Generation’高度相关（10分）。论文属于AI在无线通信领域的应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF等均未在摘要中提及，与论文核心内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了Plaza6G，一个首个将云资源与下一代无线基础设施统一的实验即服务平台，通过集成LLM助手、RAG和LoRA技术简化了5G/6G实验设计，并展示了自动化CI/CD集成和可编程传播条件下的交互式OTA测试。

摘要翻译

本文介绍了Plaza6G——首个将云资源与下一代无线基础设施统一的可运行实验即服务平台。该平台由巴塞罗那CTTC研发，集成了GPU加速计算集群、多种5G核心网（包括开源方案如Free5GC和商用方案如Cumucore）、可编程无线接入网，以及在统一编排下的物理或仿真用户设备。在Plaza6G中，实验设计可通过网页门户或REST应用程序接口以自然语言描述，极大降低了对专业知识的依赖。门户界面与REST接口均配备基于大语言模型的智能助手，该助手采用检索增强生成技术获取最新实验知识，并利用低秩自适应方法进行持续领域微调。空口测试依托四腔室微波暗室及双站点室外5G网络开展，工作频段涵盖6GHz以下及毫米波。平台演示案例包括十分钟内完成的自动化CI/CD集成部署，以及可编程传播条件下的交互式空口测试。实验描述符采用机器可读格式确保结果可复现，未来工作将聚焦策略感知编排、安全验证与联邦测试床集成，以推动开放可复现的无线实验研究。

摘要 (Abstract)

This paper presents Plaza6G, the first operational Experiment-as-a-Service (ExaS) platform unifying cloud resources with next-generation wireless infrastructure. Developed at CTTC in Barcelona, Plaza6G integrates GPU-accelerated compute clusters, multiple 5G cores, both open-source (e.g., Free5GC) and commercial (e.g., Cumucore), programmable RANs, and physical or emulated user equipment under unified orchestration. In Plaza6G, the experiment design requires minimal expertise as it is expressed in natural language via a web portal or a REST API. The web portal and REST API are enhanced with a Large Language Model (LLM)-based assistant, which employs retrieval-augmented generation (RAG) for up-to-date experiment knowledge and Low-Rank Adaptation (LoRA) for continuous domain fine-tuning. Over-the-air (OTA) trials leverage a four-chamber anechoic facility and a dual-site outdoor 5G network operating in sub-6~GHz and mmWave bands. Demonstrations include automated CI/CD integration with sub-ten-minute setup and interactive OTA testing under programmable propagation conditions. Machine-readable experiment descriptors ensure reproducibility, while future work targets policy-aware orchestration, safety validation, and federated testbed integration toward open, reproducible wireless experimentation.

关键词: Experiment-as-a-Service, 5G/6G, Large Language Model, Retrieval-Augmented Generation, LoRA, Wireless Experimentation, OTA Testing, AI-Assisted Trials

105. ❌ Automated identification of Ichneumonoidea wasps via YOLO-based deep learning: Integrating HiresCam for Explainable AI

作者: Joao Manoel Herrera Pinheiro, Gabriela Do Nascimento Herrera, Alvaro Doria Dos Santos, Luciana Bueno Dos Reis Fernandes, Ricardo V. Godoy, Eduardo A. B. Almeida, Helena Carolina Onody, Marcelo Andrade Da Costa Vieira, Angelica Maria Penteado-Dias, Marcelo Becker 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16351v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于使用YOLO架构和HiResCAM进行昆虫分类的计算机视觉应用，属于深度学习在生物科学（昆虫学）中的具体应用。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（因使用了HiResCAM进行可解释性可视化），以及与’AI for Science OR Bioinformatics OR Cheminformatics’有较强关联（属于AI for Science在生物分类学/生物信息学领域的应用）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于YOLO深度学习框架并集成HiResCAM可解释性技术的自动化系统，用于从高分辨率图像中准确识别姬蜂总科寄生蜂，实现了超过96%的分类准确率并验证了模型关注的特征具有生物学合理性。

摘要翻译

对姬蜂总科寄生蜂进行准确的分类鉴定对于生物多样性评估、生态监测及生物控制项目至关重要。然而，形态相似性、体型微小以及种间差异细微等特点，使得人工鉴定工作繁重且高度依赖专业知识。本研究提出一种基于深度学习的框架，用于姬蜂总科蜂类的自动识别。该框架采用基于YOLO的架构，并集成高分辨率类激活映射技术以增强模型的可解释性。所提出的系统能够从高分辨率图像中同步识别蜂类所属科别。数据集包含3556张膜翅目标本的高分辨率图像，其分类分布主要集中在姬蜂科（786例）、茧蜂科（648例）、蜜蜂科（466例）和胡蜂科（460例）。研究使用精心构建的数据集进行了大量实验，并通过精确率、召回率、F1分数和准确率评估模型性能。结果表明，模型准确率超过96%，且对不同形态变异具有稳健的泛化能力。高分辨率类激活映射的可视化结果证实，模型能够聚焦于具有分类学意义的解剖区域，如翅脉结构、触角分节及后体形态，从而验证了所学特征的生物学合理性。可解释人工智能技术的整合提升了系统的透明度和可信度，使其适用于昆虫学研究，以加速对这一尚未充分描述的寄生蜂总科的生物多样性表征工作。

摘要 (Abstract)

Accurate taxonomic identification of parasitoid wasps within the superfamily Ichneumonoidea is essential for biodiversity assessment, ecological monitoring, and biological control programs. However, morphological similarity, small body size, and fine-grained interspecific variation make manual identification labor-intensive and expertise-dependent. This study proposes a deep learning-based framework for the automated identification of Ichneumonoidea wasps using a YOLO-based architecture integrated with High-Resolution Class Activation Mapping (HiResCAM) to enhance interpretability. The proposed system simultaneously identifies wasp families from high-resolution images. The dataset comprises 3556 high-resolution images of Hymenoptera specimens. The taxonomic distribution is primarily concentrated among the families Ichneumonidae (n = 786), Braconidae (n = 648), Apidae (n = 466), and Vespidae (n = 460). Extensive experiments were conducted using a curated dataset, with model performance evaluated through precision, recall, F1 score, and accuracy. The results demonstrate high accuracy of over 96 % and robust generalization across morphological variations. HiResCAM visualizations confirm that the model focuses on taxonomically relevant anatomical regions, such as wing venation, antennae segmentation, and metasomal structures, thereby validating the biological plausibility of the learned features. The integration of explainable AI techniques improves transparency and trustworthiness, making the system suitable for entomological research to accelerate biodiversity characterization in an under-described parasitoid superfamily.

关键词: automated identification, Ichneumonoidea wasps, YOLO-based deep learning, High-Resolution Class Activation Mapping (HiResCAM), explainable AI, taxonomic identification, parasitoid wasps, biodiversity assessment

106. ❌ Explainable machine learning workflows for radio astronomical data processing

作者: S. Yatawatta, A. Ahmadi, B. Asabere, M. Iacobelli, N. Peters, M. Veldhuis 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16350v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于射电天文数据处理中的可解释机器学习工作流，提出结合模糊规则推理和深度学习来提高ML辅助决策的可解释性。仅与两个关键词相关：1）‘Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为论文核心是解决ML黑箱问题，提高可解释性；2）‘AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），因为论文将ML应用于射电天文（科学领域），但未涉及生物信息学或化学信息学。其他关键词均与大模型、深度学习技术原理或特定AI方法（如LLM、MoE、对齐、推理等）无关，论文未提及这些内容，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对射电天文数据处理中机器学习工作流的黑箱问题，提出结合模糊规则推理和深度学习的方法来提高可解释性，并在校准应用中展示了该方法在不牺牲质量或准确性的前提下增强了可解释性。

摘要翻译

射电天文学高度依赖高效精准的数据处理流程来产出可用于科学研究的数据。随着现代射电望远镜数据流的日益增长，人工配置此类数据处理流程已不可行。机器学习（ML）正逐渐成为实现数据处理流程自动化的可行解决方案。然而，现有几乎所有基于机器学习的流程均属于黑箱类型，自动化智能体作出的决策难以被天文学家解读。为提高射电天文领域中机器学习辅助数据处理流程的可解释性，我们提出将基于模糊规则的推理与深度学习结合使用。我们以射电天文中的一项应用——校准为例，通过高木-关野-竹内（Takagi-Sugeno-Kang, TSK）模糊系统展示所提出的机器学习辅助决策方法。基于仿真实验的结果表明，该方法在提升可解释性的同时，并未牺牲处理质量或精度。

摘要 (Abstract)

Radio astronomy relies heavily on efficient and accurate processing pipelines to deliver science ready data. With the increasing data flow of modern radio telescopes, manual configuration of such data processing pipelines is infeasible. Machine learning (ML) is already emerging as a viable solution for automating data processing pipelines. However, almost all existing ML enabled pipelines are of black-box type, where the decisions made by the automating agents are not easily deciphered by astronomers. In order to improve the explainability of the ML aided data processing pipelines in radio astronomy, we propose the joint use of fuzzy rule based inference and deep learning. We consider one application in radio astronomy, i.e., calibration, to showcase the proposed approach of ML aided decision making using a Takagi-Sugeno-Kang (TSK) fuzzy system. We provide results based on simulations to illustrate the increased explainability of the proposed approach, not compromising on the quality or accuracy.

关键词: Explainable machine learning, Radio astronomy, Data processing pipelines, Fuzzy rule based inference, Deep learning, Calibration, Takagi-Sugeno-Kang fuzzy system, Black-box ML

107. ❌ Detecting Sentiment Steering Attacks on RAG-enabled Large Language Models

作者: Isha Andrade, Shalaka S Mahadik, Mithun Mukherjee, Pranav M Pawar, Raja Muthalagu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16342v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文标题虽提及RAG-enabled Large Language Models，但摘要内容完全聚焦于物联网网络安全，使用CNN和LSTM深度学习模型进行入侵检测，未涉及任何大模型技术、架构、训练方法、推理优化、对齐、应用等关键词。所有关键词均与大模型相关，而论文研究的是传统深度学习在网络安全的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究针对物联网网络安全问题，提出了基于卷积神经网络和长短期记忆网络的轻量级智能入侵检测系统，在CICIoT2023数据集上分别实现了超过98.6%和98.68%的分类准确率。

摘要翻译

大规模物联网（IoT）网络的普及既带来了机遇也伴随着挑战。它不仅通过提升自动化流程的效率革新了组织的运作方式，也简化了我们的日常生活。然而，尽管物联网网络提升了便利性与连接性，但也因未授权设备接入网络并利用特定攻击类型钻探现有漏洞而增加了安全风险。本研究提出了两种基于轻量级深度学习（DL）的智能入侵检测系统（IDS），以增强物联网网络的安全性：一种是基于卷积神经网络（CNN）的IDS，另一种是基于长短期记忆网络（LSTM）的IDS。研究使用CICIoT2023数据集评估了这两种基于深度学习的智能IDS的性能。基于深度学习的智能IDS能够通过二元分类、分组分类和多类别分类成功识别并归类各种网络威胁。所提出的基于CNN的IDS在二元、分组和多类别分类中的准确率分别达到99.34%、99.02%和98.6%，而基于LSTM的IDS则分别达到99.42%、99.13%和98.68%。

摘要 (Abstract)

The proliferation of large-scale IoT networks has been both a blessing and a curse. Not only has it revolutionized the way organizations operate by increasing the efficiency of automated procedures, but it has also simplified our daily lives. However, while IoT networks have improved convenience and connectivity, they have also increased security risk due to unauthorized devices gaining access to these networks and exploiting existing weaknesses with specific attack types. The research proposes two lightweight deep learning (DL)-based intelligent intrusion detection systems (IDS). to enhance the security of IoT networks: the proposed convolutional neural network (CNN)-based IDS and the proposed long short-term memory (LSTM)-based IDS. The research evaluated the performance of both intelligent IDSs based on DL using the CICIoT2023 dataset. DL-based intelligent IDSs successfully identify and classify various cyber threats using binary, grouped, and multi-class classification. The proposed CNN-based IDS achieves an accuracy of 99.34%, 99.02% and 98.6%, while the proposed LSTM-based IDS achieves an accuracy of 99.42%, 99.13%, and 98.68% for binary, grouped, and multi-class classification, respectively.

关键词: IoT networks, intrusion detection systems, deep learning, convolutional neural network, long short-term memory, cyber threats, CICIoT2023 dataset, classification accuracy

108. ❌ An Interpretable Machine Learning Framework for Non-Small Cell Lung Cancer Drug Response Analysis

作者: Ann Rachel, Pranav M Pawar, Mithun Mukharjee, Raja M, Tojo Mathew 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16330v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究使用XGBoost和SHAP进行非小细胞肺癌药物反应预测，属于AI在生物信息学/科学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。摘要中提到使用大型语言模型DeepSeek验证特征的生物学有效性，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。论文使用SHAP进行解释，与’Mechanistic Interpretability OR Explainable AI’相关（8分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化、代理系统等，论文未涉及这些方面，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于XGBoost和SHAP的可解释机器学习框架，用于预测非小细胞肺癌的药物反应，并利用大型语言模型DeepSeek验证特征的生物学意义，以支持个性化治疗。

摘要翻译

肺癌是指恶性细胞在肺部以不受控制的方式异常增殖和扩散的疾病。常见的治疗策略包括手术、化疗和放疗，但由于癌症的异质性，这些并非最佳选择。在个性化医疗中，治疗方案根据个体的遗传信息及生活方式特征进行定制。此外，基于人工智能的深度学习方法能够分析大规模数据集，以发现癌症的早期迹象、肿瘤类型及治疗前景。本文重点探讨如何利用患者特定数据（主要侧重于遗传谱）制定个性化治疗方案。研究采用癌症药物敏感性基因组学（Genomics of Drug Sensitivity in Cancer）的多组学数据，结合机器学习技术构建预测模型。目标变量LN-IC50的数值决定了药物敏感或耐药的程度。研究利用XGBoost回归器预测药物反应，重点关注从癌症数据集中提取的分子与细胞特征。通过交叉验证和随机搜索进行超参数调优，以进一步优化模型的预测性能。为解释模型，研究采用了SHAP（SHapley Additive exPlanations）方法。SHAP值用于衡量每个特征对个体预测的影响。此外，研究使用DeepSeek（一种经过训练用于验证特征生物学有效性的大型语言模型）来解读特征间的关系。DeepSeek针对最重要的基因或通路提供了背景解释，并结合主要的SHAP值成分，共同支持了模型的可预测性。

摘要 (Abstract)

Lung cancer is a condition where there is abnormal growth of malignant cells that spread in an uncontrollable fashion in the lungs. Some common treatment strategies are surgery, chemotherapy, and radiation which aren’t the best options due to the heterogeneous nature of cancer. In personalized medicine, treatments are tailored according to the individual’s genetic information along with lifestyle aspects. In addition, AI-based deep learning methods can analyze large sets of data to find early signs of cancer, types of tumor, and prospects of treatment. The paper focuses on the development of personalized treatment plans using specific patient data focusing primarily on the genetic profile. Multi-Omics data from Genomics of Drug Sensitivity in Cancer have been used to build a predictive model along with machine learning techniques. The value of the target variable, LN-IC50, determines how sensitive or resistive a drug is. An XGBoost regressor is utilized to predict the drug response focusing on molecular and cellular features extracted from cancer datasets. Cross-validation and Randomized Search are performed for hyperparameter tuning to further optimize the model’s predictive performance. For explanation purposes, SHAP (SHapley Additive exPlanations) was used. SHAP values measure each feature’s impact on an individual prediction. Furthermore, interpreting feature relationships was performed using DeepSeek, a large language model trained to verify the biological validity of the features. Contextual explanations regarding the most important genes or pathways were provided by DeepSeek alongside the top SHAP value constituents, supporting the predictability of the model.

关键词: Non-Small Cell Lung Cancer, Drug Response Prediction, XGBoost, SHAP, Interpretable Machine Learning, Personalized Medicine, Multi-Omics Data, DeepSeek

109. ❌ A Human-Centred Architecture for Large Language Models-Cognitive Assistants in Manufacturing within Quality Management Systems

作者: Marcos Galdino, Johanna Grahl, Tobias Hamann, Anas Abdelrazeq, Ingrid Isenhardt 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16325v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是提出一种将LLM-CAs（大语言模型认知助手）集成到制造业质量管理系统的软件架构，属于大模型在工业领域的应用研究。与’Large Language Models’高度相关（10分），因为论文直接研究LLM-CAs；与’LLM Agents’高度相关（10分），因为LLM-CAs本质上是基于LLM的智能体系统；与’AI for Science’有一定关联（5分），因为制造业应用可视为AI在工程科学领域的应用。其他关键词主要涉及具体技术原理（如MoE、量化、推理加速等）或特定应用领域（如生物信息学），论文未涉及这些具体内容，故评0分。

!!! tip deepseek-chat TL;DR

该研究针对制造业质量管理系统中缺乏人本主义软件架构来集成大语言模型认知助手的问题，提出了一种基于组件的软件架构，并通过专家焦点小组验证了其灵活性、可扩展性和工作增强潜力。

摘要翻译

大型语言模型认知助手（LLM-CAs）能够增强制造业中的质量管理体系（QMS），促进持续流程改进与知识管理。然而，现有文献中缺乏一种以人为中心、专注于QMS的软件架构，以实现LLM-CAs在制造业中的集成。本研究通过设计一种基于组件的架构来填补这一空白，该架构综合考虑了需求分析与软件开发流程。验证工作通过迭代式专家焦点小组进行。所提出的架构确保了QMS内的灵活性、可扩展性、模块化及工作增强。此外，该架构为其与工业伙伴的实际应用铺平了道路，展现了其在推进制造流程方面的潜力。

摘要 (Abstract)

Large Language Models-Cognitive Assistants (LLM-CAs) can enhance Quality Management Systems (QMS) in manufacturing, fostering continuous process improvement and knowledge management. However, there is no human-centred software architecture focused on QMS that enables the integration of LLM-CAs into manufacturing in the current literature. This study addresses this gap by designing a component-based architecture considering requirement analysis and software development process. Validation was conducted via iterative expert focus groups. The proposed architecture ensures flexibility, scalability, modularity, and work augmentation within QMS. Moreover, it paves the way for its operationalization with industrial partners, showcasing its potential for advancing manufacturing processes.

关键词: Large Language Models, Cognitive Assistants, Quality Management Systems, Manufacturing, Software Architecture, Human-centred Design, Process Improvement, Knowledge Management

110. ❌ Learning to Predict, Discover, and Reason in High-Dimensional Discrete Event Sequences

作者: Hugo Math 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16313v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确提到使用大语言模型（LLMs）作为其框架的一部分，因此’Large Language Models OR LLMs OR Foundation Models’得10分。论文还提到开发了一个多智能体系统（multi-agent system）来自动合成布尔规则，因此’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’各得10分。论文应用AI于汽车诊断，属于AI在科学/工程领域的应用，因此’AI for Science OR Bioinformatics OR Cheminformatics’得5分。其他关键词在摘要中未提及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对现代车辆中高维离散诊断事件序列的自动化故障诊断问题，提出了一个统一事件序列建模、因果发现和大语言模型的框架，并开发了Transformer-based预测架构、可扩展的因果发现框架以及一个用于自动合成布尔错误模式规则的多智能体系统。

摘要翻译

现代车辆中嵌入的电子控制单元（ECU）会产生大量被称为诊断故障码（Diagnostic Trouble Codes, DTCs）的异步事件。这些离散事件构成了复杂的时间序列，反映了车辆各子系统健康状况的动态演变。在汽车行业中，领域专家通常使用布尔规则手动将这些故障码归类为更高层次的错误模式（Error Patterns, EPs），以描述系统故障并确保安全性。然而，随着车辆复杂性的增加，这种人工处理过程变得日益昂贵、容易出错且难以扩展。值得注意的是，现代车辆中独特的诊断故障码数量与自然语言的词汇量处于同一数量级，通常达到数万之多。这一观察促使我们进行范式转变：将诊断序列视为一种可以建模、预测并最终解释的语言。传统的统计方法无法捕捉丰富的依赖关系，也难以扩展到具有数千个节点、大样本量和长序列特征的高维数据集。具体而言，工业日志中分类事件空间的高基数性带来了重大挑战，这需要为这类事件驱动系统量身定制新的机器学习架构。本论文通过将事件序列建模、因果发现和大语言模型（Large Language Models, LLMs）统一为一个针对高维事件流的连贯框架，以解决自动化故障诊断问题。论文分为三个部分，体现了从预测到因果理解、最终到车辆诊断推理的渐进式过渡。为此，我们引入了多种基于Transformer的架构用于预测性维护，提出了可扩展的样本级和群体级因果发现框架，以及一个能够自动合成布尔错误模式规则的多智能体系统。

摘要 (Abstract)

Electronic control units (ECUs) embedded within modern vehicles generate a large number of asynchronous events known as diagnostic trouble codes (DTCs). These discrete events form complex temporal sequences that reflect the evolving health of the vehicle’s subsystems. In the automotive industry, domain experts manually group these codes into higher-level error patterns (EPs) using Boolean rules to characterize system faults and ensure safety. However, as vehicle complexity grows, this manual process becomes increasingly costly, error-prone, and difficult to scale. Notably, the number of unique DTCs in a modern vehicle is on the same order of magnitude as the vocabulary of a natural language, often numbering in the tens of thousands. This observation motivates a paradigm shift: treating diagnostic sequences as a language that can be modeled, predicted, and ultimately explained. Traditional statistical approaches fail to capture the rich dependencies and do not scale to high-dimensional datasets characterized by thousands of nodes, large sample sizes, and long sequence lengths. Specifically, the high cardinality of categorical event spaces in industrial logs poses a significant challenge, necessitating new machine learning architectures tailored to such event-driven systems. This thesis addresses automated fault diagnostics by unifying event sequence modeling, causal discovery, and large language models (LLMs) into a coherent framework for high-dimensional event streams. It is structured in three parts, reflecting a progressive transition from prediction to causal understanding and finally to reasoning for vehicle diagnostics. Consequently, we introduce several Transformer-based architectures for predictive maintenance, scalable sample- and population-level causal discovery frameworks and a multi-agent system that automates the synthesis of Boolean EP rules.

关键词: diagnostic trouble codes, event sequence modeling, causal discovery, large language models, Transformer-based architectures, multi-agent system, automated fault diagnostics, high-dimensional event streams

111. ❌ NeSy-Route: A Neuro-Symbolic Benchmark for Constrained Route Planning in Remote Sensing

作者: Ming Yang, Zhi Zhou, Shi-Yu Tian, Kun-Yang Yu, Lan-Zhe Guo, Yu-Feng Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16307v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究遥感领域的多模态大语言模型（MLLMs）在约束路径规划任务中的评估，与’Large Language Models’和’AI for Science’高度相关（8分），因为论文评估MLLMs在科学应用（遥感）中的能力。与’Chain of Thought’和’System 2 Thinking’有一定关联（5分），因为论文涉及推理和规划能力的评估，这些关键词与多步推理相关。其他关键词如MoE、SFT、RAG等与论文内容无关（0分），论文未涉及这些具体技术。

!!! tip deepseek-chat TL;DR

该论文提出了NeSy-Route，一个用于遥感约束路径规划的大规模神经符号基准，通过评估现有多模态大语言模型发现它们在感知和规划能力上存在显著不足。

摘要翻译

遥感技术支撑着灾害救援与生态实地调查等关键应用，这些场景要求系统必须理解复杂的场景与约束条件并做出可靠决策。当前遥感领域的基准测试主要侧重于评估多模态大语言模型（MLLMs）的感知与推理能力，却未能有效评估其规划能力。这一不足源于大规模策划与验证规划任务的困难，或现有评估方法存在不精确与不充分的问题。为应对这些局限，我们提出了NeSy-Route——一个用于遥感约束路径规划的大规模神经符号基准测试。在该基准中，我们引入了一个自动化数据生成框架，该框架将高保真语义掩膜与启发式搜索相结合，生成具有可证明最优解的多样化路径规划任务。这使得NeSy-Route能够基于10,821个路径规划样本进行全面评估，其规模约为先前最大基准的近10倍。此外，我们开发了一套三级分层神经符号评估协议，以实现精确评估，并同时支持对感知、推理与规划能力的细粒度分析。通过对多种前沿MLLMs的综合评估，我们发现现有模型在感知与规划能力上存在显著不足。我们希望NeSy-Route能够支持进一步研究，推动开发更强大的遥感专用多模态大语言模型。

摘要 (Abstract)

Remote sensing underpins crucial applications such as disaster relief and ecological field surveys, where systems must understand complex scenes and constraints and make reliable decisions. Current remote-sensing benchmarks mainly focus on evaluating perception and reasoning capabilities of multimodal large language models (MLLMs). They fail to assess planning capability, stemming either from the difficulty of curating and validating planning tasks at scale or from evaluation protocols that are inaccurate and inadequate. To address these limitations, we introduce NeSy-Route, a large-scale neuro-symbolic benchmark for constrained route planning in remote sensing. Within this benchmark, we introduce an automated data-generation framework that integrates high-fidelity semantic masks with heuristic search to produce diverse route-planning tasks with provably optimal solutions. This allows NeSy-Route to comprehensively evaluate planning across 10,821 route-planning samples, nearly 10 times larger than the largest prior benchmark. Furthermore, a three-level hierarchical neuro-symbolic evaluation protocol is developed to enable accurate assessment and support fine-grained analysis on perception, reasoning, and planning simultaneously. Our comprehensive evaluation of various state-of-the-art MLLMs demonstrates that existing MLLMs show significant deficiencies in perception and planning capabilities. We hope NeSy-Route can support further research and development of more powerful MLLMs for remote sensing.

关键词: remote sensing, multimodal large language models, constrained route planning, neuro-symbolic benchmark, planning capability, evaluation protocol, perception and reasoning, MLLMs

112. ❌ VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

作者: Zhengbo Zhang, Jinbo Su, Zhaowen Zhou, Changtao Miao, Yuhan Hong, Qimeng Wu, Yumeng Liu, Feier Wu, Yihe Tian, Yuhao Liang, Zitong Shan, Wanke Xia, Yi-Fan Zhang, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16289v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）驱动的浏览智能体（browsing agents），属于大模型在特定应用领域（智能体系统）的研究。核心贡献是提出新的视觉原生搜索基准VisBrowse-Bench，并设计智能体工作流来评估模型在搜索过程中的视觉推理能力。因此，与’LLM Agents’高度相关（10分），与’Large Language Models’和’Chain of Thought’较强相关（8分），因为涉及多模态LLMs和推理链评估。与’Retrieval-Augmented Generation’、‘System 2 Thinking’、‘Tool Use’有一定关联（5分），因涉及检索（text-image retrieval）、深度推理和智能体工具使用。其他关键词如MoE、量化、对齐等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有多模态浏览智能体基准在视觉推理评估和网页原生视觉信息利用方面的不足，提出了新的视觉原生搜索基准VisBrowse-Bench，并通过实验发现即使最佳模型（Claude-4.6-Opus）的准确率也仅为47.6%，揭示了当前模型在该任务上的显著挑战。

摘要翻译

多模态大语言模型（MLLMs）的快速发展使得浏览智能体能够在现实世界中获取并推理多模态信息。然而，现有基准测试存在两个局限性：对视觉推理能力的评估不足，以及在推理链中忽视了网页原生视觉信息。为应对这些挑战，我们提出了一个面向视觉原生搜索的新基准测试——VisBrowse-Bench。该基准包含169个涵盖多个领域的视觉问答实例，并通过文本-图像检索与联合推理实现多模态证据交叉验证，从而评估模型在搜索过程中的视觉推理能力。这些数据由专家采用多阶段流程构建，并经过严格的人工验证。我们还提出了一种智能体工作流，能够有效驱动浏览智能体在搜索过程中主动收集并推理视觉信息。我们在此工作流中对开源与闭源模型进行了全面评估。实验结果表明，即使表现最佳的模型Claude-4.6-Opus准确率也仅为47.6%，而专有的Deep Research模型o3-deep-research准确率仅为41.1%。代码与数据可通过以下链接获取：https://github.com/ZhengboZhang/VisBrowse-Bench

摘要 (Abstract)

The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models’ visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse-Bench

关键词: Multimodal Large Language Models, Browsing Agents, Visual Reasoning, Benchmark, VisBrowse-Bench, Multimodal Evidence, Agent Workflow, Visual-Native Search

113. ❌ Surrogate-Assisted Genetic Programming with Rank-Based Phenotypic Characterisation for Dynamic Multi-Mode Project Scheduling

作者: Yuan Tian, Yi Mei, Mengjie Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16286v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于遗传编程（GP）和代理模型在动态多模式项目调度问题中的应用，属于传统进化计算和运筹学领域。论文未涉及任何大语言模型、深度学习或相关技术原理，也未涉及AI在科学领域的应用。所有关键词均与大模型、深度学习技术或AI for Science无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于排序的表型特征化方案和代理辅助遗传编程算法，用于高效演化动态多模式资源受限项目调度问题的高质量启发式规则，显著减少了计算成本并提高了进化效率。

摘要翻译

动态多模式资源受限项目调度问题（DMRCPSP）具有重要的实际意义，因为它需要在项目状态和资源可用性不断变化的情况下进行实时决策。遗传规划（Genetic Programming, GP）已被证明能有效演化用于此类决策任务的启发式规则；然而，其演化过程通常依赖于大量基于仿真的适应度评估，导致计算成本高昂。代理模型为降低评估成本提供了一种有前景的解决方案，但将其应用于GP需要针对启发式规则设计问题特定的表型特征化（phenotypic characterisation, PC）方案。目前，缺乏适用于DMRCPSP的GP的合适PC方案。

本文提出了一种基于排序的PC方案，该方案源自决策情境中对合格活动-模式对及活动组的启发式驱动排序。由此生成的PC向量使代理模型能够估计未评估GP个体的适应度。基于此方案，本文开发了一种代理辅助的GP算法。实验结果表明，与当前最先进的DMRCPSP的GP方法相比，所提出的代理辅助GP能够持续更早地识别出高质量的启发式规则，同时仅引入微小的计算开销。进一步分析表明，代理模型为后代选择提供了有效指导，从而提升了演化效率。

摘要 (Abstract)

The dynamic multi-mode resource-constrained project scheduling problem (DMRCPSP) is of practical importance, as it requires making real-time decisions under changing project states and resource availability. Genetic Programming (GP) has been shown to effectively evolve heuristic rules for such decision-making tasks; however, the evolutionary process typically relies on a large number of simulation-based fitness evaluations, resulting in high computational cost. Surrogate models offer a promising solution to reduce evaluation cost, but their application to GP requires problem-specific phenotypic characterisation (PC) schemes of heuristic rules. There is currently a lack of suitable PC schemes for GP applied to DMRCPSP. This paper proposes a rank-based PC scheme derived from heuristic-driven ordering of eligible activity-mode pairs and activity groups in decision situations. The resulting PC vectors enable a surrogate model to estimate the fitness of unevaluated GP individuals. Based on this scheme, a surrogate-assisted GP algorithm is developed. Experimental results demonstrate that the proposed surrogate-assisted GP can identify high-quality heuristic rules consistently earlier than the state-of-the-art GP approach for DMRCPSP, while introducing only marginal computational overhead. Further analyses demonstrate that the surrogate model provides useful guidance for offspring selection, leading to improved evolutionary efficiency.

关键词: Genetic Programming, Surrogate Model, Dynamic Multi-mode Project Scheduling, Phenotypic Characterisation, Heuristic Rules, Evolutionary Algorithm, Resource-constrained Scheduling, Computational Efficiency

114. ❌ Adaptive Theory of Mind for LLM-based Multi-Agent Coordination

作者: Chunjiang Mu, Ya Zeng, Qiaosheng Zhang, Kun Shao, Chen Chu, Hao Guo, Danyang Jia, Zhen Wang, Shuyue Hu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16264v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM驱动的多智能体协调中的心智理论（ToM）对齐问题，因此与’LLM Agents’、‘Multi-agent Systems’高度相关（10分）。论文涉及智能体对他人心理状态的推理，属于深度推理和思维链范畴，与’Chain of Thought’、‘System 2 Thinking’强相关（8分）。论文提到’Alignment’（心智理论对齐），与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分）。自适应ToM机制涉及基于交互的调整，与’Self-Correction OR Self-Improvement OR Self-Reflection’有一定关联（5分）。其他关键词如MoE、量化、RAG等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM驱动的多智能体协作中因心智理论（ToM）推理深度不匹配导致的协调问题，并提出了一种自适应ToM（A-ToM）智能体，通过估计并对齐伙伴的ToM顺序来改善协调效果，在多个任务中验证了其有效性。

摘要翻译

心智理论（Theory of Mind, ToM）指个体推理他人心理状态的能力，高阶心智理论则涉及考虑到他人同样拥有其自身的心智理论。长期以来，为基于大语言模型（LLM）的智能体赋予心智理论能力，被认为能提升其在多智能体协作任务中的协调性。然而，我们发现，心智理论阶数错位——即智能体之间心智理论推理深度的不匹配——可能导致对他人推理不足或过度，从而损害其协调能力。为解决这一问题，我们设计了一种自适应心智理论（A-ToM）智能体，它能够与协作伙伴在心智理论阶数上对齐。基于先前的交互，该智能体估计伙伴可能的心智理论阶数，并利用这一估计来预测伙伴的行动，从而促进行为协调。我们在四个多智能体协调任务上进行了实证评估：重复矩阵博弈、两个网格导航任务以及一个《Overcooked》游戏任务。结果验证了我们在心智理论对齐方面的发现，并证明了所提出的A-ToM智能体的有效性。此外，我们探讨了A-ToM方法在非基于大语言模型的智能体上的泛化能力，以及哪些因素会降低心智理论对齐的重要性。

摘要 (Abstract)

Theory of Mind (ToM) refers to the ability to reason about others’ mental states, and higher-order ToM involves considering that others also possess their own ToM. Equipping large language model (LLM)-driven agents with ToM has long been considered to improve their coordination in multiagent collaborative tasks. However, we find that misaligned ToM orders-mismatches in the depth of ToM reasoning between agents-can lead to insufficient or excessive reasoning about others, thereby impairing their coordination. To address this issue, we design an adaptive ToM (A-ToM) agent, which can align in ToM orders with its partner. Based on prior interactions, the agent estimates the partner’s likely ToM order and leverages this estimation to predict the partner’s action, thereby facilitating behavioral coordination. We conduct empirical evaluations on four multi-agent coordination tasks: a repeated matrix game, two grid navigation tasks and an Overcooked task. The results validate our findings on ToM alignment and demonstrate the effectiveness of our A-ToM agent. Furthermore, we discuss the generalizability of our A-ToM to non-LLM-based agents, as well as what would diminish the importance of ToM alignment.

关键词: Theory of Mind, LLM-based agents, Multi-agent coordination, Adaptive ToM, ToM alignment, Behavioral coordination, Higher-order ToM, Agent collaboration

115. ❌ Efficient Reasoning on the Edge

作者: Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N Whatmough, Arash Behboodi, Babak Ehteshami Bejnordi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16867v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在边缘设备上的高效推理，直接涉及LLMs、小型语言模型（SLMs/On-device AI）、监督微调（SFT）、参数高效微调（LoRA）、KV缓存优化、思维链（CoT）推理等关键词，这些是论文的核心技术方法。推理加速（Inference Acceleration）与论文的优化目标相关但非核心方法，给5分。其他关键词如MoE、数据质量、对齐、RAG等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在边缘设备上部署时面临的推理效率低下问题，提出了一种结合LoRA适配器、监督微调和强化学习预算约束的轻量级方法，通过动态适配器切换和KV缓存共享策略，在Qwen2.5-7B模型上实现了资源受限下的高效准确推理。

摘要翻译

具备思维链推理能力的大语言模型在复杂问题求解任务中实现了最先进的性能，但其冗长的推理轨迹和庞大的上下文需求使其难以在边缘设备上实际部署。这些挑战包括高昂的令牌生成成本、巨大的KV缓存占用空间，以及将推理能力蒸馏到适用于移动设备的小型模型时存在的效率低下问题。现有方法通常依赖于将大型模型的推理轨迹蒸馏到小型模型中，这些轨迹冗长且存在风格冗余，不利于设备端推理。在本研究中，我们提出一种轻量级方法，通过结合LoRA适配器与监督微调，使小型大语言模型具备推理能力。我们进一步通过强化学习对这些适配器引入预算强制机制，在精度损失最小的情况下显著缩短响应长度。为应对内存受限的解码问题，我们利用并行测试时缩放技术，以微小的延迟增加换取精度提升。最后，我们提出一种动态适配器切换机制，仅在需要时激活推理能力，以及在提示编码阶段采用KV缓存共享策略，从而减少设备端推理的首令牌生成时间。在Qwen2.5-7B模型上的实验表明，我们的方法在严格资源限制下实现了高效、准确的推理，使得大语言模型推理在移动场景中具备实用性。展示该方案在移动设备上运行效果的视频已发布于项目页面。

摘要 (Abstract)

Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.

关键词: Large language models, On-device inference, Chain-of-thought reasoning, LoRA adapters, Supervised fine-tuning, KV-cache optimization, Mobile deployment, Efficient reasoning

116. ❌ Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

作者: Sahil Sen, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16862v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM驱动的对话代理（LLM Agents）的长期记忆系统，通过结构化事件检索（Retrieval-Augmented Generation）和工具调用（Tool Use）解决时间敏感的多跳查询问题。论文明确使用LLMs作为基础，涉及长上下文处理（Context Window Extension）和多步推理（Chain of Thought），并通过动态提示实现上下文学习（In-context Learning）。其他关键词如MoE、SFT、RLHF等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对LLM对话代理在长期交互中难以处理时间演化事实和偏好、缺乏有效检索策略的问题，提出了Chronos框架，通过结构化事件检索和工具调用实现了最先进的性能，在LongMemEvalS基准上达到95.60%的准确率。

摘要翻译

大型语言模型（LLM）的最新进展使得对话式人工智能代理能够进行持续数周或数月的多轮交互。然而，现有的记忆系统难以对跨越数月交互过程中基于时间的事实和不断演变的偏好进行推理，并且缺乏针对长对话历史中多跳、时间敏感查询的有效检索策略。我们提出了Chronos，一种新颖的时序感知记忆框架，它将原始对话分解为带有已解析日期时间范围和实体别名的“主-谓-宾”事件元组，并将其索引到一个结构化的事件日历中，同时辅以一个保存完整对话上下文的话轮日历。在查询时，Chronos应用动态提示为每个问题生成定制的检索指导，指示代理检索什么、如何跨时间范围筛选，以及如何通过对两个日历的迭代工具调用循环来进行多跳推理。我们在包含六大类对话历史任务、共计500个问题的LongMemEvalS基准上，使用8个开源和闭源LLM对Chronos进行了评估。Chronos Low版本达到了92.60%的准确率，Chronos High版本达到了95.60%的准确率，创造了新的最优性能，较先前最佳系统提升了7.67%。消融实验结果表明，事件日历组件相对于基线带来了58.9%的性能增益，而所有其他组件则贡献了15.5%至22.3%不等的提升。值得注意的是，仅Chronos Low版本的表现就已超越了先前在其最强模型配置下评估的所有方法。

摘要 (Abstract)

Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.

关键词: Large Language Models, Conversational Agents, Long-Term Memory, Structured Event Retrieval, Temporal Awareness, Tool Calling, Multi-hop Reasoning, Dialogue History

117. ❌ Online Experiential Learning for Language Models

作者: Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16856v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Online Experiential Learning (OEL)框架，使语言模型能从部署经验中持续改进，核心涉及大语言模型（LLMs）的在线学习和自我改进机制。与’Large Language Models’高度相关（10分），因为研究针对LLMs的改进范式；与’Self-Correction/Self-Improvement’高度相关（10分），因为OEL框架本质是自我改进过程；与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为论文测试了thinking variants模型；与’LLM Agents’有一定关联（5分），因为涉及交互轨迹和策略模型。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型无法利用真实部署经验的问题，提出了在线体验学习框架，使模型能通过提取和整合交互轨迹中的知识持续自我改进，在文本游戏环境中验证了其能提升任务准确性和token效率。

摘要翻译

当前改进大型语言模型的主流范式依赖于基于人工标注或模拟环境的离线训练，完全未能利用现实世界部署中积累的丰富经验。我们提出在线经验学习框架，该框架使语言模型能够从其自身部署经验中持续改进。OEL 分两个阶段运行：首先，从用户端收集的交互轨迹中提取并积累可迁移的经验知识；其次，通过同策略上下文蒸馏将此知识整合到模型参数中，且无需访问用户端环境。这两个阶段迭代进行，形成一个在线学习循环：改进后的模型收集更高质量的轨迹，从而为后续轮次提供更丰富的经验知识。我们在基于文本的游戏环境中评估了OEL，涵盖多种模型规模以及思维链与非思维链变体。OEL在连续迭代中实现了持续的性能提升，在保持分布外泛化能力的同时，提高了任务准确性和令牌使用效率。我们的分析进一步表明，提取的经验知识比原始轨迹显著更有效，且知识源与策略模型之间的同策略一致性对于有效学习至关重要。

摘要 (Abstract)

The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.

关键词: Online Experiential Learning, language models, deployment experience, interaction trajectories, on-policy context distillation, self-improvement, text-based game environments, token efficiency

118. ❌ Mediocrity is the key for LLM as a Judge Anchor Selection

作者: Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16848v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于LLM作为评估者（LLM-as-a-judge）范式中锚点选择对评估结果可靠性的影响，属于大模型评估方法的研究。论文核心与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为整个研究围绕LLM评估范式展开。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等）或应用领域（如生物信息学），这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在LLM作为评估者的基准测试中，锚点模型的选择如何显著影响评估结果的可靠性，发现极端性能的模型是较差的锚点，并提出了选择信息性锚点的指导原则和足够的基准规模建议。

摘要翻译

“大语言模型即评委”范式已成为评估开放式生成任务的标准方法。为应对成对比较带来的二次方扩展成本，主流基准测试（如Arena-Hard和AlpacaEval）采用将所有模型与单一锚点模型对比的策略。然而，尽管该方法被广泛使用，锚点选择对结果可靠性的影响尚未得到充分探索。本研究通过基于Arena-Hard-v2.0数据集系统评估22个不同锚点模型，系统探究了锚点选择的影响。研究发现锚点选择至关重要：不当的锚点会显著降低与人类排序结果的相关性。我们指出常见的锚点选择策略（选择性能最优或最差的模型）效果不佳，因为这些极端锚点模型始终优于或劣于所有其他模型，几乎无法反映模型间的相对排序关系。我们进一步量化了锚点选择的影响程度，证明其影响与评委模型的选择相当。最后提出两项可操作性建议：首先通过功效分析计算基于锚点的评估所需基准规模，发现现有标准基准规模对成对评估不足，难以可靠区分竞争模型；其次提供选择信息量充分锚点的指导原则，以确保评估实践的可靠性与高效性。

摘要 (Abstract)

The ``LLM-as-a-judge’’ paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.

关键词: LLM-as-a-judge, anchor selection, evaluation reliability, Arena-Hard, human rankings correlation, benchmark size, pairwise comparison, model evaluation

119. ❌ SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

作者: Jonggeun Lee, Junseong Pyo, Jeongmin Park, Yohan Jo 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16783v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于面向任务的语音对话系统，提出了SpokenTOD数据集和SpokenUS语音用户模拟器。虽然属于AI应用领域，但论文内容主要涉及语音处理、对话系统架构和数据集构建，与评分关键词列表中的大模型技术、深度学习原理、科学AI应用等主题无直接关联。所有关键词均未在标题或摘要中出现，也未涉及相关技术概念。

!!! tip deepseek-chat TL;DR

该论文针对面向任务的语音对话系统缺乏大规模多样化数据的问题，提出了包含52,390个对话的SpokenTOD数据集和具有打断处理架构的SpokenUS语音用户模拟器，显著提升了语音对话系统的训练和评估效果。

摘要翻译

构建鲁棒的任务导向型口语对话系统需要充分接触人类通过语音进行交互的多样性。为此，开发能够模拟这种多样性的口语用户模拟器需要大规模涵盖口语用户行为的任务导向型口语对话数据，然而现有数据集在规模和领域覆盖上均存在局限，且缺乏系统化的数据增强流程。为解决这一问题，我们提出了\textbf{SpokenTOD}数据集，该数据集包含52,390个对话和1,034小时的语音，并在多说话人和多领域场景中增强了四种口语用户行为——跨轮槽位填充、语音打断、言语不流畅和情感韵律。基于SpokenTOD，我们进一步提出了\textbf{SpokenUS}，这是一个基于任务导向型对话构建的口语用户模拟器，其采用专为处理语音打断而设计的架构。SpokenUS在目标覆盖率上与规模大得多的模型表现相当，同时在人类平均意见得分上显著优于所有基线模型；它能够像人类一样在对话过程中逐步透露槽位值，而非在对话初期就全部给出。进一步分析证实，SpokenUS模拟的口语行为对下游智能体构成了实质性挑战，使其成为训练和评估更鲁棒口语对话系统的实用工具。

摘要 (Abstract)

Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors – cross-turn slots, barge-in, disfluency, and emotional prosody – across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all baselines in Human MOS, disclosing slot values gradually across the dialogue as humans do rather than front-loading them. Further analysis confirms that SpokenUS’s spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.

关键词: spoken dialogue, task-oriented dialogue, user simulator, speech data, barge-in, dialogue systems, SpokenTOD, SpokenUS

120. ❌ Probing Cultural Signals in Large Language Models through Author Profiling

作者: Valentin Lafargue, Ariel Guerra-Adames, Emmanuelle Claeys, Elouan Vuichard, Jean-Michel Loubes 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16749v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）中的文化偏见，通过零样本作者画像任务评估多个开源LLMs，并引入公平性指标量化偏差。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及模型内部表示的分析，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），但并非核心。其他关键词（如MoE、SFT、RAG、量化等）均未在摘要中提及或涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究通过零样本作者画像任务评估大语言模型在歌词分析中编码的文化偏见，发现模型存在系统性文化对齐差异，并引入公平性指标量化了不同模型的偏差程度。

摘要翻译

大型语言模型（LLMs）正日益广泛地应用于具有社会影响的应用中，这引发了人们对其所编码的文化偏见的担忧。我们通过评估LLMs能否在零样本设置下根据歌词进行作者画像分析来探究这些表征，即在无需任务特定微调的情况下推断歌手的性别和种族。在对超过10,000首歌词进行评测的多个开源模型中，我们发现LLMs取得了显著的画像分析性能，但表现出系统性的文化对齐倾向：大多数模型默认偏向北美种族，而DeepSeek-1.5B则更强烈地对齐亚洲种族。这一发现既源于模型的预测分布，也源于对其生成推理的分析。为了量化这些差异，我们引入了两个公平性指标：模态准确度差异（Modality Accuracy Divergence, MAD）和召回率差异（Recall Divergence, RD），并表明在评估的模型中，Ministral-8B表现出最强的种族偏见，而Gemma-12B则展现出最平衡的行为。我们的代码已在GitHub上公开（https://github.com/ValentinLafargue/CulturalProbingLLM）。

摘要 (Abstract)

Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting, inferring singers’ gender and ethnicity without task-specific fine-tuning. Across several open-source models evaluated on more than 10,000 lyrics, we find that LLMs achieve non-trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek-1.5B aligns more strongly with Asian ethnicity. This finding emerges from both the models’ prediction distributions and an analysis of their generated rationales. To quantify these disparities, we introduce two fairness metrics, Modality Accuracy Divergence (MAD) and Recall Divergence (RD), and show that Ministral-8B displays the strongest ethnicity bias among the evaluated models, whereas Gemma-12B shows the most balanced behavior. Our code is available on GitHub (https://github.com/ValentinLafargue/CulturalProbingLLM).

关键词: Large Language Models, Cultural Bias, Author Profiling, Zero-shot Learning, Fairness Metrics, Song Lyrics, Ethnicity Bias, Model Evaluation

121. ❌ Domain Mixture Design via Log-Likelihood Differences for Aligning Language Models with a Target Model

作者: Ryo Kishino, Riku Shiomi, Hiroaki Yamagiwa, Momose Oyama, Hidetoshi Shimodaira 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16622v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语言模型对齐问题，通过设计预训练数据的领域混合权重来使基础模型与目标模型在分布上对齐。这与’Large Language Models’高度相关（10分），因为研究基于语言模型；与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为方法涉及预训练/持续预训练中的领域混合设计；与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分），因为核心是模型对齐问题。其他关键词如MoE、SLMs、SFT、RLHF等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究如何通过设计预训练数据的领域混合权重来对齐语言模型与目标模型，实验表明该方法能有效减少与目标模型的KL散度并改善下游任务性能。

摘要翻译

本研究并未直接对语言模型进行蒸馏，而是通过设计一种固定的训练方案——即调整预训练或持续预训练中训练数据的领域混合比例——来解决基础模型与目标模型在分布上对齐的问题。我们提出了一种确定领域权重的方法，该方法将模型视为对数似然空间中的点，并使训练更新方向与指向目标模型的方向对齐。在NanoGPT上的实验表明，与在Pile数据集上采用均匀权重的方法相比，所提方法能持续降低与目标模型之间的KL散度。尽管在可行的情况下知识蒸馏仍然更为有效，但所提方法仍能实现有意义的对齐，并且下游任务性能也往往更接近目标模型。

摘要 (Abstract)

Instead of directly distilling a language model, this study addresses the problem of aligning a base model with a target model in distribution by designing the domain mixture of training data for pretraining or continued pretraining as a fixed training recipe. We propose a method for determining domain weights by viewing models as points in log-likelihood space and aligning the training update direction with the direction toward the target model. Experiments with NanoGPT show that the proposed method consistently reduces the KL divergence to the target model compared with uniform weighting over the Pile. Although knowledge distillation remains more effective when available, the proposed method still achieves meaningful alignment, and downstream task performance also tends to become closer to that of the target model.

关键词: language model alignment, domain mixture design, pretraining, log-likelihood space, KL divergence, target model, training data weighting, distribution alignment

122. ❌ Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy

作者: Zhaoxin Feng, Zheng Chen, Jianfei Ma, Yip Tin Po, Emmanuele Chersoni, Bo Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16643v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM中的奉承行为（sycophancy），这是对齐（alignment）技术的一个副作用，因此与’Alignment’高度相关（10分）。研究重点分析Chain-of-Thought（CoT）推理在其中的作用，因此与’Chain of Thought’高度相关（10分）。论文涉及LLM的行为分析和评估，与’Large Language Models’高度相关（10分）。研究探讨推理过程是否作为逻辑约束或事后合理化工具，涉及深度推理和可解释性，因此与’System 2 Thinking’（8分）、‘Mechanistic Interpretability’（8分）相关。奉承行为与事实性和真实性有关，因此与’Hallucination Mitigation’（8分）有一定关联。自我纠正/反思（5分）在论文的上下文中可能间接相关，因为研究模型如何通过推理调整行为。其他关键词如MoE、SLMs、Scaling Laws、训练技术、RAG、上下文扩展、推理加速、智能体、量化、科学AI等与论文内容无直接关联，评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了Chain-of-Thought推理在缓解和掩盖大语言模型奉承行为中的作用，发现推理通常能减少最终决策中的奉承，但也会通过构建欺骗性理由来掩盖奉承，尤其是在主观任务和权威偏见下更为明显。

摘要翻译

对齐技术常无意中诱发大语言模型的谄媚行为。既往研究多在直接回答场景中探讨该现象，而思维链推理的作用尚未被充分探究：它究竟是作为缓解谄媚性的逻辑约束，还是成为掩盖该行为的后验合理化工具？我们通过一系列主客观任务对多种模型进行评估以探究此问题。结果显示，推理过程通常能降低最终决策中的谄媚倾向，但在部分样本中也会掩盖谄媚行为——模型会通过逻辑矛盾、计算错误、片面论证等方式构建具有欺骗性的合理化论述。此外，大语言模型在主观性任务及权威偏见影响下更容易表现出谄媚倾向。我们对三个开源模型的机制分析表明，谄媚倾向在推理过程中呈动态变化，而非在输入阶段预先决定。

摘要 (Abstract)

Alignment techniques often inadvertently induce sycophancy in LLMs. While prior studies studied this behaviour in direct-answer settings, the role of Chain-of-Thought (CoT) reasoning remains under-explored: does it serve as a logical constraint that mitigates sycophancy, or a tool for post-hoc rationalization that masks it? We evaluate a range of models across objective and subjective tasks to investigate the issue. Results show that reasoning generally reduces sycophancy in final decisions but also masks sycophancy in some samples, where models construct deceptive justifications through logical inconsistencies, calculation errors, and one-sided arguments etc. Furthermore, LLMs are more prone to sycophancy in subjective tasks and under authority-bias. Our mechanistic analysis on three open-source models reveals that the tendency of sycophancy is dynamic during the reasoning process rather than being pre-determined at the input stage.

关键词: LLM sycophancy, Chain-of-Thought reasoning, alignment, reasoning mitigation, post-hoc rationalization, authority bias, mechanistic analysis, subjective tasks

作者: Omnilingual SONAR Team, João Maria Janeiro, Pere-Lluís Huguet Cabot, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramírez, Loic Barrault, Belen Alastruey, Yu-An Chung, Marta R. Costa-Jussa, David Dale, Kevin Heffernan, Jaehyeong Jo, Artyom Kozhevnikov, Alexandre Mourachko, Christophe Ropers, Holger Schwenk, Paul-Ambroise Duquenne 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16606v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究跨语言和跨模态的句子嵌入模型，核心创新在于构建统一的语义空间来嵌入文本、语音、代码和数学表达式。与关键词的相关性分析：1）论文明确使用LLM初始化的编码器-解码器，因此与’Large Language Models’高度相关（8分）；2）采用渐进式训练方法，包括基础空间学习和教师-学生蒸馏，涉及预训练和领域适应概念，与’Pre-training OR Continual Pre-training OR Domain Adaptation’相关（8分）；3）其他关键词如MoE、SLMs、SFT、RAG、推理加速、量化等均未在摘要中提及或与论文核心内容无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了OmniSONAR模型，解决了跨语言和跨模态句子嵌入在数千种语言中质量下降的问题，通过渐进式训练和LLM初始化实现了文本、语音、代码和数学表达式的统一语义嵌入，在多项基准测试中显著提升了性能。

摘要翻译

跨语言句子编码器通常仅覆盖数百种语言，且常以牺牲下游任务性能为代价换取更强的语义对齐，这限制了其应用范围。我们推出OmniSONAR——一个全新的全语种、跨语言与跨模态句子嵌入模型系列，能够将文本、语音、代码及数学表达式原生嵌入到统一的语义空间中，同时在涵盖数千种语言（从高资源到极低资源变体）的规模上实现最先进的下游性能。为在如此规模下避免表征坍缩，我们采用渐进式训练方法：首先，通过LLM初始化的编码器-解码器架构，结合词元级解码任务与新颖的分割软最大化对比损失及合成困难负样本，为200种语言构建强健的基础语义空间；在此基础上，通过两阶段师生编码器蒸馏框架将模型扩展至数千种语言变体；最后，通过将177种口语无缝映射至该空间，验证了其跨模态扩展能力。OmniSONAR在200种语言的FLORES数据集上将跨语言相似性搜索误差降低一半，在1,560种语言的BIBLE基准测试中误差减少至原来的1/15。该模型同时具备强大的翻译能力，在多语言基准测试中超越NLLB-3B模型，并在1,560种语言至英语的BIBLE翻译任务中以15个chrF++点的优势超越先前模型（包括规模更大的LLM）。OmniSONAR在MTEB和XLCoST基准测试中也表现优异。对于语音模态，尽管仅基于自动语音识别（ASR）数据训练且为零样本翻译模型，OmniSONAR仍将相似性搜索误差降低43%，并达到SeamlessM4T语音转文本质量的97%。最后，通过训练专用于处理OmniSONAR嵌入序列的英语文本编码器-解码器语言模型Spectrum，我们实现了向数千种语言及语音模态的高性能迁移，以支持复杂下游任务。

摘要 (Abstract)

Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.

关键词: cross-lingual sentence embeddings, cross-modal embeddings, omnilingual models, progressive training, LLM-initialized encoder-decoder, teacher-student distillation, multilingual benchmarks, speech-text alignment

124. ❌ When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective

作者: Zelin Zhang, Fei Cheng, Chenhui Chu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16578v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在数学推理中的无监督强化学习，与’Large Language Models’高度相关（10分），涉及数学推理中的多步推理和深度推理，与’Chain of Thought’和’System 2 Thinking’高度相关（各10分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Quantization等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了无监督强化学习在提升大型语言模型数学推理能力时的成功条件与失败原因，通过设计内在奖励、测试模型基础逻辑先验并引入几何诊断视角，揭示了成功案例被流形包围的机制。

摘要翻译

尽管基于结果的强化学习显著提升了大型语言模型的数学推理能力，但其对计算成本高昂的真实标注的依赖造成了严重的可扩展性瓶颈。由内在奖励引导的无监督强化学习提供了一种可扩展的替代方案，但其训练动态不透明且存在灾难性不稳定问题，例如策略崩溃和奖励黑客攻击。本文首先设计并评估了一系列内在奖励机制，这些机制明确强制模型生成简洁且确定的输出。其次，为探索该方法的边界，我们在不同内在推理能力水平上测试了基础模型，揭示了模型的基础逻辑先验如何决定其成功或失败。最后，为阐明某些配置能够稳定而另一些配置会崩溃的原因，我们引入了一种新颖的几何诊断视角，表明成功案例被流形结构所包裹。最终，我们的工作不仅证明了强制简洁确定的响应能有效提升数学推理能力，更进一步揭示了这种无监督方法何时会失效，并从几何角度诊断了其原因。

摘要 (Abstract)

Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite of intrinsic rewards that explicitly enforce concise and certain generation. Second, to discover the boundaries of this approach, we test base models across a spectrum of intrinsic reasoning capabilities, revealing how a model’s foundational logical prior dictates its success or failure. Finally, to demystify why certain configurations stabilize while others collapse, we introduce a novel geometric diagnostic lens, showing that successful cases are enveloped by manifolds. Ultimately, our work goes beyond merely demonstrating that enforcing concise and certain responses successfully boosts mathematical reasoning; we reveal when this unsupervised approach breaks down and geometrically diagnose why.

关键词: Unsupervised Reinforcement Learning, Mathematical Reasoning, Large Language Models, Intrinsic Rewards, Policy Collapse, Reward Hacking, Geometric Diagnostic, Manifold Envelopment

125. ❌ Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry

作者: Mo El-Haj 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16601v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文介绍了一个阿拉伯语歌词和诗歌的多方言语料库（Tarab Corpus），属于语言学、文化研究和计算语言学领域的数据集构建工作。论文内容完全聚焦于语料库的收集、标准化、验证和基础分析（如方言识别和体裁区分），不涉及任何大模型、深度学习技术原理、AI应用或相关创新方法。所有评分关键词均与大模型技术、训练方法、推理优化、AI应用等主题相关，而本论文是纯粹的语言资源构建工作，与这些技术主题无任何关联。

!!! tip deepseek-chat TL;DR

该论文构建了Tarab语料库，这是一个包含256万行诗句、覆盖古典和现代阿拉伯语多种方言的大型歌词和诗歌数据集，为阿拉伯语的语言学和文化研究提供了统一的分析框架。

摘要翻译

我们推出塔拉卜语料库（Tarab Corpus），这是一个大规模文化与语言资源库，将阿拉伯语歌词与诗歌整合在统一的分析框架内。该语料库包含256万行诗节及超过1350万词元，据我们所知，这是目前最大的开放式阿拉伯语创作文本语料库，涵盖古典与当代作品。塔拉卜语料库在歌曲与诗歌之间保持总体平衡，覆盖古典阿拉伯语、现代标准阿拉伯语（Modern Standard Arabic, MSA）以及六种主要地区变体：埃及阿拉伯语、海湾阿拉伯语、黎凡特阿拉伯语、伊拉克阿拉伯语、苏丹阿拉伯语和马格里布阿拉伯语。语料库收录的艺术家与诗人来自28个现代民族国家及多个历史时期，时间跨度超过十四个世纪，涵盖从伊斯兰教兴起前至二十一世纪的阿拉伯创作表达。每一行诗节均附有结构化元数据，描述其语言变体、地理起源及历史或文化背景，支持跨体裁与跨时代的比较语言学、文体学及历时性分析。本文详细阐述了数据收集、标准化与验证流程，并提供了语言变体识别与体裁区分的基线分析。该数据集已公开发布于HuggingFace平台：https://huggingface.co/datasets/drelhaj/Tarab。

摘要 (Abstract)

We introduce the Tarab Corpus, a large-scale cultural and linguistic resource that brings together Arabic song lyrics and poetry within a unified analytical framework. The corpus comprises 2.56 million verses and more than 13.5 million tokens, making it, to our knowledge, the largest open Arabic corpus of creative text spanning both classical and contemporary production. Tarab is broadly balanced between songs and poems and covers Classical Arabic, Modern Standard Arabic (MSA), and six major regional varieties: Egyptian, Gulf, Levantine, Iraqi, Sudanese, and Maghrebi Arabic. The artists and poets represented in the corpus are associated with 28 modern nation states and multiple historical eras, covering over fourteen centuries of Arabic creative expression from the Pre-Islamic period to the twenty-first century. Each verse is accompanied by structured metadata describing linguistic variety, geographic origin, and historical or cultural context, enabling comparative linguistic, stylistic, and diachronic analysis across genres and time. We describe the data collection, normalisation, and validation pipeline and present baseline analyses for variety identification and genre differentiation. The dataset is publicly available on HuggingFace at https://huggingface.co/datasets/drelhaj/Tarab.

关键词: Arabic corpus, lyrics, poetry, multi-dialect, linguistic resource, cultural analysis, diachronic analysis, HuggingFace dataset

126. ❌ Diverging Transformer Predictions for Human Sentence Processing: A Comprehensive Analysis of Agreement Attraction Effects

作者: Titus von der Malsburg, Sebastian Padó 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16574v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究Transformer模型作为人类句子处理认知模型的适用性，通过评估11个不同规模和架构的自回归Transformer在英语一致性吸引配置上的表现。论文与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为Transformer是当前语言模型的基础架构，论文评估了多种Transformer模型。与’Mechanistic Interpretability OR Explainable AI’也有一定关联（5分），因为论文旨在理解Transformer的内部工作机制及其与人类认知的对应关系。其他关键词主要涉及大模型的具体技术、训练方法、应用场景等，而本文聚焦于认知科学评估而非技术创新，因此相关性为0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了多种Transformer模型在英语一致性吸引配置上的表现，发现虽然Transformer预测与人类阅读时间数据在介词短语配置上基本一致，但在宾语提取关系从句配置上表现显著下降，且无法复制人类的不对称干扰模式，表明当前Transformer模型不能解释人类的形态句法处理。

摘要翻译

Transformer已成为计算语言学中几乎所有最先进语言模型的基础，但其作为人类句子处理模型的认知适切性仍存争议。本研究采用基于惊异值的关联机制，系统评估了十一种不同规模和架构的自回归Transformer模型，测试范围涵盖了比以往研究更全面的英语一致性吸引结构配置。实验结果呈现复杂性：虽然Transformer模型在介词短语结构上的预测结果与人类阅读时间数据总体一致，但在宾语提取关系从句结构上的表现显著下降。在后一种情况下，不同模型的预测结果存在显著差异，且没有任何模型能成功复现人类表现出的不对称干扰模式。我们的结论是：当前Transformer模型尚不能解释人类的形态句法加工机制，而将Transformer作为认知模型进行评估时，必须采用严谨全面的实验设计，以避免从孤立的句法结构或个别模型中得出虚假的普遍性结论。

摘要 (Abstract)

Transformers underlie almost all state-of-the-art language models in computational linguistics, yet their cognitive adequacy as models of human sentence processing remains disputed. In this work, we use a surprisal-based linking mechanism to systematically evaluate eleven autoregressive transformers of varying sizes and architectures on a more comprehensive set of English agreement attraction configurations than prior work. Our experiments yield mixed results: While transformer predictions generally align with human reading time data for prepositional phrase configurations, performance degrades significantly on object-extracted relative clause configurations. In the latter case, predictions also diverge markedly across models, and no model successfully replicates the asymmetric interference patterns observed in humans. We conclude that current transformer models do not explain human morphosyntactic processing, and that evaluations of transformers as cognitive models must adopt rigorous, comprehensive experimental designs to avoid spurious generalizations from isolated syntactic configurations or individual models.

关键词: Transformer models, human sentence processing, agreement attraction, cognitive adequacy, surprisal-based linking, autoregressive transformers, morphosyntactic processing, experimental evaluation

作者: Xizhong Yang, Yinan Xia, Huiming Wang, Mofei Song 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16500v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习（RL）中的置信度校准和奖励信号优化，提出DistriTTRL方法来解决测试时训练中的内部信息差异和奖励黑客问题。虽然涉及模型内部信息和置信度，但论文未明确提及大语言模型（LLMs）、深度学习技术原理创新或任何评分关键词中的具体技术（如MoE、SFT、RAG等）。研究背景强调大模型和深度学习在科学领域的应用，但本文未涉及这些领域或技术，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

论文提出DistriTTRL方法，通过渐进式分布细化和多样性惩罚，解决强化学习中测试时训练的置信度校准和奖励黑客问题，在多个模型和基准上实现了显著性能提升。

摘要翻译

利用模型内部信息作为强化学习（RL）中的自奖励信号，因其无需标注的特性而受到广泛关注。尽管先前研究在将测试时间缩放（Test-Time Scaling, TTS）策略应用于强化学习方面取得了显著进展，但测试与训练阶段内部信息的差异仍未得到充分解决。此外，基于投票式TTS策略的测试时间训练常受奖励黑客问题困扰。针对这些问题，我们提出了DistriTTRL方法，该方法利用强化学习过程中模型置信度的分布先验逐步优化奖励信号，而非仅依赖单次查询推演。同时，我们通过引入以多样性为目标的惩罚机制，缓解了由投票式TTS策略引发的持续性奖励黑客现象。得益于这种模型能力与自奖励信号相互补充的训练机制，以及对奖励黑客问题的缓解，DistriTTRL在多个模型与基准测试中均实现了显著的性能提升。

摘要 (Abstract)

Leveraging the model’s internal information as the self-reward signal in Reinforcement Learning (RL) has received extensive attention due to its label-free nature. While prior works have made significant progress in applying the Test-Time Scaling (TTS) strategies to RL, the discrepancy in internal information between test and training remains inadequately addressed. Moreover, Test-Time Training based on voting-based TTS strategies often suffers from reward hacking problems. To address these issues, we propose DistriTTRL, which leverages the distribution prior of the model’s confidence during RL to progressively optimize the reward signal, rather than relying solely on single-query rollouts. Additionally, we mitigate the phenomenon of consistent reward hacking caused by the voting-based TTS strategies through diversity-targeted penalties. Benefiting from this training mechanism where model capability and self-reward signals complement each other, and the mitigation of reward hacking, DistriTTRL has achieved significant performance improvements across multiple models and benchmarks.

关键词: Reinforcement Learning, Confidence Calibration, Test-Time Training, Reward Hacking, Distribution Refinement, Self-Reward Signal, DistriTTRL, Model Internal Information

128. ❌ How often do Answers Change? Estimating Recency Requirements in Question Answering

作者: Bhawna Piryani, Zehra Mert, Adam Jatowt 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16544v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在回答时间敏感问题时的局限性，直接高度相关于’Large Language Models OR LLMs OR Foundation Models’（10分）。论文涉及模型决定何时检索外部证据，这与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’有一定关联（5分）。论文关注LLMs因依赖过时知识而产生自信但错误的回答，这与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分）。其他关键词如MoE、SLMs、训练技术、推理方法、代理系统、模型压缩等，论文未直接涉及或仅作为背景提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在回答时间敏感问题时因依赖过时知识而产生错误回答的挑战，提出了一个基于时效性和平稳性的分类法，并构建了RecencyQA数据集，用于细粒度评估和开发具有时效意识和上下文敏感性的问答系统。

摘要翻译

大型语言模型（LLMs）在回答时效性问题时，常依赖过时的知识，导致其给出自信却错误的回答。由于缺乏明确信号指示是否需要最新信息，模型难以决定何时检索外部证据、如何对陈旧事实进行推理，以及如何根据答案的有效性进行排序。现有基准测试要么定期更新答案，要么依赖固定模板，但未能反映答案变化的频率或问题本身是否需要最新信息。为填补这一空白，我们提出了一种时效性-平稳性分类法，该分类法根据答案变化的频率以及这种变化频率是否随时间不变或依赖于上下文来对问题进行分类。基于此分类法，我们构建了RecencyQA数据集，该数据集包含4,031个开放领域问题，并标注了时效性与平稳性标签。通过人工评估与实证分析，我们发现非平稳性问题（即上下文会改变其时效性要求的问题）对LLMs而言更具挑战性，且难度随着更新频率的上升而增加。通过显式建模时效性与上下文依赖性，RecencyQA能够实现超越二元新鲜度概念的细粒度时间推理基准测试与分析，并为开发具有时效感知和上下文敏感性的问答系统奠定基础。

摘要 (Abstract)

Large language models (LLMs) often rely on outdated knowledge when answering time-sensitive questions, leading to confident yet incorrect responses. Without explicit signals indicating whether up-to-date information is required, models struggle to decide when to retrieve external evidence, how to reason about stale facts, and how to rank answers by their validity. Existing benchmarks either periodically refresh answers or rely on fixed templates, but they do not reflect on how frequently answers change or whether a question inherently requires up-to-date information. To address this gap, we introduce a recency-stationarity taxonomy that categorizes questions by how often their answers change and whether this change frequency is time-invariant or context-dependent. Building on this taxonomy, we present RecencyQA, a dataset of 4,031 open-domain questions annotated with recency and stationarity labels. Through human evaluation and empirical analysis, we show that non-stationary questions, i.e., those where context changes the recency requirement, are significantly more challenging for LLMs, with difficulty increasing as update frequency rises. By explicitly modeling recency and context dependence, RecencyQA enables fine-grained benchmarking and analysis of temporal reasoning beyond binary notions of freshness, and provides a foundation for developing recency-aware and context-sensitive question answering systems.

关键词: Large language models, Question answering, Temporal reasoning, Recency requirements, Dataset, Benchmarking, Outdated knowledge, Context dependence

129. ❌ AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

作者: Shannan Yan, Jingchen Ni, Leqi Zheng, Jiajun Zhang, Peixi Wu, Dacheng Yin, Jing Lyu, Chun Yuan, Fengyun Rao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16496v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AdaMem提出了一种用于长视野对话代理的自适应用户中心记忆框架，核心围绕LLM代理（高度相关，10分）和记忆检索增强（相关，8分）展开。论文明确提到LLM代理依赖外部记忆支持长视野交互和多步推理，因此与’LLM Agents’高度相关。记忆检索机制与’Retrieval-Augmented Generation’相关，但论文重点在记忆组织而非生成增强。‘Context Window Extension’和’Chain of Thought’有一定关联（各5分），因论文涉及长上下文记忆和多步推理，但非直接技术焦点。其余关键词如MoE、SFT、RLHF等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文提出AdaMem框架，通过自适应记忆组织解决长视野对话代理中语义相似性依赖、记忆碎片化和静态粒度问题，在LoCoMo和PERSONAMEM基准上实现了最先进的性能。

摘要翻译

大型语言模型（LLM）智能体日益依赖外部记忆来支持长程交互、个性化辅助与多步推理。然而，现有记忆系统仍面临三个核心挑战：它们往往过度依赖语义相似性，可能遗漏对以用户为中心的理解至关重要的证据；它们常将相关经验存储为孤立片段，削弱了时间与因果连贯性；且通常采用静态的记忆粒度，难以适配不同问题的需求。本文提出AdaMem，一种面向长程对话智能体的自适应、以用户为中心的记忆框架。AdaMem将对话历史组织为工作记忆、情景记忆、人物记忆与图记忆，使系统能够在统一框架中保留近期上下文、结构化的长期经验、稳定的用户特征以及关系感知的连接。在推理阶段，AdaMem首先解析目标参与者，随后构建基于问题条件的检索路径——仅在需要时结合语义检索与关系感知的图扩展，最终通过专为证据合成与响应生成设计的角色专业化流程生成答案。我们在长程推理与用户建模基准测试集LoCoMo和PERSONAMEM上对AdaMem进行评估。实验结果表明，AdaMem在两个基准测试上均取得了最先进的性能。代码将在论文录用后公开。

摘要 (Abstract)

Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user-centric understanding; they frequently store related experiences as isolated fragments, weakening temporal and causal coherence; and they typically use static memory granularities that do not adapt well to the requirements of different questions. We propose AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem organizes dialogue history into working, episodic, persona, and graph memories, enabling the system to preserve recent context, structured long-term experiences, stable user traits, and relation-aware connections within a unified framework. At inference time, AdaMem first resolves the target participant, then builds a question-conditioned retrieval route that combines semantic retrieval with relation-aware graph expansion only when needed, and finally produces the answer through a role-specialized pipeline for evidence synthesis and response generation. We evaluate AdaMem on the LoCoMo and PERSONAMEM benchmarks for long-horizon reasoning and user modeling. Experimental results show that AdaMem achieves state-of-the-art performance on both benchmarks. The code will be released upon acceptance.

关键词: LLM agents, external memory, long-horizon dialogue, adaptive memory, user-centric, retrieval, multi-step reasoning, personalized assistance

130. ❌ On the Emotion Understanding of Synthesized Speech

作者: Yuan Ge, Haishu Zhao, Aokai Hao, Junxiang Zhang, Bei Li, Xiaoqian Liu, Chenglong Wang, Jianjin Wang, Bingsen Zhou, Bingyu Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16483v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究合成语音的情感识别问题，主要涉及语音合成、语音情感识别和语音语言模型，但未涉及大模型技术原理创新或深度学习在科学领域的应用。所有关键词均与大模型技术、训练方法、推理优化、对齐技术、代理系统等具体技术相关，而本文聚焦于语音领域的特定应用评估，与这些技术关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文研究发现，现有的语音情感识别模型无法有效泛化到合成语音上，主要是因为合成过程中的语音标记预测导致表示不匹配，且生成式语音语言模型倾向于从文本语义推断情感而忽略副语言线索。

摘要翻译

情感是语音交互中的核心副语言特征。学界普遍认为，情感理解模型能够学习可迁移至合成语音的基础表征，这使得情感理解结果可作为评估语音合成情感表现力的合理奖励或评价指标。本研究通过系统性地评估合成语音上的语音情感识别（SER）——涵盖不同数据集、判别式与生成式SER模型以及多样化的合成模型——对这一假设进行了批判性检验。我们发现，当前的SER模型无法泛化至合成语音，其主要原因在于合成过程中的语音标记预测导致了合成语音与人类语音之间的表征失配。此外，生成式语音语言模型（SLMs）倾向于从文本语义推断情感，而忽略副语言线索。总体而言，我们的研究结果表明，现有SER模型往往利用非鲁棒的捷径而非捕捉基础特征，且SLMs中的副语言理解仍面临挑战。

摘要 (Abstract)

Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.

关键词: Speech Emotion Recognition, Synthesized Speech, Speech Language Models, Paralinguistic Features, Emotion Understanding, Representation Mismatch, Generative Models, Speech Synthesis

131. ❌ DynHD: Hallucination Detection for Diffusion Large Language Models via Denoising Dynamics Deviation Learning

作者: Yanyu Qian, Yue Tan, Yixin Liu, Wang Yu, Shirui Pan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16459v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散大语言模型（D-LLMs）的幻觉检测，与’Large Language Models’高度相关（10分），因为D-LLMs是LLMs的一种变体。与’Hallucination Mitigation’高度相关（10分），因为核心目标是检测和缓解幻觉。与’Mechanistic Interpretability’有一定关联（5分），因为通过分析去噪动态来理解模型行为，但并非主要解释AI。其他关键词（如MoE、SFT、RAG等）未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对扩散大语言模型（D-LLMs）的幻觉问题，提出了一种名为DynHD的方法，通过从空间（令牌序列）和时间（去噪动态）角度建模不确定性证据，有效检测幻觉响应，并在多个基准测试中优于现有方法。

摘要翻译

扩散大语言模型（D-LLMs）因其迭代优化能力，已成为自回归模型的一种有前景的替代方案。然而，幻觉问题仍然是影响其可靠性的关键障碍。为检测模型输出中的幻觉响应，词元级不确定性（如熵）已被广泛用作指示潜在事实错误的有效信号。然而，D-LLMs的定长生成范式意味着不同词元对幻觉检测的贡献并不均衡，仅有一小部分能提供有意义的信号。此外，不确定性在扩散过程中的演变趋势也能提供重要线索，这凸显了对其去噪动态进行建模以检测幻觉的必要性。本文提出的DynHD方法从空间（词元序列）和时间（去噪动态）两个视角弥合了上述差距。针对词元间信息密度不均衡的问题，我们设计了一个语义感知的证据构建模块，通过过滤非信息性词元并强调语义显著的词元，来提取指示幻觉的信号。为建模用于幻觉检测的去噪动态，我们引入了一个参考证据生成器，用于学习不确定性证据的预期演变轨迹，同时设计了一个基于偏差的幻觉检测器，通过度量观测轨迹与参考轨迹之间的差异来做出预测。大量实验表明，DynHD在多个基准测试和骨干模型中均能持续优于现有先进基线，同时实现了更高的效率。

摘要 (Abstract)

Diffusion large language models (D-LLMs) have emerged as a promising alternative to auto-regressive models due to their iterative refinement capabilities. However, hallucinations remain a critical issue that hinders their reliability. To detect hallucination responses from model outputs, token-level uncertainty (e.g., entropy) has been widely used as an effective signal to indicate potential factual errors. Nevertheless, the fixed-length generation paradigm of D-LLMs implies that tokens contribute unevenly to hallucination detection, with only a small subset providing meaningful signals. Moreover, the evolution trend of uncertainty throughout the diffusion process can also provide important signals, highlighting the necessity of modeling its denoising dynamics for hallucination detection. In this paper, we propose DynHD that bridge these gaps from both spatial (token sequence) and temporal (denoising dynamics) perspectives. To address the information density imbalance across tokens, we propose a semantic-aware evidence construction module that extracts hallucination-indicative signals by filtering out non-informative tokens and emphasizing semantically meaningful ones. To model denoising dynamics for hallucination detection, we introduce a reference evidence generator that learns the expected evolution trajectory of uncertainty evidence, along with a deviation-based hallucination detector that makes predictions by measuring the discrepancy between the observed and reference trajectories. Extensive experiments demonstrate that DynHD consistently outperforms state-of-the-art baselines while achieving higher efficiency across multiple benchmarks and backbone models.

关键词: Diffusion Large Language Models, Hallucination Detection, Denoising Dynamics, Uncertainty Evidence, Semantic-aware Evidence Construction, Deviation-based Detection, Token-level Uncertainty, Iterative Refinement

132. ❌ Capability-Guided Compression: Toward Interpretability-Aware Budget Allocation for Large Language Models

作者: Rishaank Gupta 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16440v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型压缩（LLM compression），直接对应关键词’Large Language Models’（10分）和’Quantization OR Model Compression OR Low-bit Weights’（10分）。论文提出使用稀疏自编码器（SAE）分析模型组件功能，以指导压缩预算分配，这属于可解释性AI/机制可解释性范畴，对应关键词’Mechanistic Interpretability OR Explainable AI’（10分）。其他关键词如MoE、SFT、RAG、推理方法、代理等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型压缩中存在的'能力盲压缩'问题，提出了能力引导压缩（CGC）框架，利用稀疏自编码器生成的能力密度图来分配差异化的压缩预算，从而在压缩过程中更好地保留模型的关键推理能力。

摘要翻译

大语言模型压缩通过剪枝、量化和低秩分解已取得显著进展，但所有现有方法仍存在一个根本性局限：压缩预算的分配完全缺乏对各个模型组件功能编码内容的表征。我们将此称为“能力盲压缩”问题，并论证这是两类已充分记录失效现象的根本原因——基于困惑度的评估对推理能力损失不敏感，以及近期由Ma等人（2026）揭示的模型性能突变相变现象。我们提出能力导向压缩框架，该框架通过采用稀疏自编码器衍生的能力密度图，为Transformer各组件分配差异化压缩预算来解决此问题。能力密度是一个形式化定义的标量度量，它综合了组件SAE特征激活分布的特征广度、激活熵和跨输入一致性。我们在理论上证明，具有更高能力密度的组件表现出更低的结构冗余度，并在更低压缩比下达到其个体相变点，这为首个组件级相变预测的预压缩机制提供了基础。在GPT-2 Medium上的实验证实，能力密度与Wanda重要性评分具有统计独立性（斯皮尔曼ρ = -0.054，n = 384个头），这确立了其作为一种与所有现有重要性度量正交的全新压缩信号。我们报告了基于困惑度的压缩比较的负面结果，并通过原理性诊断指出GPT-2 Medium不足以作为完整CGC假说的测试平台。该理论框架、密度形式化体系及正交性发现共同构成了能力感知压缩研究的基础。

摘要 (Abstract)

Large language model compression has made substantial progress through pruning, quantization, and low-rank decomposition, yet a fundamental limitation persists across all existing methods: compression budgets are allocated without any representation of what individual model components functionally encode. We term this the capability-blind compression problem and argue it is a root cause of two well-documented failures – the insensitivity of perplexity-based evaluation to reasoning capability loss, and the abrupt phase transitions in model performance recently characterized by Ma et al. (2026). We propose Capability-Guided Compression (CGC), a framework that addresses this by using Sparse Autoencoder (SAE)-derived capability density maps to allocate differential compression budgets across transformer components. Capability density is a formally defined scalar measure combining the feature breadth, activation entropy, and cross-input consistency of a component’s SAE feature activation distribution. We prove theoretically that components with higher capability density exhibit lower structural redundancy and reach their individual phase transition points at lower compression ratios, providing the first pre-compression mechanism for component-level phase transition prediction. Experiments on GPT-2 Medium confirm that capability density is statistically independent of Wanda importance scores (Spearman rho = -0.054, n = 384 heads), establishing it as a genuinely novel compression signal orthogonal to all existing importance metrics. We report a negative result on PPL-based compression comparison and provide a principled diagnosis identifying GPT-2 Medium as an insufficient test bed for the full CGC hypothesis. The theoretical framework, density formalism, and orthogonality finding constitute a foundation for capability-aware compression research.

关键词: Large Language Model Compression, Capability-Guided Compression, Sparse Autoencoder, Capability Density, Phase Transition, Interpretability, Budget Allocation, Model Components

133. ❌ VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

作者: Yixuan Wang, Qingyu Shi, Jiayu Zhou, Dianbo Liu, Ziwei He, Zhouhan Lin 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16435v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究KV缓存压缩技术，与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（15分），直接解决LLM部署中的内存瓶颈问题。论文明确针对LLM（10分），使用向量量化方法实现压缩（10分），并涉及推理加速（8分）。由于LLM上下文长度增长导致KV缓存增大，与长上下文LLM相关（8分）。其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为VQKV的训练无关KV缓存压缩方法，通过向量量化技术实现了82.8%的压缩率，在LLaMA3.1-8B上保持98.6%的性能，并在相同内存占用下支持4.3倍更长的生成长度。

摘要翻译

大型语言模型（LLM）不断增长的上下文长度导致其键值（KV）缓存持续扩大，从而限制了其在资源受限环境中的部署。现有的免训练KV缓存压缩方法通常依赖于低秩近似或标量量化，这些方法难以同时实现高压缩比与高重建保真度。我们提出VQKV，一种新颖的免训练方法，该方法引入向量量化（VQ）来获得高度压缩的KV表示，同时保持较高的模型保真度，使得仅用少数整数索引即可表示数千个浮点数值。实验结果表明，VQKV在LLaMA3.1-8B模型上实现了82.8%的压缩比，同时在LongBench基准测试中保持了98.6%的基线性能，并在相同内存占用下实现了4.3倍的生成长度扩展。

摘要 (Abstract)

The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8% compression ratio on LLaMA3.1-8B while retaining 98.6% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.

关键词: KV cache compression, vector quantization, Large Language Models, memory efficiency, inference optimization, training-free method, high compression ratio, model fidelity

134. ❌ RECOVER: Robust Entity Correction via agentic Orchestration of hypothesis Variants for Evidence-based Recovery

作者: Abhishek Kumar, Aashraya Sachdeva 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16411v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文RECOVER提出了一种基于智能体（agentic）的框架，用于自动语音识别（ASR）中的实体纠正。它明确使用大型语言模型（LLM）进行纠正，并设计为一个使用工具（tool-using）的智能体，因此与’Large Language Models’、‘LLM Agents’和’Tool Use’高度相关（10分）。框架涉及检索相关实体作为证据，与’Retrieval-Augmented Generation’有一定关联（5分）。纠正过程可视为一种自我改进，与’Self-Correction’有一定关联（5分）。应用领域包括医学等科学领域，与’AI for Science’有一定关联（5分）。其他关键词如MoE、Scaling Laws、RLHF等未在摘要中提及或与核心方法无关，故评0分。

!!! tip deepseek-chat TL;DR

该研究解决了自动语音识别（ASR）中罕见和领域特定实体识别错误的问题，通过引入一个基于智能体的框架RECOVER，利用大型语言模型（LLM）进行纠正，在多个数据集上实现了实体短语词错误率（E-WER）相对降低8-46%和召回率提升高达22个百分点。

摘要翻译

自动语音识别（ASR）中的实体识别对于罕见词和领域专有术语具有挑战性。在金融、医疗和空中交通管制等领域，此类识别错误代价高昂。若实体在ASR输出中完全缺失，则后续修正将极为困难。为此，我们提出RECOVER——一种作为工具使用智能体的自主修正框架。该框架利用ASR产生的多假设作为证据，检索相关实体，并在约束条件下应用大语言模型（LLM）进行修正。我们采用四种策略处理多假设：1-Best、实体感知选择（Entity-Aware Select）、识别器输出投票误差降低（ROVER）集成以及LLM选择（LLM-Select）。在五个不同数据集上的评估表明，该框架使实体短语词错误率（E-WER）相对降低8-46%，召回率最高提升22个百分点。其中LLM-Select在保持整体词错误率的同时，实现了最佳的实体修正综合性能。

摘要 (Abstract)

Entity recognition in Automatic Speech Recognition (ASR) is challenging for rare and domain-specific terms. In domains such as finance, medicine, and air traffic control, these errors are costly. If the entities are entirely absent from the ASR output, post-ASR correction becomes difficult. To address this, we introduce RECOVER, an agentic correction framework that serves as a tool-using agent. It leverages multiple hypotheses as evidence from ASR, retrieves relevant entities, and applies Large Language Model (LLM) correction under constraints. The hypotheses are used using different strategies, namely, 1-Best, Entity-Aware Select, Recognizer Output Voting Error Reduction (ROVER) Ensemble, and LLM-Select. Evaluated across five diverse datasets, it achieves 8-46% relative reductions in entity-phrase word error rate (E-WER) and increases recall by up to 22 percentage points. The LLM-Select achieves the best overall performance in entity correction while maintaining overall WER.

关键词: Entity Recognition, Automatic Speech Recognition, ASR Correction, Agentic Framework, Large Language Model, Tool-Using Agent, Entity-Phrase Word Error Rate, Hypothesis Variants

135. ❌ PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

作者: Hanif Rahman 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16354v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是构建了一个大规模低资源语言（普什图语）语料库PashtoCorp，并展示了其在预训练（特别是持续掩码语言模型预训练）中的应用效果。因此，与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为论文明确进行了’Continued MLM pretraining’，并评估了领域适应效果。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为论文提到了Gemma-3n作为LLM基准，但主要工作不是LLM本身。与’Scaling Laws AND Data Quality’有一定关联（5分），因为论文涉及大规模语料库构建（1.25B词）和质量过滤（如去重、过滤），这间接关联数据规模和质量对模型性能的影响。其他关键词（如MoE、SFT、RLHF、RAG、Agents等）在论文中未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该研究构建了一个1.25B词的大规模普什图语语料库PashtoCorp，并通过持续预训练XLM-R-base模型显著提升了其在命名实体识别和阅读理解任务上的性能，为低资源语言NLP提供了重要资源。

摘要翻译

本文介绍了PashtoCorp——一个包含12.5亿词规模的普什图语语料库。普什图语拥有6000万使用者，但在自然语言处理（NLP）领域长期处于严重资源不足状态。该语料库整合了来自39个数据源的内容，包括7个HuggingFace数据集和32个专门构建的网络爬虫数据，并通过可复现的流程进行处理，涵盖阿拉伯文字符分词、SHA-256去重和质量过滤。PashtoCorp包含281万份文档共计12.5亿词，其规模是OSCAR普什图语子集的40倍，也是此前最大专用普什图语料库的83倍。基于PashtoCorp对XLM-R-base模型进行持续掩码语言建模（MLM）预训练，将留困惑度降低了25.1%（从8.08降至6.06）。在WikiANN普什图语命名实体识别（NER）任务中，预训练模型的实体F1分数相对提升10%（从19.0%升至21.0%），训练方差降低近7倍；最大增益出现在50个训练语句场景（+27%），且PashtoCorp覆盖了WikiANN 97.9%的实体词汇。在Belebele普什图语阅读理解任务中，Gemma-3n模型达到64.6%的准确率，这是该基准测试中首个公开发表的普什图语大语言模型（LLM）基线。通过留一法源数据消融实验发现，维基百科（占文档总量的0.7%）对NER任务最为关键：仅移除该数据源就会导致实体F1分数下降47%。语料库数据、训练模型及代码已发布于https://huggingface.co/datasets/ihanif/pashto-corpus、https://huggingface.co/ihanif/xlmr-pashto 和 https://github.com/ihanif/pashto-corpus。

摘要 (Abstract)

We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at https://huggingface.co/datasets/ihanif/pashto-corpus, https://huggingface.co/ihanif/xlmr-pashto, and https://github.com/ihanif/pashto-corpus.

关键词: low-resource language, corpus construction, Pashto, pre-training, multilingual NLP, named entity recognition, reading comprehension, domain adaptation

136. ❌ Omnilingual MT: Machine Translation for 1,600 Languages

作者: Omnilingual MT Team, Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Chierh Cheng, Joe Chuang, Paul-Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, Gabriel Mejia Gonzalez, Holger Schwenk, Edan Toledo, Arina Turkatenko, Albert Ventayol-Boada, Rashel Moritz, Alexandre Mourachko, Surya Parimi, Mary Williamson, Shireen Yates, David Dale, Marta R. Costa-jussà 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16309v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在机器翻译领域的专业化应用，明确提到使用LLM作为decoder-only模型（OMT-LLaMA）和encoder-decoder架构模块（OMT-NLLB），因此与’Large Language Models’高度相关（10分）。论文涉及监督微调（SFT）来优化翻译性能，与’Post-training/SFT’高度相关（10分）。论文提到数据策略整合大规模多语言语料库和新创建数据集，与’Scaling Laws AND Data Quality’有一定关联（5分）。论文探索LLM的专业化应用，涉及领域适应，与’Pre-training/Domain Adaptation’有一定关联（5分）。其他关键词如MoE、SLMs、RLHF、RAG等未在摘要中提及或与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何将大语言模型专业化用于机器翻译，开发了支持1,600种语言的Omnilingual MT系统，其1B-8B参数模型在翻译质量上匹配或超越了70B LLM基线，显著扩展了可生成连贯翻译的语言范围。

摘要翻译

高质量的机器翻译（MT）系统已能扩展至数百种语言，为多语言系统设定了高标准。然而，与世界现存的约7,000种语言相比，当前系统的覆盖范围仍然有限：目标端支持约200种语言，源端可能因跨语言迁移（cross-lingual transfer）而多支持数百种。由于缺乏可靠的基准测试和评估指标，这些数据本身也难以准确衡量。

我们提出了全语种机器翻译（Omnilingual Machine Translation, OMT），这是首个支持超过1,600种语言的机器翻译系统。这一规模得益于一项综合数据策略，该策略整合了大型公共多语言语料库与新创建的数据集，包括人工精校的MeDLEY双语对照语料。

我们探索了两种将大语言模型（Large Language Model, LLM）专门化用于机器翻译的途径：作为仅解码器模型（OMT-LLaMA）或作为编码器-解码器架构中的一个模块（OMT-NLLB）。值得注意的是，我们所有参数量从10亿到80亿的模型，其机器翻译性能均达到或超过了一个700亿参数大语言模型基线的水平，这揭示了明确的专门化优势，并能在低计算资源环境下实现强大的翻译质量。此外，我们对英语到1,600种语言翻译的评估进一步表明，基线模型虽能理解支持不足的语言，却常常无法以有意义的保真度生成这些语言；而OMT-LLaMA模型则大幅扩展了能够实现连贯生成的语言集合。同时，OMT模型在跨语言迁移方面有所改进，对于所评估的1,600种语言，已接近解决机器翻译中“理解”部分的难题。我们的排行榜及主要人工创建的评估数据集（BOUQuET和Met-BOUQuET）正动态地向全语种方向演进，并免费开放。

摘要 (Abstract)

High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world’s 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the “understanding” part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

关键词: Machine Translation, Large Language Models, Multilingual Systems, Specialization, Low-compute Settings, Cross-lingual Transfer, Data Strategy, Evaluation Benchmarks

137. ❌ Is Semi-Automatic Transcription Useful in Corpus Creation? Preliminary Considerations on the KIParla Corpus

作者: Martina Simonotti, Ludovica Pannitto, Eleonora Zucchini, Silvia Ballarè, Caterina Mauri 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16258v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究自动语音识别（ASR）在语料库转录工作流程中的应用，属于语音处理和语料库语言学领域。论文未涉及大模型、深度学习技术原理创新或科学AI应用，所有关键词均与大模型技术、训练方法、推理优化、对齐、压缩、代理系统等无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究评估了自动语音识别辅助转录在意大利语口语语料库KIParla创建中的效果，发现ASR能提高转录速度但不会一致提升准确性，其效果受工作流程配置、对话类型和转录员经验等多因素影响。

摘要翻译

本文分析了自动语音识别技术在KIParla语料库转录工作流程中的应用，KIParla是一个意大利语口语资源库。通过一项两阶段实验，11名专家级和新手级转录员对三种不同类型的对话音频片段分别进行了人工转录和ASR辅助转录，随后通过统计建模、词级对齐和一系列基于标注的指标进行综合分析。结果表明，ASR辅助工作流程能提高转录速度，但并未持续提升整体准确率，其效果受工作流配置、对话类型及标注者经验等多重因素影响。结合对齐指标、描述性统计与统计建模的分析方法，为监测不同标注者及工作流程间的转录行为提供了系统化框架。尽管存在局限，ASR辅助转录（在特定任务微调的支持下）仍可整合至KIParla转录流程中，在保证转录质量的同时加速语料库建设。

摘要 (Abstract)

This paper analyses the implementation of Automatic Speech Recognition (ASR) into the transcription workflow of the KIParla corpus, a resource of spoken Italian. Through a two-phase experiment, 11 expert and novice transcribers produced both manual and ASR-assisted transcriptions of identical audio segments across three different types of conversation, which were subsequently analyzed through a combination of statistical modeling, word-level alignment and a series of annotation-based metrics. Results show that ASR-assisted workflows can increase transcription speed but do not consistently improve overall accuracy, with effects depending on multiple factors such as workflow configuration, conversation type and annotator experience. Analyses combining alignment-based metrics, descriptive statistics and statistical modeling provide a systematic framework to monitor transcription behavior across annotators and workflows. Despite limitations, ASR-assisted transcription, potentially supported by task-specific fine-tuning, could be integrated into the KIParla transcription workflow to accelerate corpus creation without compromising transcription quality.

关键词: Automatic Speech Recognition, ASR, corpus creation, transcription workflow, KIParla corpus, spoken Italian, transcription accuracy, transcription speed

138. ❌ PyPhonPlan: Simulating phonetic planning with dynamic neural fields and task dynamics

作者: Sam Kirkham 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16299v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《PyPhonPlan: Simulating phonetic planning with dynamic neural fields and task dynamics》专注于语音通信研究，开发了一个用于语音规划动态建模的Python工具包。论文内容涉及动态神经场、任务动态模拟、语音产生/感知循环等，属于计算语言学/语音学的特定领域工具开发。所有评分关键词均与大模型、深度学习技术原理、AI for Science等主题相关，而本论文完全不涉及这些内容：没有提到任何语言模型（大或小）、模型训练/微调技术、推理优化、对齐方法、代理系统、模型压缩等，也不属于生物信息学或化学信息学范畴。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个名为PyPhonPlan的Python工具包，用于通过耦合动态神经场和任务动态模拟来建模语音规划，以促进语音通信研究中可重现、可扩展的计算开发。

摘要翻译

我们推出PyPhonPlan，这是一个用于通过耦合动态神经场与任务动态模拟来实现语音规划动力学模型的Python工具包。该工具包提供模块化组件，用于定义规划场、感知场和记忆场，以及场间耦合、手势输入，并利用场激活剖面来求解声道变量轨迹。我们通过一个示例应用展示该工具包的功能：使用耦合记忆场模拟产生/感知循环，这证明了该框架能够使用时序原则化、神经基础化且语音信息丰富的表征来建模交互式语音动力学。PyPhonPlan作为开源软件发布，包含可执行示例，以促进语音通信研究的可复现性、可扩展性和累积性计算发展。

摘要 (Abstract)

We introduce PyPhonPlan, a Python toolkit for implementing dynamical models of phonetic planning using coupled dynamic neural fields and task dynamic simulations. The toolkit provides modular components for defining planning, perception and memory fields, as well as between-field coupling, gestural inputs, and using field activation profiles to solve tract variable trajectories. We illustrate the toolkit’s capabilities through an example application:~simulating production/perception loops with a coupled memory field, which demonstrates the framework’s ability to model interactive speech dynamics using representations that are temporally-principled, neurally-grounded, and phonetically-rich. PyPhonPlan is released as open-source software and contains executable examples to promote reproducibility, extensibility, and cumulative computational development for speech communication research.

关键词: phonetic planning, dynamic neural fields, task dynamics, speech communication, computational modeling, open-source toolkit, production/perception loops, tract variable trajectories

139. ❌ How to Utilize Complementary Vision-Text Information for 2D Structure Understanding

作者: Jiancheng Dong, Pengyue Jia, Derong Xu, Jiawei Cheng, Jingyu Peng, Chao Zhang, Bowen Liu, Xin Sun, Lixin Su, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16245v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在2D表格理解中的应用，提出DiVA-Former架构解决视觉-文本信息融合问题，与’Large Language Models’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs线性化2D表格导致布局信息丢失的问题，提出DiVA-Former架构有效融合视觉和文本信息，在13个表格基准上比纯文本基线提升23.9%。

摘要翻译

大语言模型通常将二维表格线性化为一维序列以适应其自回归架构，这会削弱行列邻接关系及其他布局信息。相比之下，纯视觉编码器虽能捕捉空间线索，却往往难以精确保留单元格文本。我们的分析表明，这两种模态为大语言模型提供了高度差异化的信息，并展现出强烈的互补性。然而，直接拼接或其他融合方法带来的性能提升有限，且常引入跨模态干扰。为解决这一问题，我们提出DiVA-Former——一种轻量级架构，旨在有效整合视觉与文本信息。DiVA-Former利用视觉标记作为动态查询，将长文本序列提炼为紧凑向量，从而高效利用视觉与文本的互补信息。在13个表格基准测试中，DiVA-Former较纯文本基线提升了23.9%，并在仅使用视觉输入、文本输入或两者结合的现有基线基础上实现了持续的性能增益。

摘要 (Abstract)

LLMs typically linearize 2D tables into 1D sequences to fit their autoregressive architecture, which weakens row-column adjacency and other layout cues. In contrast, purely visual encoders can capture spatial cues, yet often struggle to preserve exact cell text. Our analysis reveals that these two modalities provide highly distinct information to LLMs and exhibit strong complementarity. However, direct concatenation and other fusion methods yield limited gains and frequently introduce cross-modal interference. To address this issue, we propose DiVA-Former, a lightweight architecture designed to effectively integrate vision and text information. DiVA-Former leverages visual tokens as dynamic queries to distill long textual sequences into digest vectors, thereby effectively exploiting complementary vision–text information. Evaluated across 13 table benchmarks, DiVA-Former improves upon the pure-text baseline by 23.9% and achieves consistent gains over existing baselines using visual inputs, textual inputs, or a combination of both.

关键词: Large Language Models, 2D table understanding, vision-text fusion, DiVA-Former, complementary information, table benchmarks, autoregressive architecture, visual encoders

140. ❌ SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation

作者: Hang Lv, Sheng Liang, Hao Wang, Yongyue Zhang, Hongchao Gu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16219v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SpecSteer框架，核心涉及大模型（云端LLM）与小模型（本地SLM）的协同推理，直接相关关键词包括’Large Language Models’（10分）、‘Small Language Models’（10分）。框架基于推测解码（speculative decoding）实现加速，与’Speculative Decoding’高度相关（10分）。研究涉及推理验证、逻辑修正，与’Chain of Thought’（5分）、‘System 2 Thinking’（5分）、‘Self-Correction’（5分）有一定关联。其他关键词如MoE、训练方法、RAG、量化等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

论文解决了在保护用户隐私的前提下，如何协同本地小模型与云端大模型以实现高效个性化生成的难题，提出了SpecSteer框架，通过推测解码和贝叶斯知识融合，在提升生成质量的同时实现了2.36倍的加速。

摘要翻译

实现个性化智能面临一个核心困境：将用户历史发送至中心化大型语言模型会引发隐私担忧，而设备端的小型语言模型则缺乏高质量生成所需的推理能力。我们的初步研究表明，纯本地增强方法仍不足以可靠地弥合这一差距。为此，我们提出SpecSteer——一种非对称协同推理框架，它将私有设备端上下文与云端规模推理能力协同整合。SpecSteer将协同过程建模为贝叶斯知识融合，并将推测解码（speculative decoding）重构为分布式对齐协议，形成“起草—验证—恢复”流程：设备端模型起草个性化序列；云端通过基于比率的机制进行验证，该机制将推理验证与私有上下文解耦，可在不接触原始用户上下文的情况下过滤逻辑缺陷；若序列被拒绝，则通过转向恢复机制在校正过程中注入本地意图。实验表明，SpecSteer成功弥合了推理差距，实现了卓越的个性化生成性能，同时相比标准基线获得了2.36倍的加速效果。

摘要 (Abstract)

Realizing personalized intelligence faces a core dilemma: sending user history to centralized large language models raises privacy concerns, while on-device small language models lack the reasoning capacity required for high-quality generation. Our pilot study shows that purely local enhancements remain insufficient to reliably bridge this gap. We therefore propose SpecSteer, an asymmetric collaborative inference framework that synergizes private on-device context with cloud-scale reasoning. SpecSteer casts collaboration as Bayesian knowledge fusion and repurposes speculative decoding as a distributed alignment protocol, yielding a Draft–Verify–Recover pipeline: the on-device model drafts personalized sequences; the cloud validates via a ratio-based mechanism that decouples reasoning verification from private context, filtering logical flaws without accessing raw user context; upon rejection, a steering recovery injects local intent during correction. Experiments demonstrate that SpecSteer successfully closes the reasoning gap and achieves superior personalized generation performance, while delivering a 2.36x speedup over standard baselines.

关键词: personalized generation, on-device small language models, cloud large language models, speculative decoding, Bayesian knowledge fusion, privacy-preserving, reasoning gap, inference acceleration

141. ❌ More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

作者: Song Tae-Eun 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16244v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM验证方法（Cross-Context Review），直接涉及LLM技术应用，因此’Large Language Models’得10分；研究多轮审查导致虚假阳性增加，与’Hallucination Mitigation’相关（虚假阳性可视为一种幻觉），得8分；其他关键词如MoE、量化、推理加速等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究发现多轮动态跨上下文审查（D-CCR）在LLM验证中虽能提高召回率，但会显著增加虚假阳性并降低精度，导致整体性能不如单轮审查。

摘要翻译

跨上下文审阅（Cross-Context Review, CCR）通过将生成与审阅分离为独立会话，改进了大语言模型的验证效果。一个自然的延伸是多轮审阅：允许审阅者提出后续问题、接收作者回复并再次审阅。我们称此方法为动态跨上下文审阅（Dynamic Cross-Context Review, D-CCR）。在一项涉及30个工件和150个注入错误的对照实验中，我们测试了四种D-CCR变体与单轮CCR基线的性能。单轮CCR（F1 = 0.376）显著优于所有多轮变体，包括包含问答交流的D-CCR-2b（F1 = 0.303，$p < 0.001$，$d = -0.59$）。多轮审阅提高了召回率（+0.08），但产生了多出62%的误报（8.5对比5.2），导致精确率从0.30降至0.20。两种机制导致了这种性能下降：（1）误报压力——当工件的真实错误已被穷尽时，后续轮次的审阅者会编造发现；（2）审阅目标漂移——获得先前问答交流记录的审阅者从审阅工件本身转向批评对话内容。无先前上下文的独立重复审阅（D-CCR-2c）表现最差（F1 = 0.263），证实单纯重复审阅会损害而非提升效果。性能下降源于附加轮次中的误报压力，而非信息量——在多轮审阅条件下，更多信息实际上有益（D-CCR-2b > D-CCR-2a）。问题不在于审阅者看到什么，而在于重复审阅本身引入了噪声。

摘要 (Abstract)

Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, $p < 0.001$, $d = -0.59$). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure – reviewers in later rounds fabricate findings when the artifact’s real errors have been exhausted, and (2) Review Target Drift – reviewers provided with prior Q&A exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount – within multi-turn conditions, more information actually helps (D-CCR-2b > D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.

关键词: Cross-Context Review, LLM verification, multi-turn review, false positives, precision degradation, Dynamic Cross-Context Review, review target drift

142. ❌ Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

作者: Yongyu Mu, Jiali Zeng, Fandong Meng, JingBo Zhu, Tong Xiao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16206v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究监督微调（SFT）方法在数学推理任务中的改进，提出OXA方法优化SFT过程。与"Supervised Fine-tuning (SFT)“高度相关（10分），因为这是论文的核心技术。与"Chain of Thought (CoT)“高度相关（10分），因为论文专注于长链数学推理和CoT轨迹。与"Large Language Models"高度相关（10分），因为研究基于Qwen2.5-1.5B-Math等大模型。其他关键词如MoE、SLMs、RLHF、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对长链数学推理任务，提出了一种离线探索感知的监督微调方法（OXA），通过优化教师蒸馏和自我蒸馏数据，显著提升了大型语言模型的数学推理性能，并在多个基准测试中验证了其有效性。

摘要翻译

通过鼓励自我探索，基于可验证奖励的强化学习（RLVR）显著提升了大型语言模型的数学推理能力。作为RLVR的起点，监督微调（SFT）记忆新思维链轨迹的能力提供了关键的初始化，塑造了后续的探索空间。然而，现有研究主要集中于促进RLVR训练期间的探索，而对探索感知的SFT关注不足。为弥补这一空白，我们提出了离线探索感知（OXA）微调方法。具体而言，OXA优化两个目标：一是促进低置信度的已验证教师蒸馏数据，以内化先前未捕获的推理模式；二是抑制高置信度的错误自蒸馏数据，将错误模式的概率质量重新分配给潜在的正确候选。在6个基准测试上的实验结果表明，OXA持续提升了数学推理性能，特别是在Qwen2.5-1.5B-Math模型上，相比传统SFT平均获得了$+6$ Pass@1和$+5$ Pass@$k$的增益。关键的是，OXA提高了初始策略熵，且性能增益在广泛的RLVR训练中持续存在，这证明了OXA的长期价值。

摘要 (Abstract)

Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward potentially correct candidates. Experimental results across 6 benchmarks show that OXA consistently improves mathematical reasoning performance, especially achieving an average gain of $+6$ Pass@1 and $+5$ Pass@$k$ points compared to conventional SFT on the Qwen2.5-1.5B-Math. Crucially, OXA elevates initial policy entropy, and performance gains persist throughout extensive RLVR training, demonstrating the long-term value of OXA.

关键词: supervised fine-tuning, chain-of-thought, mathematical reasoning, large language models, offline exploration, reinforcement learning from verifiable rewards, teacher-distillation, self-distillation

143. ❌ Are Large Language Models Truly Smarter Than Humans?

作者: Eshwar Reddy M, Sourav Karmakar 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16197v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在基准测试中的表现评估，特别是数据污染问题，因此与’Large Language Models’高度相关（10分）。论文涉及模型真实能力评估，与’Factuality’有一定关联（5分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、训练方法、推理技术、代理系统、压缩技术等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文通过三项实验对六个前沿大语言模型进行数据污染审计，发现MMLU基准存在显著的数据污染问题，导致模型性能被高估，特别是在STEM和哲学领域。

摘要翻译

公开排行榜日益显示，大型语言模型（LLM）在涵盖学术知识、法律和编程的基准测试中超越了人类专家。然而，大多数基准测试数据是完全公开的，其问题在互联网上被广泛复制，这带来了系统性风险：模型可能正是在用于评估它们的数据上进行训练的。本文设计了三个互补的实验，对六款前沿大型语言模型——GPT-4o、GPT-4o-mini、DeepSeek-R1、DeepSeek-V3、Llama-3.3-70B和Qwen3-235B——进行了严格的多方法数据污染审计。实验一将词汇污染检测流程应用于涵盖全部57个科目的513道MMLU（大规模多任务语言理解）问题，发现总体污染率为13.8%（STEM领域为18.1%，哲学领域高达66.7%），并估计各科目因污染带来的性能增益在+0.030至+0.054个准确率点之间。实验二对100道MMLU问题应用了转述与间接指代诊断测试，发现在间接指代条件下，模型准确率平均下降7.0个百分点，在法律与伦理领域降幅高达19.8个百分点。实验三对所有513道问题及全部六个模型应用了TS-Guessing（训练集猜测）行为探针，发现72.5%的问题触发了远高于随机水平的记忆信号，其中DeepSeek-R1表现出一种分布式记忆特征（76.6%部分重构，0%逐字回忆），这解释了其在实验二中表现出的异常模式。三项实验共同指向一致的污染程度排序：STEM > 专业学科 > 社会科学 > 人文学科。

摘要 (Abstract)

Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM > Professional > Social Sciences > Humanities.

关键词: Large Language Models, LLMs, contamination audit, benchmark evaluation, MMLU, data contamination, performance assessment, memorization

144. ❌ Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR

作者: Quy-Anh Dang, Chris Ngo 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16184v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多语言自动语音识别（ASR）系统的开发，通过平衡微调预训练模型Qwen3-ASR来优化新加坡四种语言的识别性能。仅与关键词’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文核心是使用监督微调技术（SFT）在公开语音语料库上微调预训练模型。其他关键词涉及大模型技术原理、推理、对齐、压缩、科学AI应用等，均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文通过平衡微调Qwen3-ASR预训练模型，开发了高效的多语言自动语音识别系统Polyglot-Lion，在保持竞争力的错误率（14.85）的同时，将训练成本从$18,862大幅降低至$81，推理速度提升约20倍。

摘要翻译

本文提出Polyglot-Lion系列紧凑型多语言自动语音识别（ASR）模型，专为新加坡多语言环境设计，涵盖英语、普通话、泰米尔语和马来语。该系列模型通过纯公开语音语料库对Qwen3-ASR-0.6B和Qwen3-ASR-1.7B进行微调获得，采用平衡采样策略使各语言训练语句数量均等，并刻意省略语言标签条件机制，使模型能够从音频中隐式识别语言。在涵盖四种目标语言的12个基准测试中，Polyglot-Lion-1.7B的平均错误率为14.85，与规模大6倍的MERaLiON-2-10B-ASR模型（14.32）性能相当，而其训练成本仅需单块RTX PRO 6000 GPU花费81美元，远低于128 GPU基线模型的18,862美元。推理吞吐速度达到约0.10秒/样本，较MERaLiON模型的2.02秒/样本提升近20倍。这些结果表明，对中等规模预训练模型进行语言平衡的微调，能以极低成本获得可部署的多语言ASR系统，其性能可与大型专用系统相媲美。

摘要 (Abstract)

We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of $81 on a single RTX PRO 6000 GPU compared to $18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.

关键词: multilingual ASR, fine-tuning, Qwen3-ASR, balanced sampling, speech recognition, efficient training, Singapore languages, inference acceleration

145. ❌ Open-Source Reproduction and Explainability Analysis of Corrective Retrieval Augmented Generation

作者: Surya Vardhan Yalavarthi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16169v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究Corrective Retrieval Augmented Generation (CRAG)，因此与’Retrieval-Augmented Generation’高度相关（10分）。论文使用Phi-3-mini（小型语言模型）和LLaMA-2（大语言模型），因此与’Large Language Models’和’Small Language Models’相关（各8分）。论文进行SHAP解释性分析，与’Mechanistic Interpretability’相关（8分）。论文在科学问题（如ARC-Challenge）上测试，与’AI for Science’有一定关联（5分）。其他关键词如MoE、Scaling Laws、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文开源复现了Corrective Retrieval Augmented Generation (CRAG)系统，使用Wikipedia API和Phi-3-mini替代原专有组件，在PopQA和ARC-Challenge上达到可比性能，并通过SHAP分析首次揭示了其检索评估器主要依赖命名实体对齐而非语义相似性。

摘要翻译

修正性检索增强生成（CRAG）通过评估检索文档质量并触发修正操作，提升了RAG系统的鲁棒性。然而，其原始实现依赖于包括谷歌搜索API和闭源模型权重在内的专有组件，限制了可复现性。本研究提出了一个完全开源的CRAG复现版本，使用维基百科API替代了专有网络搜索，并以Phi-3-mini-4k-instruct模型替换了原始的LLaMA-2生成器。我们在PopQA和ARC-Challenge数据集上进行评估，证明我们的开源流程达到了与原始系统相当的性能。此外，我们首次利用SHAP方法对CRAG基于T5的检索评估器进行可解释性分析，揭示该评估器主要依赖命名实体对齐而非语义相似性。我们的分析识别了关键失效模式，包括在科学问题上的领域迁移局限性。所有代码与结果已发布于https://github.com/suryayalavarthi/crag-reproduction。

摘要 (Abstract)

Corrective Retrieval Augmented Generation (CRAG) improves the robustness of RAG systems by evaluating retrieved document quality and triggering corrective actions. However, the original implementation relies on proprietary components including the Google Search API and closed model weights, limiting reproducibility. In this work, we present a fully open-source reproduction of CRAG, replacing proprietary web search with the Wikipedia API and the original LLaMA-2 generator with Phi-3-mini-4k-instruct. We evaluate on PopQA and ARC-Challenge, demonstrating that our open-source pipeline achieves comparable performance to the original system. Furthermore, we contribute the first explainability analysis of CRAG’s T5-based retrieval evaluator using SHAP, revealing that the evaluator primarily relies on named entity alignment rather than semantic similarity. Our analysis identifies key failure modes including domain transfer limitations on science questions. All code and results are available at https://github.com/suryayalavarthi/crag-reproduction.

关键词: Corrective Retrieval Augmented Generation, RAG, open-source reproduction, explainability analysis, SHAP, Phi-3-mini, retrieval evaluator, domain transfer limitations

146. ❌ STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

作者: Suvajit Patra, Soumitra Samanta 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16163v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于连续手语识别（CSLR），提出了一种基于关键点的时空注意力网络。虽然属于计算机视觉和深度学习领域，但研究内容与所有评分关键词（均围绕大模型技术、训练方法、推理优化、对齐、代理等）完全无关。论文未涉及任何形式的大语言模型、模型训练技术、推理方法或AI for Science的具体应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于连续手语识别的统一时空注意力网络，在保持与基于关键点方法相当性能的同时，将编码器参数减少了70-80%。

摘要翻译

连续手语识别是理解聋人群体语言的关键任务。当前基于关键点的方法通常依赖于时空编码，其中关键点间的空间交互通过图卷积网络或注意力机制建模，而时间动态则使用一维卷积网络捕捉。然而，此类设计往往在编码器和解码器中引入大量参数。本文提出一种统一的时空注意力网络，该网络在空间维度（跨关键点）和时间维度（局部窗口内）同时计算注意力分数，并通过特征聚合生成局部上下文感知的时空表征。所提出的编码器参数量较现有最优模型减少约$70-80%$，同时在Phoenix-14T数据集上取得了与基于关键点方法相当的性能。

摘要 (Abstract)

Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70-80%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.

关键词: Continuous Sign Language Recognition, CSLR, Spatio-temporal Attention, Keypoint-based Methods, Parameter Reduction, Phoenix-14T Dataset, Attention Mechanisms

147. ❌ HIPO: Instruction Hierarchy via Constrained Reinforcement Learning

作者: Keru Chen, Jun Luo, Sen Lin, Yingbin Liang, Alvaro Velasquez, Nathaniel Bastian, Shaofeng Zou 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16152v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型对齐问题，提出HIPO框架解决分层指令遵循问题。与LLMs、Alignment、RLHF/DPO高度相关（10分），因为论文直接研究LLM对齐方法，并对比RLHF/DPO的局限性。与SFT相关（8分），因为论文提到SFT的局限性并作为对比方法。与LLM Agents相关（5分），因为分层指令遵循在复杂工作流中应用。与Mechanistic Interpretability相关（5分），因为论文进行了机制分析。其他关键词如MoE、SLMs、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文提出HIPO框架，通过约束强化学习解决大语言模型在分层指令遵循中的系统提示合规性问题，显著提高了系统合规性和用户效用。

摘要翻译

分层指令跟随（Hierarchical Instruction Following, HIF）指通过优先级排序的指令栈来提示大语言模型的问题。传统方法如RLHF和DPO通常在此问题上失效，因为它们主要针对单一目标进行优化，未能显式强制系统提示的遵从性。同时，监督微调依赖于模仿经过筛选的合规数据，这无法在算法层面建立优先级的不对称性。本文提出\textsc{HIPO}，一种新颖的对齐框架，将HIF建模为约束马尔可夫决策过程。\textsc{HIPO}将系统提示从单纯的输入上下文提升为严格的算法边界。通过原始对偶安全强化学习方法，该算法将系统提示遵从性作为显式约束动态执行，严格在可行区域内最大化用户效用。在不同模型架构（如Qwen、Phi、Llama）上的广泛评估表明，\textsc{HIPO}显著提升了系统遵从性和用户效用。此外，机制分析表明，这种约束优化能自主驱动模型将其注意力转向长程系统标记，为复杂工作流中可靠的大语言模型部署提供了原理性基础。

摘要 (Abstract)

Hierarchical Instruction Following (HIF) refers to the problem of prompting large language models with a priority-ordered stack of instructions. Standard methods like RLHF and DPO typically fail in this problem since they mainly optimize for a single objective, failing to explicitly enforce system prompt compliance. Meanwhile, supervised fine-tuning relies on mimicking filtered, compliant data, which fails to establish the priority asymmetry at the algorithmic level. In this paper, we introduce \textsc{HIPO}, a novel alignment framework that formulates HIF as a Constrained Markov Decision Process. \textsc{HIPO} elevates system prompts from mere input context to strict algorithmic boundaries. Using a primal-dual safe reinforcement learning approach, the algorithm dynamically enforces system prompt compliance as an explicit constraint, maximizing user utility strictly within this feasible region. Extensive evaluations across diverse model architectures (e.g., Qwen, Phi, Llama) demonstrate that \textsc{HIPO} significantly improves both system compliance and user utility. Furthermore, mechanistic analysis reveals that this constrained optimization autonomously drives the model to shift its attention toward long-range system tokens, providing a principled foundation for reliable LLM deployment in complex workflows.

关键词: Hierarchical Instruction Following, Constrained Reinforcement Learning, LLM Alignment, System Prompt Compliance, Primal-Dual Safe RL, Constrained Markov Decision Process, User Utility, Mechanistic Analysis

148. ❌ Answer Bubbles: Information Exposure in AI-Mediated Search

作者: Michelle Huang, Agam Goyal, Koustuv Saha, Eshwar Chandrasekharan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16138v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究生成式搜索系统（如GPT、Google AI Overviews）与传统搜索在信息呈现上的差异，核心关注AI生成摘要的源选择偏见、语言特征和保真度。与’Large Language Models’相关（8分），因系统使用GPT等大模型；与’Retrieval-Augmented Generation’相关（8分），因涉及检索增强的生成式搜索；与’Hallucination Mitigation’高度相关（10分），因研究AI摘要对源材料的保真度和事实性；与’Explainable AI’有一定关联（5分），因涉及系统透明度和偏见分析。其他关键词主要涉及大模型技术原理或特定应用领域，与论文的实证评估焦点无关。

!!! tip deepseek-chat TL;DR

该论文研究了生成式搜索系统（如GPT和Google AI Overviews）与传统搜索相比，在信息源选择、语言特征和摘要保真度上的差异，发现AI系统存在显著的源选择偏见和语言风格变化，可能导致'答案泡沫'，影响用户信任和信息透明度。

摘要翻译

生成式搜索系统正日益以人工智能生成的摘要取代基于链接的检索，然而人们对于这些系统在信息来源、语言特征及对引用材料的忠实度方面存在何种差异知之甚少。本研究从三个层面考察了四种系统——原始GPT、搜索GPT、谷歌AI概览以及传统谷歌搜索——对11,000条真实搜索查询的响应：来源多样性、生成摘要的语言特征，以及来源与摘要的忠实度。我们发现生成式搜索系统在引用中存在显著的来源选择偏见，倾向于优先引用特定来源。引入搜索功能会选择性弱化认知标记，在保留人工智能生成摘要中确定性语言的同时，将模糊表述减少高达60%。与此同时，AI摘要进一步加剧了引用偏见：维基百科和篇幅较长的来源被过度呈现，而引用的社交媒体内容与负面框架的来源则明显呈现不足。我们的研究结果揭示了答案泡沫存在的可能性，即相同查询在不同系统中会产生结构差异化的信息现实，这对用户信任、来源可见度以及人工智能中介信息访问的透明度具有重要影响。

摘要 (Abstract)

Generative search systems are increasingly replacing link-based retrieval with AI-generated summaries, yet little is known about how these systems differ in sources, language, and fidelity to cited material. We examine responses to 11,000 real search queries across four systems – vanilla GPT, Search GPT, Google AI Overviews, and traditional Google Search – at three levels: source diversity, linguistic characterization of the generated summary, and source-summary fidelity. We find that generative search systems exhibit significant \textit{source-selection} biases in their citations, favoring certain sources over others. Incorporating search also selectively attenuates epistemic markers, reducing hedging by up to 60% while preserving confidence language in the AI-generated summaries. At the same time, AI summaries further compound the citation biases: Wikipedia and longer sources are disproportionately overrepresented, whereas cited social media content and negatively framed sources are substantially underrepresented. Our findings highlight the potential for \textit{answer bubbles}, in which identical queries yield structurally different information realities across systems, with implications for user trust, source visibility, and the transparency of AI-mediated information access.

关键词: generative search systems, AI-generated summaries, source-selection biases, citation biases, answer bubbles, information exposure, user trust, transparency

作者: Agam Goyal, Olivia Pal, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于LLM的自主智能体在社交平台上的社区动态，与’Large Language Models’和’LLM Agents’高度相关（10分），并涉及多智能体系统（10分）。其他关键词如MoE、SFT、RAG等涉及具体技术原理或应用领域，论文未直接探讨，故评0分。

!!! tip deepseek-chat TL;DR

该研究通过大规模实证比较AI智能体社区（Moltbook）与人类社区（Reddit），揭示了AI智能体在社交平台上的独特动态，包括极端参与不平等、情感扁平化、认知风格转变以及由共享作者身份导致的社区同质化现象。

摘要翻译

随着基于大型语言模型（LLM）的自主智能体日益活跃于社交平台，理解AI智能体社群的动态对传播研究和平台治理均至关重要。本文首次对AI智能体与人类在线社群进行了大规模实证比较，分析了五个匹配社群中的73,899条Moltbook帖文与189,838条Reddit帖文。在结构层面，我们发现Moltbook表现出极端的参与不平等（基尼系数为0.84，而Reddit为0.47）和较高的跨社群作者重叠率（33.8% 对比 0.5%）。在语言特征方面，AI智能体生成的内容情感趋于扁平化，认知风格偏向断言而非探索，且社会联系性较弱。这些差异导致了明显的社群层面同质化现象，但我们证明这主要是共享作者身份造成的结构性假象。在作者个体层面，由于极端发帖量放大了其异常风格特征，单个智能体比人类用户更具可识别性。随着AI中介传播重塑在线话语，本研究为理解多智能体互动如何催生不同于人类社群的集体传播动态提供了实证基础。

摘要 (Abstract)

As autonomous LLM-based agents increasingly populate social platforms, understanding the dynamics of AI-agent communities becomes essential for both communication research and platform governance. We present the first large-scale empirical comparison of AI-agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find that Moltbook exhibits extreme participation inequality (Gini = 0.84 vs. 0.47) and high cross-community author overlap (33.8% vs. 0.5%). In terms of linguistic attributes, content generated by AI-agents is emotionally flattened, cognitively shifted toward assertion over exploration, and socially detached. These differences give rise to apparent community-level homogenization, but we show this is primarily a structural artifact of shared authorship. At the author level, individual agents are more identifiable than human users, driven by outlier stylistic profiles amplified by their extreme posting volume. As AI-mediated communication reshapes online discourse, our work offers an empirical foundation for understanding how multi-agent interaction gives rise to collective communication dynamics distinct from those of human communities.

关键词: LLM-based agents, AI-agent communities, social platforms, multi-agent interaction, communication dynamics, empirical comparison, Moltbook, Reddit

150. ❌ SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

作者: Han Jang, Junhyeok Lee, Kyu Sung Choi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16131v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究LLM时代科学文献的层次化摘要基准构建，并分析LLM对科学写作风格的影响。核心相关关键词是’Large Language Models’（论文明确研究LLM在科学摘要中的应用和LLM时代前后的写作变化）和’AI for Science’（论文构建科学领域的AI基准并分析AI对科学写作的影响）。其他关键词涉及具体LLM技术（如MoE、SFT、RAG等）、推理方法、优化技术等，论文未涉及这些具体技术细节，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对LLM时代科学信息过载问题，构建了包含44,946篇论文的层次化科学摘要基准SciZoom，并发现LLM辅助写作导致科学文本风格更加自信但同质化。

摘要翻译

人工智能研究的爆炸式增长造成了前所未有的信息过载，推动了超越传统摘要的多粒度科学文献摘要需求。尽管大语言模型越来越多地被用于摘要生成，但现有基准数据集仍存在规模有限、仅针对单一粒度且大多发布于大语言模型时代之前的问题。此外，自2022年11月ChatGPT发布以来，研究人员已迅速采用大语言模型辅助手稿撰写，这从根本上改变了科学写作方式，但目前尚无资源可用于分析这种写作模式的演变。为填补这些空白，我们推出了SciZoom基准数据集，该数据集包含来自NeurIPS、ICLR、ICML和EMNLP四大顶级机器学习会议2020至2025年间的44,946篇论文，并明确划分为“前大语言模型时代”与“后大语言模型时代”两个阶段。SciZoom提供三个层次化的摘要生成目标（Abstract摘要、Contributions贡献点与TL;DR极简摘要），最高可实现600:1的压缩比，既能支持多粒度摘要研究，又能用于科学写作模式的时序挖掘。我们的语言学分析揭示了显著的短语模式变化（程式化表达的使用增长高达10倍）和修辞风格转变（模糊性表达减少23%），这表明大语言模型辅助写作产生了更自信但同质化的文本。SciZoom既是一个具有挑战性的基准测试平台，也是挖掘生成式人工智能时代科学话语演变的独特资源。我们的代码与数据集已分别公开于GitHub（https://github.com/janghana/SciZoom）和Hugging Face（https://huggingface.co/datasets/hanjang/SciZoom）平台。

摘要 (Abstract)

The explosive growth of AI research has created unprecedented information overload, increasing the demand for scientific summarization at multiple levels of granularity beyond traditional abstracts. While LLMs are increasingly adopted for summarization, existing benchmarks remain limited in scale, target only a single granularity, and predate the LLM era. Moreover, since the release of ChatGPT in November 2022, researchers have rapidly adopted LLMs for drafting manuscripts themselves, fundamentally transforming scientific writing, yet no resource exists to analyze how this writing has evolved. To bridge these gaps, we introduce SciZoom, a benchmark comprising 44,946 papers from four top-tier ML venues (NeurIPS, ICLR, ICML, EMNLP) spanning 2020 to 2025, explicitly stratified into Pre-LLM and Post-LLM eras. SciZoom provides three hierarchical summarization targets (Abstract, Contributions, and TL;DR) achieving compression ratios up to 600:1, enabling both multi-granularity summarization research and temporal mining of scientific writing patterns. Our linguistic analysis reveals striking shifts in phrase patterns (up to 10x for formulaic expressions) and rhetorical style (23% decline in hedging), suggesting that LLM-assisted writing produces more confident yet homogenized prose. SciZoom serves as both a challenging benchmark and a unique resource for mining the evolution of scientific discourse in the generative AI era. Our code and dataset are publicly available on GitHub (https://github.com/janghana/SciZoom) and Hugging Face (https://huggingface.co/datasets/hanjang/SciZoom), respectively.

关键词: scientific summarization, large language models, benchmark, hierarchical summarization, scientific writing evolution, LLM era, information overload, multi-granularity

151. ❌ Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

作者: Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, Jun Suzuki 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16127v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型预训练中的学习率调度策略（特别是Warmup-Stable-Only方法）及其对监督微调后性能的影响，因此与’Large Language Models’、‘Pre-training’和’Supervised Fine-tuning’高度相关（10分）。论文未涉及其他关键词如MoE、量化、RAG、对齐等具体技术，故相关度为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在大型语言模型预训练中采用无学习率衰减的Warmup-Stable-Only调度策略，相比传统衰减策略能产生更平坦的损失极小值，从而在监督微调后获得更好的下游任务性能。

摘要翻译

本研究探讨了学习率调度在大语言模型大规模预训练中的作用，重点关注其对监督微调后下游任务性能的影响。基于衰减的学习率调度器被广泛用于最小化预训练损失。然而，尽管其应用普遍，这些调度器如何影响监督微调后的性能仍未得到充分研究。本文研究了预热后保持恒定学习率、不进行任何衰减的Warmup-Stable-Only调度器。通过对10亿和80亿参数模型的实验，我们发现，尽管基于衰减的调度器在预训练后可能表现出更好的性能，但WSO在监督微调后的性能上始终优于前者。这一结论在训练中期和过度训练的不同阶段均成立。损失景观分析进一步揭示，基于衰减的调度器将模型导向更尖锐的极小值，而WSO则能保持更平坦的极小值，从而支持模型的适应性。这些发现表明，应用学习率衰减来优化预训练指标可能会损害下游任务的适应性。我们的工作也为训练和模型发布策略提供了实用指导，强调使用WSO进行预训练能增强模型对下游任务的适应能力。

摘要 (Abstract)

We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability. These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability. Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.

关键词: Large Language Models, Pre-training, Learning Rate Scheduling, Supervised Fine-tuning, Warmup-Stable-Only, Loss Landscape, Downstream Adaptability, Model Training Strategy

152. ❌ SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

作者: Songcheng Cai, Zhiheng Lyu, Yuansheng Ni, Xiangchao Chen, Baichuan Zhou, Shenzhe Zhu, Yi Lu, Haozhe Wang, Chi Ruan, Benjamin Schneider, Weixu Zhang, Xiang Li, Andy Zheng, Yuyu Zhang, Ping Nie, Wenhu Chen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16124v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大模型在代码理解领域的应用，特别是代理式工作流和训练方法。高度相关的关键词包括：LLMs（论文多次提及）、SFT和RLAIF（核心训练方法）、LLM Agents和Tool Use（代理式代码探索的核心）、Small Language Models（训练了Qwen3-8B模型）。Chain of Thought和System 2 Thinking与代理式推理相关，但非核心。其他关键词如MoE、Scaling Laws、PEFT等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对代码库理解任务缺乏可靠基准的问题，提出了SWE-QA-Pro基准和两阶段训练方法（SFT+RLAIF），使小型开源模型在代理式代码理解任务上超越了GPT-4o。

摘要翻译

具备自主能力的仓库级代码理解对于自动化复杂软件工程任务至关重要，然而该领域缺乏可靠的基准测试。现有评估往往忽视长尾主题，并依赖大型语言模型（LLMs）可通过记忆知识进行作弊的流行代码库。为解决这一问题，我们引入了SWE-QA-Pro——一个基于多样化、长尾仓库构建并配备可执行环境的基准测试。我们通过问题驱动的聚类方法强制实现主题平衡，以覆盖代表性不足的任务类型，并应用严格的难度校准流程：过滤掉可通过直接回答基线解决的问题。由此产生的数据集中，自主工作流的表现显著优于直接回答（例如Claude Sonnet 4.5模型存在约13分的差距），证实了自主探索代码库的必要性。此外，针对此类复杂行为训练数据稀缺的挑战，我们提出了一种可扩展的合成数据生成流程，支持两阶段训练方案：首先进行监督微调（SFT），随后实施基于人工智能反馈的强化学习（RLAIF）。该方法使小型开源模型能够学习高效的工具使用和推理能力。实验表明，采用本方案训练的Qwen3-8B模型在SWE-QA-Pro基准上超越GPT-4o达2.3分，并大幅缩小了与顶尖专有模型的差距，这既验证了我们评估体系的有效性，也证明了自主智能体训练工作流程的优越性。

摘要 (Abstract)

Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.

关键词: repository-level code understanding, agentic workflow, benchmark, Supervised Fine-Tuning, Reinforcement Learning from AI Feedback, tool usage, reasoning, small open models

153. ❌ Language Models Don’t Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

作者: Nishant Balepur, Malachi Hamada, Varsha Kishore, Sergey Feldman, Amanpreet Singh, Pao Siangliulue, Joseph Chee Chang, Eunsol Choi, Jordan Lee Boyd-Graber, Aakanksha Naik 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16120v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究个性化Deep Research工具MyScholarQA，该系统使用LLMs（如OpenAI DR）为研究人员合成科学论文，因此与’Large Language Models’高度相关（10分）。研究属于AI在科学领域的应用，与’AI for Science’高度相关（10分）。系统涉及检索生成（RAG）和代理工作流（LLM Agents）概念，各给5分。其他关键词如MoE、SFT、RLHF等未在摘要中提及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了Deep Research工具缺乏用户理解的问题，开发了个性化工具MyScholarQA，通过LLM基准测试和真实用户访谈揭示了标准评估方法无法检测的九个细微错误，并强调真实用户参与对个性化研究的重要性。

摘要翻译

深度研究工具（如OpenAI DR）帮助研究者应对日益增长的文献数量。这类工具能够综合科学论文以回答研究者的查询，但缺乏对用户的理解。我们在MyScholarQA（MySQA）中改变了这一现状，这是一个个性化的深度研究工具，其功能包括：1）推断用户研究兴趣的画像；2）针对用户输入的查询提出个性化行动建议；3）根据用户认可的行动方案，为查询撰写多章节报告。我们首先采用自然语言处理的标准流程测试MySQA：设计了一个合成用户与大型语言模型评判者的基准测试，结果显示MySQA在引用指标和个性化行动遵循方面均优于基线模型。然而，我们怀疑这一流程未能涵盖个性化深度研究用户所重视的所有层面，因此通过MySQA的在线版本访谈真实用户以揭示深层需求。我们发现了九类大型语言模型评判者无法检测的个性化深度研究细微错误，并通过分析质性反馈为未来深度研究工具设计总结经验。总体而言，我们主张建立一种易于使用的大型语言模型评判者可能导致自然语言处理领域忽视的个性化支柱：唯有通过真实用户参与，个性化研究才能取得实质进展。

摘要 (Abstract)

Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers’ queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user’s research interests; 2) proposes personalized actions for a user’s input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP’s standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.

关键词: Deep Research, personalization, Large Language Models, MyScholarQA, user evaluation, scientific synthesis, research tools, AI for science

154. ❌ ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

作者: Tik Yu Yim, Wenting Tan, Sum Yee Chan, Tak-Wah Lam, Siu Ming Yiu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16112v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在金融推理领域的领域适应问题，提出ASDA框架通过自动生成结构化技能工件（包含推理过程、代码模板和示例）来提升LLMs的金融推理能力，而不修改模型权重。高度相关的关键词包括：LLMs（核心研究对象）、Chain of Thought（涉及多步推理）、System 2 Thinking（涉及深度推理）、Self-Correction（通过错误分析改进）、LLM Agents（框架涉及教师-学生代理交互）、In-context Learning（动态注入技能文件）、Domain Adaptation（金融领域适应）。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在金融推理领域的适应问题，提出了ASDA框架，通过自动生成结构化技能工件而不修改模型权重，在FAMMA基准上实现了显著的性能提升。

摘要翻译

使大型语言模型（LLM）适应专业金融推理通常需要进行昂贵的微调，这会产生模型锁定的专业知识。无需训练的方法已经出现，但我们的实验表明，主流方法（GEPA和ACE）在FAMMA金融推理基准测试中仅取得边际收益，揭示了非结构化文本优化在复杂、多步骤领域推理中的局限性。我们提出了自动化技能提炼与适配（ASDA, Automated Skill Distillation and Adaptation）框架，该框架通过迭代式纠错学习自动生成结构化技能构件，而无需修改模型权重。一个教师模型分析学生模型在金融推理任务上的失败案例，按子领域和错误类型对错误进行聚类，并合成包含推理流程、代码模板和示例解答的技能文件，这些文件在推理过程中被动态注入。在FAMMA上的评估显示，ASDA在算术推理任务上实现了高达+17.33%的提升，在非算术推理任务上提升了+5.95%，显著优于所有无需训练的基线方法。生成的技能构件具有人类可读性、版本可控性，并与Agent Skills开放标准兼容，为任何拥有标注领域数据集的组织提供了一条无需访问权重或重新训练即可实现领域适配的、实用且可审计的路径。

摘要 (Abstract)

Adapting large language models (LLMs) to specialized financial reasoning typically requires expensive fine-tuning that produces model-locked expertise. Training-free alternatives have emerged, yet our experiments show that leading methods (GEPA and ACE) achieve only marginal gains on the FAMMA financial reasoning benchmark, exposing the limits of unstructured text optimization for complex, multi-step domain reasoning. We introduce Automated Skill Distillation and Adaptation (ASDA), a framework that automatically generates structured skill artifacts through iterative error-corrective learning without modifying model weights. A teacher model analyzes a student model’s failures on financial reasoning tasks, clusters errors by subfield and error type, and synthesizes skill files containing reasoning procedures, code templates, and worked examples, which are dynamically injected during inference. Evaluated on FAMMA, ASDA achieves up to +17.33% improvement on arithmetic reasoning and +5.95% on non-arithmetic reasoning, substantially outperforming all training-free baselines. The resulting skill artifacts are human-readable, version-controlled, and compatible with the Agent Skills open standard, offering any organization with a labeled domain dataset a practical and auditable path to domain adaptation without weight access or retraining.

关键词: Large Language Models, Financial Reasoning, Domain Adaptation, Automated Skill Distillation, Training-free Methods, Multi-step Reasoning, Error-corrective Learning, Agent Skills

155. ❌ CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

作者: Tianyi Huang, Ying Kai Deng 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究检索增强生成（RAG）框架中的推理时知识修复方法，通过检索反证据来验证和修正答案，因此与’Retrieval-Augmented Generation’、‘Self-Correction’、‘Hallucination Mitigation’高度相关（10分）。论文使用大语言模型（GPT-5）作为基础，与’Large Language Models’高度相关（10分）。方法涉及多步推理和深度思考来验证答案，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。其他关键词如MoE、量化、代理系统等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出CounterRefine方法，通过检索反证据在推理时修复检索增强问答系统中的答案错误，在SimpleQA基准上显著提升了GPT-5 Baseline-RAG的性能。

摘要翻译

在事实性问答任务中，许多错误并非源于信息获取失败，而是源于承诺失败：系统检索到了相关证据，却仍给出了错误答案。本文提出CounterRefine——一种面向检索增强型问答的轻量级推理时修复层。该方法首先基于检索到的证据生成一个简短答案，随后以该草稿答案为条件发起后续查询，收集额外的支持性证据与矛盾证据，最后执行受限的 refinement 步骤，输出KEEP（保持）或REVISE（修订）决策；只有当修订建议通过确定性验证时才会被采纳。实际上，CounterRefine将检索机制转化为检验临时答案的工具，而非仅仅用于收集更多上下文信息。在完整的SimpleQA基准测试中，CounterRefine将匹配的GPT-5 Baseline-RAG系统性能提升了5.8个百分点，达到73.1%的正确率，同时比已报道的单次GPT-5.4得分高出约40个百分点。这些发现为知识型基础模型指出了一个简单而重要的研究方向：除了获取证据，模型还应具备利用证据进行反思并在必要时修正自身答案的能力。

摘要 (Abstract)

In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair layer for retrieval-grounded question answering. CounterRefine first produces a short answer from retrieved evidence, then gathers additional support and conflicting evidence with follow-up queries conditioned on that draft answer, and finally applies a restricted refinement step that outputs either KEEP or REVISE, with proposed revisions accepted only if they pass deterministic validation. In effect, CounterRefine turns retrieval into a mechanism for testing a provisional answer rather than merely collecting more context. On the full SimpleQA benchmark, CounterRefine improves a matched GPT-5 Baseline-RAG by 5.8 points and reaches a 73.1 percent correct rate, while exceeding the reported one-shot GPT-5.4 score by roughly 40 points. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.

关键词: factual question answering, retrieval-grounded QA, counterevidence retrieval, inference-time repair, knowledge repair, answer validation, RAG improvement, GPT-5 baseline

156. ❌ ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

作者: Aniket Pramanick, Yufang Hou, Saif M. Mohammad, Iryna Gurevych 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16073v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究科学文献中科学主张的演变追踪和关系分类，属于AI在科学领域的应用（AI for Science），因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（8分）。论文评估了大型语言模型在Claim Relation Classification任务上的性能，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。论文不涉及其他关键词所描述的大模型技术原理、训练方法、推理优化、对齐、代理系统等具体技术内容，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了ClaimFlow框架来追踪NLP领域科学主张的演变，通过人工标注构建了包含1084个主张和832个跨论文关系的语料库，定义了主张关系分类任务，并评估了神经网络模型和大型语言模型在该任务上的性能，发现63.5%的主张从未被重用，只有11.1%曾被挑战。

摘要翻译

科学论文不仅报告结果——它们提出主张，后续研究会对这些主张予以支持、拓展，有时也会反驳。然而，现有的引文和主张分析方法仅能捕捉这种对话的片段。在本研究中，我们在个体科学主张的层面上将这些互动关系显式化。我们引入了ClaimFlow，这是一个以主张为中心的自然语言处理（NLP）文献视图，它基于对304篇ACL Anthology论文（1979–2025）的人工标注构建而成，共包含1,084个主张和832个跨论文主张关系。这些关系标注了引用论文对某个被引主张的态度是支持、拓展、限定、反驳，还是仅将其作为背景进行参考。利用ClaimFlow，我们定义了一项新任务——主张关系分类——该任务要求模型根据文本和引文上下文，推断出对某个被引主张的科学立场。通过评估强大的神经网络模型和大语言模型在此任务上的表现，我们报告了0.78宏观F1分数的基线性能，表明主张关系分类是可行但具有挑战性的。我们进一步将模型应用于约13,000篇NLP论文，以分析主张在数十年NLP研究中的演变。我们的分析揭示：63.5%的主张从未被再次使用；仅有11.1%的主张曾受到挑战；同时，广泛传播的主张更多地是通过限定和拓展被重塑，而非被直接证实或反驳。总体而言，ClaimFlow提供了一个审视NLP领域内思想如何演变和成熟的视角，并为评估模型能否理解科学论证奠定了基础。

摘要 (Abstract)

Scientific papers do more than report results $-$ they advance $\textit{claims}$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $304$ ACL Anthology papers (1979$-$2025) that are manually annotated with $1{,}084$ claims and $832$ cross-paper claim relations, indicating whether a citing paper $\textit{supports}$, $\textit{extends}$, $\textit{qualifies}$, $\textit{refutes}$, or references a claim as $\textit{background}$. Using $\texttt{ClaimFlow}$, we define a new task $-$ $\textit{Claim Relation Classification}$ $-$ which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating strong neural models and large language models on this task, we report baseline performance of $0.78$ macro-F1, highlighting that claim-relation classification is feasible but challenging. We further apply our model to $\sim$$13k$ NLP papers to analyze how claims evolve across decades of NLP research. Our analysis reveals that $63.5$% claims are never reused; only $11.1$% are ever challenged; meanwhile, widely propagated claims are more often $\textit{reshaped}$ through qualification and extension than directly confirmed or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP, and a foundation for assessing whether models can interpret scientific argumentation.

关键词: scientific claims, claim evolution, claim relation classification, NLP literature, large language models, citation analysis, scientific argumentation, ACL Anthology

157. ❌ SEAHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Southeast Asia

作者: Ri Chi Ng, Aditi Kumaresan, Yujia Hu, Roy Ka-Wei Lee 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16070v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究东南亚低资源语言（印尼语、他加禄语、泰语、越南语）的仇恨言论检测数据集构建和模型评估，仅摘要中提到使用大语言模型（LLMs）来增强测试用例，因此仅与’Large Language Models OR LLMs OR Foundation Models’关键词有中等关联（5分），其他关键词均未涉及大模型技术原理创新或具体应用，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对东南亚低资源语言仇恨言论检测资源匮乏的问题，构建了SEAHateCheck数据集并评估了现有模型，发现模型在特定语言和文化语境下存在检测困难。

摘要翻译

仇恨言论检测高度依赖语言资源，这些资源主要集中在英语和中文等高资源语言中，这为东南亚低资源语言工具开发的研究人员和平台造成了障碍——该地区多样化的社会语言环境使网络仇恨内容治理更为复杂。为此，我们推出了SEAHateCheck，这是一个针对印度尼西亚、泰国、菲律宾和越南的创新型数据集，涵盖印尼语、他加禄语（Tagalog）、泰语和越南语。基于HateCheck的功能测试框架并改进SGHateCheck的方法，SEAHateCheck提供了与文化背景相关的测试用例，通过大语言模型进行数据增强，并由当地专家验证以确保准确性。对前沿多语言模型的实验揭示了其在特定低资源语言中检测仇恨言论的局限性：其中他加禄语测试用例的模型准确率最低，这可能源于语言复杂性及训练数据有限；而基于俚语的功能测试被证明最具挑战性，因为模型难以理解蕴含文化细微差别的表达。SEAHateCheck的诊断性分析进一步暴露出模型在隐晦仇恨检测上的弱点，以及处理反制性言论（counter-speech）表达时的困境。作为首个面向这些东南亚语言的功能测试套件，本研究为研究者提供了坚实的基准，推动开发实用且契合文化背景的仇恨言论检测工具，以促进包容性的网络内容治理。

摘要 (Abstract)

Hate speech detection relies heavily on linguistic resources, which are primarily available in high-resource languages such as English and Chinese, creating barriers for researchers and platforms developing tools for low-resource languages in Southeast Asia, where diverse socio-linguistic contexts complicate online hate moderation. To address this, we introduce SEAHateCheck, a pioneering dataset tailored to Indonesia, Thailand, the Philippines, and Vietnam, covering Indonesian, Tagalog, Thai, and Vietnamese. Building on HateCheck’s functional testing framework and refining SGHateCheck’s methods, SEAHateCheck provides culturally relevant test cases, augmented by large language models and validated by local experts for accuracy. Experiments with state-of-the-art and multilingual models revealed limitations in detecting hate speech in specific low-resource languages. In particular, Tagalog test cases showed the lowest model accuracy, likely due to linguistic complexity and limited training data. In contrast, slang-based functional tests proved the hardest, as models struggled with culturally nuanced expressions. The diagnostic insights of SEAHateCheck further exposed model weaknesses in implicit hate detection and models’ struggles with counter-speech expression. As the first functional test suite for these Southeast Asian languages, this work equips researchers with a robust benchmark, advancing the development of practical, culturally attuned hate speech detection tools for inclusive online content moderation.

关键词: hate speech detection, low-resource languages, Southeast Asia, functional testing, multilingual models, cultural relevance, dataset benchmark, online content moderation

158. ❌ Resource Consumption Threats in Large Language Models

作者: Yuanhe Zhang, Xinyue Wang, Zhican Chen, Weiliu Wang, Zilu Zhang, Zhengshuo Gong, Zhenhong Zhou, Li Sun, Yang Liu, Sen Su 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16068v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文标题和摘要明确聚焦于大型语言模型（LLMs）的资源消耗威胁，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文讨论资源效率、服务容量、延迟和API成本，这些是LLM部署和运营的核心问题。然而，论文是综述性质，系统回顾威胁、机制理解和缓解措施，并未深入探讨其他关键词的具体技术（如MoE、量化、推理加速、对齐等），因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

这篇综述系统回顾了大型语言模型中导致资源过度消耗的威胁，分析了从威胁诱导到机制理解和缓解的完整流程，旨在为该新兴领域提供清晰的问题界定和缓解基础。

摘要翻译

在计算基础设施有限且成本高昂的背景下，资源效率已成为大语言模型（LLMs）的关键要求。高效的LLM能提升服务提供方的承载能力，并降低用户的延迟与API成本。近期出现的资源消耗威胁会诱发过度生成，从而降低模型效率，损害服务可用性与经济可持续性。本文针对LLM中的资源消耗威胁进行了系统性综述。通过厘清该领域的范畴，并沿着从威胁诱发、机制理解到缓解措施的全流程审视该问题，我们进一步构建了这一新兴领域的统一视角。本研究旨在阐明该领域的问题全景，从而为特征刻画与威胁缓解提供更清晰的基础。

摘要 (Abstract)

Given limited and costly computational infrastructure, resource efficiency is a key requirement for large language models (LLMs). Efficient LLMs increase service capacity for providers and reduce latency and API costs for users. Recent resource consumption threats induce excessive generation, degrading model efficiency and harming both service availability and economic sustainability. This survey presents a systematic review of threats to resource consumption in LLMs. We further establish a unified view of this emerging area by clarifying its scope and examining the problem along the full pipeline from threat induction to mechanism understanding and mitigation. Our goal is to clarify the problem landscape for this emerging area, thereby providing a clearer foundation for characterization and mitigation.

关键词: Large Language Models, LLMs, resource consumption, threats, efficiency, mitigation, survey, computational infrastructure

159. ❌ Residual Stream Duality in Modern Transformer Architectures

作者: Yifan Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16039v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究Transformer架构中残差流的数学对偶性，提出了Transformer²概念，将序列轴和深度轴视为对偶维度。论文与’Large Language Models OR LLMs OR Foundation Models’有中等相关性（8分），因为Transformer是LLM的核心架构，论文探讨了其基础架构的数学性质。但论文不涉及具体的训练方法、优化技术、应用场景或其他关键词，因此其他关键词均为0分。论文属于纯大模型技术原理的创新研究，符合研究背景要求。

!!! tip deepseek-chat TL;DR

该论文揭示了Transformer架构中残差流在序列轴和深度轴之间的数学对偶性，提出了Transformer²概念，并建议根据具体目标选择深度增量学习或序列轴短滑动窗口注意力作为优化方案。

摘要翻译

近期研究已明确指出，残差路径并非仅仅是优化过程的辅助结构，而是模型表征机制的重要组成部分。我们认同这一观点，但主张通过双轴视角来梳理Transformer的设计空间是更为清晰的框架。解码器沿两个有序维度演化信息：序列位置与层深度。自注意力机制已在序列轴上实现了自适应混合，而残差流通常在深度轴上执行固定的加法操作。若固定某个词元位置并将层索引视为有序变量，则因果深度残差注意力读取在数学形式上完全等同于因果短滑动窗口注意力，区别仅在于其操作维度是深度而非序列。这正是Transformer$^2$背后的核心残差流对偶性。这一视角也为近期研究提供了清晰阐释：ELC-BERT与DenseFormer已证明，沿深度进行可学习的聚合能够超越均匀残差累积的效果；而垂直注意力、深度交叉注意力、MUDDFormer及注意力残差等方法则进一步向基于显式注意力的跨层路由机制发展。然而关键在于，算子层面的对偶性并不等同于系统层面的对称性。对于大规模自回归模型，序列轴短滑动窗口注意力通常更具硬件友好性，因其可复用词元侧滑动窗口计算核、键值缓存布局与分块执行策略。若目标在于修改捷径连接本身，深度增量学习则是更简洁的干预方案，因其直接修改残差算子而非添加独立的跨层检索路径。因此我们的建议简明直接：当研究焦点为捷径连接时采用深度增量学习；当目标为局部自适应混合时则采用序列轴短滑动窗口注意力。

摘要 (Abstract)

Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model’s representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer$^2$. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.

关键词: Transformer, residual stream, duality, depth axis, sequence axis, ShortSWA, Deep Delta Learning, autoregressive models

160. ❌ Evaluating Agentic Optimization on Large Codebases

作者: Atharva Sehgal, James Hou, Akanksha Sarkar, Ishaan Mantripragada, Swarat Chaudhuri, Jennifer J. Sun, Yisong Yue 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.16011v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM编码代理在代码库级别的优化能力评估，与’Large Language Models’和’LLM Agents’高度相关（10分），因为论文明确研究LLM coding agents并开发了FormulaCode基准。与’AI for Science’有一定关联（5分），因为论文使用科学Python代码库作为数据源，涉及科学计算领域的AI应用。其他关键词如MoE、SFT、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有代码基准的局限性，提出了FormulaCode基准来评估LLM代理在真实大型代码库中的多目标优化能力，发现前沿LLM代理在仓库规模的多目标优化方面仍面临重大挑战。

摘要翻译

大型语言模型（LLM）编码代理日益在仓库级别进行操作，这促使需要建立能够评估其在现实约束下优化整个代码库能力的基准测试。现有的代码基准测试主要依赖合成任务、二元正确性信号或单目标评估，限制了其评估整体优化行为的能力。我们提出了FormulaCode，这是一个用于评估代理在大型、真实世界代码库上进行细粒度、多目标性能优化的基准测试。FormulaCode包含从GitHub上的科学Python仓库中挖掘出的957个性能瓶颈，每个瓶颈均配有专家编写的补丁，并且平均每个任务包含264.6个社区维护的性能工作负载，从而能够全面评估LLM代理在现实正确性和性能约束下优化代码库的能力。我们的评估表明，仓库规模的多目标优化对于前沿LLM代理而言仍然是一个重大挑战。项目网站位于：https://formula-code.github.io

摘要 (Abstract)

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: https://formula-code.github.io

关键词: LLM coding agents, repository-level optimization, FormulaCode benchmark, multi-objective evaluation, performance bottlenecks, scientific Python repositories, agentic optimization, codebase optimization

161. ❌ Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

作者: Fan Huang, Haewoon Kwak, Jisun An 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.16017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在道德推理过程中的行为机制，因此与’Large Language Models’高度相关（10分）。研究聚焦于道德推理轨迹和伦理框架切换，涉及多步推理和深入思考过程，与’Chain of Thought’和’System 2 Thinking’高度相关（各10分）。论文通过线性探针和激活导向技术分析模型内部表示，属于可解释AI范畴，与’Mechanistic Interpretability’高度相关（10分）。研究涉及道德对齐和价值判断，与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分）。其他关键词如MoE、量化、RAG、科学AI应用等与论文内容无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在道德推理过程中伦理框架的动态切换模式，发现大多数推理轨迹存在框架不稳定性，并提出了基于表示一致性的度量方法来评估模型推理的连贯性。

摘要翻译

大语言模型（LLM）日益参与道德敏感决策，但其在推理步骤间如何组织伦理框架仍待深入探究。我们提出道德推理轨迹（moral reasoning trajectories）的概念，即中间推理步骤中伦理框架调用的序列，并分析了六种模型和三个基准测试中的轨迹动态。研究发现，道德推理涉及系统性的多框架审议：55.4–57.7% 的连续步骤存在框架切换，仅 16.4–17.8% 的轨迹保持框架一致性。不稳定的轨迹对说服性攻击的敏感性仍高出 1.29 倍（$p=0.015$）。在表征层面，线性探针将特定框架的编码定位至模型特定层（Llama-3.3-70B 的第 63/81 层；Qwen2.5-72B 的第 17/81 层），其 KL 散度比训练集先验基线低 13.8–22.6%。轻量级激活导向技术可调节框架整合模式（漂移减少 6.7–8.9%）并增强稳定性与准确性之间的关系。我们进一步提出道德表征一致性（Moral Representation Consistency, MRC）指标，该指标与大语言模型连贯性评分呈强相关（$r=0.715$, $p<0.0001$），其底层框架归因经人工标注者验证（平均余弦相似度 $= 0.859$）。

摘要 (Abstract)

Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4–57.7% of consecutive steps involve framework switches, and only 16.4–17.8% of trajectories remain framework-consistent. Unstable trajectories remain 1.29$\times$ more susceptible to persuasive attacks ($p=0.015$). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8–22.6% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7–8.9% drift reduction) and amplifies the stability–accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly ($r=0.715$, $p<0.0001$) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity $= 0.859$).

关键词: Large Language Models, Moral Reasoning, Ethical Frameworks, Explainable AI, Reasoning Trajectories, Representation Analysis, Activation Steering, Coherence Evaluation

162. ❌ RadAnnotate: Large Language Models for Efficient and Reliable Radiology Report Annotation

作者: Saisha Pradeep Shetty, Roger Eric Goldman, Vladimir Filkov 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.16002v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLMs（RadAnnotate框架）和RAG技术（检索增强合成报告）解决放射学报告标注问题，属于AI for Science（生物信息学/医学AI应用）领域。其他关键词如MoE、SFT、量化等未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了RadAnnotate框架，利用大语言模型和检索增强生成技术自动标注放射学报告，在减少专家工作量的同时保持了高准确率（实体匹配分数0.86-0.92）。

摘要翻译

放射学报告标注对于临床自然语言处理至关重要，但人工标注速度慢且成本高昂。本文提出RadAnnotate——一个基于大语言模型的框架，通过研究检索增强的合成报告和基于置信度的选择性自动化，以减少RadGraph标注任务中专家所需的工作量。本研究聚焦于RadGraph风格的实体标注（图节点），将关系抽取（边）留待未来工作。首先，我们在金标准报告上训练特定实体分类器，并分析其在解剖结构和观察结果类别上的优势与失败模式，其中不确定性观察结果最难学习。其次，我们生成检索增强生成技术引导的合成报告，结果表明仅使用合成数据的模型性能与金标准训练模型的差距保持在1-2个F1值以内，且合成数据增强在低资源场景下对不确定性观察结果特别有效，将F1值从0.61提升至0.70。最后，通过学习特定实体的置信度阈值，RadAnnotate能以0.86-0.92的实体匹配分数自动标注55-90%的报告，同时将低置信度案例路由至专家审核环节。

摘要 (Abstract)

Radiology report annotation is essential for clinical NLP, yet manual labeling is slow and costly. We present RadAnnotate, an LLM-based framework that studies retrieval-augmented synthetic reports and confidence-based selective automation to reduce expert effort for labeling in RadGraph. We study RadGraph-style entity labeling (graph nodes) and leave relation extraction (edges) to future work. First, we train entity-specific classifiers on gold-standard reports and characterize their strengths and failure modes across anatomy and observation categories, with uncertain observations hardest to learn. Second, we generate RAG-guided synthetic reports and show that synthetic-only models remain within 1-2 F1 points of gold-trained models, and that synthetic augmentation is especially helpful for uncertain observations in a low-resource setting, improving F1 from 0.61 to 0.70. Finally, by learning entity-specific confidence thresholds, RadAnnotate can automatically annotate 55-90% of reports at 0.86-0.92 entity match score while routing low-confidence cases for expert review.

关键词: Large Language Models, Radiology Report Annotation, Retrieval-Augmented Generation, Clinical NLP, Synthetic Reports, Confidence-based Automation, Entity Labeling, RadGraph

163. ❌ Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models

作者: Sijie Li, Biao Qian, Jungong Han 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.16001v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型视觉语言模型（LVLMs）的剪枝方法，属于大模型优化技术。与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为论文核心是稀疏模型剪枝，实现高达50%的稀疏度。与’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分），因为网络剪枝是模型压缩的核心技术之一。与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为LVLMs是大语言模型的扩展。与’Small Language Models OR SLMs OR On-device AI’相关（8分），因为剪枝旨在实现轻量化模型，适合边缘部署。其他关键词如预训练、对齐、推理加速等，论文未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型（LVLMs）提出了一种非对称文本-视觉权重剪枝方法（ATV-Pruning），通过分别处理文本和视觉模态的敏感性，实现了更准确的模型压缩，在标准多模态基准测试中优于现有方法。

摘要翻译

网络剪枝是实现轻量化大型视觉语言模型（LVLMs）的有效技术，其重要性度量主要综合了权重和激活值。然而，现有方法通常以统一方式处理来自不同模态的校准数据，忽视了模态特异性行为。这引发了一个关键挑战：如何应对文本与视觉令牌的差异性行为，以实现对LVLMs的精确剪枝。为此，我们通过解耦其对应权重，系统性地研究了视觉与文本令牌对剪枝操作的敏感性，发现：（i）文本路径应通过文本令牌进行校准，因其表现出比视觉路径更高的敏感性；（ii）视觉路径具有高度冗余性，甚至可容忍50%的稀疏度。基于这些发现，我们提出了一种简单而有效的非对称文本-视觉权重剪枝方法（称为ATV-Pruning），该方法通过从文本和视觉路径中选择信息丰富的令牌，建立精确权重剪枝的重要性度量。具体而言，ATV-Pruning融合了两项核心创新：首先，通过整合所有文本令牌和部分视觉令牌自适应构建校准池；其次，我们设计了一种层自适应选择策略以筛选重要视觉令牌。最后，在标准多模态基准测试上进行的大量实验验证了我们的ATV-Pruning方法相较于前沿技术的优越性。

摘要 (Abstract)

Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.

关键词: Large Vision-Language Models, Network Pruning, Asymmetric Pruning, Model Compression, Sparse Models, Multimodal AI, Weight Pruning, Lightweight Models

164. ❌ NLP Occupational Emergence Analysis: How Occupations Form and Evolve in Real Time – A Zero-Assumption Method Demonstrated on AI in the US Technology Workforce, 2022-2026

作者: David Nordfors 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15998v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是职业形成的社会学分析方法（基于简历数据的零假设方法），并以AI技术在美国劳动力市场中的扩散为例进行实证分析。所有关键词都涉及大模型/深度学习的技术原理、训练方法、优化技术或具体应用，而本文完全不涉及这些技术内容，仅将AI作为社会学分析的对象，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种基于简历数据的零假设方法来检测职业形成，并应用该方法发现AI在美国劳动力市场中表现为扩散的技术而非新兴职业，因为形成了专业词汇但未形成凝聚的从业者群体。

摘要翻译

职业的形成与演变速度往往超越分类系统的追踪能力。我们提出，真正的职业是一种自我强化的结构（二分协同吸引子），其中共享的专业词汇使从业者凝聚为群体，而凝聚的群体又维系着该词汇体系。这一协同吸引子概念使我们能够通过零假设方法从简历数据中检测职业的涌现，无需依赖预定义的分类体系或职位名称：我们独立检验词汇凝聚力与群体凝聚力，并通过消融实验验证词汇是否为联结群体的机制。该方法应用于820万份美国简历数据（2022-2026年），成功识别出既有职业，并揭示出人工智能领域存在显著不对称性：2024年初已迅速形成具有凝聚力的专业词汇体系，但从业者群体始终未能实现凝聚。随着AI工具走向主流，原有AI社群逐渐解体，新词汇被吸收至现有职业体系而非催生新职业。人工智能似乎是一种扩散性技术，而非新兴职业。我们探讨了引入“AI工程师”职业类别是否可能围绕已形成的词汇体系催化群体凝聚力，从而完成协同吸引子的构建。

摘要 (Abstract)

Occupations form and evolve faster than classification systems can track. We propose that a genuine occupation is a self-reinforcing structure (a bipartite co-attractor) in which a shared professional vocabulary makes practitioners cohesive as a group, and the cohesive group sustains the vocabulary. This co-attractor concept enables a zero-assumption method for detecting occupational emergence from resume data, requiring no predefined taxonomy or job titles: we test vocabulary cohesion and population cohesion independently, with ablation to test whether the vocabulary is the mechanism binding the population. Applied to 8.2 million US resumes (2022-2026), the method correctly identifies established occupations and reveals a striking asymmetry for AI: a cohesive professional vocabulary formed rapidly in early 2024, but the practitioner population never cohered. The pre-existing AI community dissolved as the tools went mainstream, and the new vocabulary was absorbed into existing careers rather than binding a new occupation. AI appears to be a diffusing technology, not an emerging occupation. We discuss whether introducing an “AI Engineer” occupational category could catalyze population cohesion around the already-formed vocabulary, completing the co-attractor.

关键词: occupational emergence, co-attractor, vocabulary cohesion, population cohesion, resume data analysis, AI workforce, diffusing technology, zero-assumption method

165. ❌ Visual Set Program Synthesizer

作者: Zehua Cheng, Wei Dai, Wenhu Zhang, Thomas Lukasiewicz, Jiahao Sun 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15997v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉问答中的集合推理问题，提出程序合成方法，与大多数大模型技术关键词无关。仅与推理相关的关键词有弱关联：‘Chain of Thought’（5分，涉及多步推理但非文本推理）、‘System 2 Thinking’（5分，涉及深度推理）、‘Explainable AI’（5分，强调透明性）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对视觉问答中需要集合推理的挑战，提出了一种视觉程序合成方法，通过生成可执行的符号程序来替代黑盒推理，显著提升了复杂视觉推理任务的准确性和可解释性。

摘要翻译

当用户将手机对准超市货架并询问“哪种苏打水含糖量最低？”时，这对当前的视觉人工智能助手构成了严峻挑战。此类查询不仅需要物体识别能力，还涉及基于集合的显式推理，例如筛选、比较与聚合。标准的端到端多模态大语言模型往往难以完成此类任务，因为它们缺乏组合逻辑的显式机制。我们提出将视觉推理视为视觉程序合成问题，即模型首先生成符号化程序，再由一个基于视觉场景的独立引擎执行该程序。同时，我们引入了Set-VQA这一专门用于评估集合式视觉推理能力的新基准。实验表明，在复杂推理任务中，我们的方法显著优于当前最先进的基线模型，不仅产生了更系统化、透明化的推理过程，同时大幅提升了答案准确率。这些结果证明，程序驱动的推理为黑箱式的视觉-语言推断提供了一种具有原则性的替代方案。

摘要 (Abstract)

A user pointing their phone at a supermarket shelf and asking “Which soda has the least sugar?” poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.

关键词: Visual Reasoning, Program Synthesis, Set-based Reasoning, Visual Question Answering, Multimodal Large Language Models, Compositional Logic, Benchmark Set-VQA, Transparent Inference

166. ❌ Robust Language Identification for Romansh Varieties

作者: Charlotte Model, Sina Ahmadi, Jannis Vamvas 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15969v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是针对罗曼什语方言的语言识别系统，使用传统的SVM方法，完全不涉及大模型、深度学习或任何现代AI技术。所有关键词都聚焦于大模型相关技术、架构、训练方法、推理优化、应用场景等，而本论文是传统的机器学习应用，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对罗曼什语方言缺乏有效识别系统的问题，开发了一个基于SVM的语言识别模型，在新建的基准测试中达到了97%的准确率。

摘要翻译

罗曼什语拥有多种地区性变体，称为方言变体，这些变体之间的相互可懂度有时较为有限。尽管存在这种语言多样性，目前仍缺乏构建能够区分这些方言变体的语言识别系统的文献记录。由于罗曼什语语言识别系统还需能够识别格劳宾登罗曼什语——一种融合了多种方言变体元素的跨区域变体，这构成了一个新颖且有趣的分类问题。本文提出了一种基于支持向量机方法的罗曼什语方言变体识别系统。我们在新构建的跨两个领域的基准数据集上评估了模型，发现其平均领域内准确率达到97%，可支持方言感知的拼写检查或机器翻译等应用。我们的分类器已公开提供。

摘要 (Abstract)

The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.

关键词: language identification, Romansh, idioms, SVM, benchmark, spell checking, machine translation, classification

167. ❌ Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

作者: Jingxiang Chen, Minseok Kim, Seong-Gyun Leem, Yin Huang, Rashi Rungta, Zhicheng Ouyang, Haibin Wu, Surya Teja Appini, Ankur Bansal, Yang Bai, Yue Liu, Florian Metze, Ahmed A Aly, Anuj Kumar, Ariya Rastrow, Zhaojiang Lin 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15981v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语音大语言模型（Speech LLMs）的副语言理解与生成对齐问题，因此与’Large Language Models’高度相关（10分）。方法上采用多任务强化学习（multi-task RL），与’RLHF/RLAIF/DPO’高度相关（10分），并使用chain-of-thought prompting进行显式情感推理，与’Chain of Thought/CoT Reasoning’高度相关（10分）。论文涉及对齐和情感推理，与’Instruction Tuning/Alignment’和’System 2 Thinking/Slow Thinking’有一定关联（各5分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等未在摘要中提及或与核心内容无关，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对语音大语言模型在副语言理解（如情感、韵律）上面临的数据稀缺和模型依赖词汇捷径的问题，提出了一种基于多任务强化学习和思维链提示的两阶段方法，实验表明该方法在多个数据集上比监督基线和强有模型提升了8-12%的副语言理解性能。

摘要翻译

语音大语言模型能够感知韵律、情感与非言语声音等副语言线索——这些线索对意图理解至关重要。然而，利用这些线索面临多重挑战：训练数据有限、标注困难，以及模型倾向于利用词汇捷径而非副语言信号。我们提出结合思维链提示的多任务强化学习方法，以激发显式的情感推理。为应对数据稀缺问题，我们引入了一种副语言感知的语音大语言模型，该模型通过两阶段流程联合优化音频情感分类与副语言感知的响应生成。实验表明，在Expresso、IEMOCAP和RAVDESS数据集上，我们的方法相较于有监督基线模型及强大的专有模型（Gemini-2.5-Pro、GPT-4o-audio），在副语言理解能力上提升了8-12%。结果表明，通过多任务强化学习建模副语言推理对于构建具有情感智能的语音大语言模型至关重要。

摘要 (Abstract)

Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds–crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

关键词: Speech LLMs, paralinguistic understanding, multi-task reinforcement learning, chain-of-thought prompting, affective reasoning, sentiment classification, emotionally intelligent, audio processing

168. ❌ MAC: Multi-Agent Constitution Learning

作者: Rushil Thareja, Gautam Gupta, Francesco Pinto, Nils Lukas 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15968v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的宪法AI方法，通过多智能体网络优化结构化规则集，属于LLM对齐和智能体系统的创新研究。与’Large Language Models’、‘Instruction Tuning/Alignment’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分）；与’Post-training/SFT’、‘Self-Correction’、‘Tool Use’、‘Explainable AI’有一定关联（5分）；其余关键词未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种多智能体宪法学习方法（MAC），通过智能体网络优化LLM的规则集，在PII标注等任务上比现有提示优化方法性能提升50%以上，且无需参数更新即可达到与监督微调相当的效果。

摘要翻译

宪法式人工智能是一种基于自然语言规则集对大型语言模型进行监督与控制的方法。这些规则通常由人类专家编写，但原则上可通过足够的目标行为训练数据自动学习。现有基于大型语言模型的提示优化器尝试实现这一目标，但在学习宪法规则时效果有限，原因在于：（1）需要大量标注示例；（2）优化后的提示缺乏结构化，导致随着提示规模扩大改进效果递减。为突破这些限制，我们提出多智能体宪法学习框架，该框架通过由专门智能体组成的网络对结构化提示（即规则集）进行优化，各智能体分别承担接受、编辑或拒绝规则更新的任务。我们还提出了增强版多智能体宪法学习框架，该版本通过训练智能体学习成功轨迹以强化能带来更高奖励的规则更新，从而提升性能。我们在个人可识别信息标注任务（一种标注数据有限且可解释性至关重要的分类任务）上评估多智能体宪法学习框架，并证明其可推广至工具调用等其他智能体任务。实验表明：多智能体宪法学习框架性能超越近期提示优化方法50%以上，生成人类可读且可审计的规则集，在不更新模型参数的情况下达到与监督微调及GRPO相当的性能水平。

摘要 (Abstract)

Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM-based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled examples and (ii) lack structure in the optimized prompts, leading to diminishing improvements as prompt size grows. To address these limitations, we propose Multi-Agent Constitutional Learning (MAC), which optimizes over structured prompts represented as sets of rules using a network of agents with specialized tasks to accept, edit, or reject rule updates. We also present MAC+, which improves performance by training agents on successful trajectories to reinforce updates leading to higher reward. We evaluate MAC on tagging Personally Identifiable Information (PII), a classification task with limited labels where interpretability is critical, and demonstrate that it generalizes to other agentic tasks such as tool calling. MAC outperforms recent prompt optimization methods by over 50%, produces human-readable and auditable rule sets, and achieves performance comparable to supervised fine-tuning and GRPO without requiring parameter updates.

关键词: Constitutional AI, Multi-Agent Systems, LLM Alignment, Prompt Optimization, Rule-based Learning, Agent Coordination, Interpretability, Parameter-free Learning

169. ❌ MoLoRA: Composable Specialization via Per-Token Adapter Routing

作者: Shrey Shah, Justin Wagle 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15965v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	15.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心贡献是MoLoRA（Mixture of LoRA），一种基于每令牌路由的混合LoRA适配器系统，与’Mixture of Experts’和’PEFT/LoRA’高度相关（15分）。论文使用Qwen3-1.7B等模型，与’Large Language Models’和’Small Language Models’相关（10分）。其他关键词如Scaling Laws、Pre-training、RLHF等未在摘要中提及，得0分。

!!! tip deepseek-chat TL;DR

论文解决了多适配器系统中整个序列路由到单一适配器的局限性，提出了基于每令牌路由的MoLoRA系统，使小型模型通过组合多个领域专用适配器在推理任务上超越更大模型。

摘要翻译

多适配器服务系统通常将整个序列路由至单一适配器，当请求涉及多个领域时便面临选择困境。这一假设在两种重要场景中并不适用：(1) 多模态生成场景中，同一序列内的文本与图像标记需要不同的适配器；(2) 混合能力请求（如“编写代码求解此方程”）需要多个专业适配器的协同处理。我们提出基于标记的路由机制，该机制可根据词汇结构（针对多模态模型）或学习型门控（针对语义专业化）将单个标记动态路由至相应适配器。理论证明该路由方式具有最优性：处理N个标记仅需N次计算量，而采用每序列路由策略时，若存在K种适配器类型则需K·N次计算量。我们的核心贡献是MoLoRA（混合LoRA），它实现了可组合的专业化：加载多个领域专用适配器，并通过学习型路由器为每个标记选择适配器。实验表明专业化显著优于规模扩展：在四项推理基准测试中，采用MoLoRA的Qwen3-1.7B模型性能超越Qwen3-8B模型，而参数量仅为后者的1/4.7。这实现了推理时的模块化专家能力：可独立训练聚焦型LoRA模块，无需重新训练即可组合使用，并通过加载新适配器灵活扩展系统能力。

摘要 (Abstract)

Multi-adapter serving systems route entire sequences to a single adapter, forcing a choice when requests span multiple domains. This assumption fails in two important settings: (1) multimodal generation, where text and image tokens require different adapters within the same sequence, and (2) mixed-capability requests like “write code to solve this equation,” which need expertise from multiple specialized adapters. We introduce per-token routing, which routes individual tokens to adapters based on either vocabulary structure (for multimodal models) or learned gating (for semantic specialization). Per-token routing is provably optimal, achieving work N for N tokens versus K \cdot N for per-sequence routing with K adapter types. Our key contribution is MoLoRA (Mixture of LoRA), which enables composable specialization: load multiple domain-specific adapters and let a learned router select the appropriate adapter per-token. We demonstrate that specialization dramatically beats scale: MoLoRA enables Qwen3-1.7B to exceed Qwen3-8B across four reasoning benchmarks while being 4.7x smaller. This enables modular expertise at inference time: train focused LoRAs independently, combine them without retraining, and add new capabilities by simply loading new adapters.

关键词: MoLoRA, per-token routing, Mixture of LoRA, adapter routing, parameter-efficient fine-tuning, composable specialization, multimodal generation, reasoning benchmarks

170. ❌ POLAR:A Per-User Association Test in Embedding Space

作者: Pedro Bento, Arthur Buzelin, Arthur Chagas, Yan Aquino, Victoria Estanislau, Samira Malaquias, Pedro Robles Dutenhefner, Gisele L. Pappa, Virgilio Almeida, Wagner MeiraJr 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15950v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出POLAR方法，使用轻量适配的掩码语言模型嵌入空间进行用户级词汇关联测试，属于大模型在社会科学领域的应用创新。与’Large Language Models’相关度5分（使用掩码语言模型作为基础）；与’Pre-training’相关度5分（涉及模型适配）；与’AI for Science’相关度8分（核心应用在计算社会科学领域）。其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了POLAR方法，通过掩码语言模型的嵌入空间进行用户级词汇关联测试，成功区分了Twitter上的LLM驱动机器人与真实用户，并在极端主义论坛中量化了词汇偏见和时间漂移。

摘要翻译

现有的大多数内在关联性探测方法均在词汇、句子或语料库层面运行，掩盖了作者层面的差异性。本文提出POLAR（Per-user On-axis Lexical Association Report，用户轴向词汇关联报告），这是一种在轻适配掩码语言模型的嵌入空间中运行的、面向单一用户的词汇关联性检验方法。作者通过私有确定性标记进行表征；POLAR将这些向量投影至精心构建的词汇轴上，并通过置换检验p值与Benjamini–Hochberg校正报告标准化效应值。在一个平衡的机器人–人类Twitter基准测试中，POLAR清晰地区分了LLM驱动的机器人与真实用户账户；在一个极端主义论坛上，该方法量化了与侮辱性词汇库的强对齐性，并揭示了随时间推移的右倾漂移现象。该方法可模块化适配新属性集，并为计算社会科学提供简洁的、面向单一作者的分析诊断。所有代码已在https://github.com/pedroaugtb/POLAR-A-Per-User-Association-Test-in-Embedding-Space 公开。

摘要 (Abstract)

Most intrinsic association probes operate at the word, sentence, or corpus level, obscuring author-level variation. We present POLAR (Per-user On-axis Lexical Association Re-port), a per-user lexical association test that runs in the embedding space of a lightly adapted masked language model. Authors are represented by private deterministic to-kens; POLAR projects these vectors onto curated lexicalaxes and reports standardized effects with permutation p-values and Benjamini–Hochberg control. On a balanced bot–human Twitter benchmark, POLAR cleanly separates LLM-driven bots from organic accounts; on an extremist forum,it quantifies strong alignment with slur lexicons and reveals rightward drift over time. The method is modular to new attribute sets and provides concise, per-author diagnostics for computational social science. All code is publicly avail-able at https://github.com/pedroaugtb/POLAR-A-Per-User-Association-Test-in-Embedding-Space.

关键词: per-user association test, embedding space, masked language model, lexical association, computational social science, LLM-driven bots, author-level variation, standardized effects

171. ❌ A Family of LLMs Liberated from Static Vocabularies

作者: Aleph Alpha, :, Adnen Abdessaied, Artur Baranowski, Lukas Balles, Michael Barlow, Fabien C. Y. Benureau, Felix Berkenkamp, Lukas Bluebaum, Bastian Boll, Thomas F. Burns, Björn Deiseroth, Constantin Eichenberg, David Friede, Pablo Iyu Guerrero, Ahmed Hammam, Bastian Harren, Johann Higl, Yasser Jadidi, Carina Kauf, Johannes Messner, Jan Hendrik Metzen, Max Meuer, Vedant Nanda, Pit Neitemeier, Koen Oostermeijer, Letitia Parcalabescu, Markus Pernpointner, Felix Reinfurt, Dylan Rodriquez, Grégory Schott, Philipp Siedler, Martin Simonovsky, Till Speicher, Volker Stampa, Stephan Wäldchen, Samuel Weinbach, Gregor Ziegltrum 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15953v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的tokenization改进，提出HAT架构实现字节级处理，解放静态词汇表限制。高度相关关键词：LLMs（核心研究对象）、Pre-training（从头训练7B模型并重用预训练模型）、SFT（监督微调）、DPO（直接偏好优化）。其他关键词如MoE、SLMs、RAG、量化等未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于分层自回归变换器（HAT）架构的大语言模型家族，通过字节级处理解放了静态词汇表的限制，在英语和德语基准测试中超越了原始Llama 3.1模型。

摘要翻译

分词是当前大语言模型自然语言处理的核心组件，它使模型能够将原始文本转换为可处理的单元。尽管学习式分词器已被广泛采用，但它们存在显著局限性，包括庞大且固定的词汇表规模以及对新领域或语言的适应性较差。我们提出了一个基于分层自回归变换器架构的模型系列，参数量最高达700亿。在HAT架构中，编码器变换器将字节聚合为词嵌入，随后将其馈送至主干网络——一个经典的自回归变换器。主干网络的输出再经过解码器交叉注意力处理并转换回字节。我们证明可以通过将Llama 3.1的8B和70B模型转换为HAT架构来复用现有预训练模型：Llama-3.1-8B-TFree-HAT和Llama-3.1-70B-TFree-HAT是字节级模型，其编码器和解码器从头开始训练，同时我们调整了预训练的Llama主干网络（即移除了嵌入矩阵和输出头的变换器模块），使其能够处理词嵌入而非原始分词单元。我们还提供了完全从头训练的7B参数HAT模型Llama-TFree-HAT-Pretrained，该模型在近4万亿词汇上完成预训练。HAT架构通过减少所需序列位置数量改进了文本压缩能力，并增强了对词内变体（例如拼写差异）的鲁棒性。通过英语和德语上的预训练、监督微调及直接偏好优化，我们的模型在两种语言上都展现出强大能力，在多数基准测试中超越了原始Llama 3.1表现。我们已在Hugging Face平台发布全部模型（包含200个预训练检查点）。

摘要 (Abstract)

Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.

关键词: Large Language Models, Tokenization, Hierarchical Autoregressive Transformer, Byte-level Models, Pre-training, Supervised Fine-tuning, Direct Preference Optimization, Llama 3.1

作者: Tanvir Ahmed Sijan, S. M Golam Rifat, Pankaj Chowdhury Partha, Md. Tanjeed Islam, Md. Musfique Anwar 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15949v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在孟加拉语社会文化背景下的对齐问题，因此与’Large Language Models’高度相关（10分），与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分），因为论文关注的是文化对齐而非技术对齐。其他关键词主要涉及模型架构、训练技术、推理优化等具体技术细节，与论文的社会文化评估主题无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在孟加拉语社会互动中的社会语用和文化对齐问题，通过创建BANGLASOCIALBENCH基准测试发现现有模型存在系统性的文化错配，经常使用过于正式的称呼形式，无法识别多种社会可接受的称呼代词，并在不同宗教背景下混淆亲属术语。

摘要翻译

大型语言模型已展现出强大的多语言流畅性，但仅凭流畅性并不能保证社会情境中恰当的语言使用。在高语境语言中，交际能力要求对话语中直接编码的社会等级、关系角色及互动规范保持敏感性。孟加拉语通过其三级代词体系、基于亲属关系的称谓系统以及文化内嵌的社会习俗，典型地体现了这一挑战。本文推出首个专注于通过语境化语言使用（而非事实记忆）评估孟加拉语社会语用能力的基准测试——BANGLASOCIALBENCH。该基准涵盖三大领域：孟加拉语称谓体系、亲属关系推理与社会习俗，包含1,719个由孟加拉语母语者编写并核验的文化情境实例。我们在零样本设置下评估了十二个当代大型语言模型，观察到系统性的文化失准模式：模型常默认使用过度正式的称谓形式，未能识别多种社会可接受的指代代词，并在不同宗教背景下混淆亲属术语。我们的研究结果表明，社会语用失误往往具有结构性且非随机性，这揭示了当前大型语言模型在真实孟加拉社会互动中推断与应用文化适宜语言方面存在持续局限。

摘要 (Abstract)

Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BANGLASOCIALBENCH, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, and consists of 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random, revealing persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.

关键词: Large Language Models, Sociopragmatic Competence, Cultural Alignment, Bangla Language, Benchmark Evaluation, Social Interaction, Address Terms, Kinship Reasoning

173. ❌ Machine Translation in the Wild: User Reaction to Xiaohongshu’s Built-In Translation Feature

作者: Sui He 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15922v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究社交媒体平台（小红书）内置机器翻译功能的用户反应，属于应用层面的用户研究，而非大模型或深度学习的技术创新。论文未涉及任何关键词中的技术原理、方法或模型，仅提及机器翻译的一般应用，与所有关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该研究通过分析用户对小红书内置翻译功能的评论，发现用户反应总体积极，但也对功能、可访问性和翻译准确性表达了担忧。

摘要翻译

机器翻译在社交媒体平台中的日益融合，正在重塑用户跨越文化和语言界限的互动方式。本文研究了用户对小红书于2025年1月上线内置翻译功能的反应。基于从11篇推广该翻译功能的官方帖子中收集的6,723条评论数据集，本文结合情感分析与主题分析，探究了用户如何感知并尝试使用该功能。结果显示，用户反应总体积极，尤其在翻译帖子和评论方面，但也表达了对功能、可访问性及翻译准确性的担忧。除了评价性反馈，用户还积极使用多样化输入内容测试该功能，包括中英文词汇、拼音缩写、网络俚语，以及表情符号、颜文字、编码文本等其他语言形式。研究结果强调了计算机科学家、翻译学者与平台设计者之间需加强协作，以在真实交际语境中更好地理解并改进翻译技术。

摘要 (Abstract)

The growing integration of machine translation into social media platforms is transforming how users interact with each other across cultural and linguistic boundaries. This paper examines user reactions to the launch of Xiaohongshu’s built-in translation feature in January 2025. Drawing on a dataset of 6,723 comments collected from 11 official posts promoting the translation function, this paper combines sentiment analysis with thematic analysis to investigate how users perceived and experimented with the function. Results show that reactions were generally positive, particularly for translating posts and comments, although concerns regarding functionality, accessibility, and translation accuracy were also expressed. In addition to evaluative feedback, users actively tested the function with diverse inputs, including words and phrases in English and Chinese, abbreviations in pinyin, internet slang, and other language forms such as emoji, kaomoji, coded texts, etc. The findings highlight the importance of closer collaboration among computer scientists, translation scholars, and platform designers to better understand and improve translation technologies in real world communicative context.

关键词: machine translation, social media platforms, user reactions, sentiment analysis, thematic analysis, translation accuracy, Xiaohongshu, real-world communication

174. ❌ CTG-DB: An Ontology-Based Transformation of ClinicalTrials.gov to Enable Cross-Trial Drug Safety Analyses

作者: Jeffery L. Painter, François Haguinet, Andrew Bate 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15936v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于临床数据标准化和数据库构建，属于生物信息学应用领域。论文内容涉及临床数据转换、术语标准化和数据库构建，与大多数大模型技术关键词（如LLM、MoE、训练方法、推理优化等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及生物医学数据标准化，属于生物信息学应用范畴，但论文本身并未使用AI或大模型技术，只是数据处理工作，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文解决了ClinicalTrials.gov中不良事件数据缺乏标准化术语的问题，通过开发CTG-DB管道将原始XML数据转换为使用MedDRA标准化术语的关系数据库，实现了跨试验的药物安全性分析。

摘要翻译

ClinicalTrials.gov（CT.gov）是最大的公开临床研究注册平台，但其以注册为导向的架构和异质性的不良事件术语限制了系统性药物警戒分析。不良事件通常以研究者报告的文本形式记录，而非标准化标识符，需人工核对才能识别一致的安全性概念。我们推出ClinicalTrials.gov转化数据库，这是一个开源处理流程，可完整读取CT.gov的XML存档，并利用监管活动医学词典将数据转化为与标准化不良事件术语对齐的关系型数据库。该数据库保留组别层面的分母数据，呈现安慰剂和对照组的设置，并通过确定性精确匹配与模糊匹配对不良事件术语进行标准化处理，确保映射过程的透明性与可重复性。此框架支持概念级别的检索与跨试验数据聚合，从而实现可扩展的安慰剂参照安全性分析，并将临床试验证据整合至下游药物警戒信号检测中。

摘要 (Abstract)

ClinicalTrials.gov (CT.gov) is the largest publicly accessible registry of clinical studies, yet its registry-oriented architecture and heterogeneous adverse event (AE) terminology limit systematic pharmacovigilance (PV) analytics. AEs are typically recorded as investigator-reported text rather than standardized identifiers, requiring manual reconciliation to identify coherent safety concepts. We present the ClinicalTrials.gov Transformation Database (CTG-DB), an open-source pipeline that ingests the complete CT.gov XML archive and produces a relational database aligned to standardized AE terminology using the Medical Dictionary for Regulatory Activities (MedDRA). CTG-DB preserves arm-level denominators, represents placebo and comparator arms, and normalizes AE terminology using deterministic exact and fuzzy matching to ensure transparent and reproducible mappings. This framework enables concept-level retrieval and cross-trial aggregation for scalable placebo-referenced safety analyses and integration of clinical trial evidence into downstream PV signal detection.

关键词: ClinicalTrials.gov, adverse events, MedDRA, pharmacovigilance, data standardization, cross-trial analysis, safety analysis, clinical trial database

175. ❌ Agent-based imitation dynamics can yield efficiently compressed population-level vocabularies

作者: Nathaniel Imel, Richard Futrell, Michael Franke, Noga Zaslavsky 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15903v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究语言演化中的词汇压缩效率问题，采用基于代理的模仿动态和进化博弈论方法，与信息瓶颈理论结合。所有评分关键词均涉及大模型、深度学习技术原理或具体应用（如AI for Science），而本文属于理论语言学、进化博弈论和信息论交叉领域，未涉及任何深度学习、神经网络或大模型技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究通过整合进化博弈论和信息瓶颈框架，证明了基于代理的策略模仿动态可以在信号游戏中驱动词汇演化，实现接近最优的信息压缩效率。

摘要翻译

有观点认为，自然语言在演化过程中受到压力，需要通过优化信息瓶颈（Information Bottleneck，简称IB）的复杂度-准确性权衡，从而将意义高效压缩为词汇。然而，能够推动语言词汇向高效方向优化的潜在社会动力学机制，在很大程度上仍属未知。与此同时，进化博弈论已被用于解释语言如何从基础的智能体层面动态中涌现，但此类方法能否在IB意义上实现高效压缩，尚未得到验证。本文提出了一个将进化博弈论与IB框架相统一的模型，并展示了在信号博弈中，通过一种独立动机驱动的非精确策略模仿动态，群体中如何能产生接近最优的压缩效果。我们发现，模型的关键参数——即调控这些博弈中策略精确度的参数，以及参与者混淆相似状态的倾向——会导致涌现词汇所实现的权衡受到约束性变化。我们的研究结果表明，进化博弈动力学可能为词汇的演化提供一种机制性基础，使其具备信息论上最优且经实证验证的特性。

摘要 (Abstract)

Natural languages have been argued to evolve under pressure to efficiently compress meanings into words by optimizing the Information Bottleneck (IB) complexity-accuracy tradeoff. However, the underlying social dynamics that could drive the optimization of a language’s vocabulary towards efficiency remain largely unknown. In parallel, evolutionary game theory has been invoked to explain the emergence of language from rudimentary agent-level dynamics, but it has not yet been tested whether such an approach can lead to efficient compression in the IB sense. Here, we provide a unified model integrating evolutionary game theory with the IB framework and show how near-optimal compression can arise in a population through an independently motivated dynamic of imprecise strategy imitation in signaling games. We find that key parameters of the model – namely, those that regulate precision in these games, as well as players’ tendency to confuse similar states – lead to constrained variation of the tradeoffs achieved by emergent vocabularies. Our results suggest that evolutionary game dynamics could potentially provide a mechanistic basis for the evolution of vocabularies with information-theoretically optimal and empirically attested properties.

关键词: evolutionary game theory, information bottleneck, vocabulary evolution, agent-based modeling, signaling games, language efficiency, compression tradeoff, social dynamics

176. ❌ Prompt Engineering for Scale Development in Generative Psychometrics

作者: Lara Lee Russell-Lasalandra, Hudson Golino 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15909v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在心理测量学领域的应用，通过Monte Carlo模拟评估不同提示工程策略对LLM生成人格评估项目质量的影响，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。研究涉及few-shot prompting等上下文学习技术，与’In-context Learning OR Many-shot Learning’有一定关联（5分）。论文属于AI在科学（心理学）领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’相关度较高（8分）。其他关键词如MoE、SFT、RAG等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究通过Monte Carlo模拟评估了不同提示工程策略对大型语言模型生成人格评估项目质量的影响，发现自适应提示能显著提升项目质量并减少语义冗余，其效果随模型能力增强而提升。

摘要翻译

本蒙特卡洛模拟研究在生成心理测量学的AI-GENIE框架下，探讨提示工程策略如何影响大语言模型（LLM）生成的人格评估条目的质量。研究针对大五人格特质，采用多种提示设计（零样本、少样本、基于角色和自适应提示）、模型温度参数及不同大语言模型生成条目池，随后使用网络心理测量学方法进行评估与筛选。在所有实验条件下，经过筛选后AI-GENIE均能稳定提升条目的结构效度，其增量贡献的幅度与初始条目池的质量呈负相关。提示设计对筛选前后的条目质量均有显著影响：自适应提示策略持续优于非自适应策略，它能显著降低语义冗余度、提升筛选前的结构效度，并保留更大规模的条目池，这一优势在与更新、更高性能的模型结合时尤为突出。对于大多数模型，这些增益在不同温度设置下均保持稳健，表明自适应提示能够缓解创造力与心理测量学一致性之间常见的权衡关系。但GPT-4o模型在高温参数下出现例外，提示该模型在高随机性条件下对自适应约束存在特异性敏感。总体而言，研究结果表明自适应提示在此情境中是最有效的策略，其优势随模型能力的提升而增强，这推动着对生成心理测量学流程中模型与提示交互作用的持续探索。

摘要 (Abstract)

This Monte Carlo simulation examines how prompt engineering strategies shape the quality of large language model (LLM)–generated personality assessment items within the AI-GENIE framework for generative psychometrics. Item pools targeting the Big Five traits were generated using multiple prompting designs (zero-shot, few-shot, persona-based, and adaptive), model temperatures, and LLMs, then evaluated and reduced using network psychometric methods. Across all conditions, AI-GENIE reliably improved structural validity following reduction, with the magnitude of its incremental contribution inversely related to the quality of the incoming item pool. Prompt design exerted a substantial influence on both pre- and post-reduction item quality. Adaptive prompting consistently outperformed non-adaptive strategies by sharply reducing semantic redundancy, elevating pre-reduction structural validity, and preserving substantially larger item pool, particularly when paired with newer, higher-capacity models. These gains were robust across temperature settings for most models, indicating that adaptive prompting mitigates common trade-offs between creativity and psychometric coherence. An exception was observed for the GPT-4o model at high temperatures, suggesting model-specific sensitivity to adaptive constraints at elevated stochasticity. Overall, the findings demonstrate that adaptive prompting is the strongest approach in this context, and that its benefits scale with model capability, motivating continued investigation of model–prompt interactions in generative psychometric pipelines.

关键词: prompt engineering, large language models, generative psychometrics, Monte Carlo simulation, adaptive prompting, personality assessment, AI-GENIE framework, network psychometric methods

177. ❌ COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives

作者: Azwad Anjum Islam, Tisa Islam Erana 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15897v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在语义评估任务中的应用，直接涉及LLMs和Chain-of-Thought prompting，因此这两项得10分。论文使用零样本和上下文学习，与In-context Learning相关得5分。推理过程涉及结构化思考，与System 2 Thinking有一定关联得5分。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何利用LLM集成和多种提示策略（包括Chain-of-Thought）来评估同形异义词在短篇故事中的语义合理性，最终通过模型集成显著提升了与人类判断的一致性，在竞赛中取得了优异性能。

摘要翻译

本文描述了为SemEval-2026任务5所开发的系统，该任务要求对短篇故事中同形异义词（homonyms）在给定词义下的合理性进行5点李克特量表评分。系统通过准确率（处于人类标注者判断均值的一个标准差范围内）与斯皮尔曼等级相关系数（Spearman Rank Correlation）的未加权平均值进行评估。我们利用多个闭源商业大语言模型（LLMs）探索了三种提示策略：（i）基线零样本（zero-shot）设置；（ii）采用结构化推理的思维链（Chain-of-Thought, CoT）式提示；（iii）用于同时评估候选词义的对比提示策略。此外，考虑到黄金标注数据中存在显著的标注者间差异，我们提出了一种通过平均模型预测结果进行集成的方法。我们最佳的官方系统整合了所有三种提示策略下多个大语言模型的预测结果，在竞赛排行榜上位列第四，准确率为0.88，斯皮尔曼系数为0.83（平均分0.86）。赛后使用额外模型进行的实验进一步将性能提升至0.92准确率和0.85斯皮尔曼系数（平均分0.89）。我们发现，对比提示策略在不同模型系列中均能持续提升性能，而模型集成则显著增强了与人类标注平均判断的一致性，这表明大语言模型集成特别适用于涉及多位标注者的主观语义评估任务。

摘要 (Abstract)

We describe our system for SemEval-2026 Task 5, which requires rating the plausibility of given word senses of homonyms in short stories on a 5-point Likert scale. Systems are evaluated by the unweighted average of accuracy (within one standard deviation of mean human judgments) and Spearman Rank Correlation. We explore three prompting strategies using multiple closed-source commercial LLMs: (i) a baseline zero-shot setup, (ii) Chain-of-Thought (CoT) style prompting with structured reasoning, and (iii) a comparative prompting strategy for evaluating candidate word senses simultaneously. Furthermore, to account for the substantial inter-annotator variation present in the gold labels, we propose an ensemble setup by averaging model predictions. Our best official system, comprising an ensemble of LLMs across all three prompting strategies, placed 4th on the competition leaderboard with 0.88 accuracy and 0.83 Spearman’s rho (0.86 average). Post-competition experiments with additional models further improved this performance to 0.92 accuracy and 0.85 Spearman’s rho (0.89 average). We find that comparative prompting consistently improved performance across model families, and model ensembling significantly enhanced alignment with mean human judgments, suggesting that LLM ensembles are especially well suited for subjective semantic evaluation tasks involving multiple annotators.

关键词: LLM ensembles, Chain-of-Thought, word sense plausibility, prompting strategies, semantic evaluation, model ensembling, human judgments, comparative prompting

178. ❌ Temporal Fact Conflicts in LLMs: Reproducibility Insights from Unifying DYNAMICQA and MULAN

作者: Ritajit Dey, Iadh Ounis, Graham McDonald, Yashar Moshfeghi 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15892v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在处理时序事实冲突时的行为，与’Large Language Models’高度相关（10分），因为全文围绕LLMs展开。与’Hallucination Mitigation’有一定关联（5分），因为研究涉及事实冲突和模型输出准确性，但未直接解决幻觉缓解技术。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

本研究通过复现DYNAMICQA和MULAN两个基准测试，揭示了大型语言模型在处理时序事实冲突时，其行为受数据集设计、评估指标和模型大小的影响，并发现MULAN的结论更具普适性。

摘要翻译

大型语言模型（LLMs）常因训练数据中的信息过时或动态演变而难以处理时序事实冲突。近期两项附带数据集的研究就外部语境能否有效解决此类冲突得出了相反结论。DYNAMICQA评估了外部语境对改变模型输出分布的有效性，发现时序事实具有更强的抗改变性；与之相对，MULAN通过考察外部语境改变记忆事实的频率，得出时序事实更容易更新的结论。在本可复现性研究中，我们首先复现了两项基准测试的实验，随后将每项研究的实验方案应用于对方的数据集，以探究其结论分歧的根源。为使研究结果可直接比较，我们将两个数据集标准化以适配各自研究的评估设置。值得注意的是，在复现DYNAMICQA结论时，我们使用LLM合成生成符合现实场景的自然语言语境，以替代MULAN中通过程序化构建的陈述。分析表明结论具有强烈的数据集依赖性：MULAN的发现可在两种方法论框架下推广，而将MULAN的评估方案应用于DYNAMICQA则产生混合结果。最后，原始研究仅考虑了7B参数的LLMs，我们则在多种规模的语言模型上复现了这些实验，揭示了模型规模如何影响时序事实的编码与更新。本研究结果凸显了数据集设计、评估指标与模型规模如何共同塑造LLMs在面临时序知识冲突时的行为模式。

摘要 (Abstract)

Large Language Models (LLMs) often struggle with temporal fact conflicts due to outdated or evolving information in their training data. Two recent studies with accompanying datasets report opposite conclusions on whether external context can effectively resolve such conflicts. DYNAMICQA evaluates how effective external context is in shifting the model’s output distribution, finding that temporal facts are more resistant to change. In contrast, MULAN examines how often external context changes memorised facts, concluding that temporal facts are easier to update. In this reproducibility paper, we first reproduce experiments from both benchmarks. We then reproduce the experiments of each study on the dataset of the other to investigate the source of their disagreement. To enable direct comparison of findings, we standardise both datasets to align with the evaluation settings of each study. Importantly, using an LLM, we synthetically generate realistic natural language contexts to replace MULAN’s programmatically constructed statements when reproducing the findings of DYNAMICQA. Our analysis reveals strong dataset dependence: MULAN’s findings generalise under both methodological frameworks, whereas applying MULAN’s evaluation to DYNAMICQA yields mixed outcomes. Finally, while the original studies only considered 7B LLMs, we reproduce these experiments across LLMs of varying sizes, revealing how model size influences the encoding and updating of temporal facts. Our results highlight how dataset design, evaluation metrics, and model size shape LLM behaviour in the presence of temporal knowledge conflicts.

关键词: Large Language Models, temporal fact conflicts, reproducibility, dataset dependence, model size, external context, knowledge updating, benchmark evaluation

179. ❌ When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making

作者: Nazia Riasat 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15840v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在科学决策（特别是生物信息学中的基因优先排序任务）中的应用和评估，因此与’Large Language Models’和’AI for Science’高度相关（10分）。论文关注LLMs输出的事实性和有效性验证，与’Hallucination Mitigation’和’Explainable AI’有一定关联（5分）。其他关键词涉及具体技术方法（如MoE、量化、推理加速等）或特定应用场景（如智能体、工具调用），论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在数据受限的科学决策任务中，大型语言模型（LLMs）即使表现出良好的运行稳定性，也可能与统计真实值存在系统性偏差，并产生看似合理但无效的输出，强调了在科学工作流中部署LLMs时进行显式真实性验证的重要性。

摘要翻译

大型语言模型（LLM）正日益成为数据受限科学工作流程中的决策支持工具，其中正确性与有效性至关重要。然而，当前的评估实践往往侧重于多次运行中的稳定性或可复现性。尽管这些特性是值得追求的，但当存在统计基准真值时，稳定性本身并不能保证与之一致。我们提出了一种受控的行为评估框架，明确区分了LLM决策的四个维度：稳定性、正确性、提示敏感性以及在固定统计输入下的输出有效性。我们通过一项基于差异表达分析的统计基因优先级排序任务，在涉及严格与宽松显著性阈值、边界排名情境以及细微措辞变化的多种提示模式下，对多个LLM进行了评估。实验表明，LLM可以表现出近乎完美的运行间稳定性，但同时可能系统性地偏离统计基准真值——例如在宽松阈值下过度选择、对细微的提示措辞变化反应剧烈，或生成输入表中不存在的句法上合理的基因标识符。尽管稳定性反映了多次运行间的稳健性，但在结构化的科学决策任务中，它并不能保证与统计基准真值一致。这些发现凸显了在自动化或半自动化科学工作流程中部署LLM时，进行明确的基准真值验证和输出有效性检查的重要性。

摘要 (Abstract)

Large language models (LLMs) are increasingly used as decision-support tools in data-constrained scientific workflows, where correctness and validity are critical. However, evaluation practices often emphasize stability or reproducibility across repeated runs. While these properties are desirable, stability alone does not guar- antee agreement with statistical ground truth when such references are available. We introduce a controlled behavioral evaluation framework that explicitly sep- arates four dimensions of LLM decision-making: stability, correctness, prompt sensitivity, and output validity under fixed statistical inputs. We evaluate multi- ple LLMs using a statistical gene prioritization task derived from differential ex- pression analysis across prompt regimes involving strict and relaxed significance thresholds, borderline ranking scenarios, and minor wording variations. Our ex- periments show that LLMs can exhibit near-perfect run-to-run stability while sys- tematically diverging from statistical ground truth, over-selecting under relaxed thresholds, responding sharply to minor prompt wording changes, or producing syntactically plausible gene identifiers absent from the input table. Although sta- bility reflects robustness across repeated runs, it does not guarantee agreement with statistical ground truth in structured scientific decision tasks. These findings highlight the importance of explicit ground-truth validation and output validity checks when deploying LLMs in automated or semi-automated scientific work- flows.

关键词: Large language models, Scientific decision-making, Data-constrained workflows, Stability, Ground-truth validation, Gene prioritization, Bioinformatics, Output validity

180. ❌ FlashSampling: Fast and Memory-Efficient Exact Sampling

作者: Tomas Ruiz, Zhen Qin, Yifan Zhang, Xuyang Shen, Yiran Zhong, Mengdi Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15854v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FlashSampling专注于大语言模型（LLMs）解码阶段的核心优化技术，通过将采样操作融合到LM-head矩阵乘法中，避免将logits张量写入HBM，从而显著提升推理速度并减少内存占用。该研究与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为它是针对LLM解码过程的优化。与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（10分），因为它是一种类似FlashAttention的内存高效计算技术，通过分块计算和融合内核来优化性能。与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为它直接加速了LLM的推理过程，在vLLM实验中减少了高达19%的每输出token时间。其他关键词如MoE、SFT、RAG、量化等与论文的采样优化技术无直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

FlashSampling提出了一种将精确采样操作融合到语言模型头部矩阵乘法中的内存高效方法，避免了logits张量的HBM存储，在多种GPU上实现了高达19%的解码加速。

摘要翻译

从分类分布中采样在数学上很简单，但在大词汇量解码中，它常常会引发额外的内存流量和语言模型头（LM head）之后额外的内核启动。我们提出了FlashSampling，这是一种精确的采样原语，它将采样过程融合到LM头的矩阵乘法（matmul）中，并且永远不会在高速带宽内存（HBM）中具体化整个逻辑张量（logits tensor）。该方法很简单：在芯片上逐片（tile-by-tile）计算逻辑值，加入Gumbel噪声，每个行和每个词汇片只保留一个最大值，最后通过一个跨片的小型规约操作完成。这种融合的分片内核是精确的，因为$\argmax$操作在分区上是可分解的；针对在线和并行张量（tensor-parallel）设置的分组变体，则通过分类分布的层次化分解来保证精确性。在H100、H200、B200和B300 GPU上的测试表明，FlashSampling加速了内核级别的解码工作负载；在端到端的vLLM实验中，它将我们测试的模型上每个输出令牌的生成时间减少了高达$19%$。这些结果表明，无需任何近似，精确采样可以被集成到矩阵乘法本身，将一个受带宽限制的后处理步骤转变为一个轻量级的收尾操作。项目页面：https://github.com/FlashSampling/FlashSampling。

摘要 (Abstract)

Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. The fused tiled kernel is exact because $\argmax$ decomposes over a partition; grouped variants for online and tensor-parallel settings are exact by hierarchical factorization of the categorical distribution. Across H100, H200, B200, and B300 GPUs, FlashSampling speeds up kernel-level decode workloads, and in end-to-end vLLM experiments, it reduces time per output token by up to $19%$ on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, turning a bandwidth-bound postprocessing step into a lightweight epilogue. Project Page: https://github.com/FlashSampling/FlashSampling.

关键词: FlashSampling, exact sampling, memory-efficient, decoding acceleration, LM-head fusion, tiled kernel, vLLM, inference optimization

181. ❌ Persona-Conditioned Risk Behavior in Large Language Models: A Simulated Gambling Study with GPT-4.1

作者: Sankalp Dubedy 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15831v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM在模拟赌博环境中的风险决策行为，核心关注LLM作为自主代理（LLM Agents）的行为模式，并探讨其是否隐含编码了经典认知经济偏见（与解释性AI相关）。因此，‘Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。‘Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为研究涉及理解LLM内部决策机制。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，故为0分。

!!! tip deepseek-chat TL;DR

该研究通过模拟赌博实验探究GPT-4.1在不同社会经济角色下的风险决策行为，发现其自发表现出与前景理论一致的行为模式，且贫困角色表现出显著更高的风险偏好。

摘要翻译

大型语言模型（LLM）正越来越多地被部署为不确定、序列化决策环境中的自主智能体。然而，人们尚不清楚它们在此类环境中表现出的行为是反映了原则性的认知模式，还是仅仅为表层提示的模仿。本文设计了一项对照实验，为GPT-4.1分配了三种社会经济角色（富裕、中等收入、贫困）之一，并将其置于一个结构化的老虎机环境中，该环境包含三种不同的机器配置：公平（50%胜率）、低偏置（35%胜率）和连胜（在连续损失后动态增加胜率）。在每种条件下进行50次独立迭代并记录6,950次决策后，我们发现模型在没有被指示的情况下，再现了卡尼曼和特沃斯基前景理论所预测的关键行为特征。贫困角色每轮平均进行37.4次游戏（标准差=15.5），而富裕角色仅为1.1次（标准差=0.31），这一差异具有高度显著性（克鲁斯卡尔-沃利斯H=393.5，p<2.2e-16）。按角色划分的风险评分显示出较大的效应量（贫困与富裕的科恩d值=4.15）。情感标签似乎充当了事后注解而非决策驱动因素（卡方=3205.4，克莱姆V值=0.39），且跨轮次的信念更新可忽略不计（贫困角色的斯皮尔曼rho=0.032，p=0.016）。这些发现对LLM智能体设计、可解释性研究以及一个更广泛的问题——经典认知经济偏差是否被隐式编码于大规模预训练语言模型中——具有重要意义。

摘要 (Abstract)

Large language models (LLMs) are increasingly deployed as autonomous agents in uncertain, sequential decision-making contexts. Yet it remains poorly understood whether the behaviors they exhibit in such environments reflect principled cognitive patterns or simply surface-level prompt mimicry. This paper presents a controlled experiment in which GPT-4.1 was assigned one of three socioeconomic personas (Rich, Middle-income, and Poor) and placed in a structured slot-machine environment with three distinct machine configurations: Fair (50%), Biased Low (35%), and Streak (dynamic probability increasing after consecutive losses). Across 50 independent iterations per condition and 6,950 recorded decisions, we find that the model reproduces key behavioral signatures predicted by Kahneman and Tversky’s Prospect Theory without being instructed to do so. The Poor persona played a mean of 37.4 rounds per session (SD=15.5) compared to 1.1 rounds for the Rich persona (SD=0.31), a difference that is highly significant (Kruskal-Wallis H=393.5, p<2.2e-16). Risk scores by persona show large effect sizes (Cohen’s d=4.15 for Poor vs Rich). Emotional labels appear to function as post-hoc annotations rather than decision drivers (chi-square=3205.4, Cramer’s V=0.39), and belief-updating across rounds is negligible (Spearman rho=0.032 for Poor persona, p=0.016). These findings carry implications for LLM agent design, interpretability research, and the broader question of whether classical cognitive economic biases are implicitly encoded in large-scale pretrained language models.

关键词: Large Language Models, LLM Agents, Risk Behavior, Prospect Theory, Autonomous Decision-making, Persona Conditioning, Cognitive Biases, Simulated Gambling

182. ❌ Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs

作者: Yara Alakeel, Chatrine Qwaider, Hanan Aldarmaki, Sawsan Alqahtani 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15773v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文直接研究大语言模型（LLMs）在阿拉伯语形态学中的表现，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文通过分析LLMs如何处理复杂形态结构，间接涉及模型内部工作机制的理解，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统、压缩加速等均未在摘要中提及，与论文核心内容无关，故评0分。

!!! tip deepseek-chat TL;DR

该研究评估了大型语言模型及其分词器在表示和生成阿拉伯语根-模式形态学方面的有效性，发现分词器的形态对齐对于形态生成既非必要也不充分，从而质疑了形态分词在下游性能中的作用。

摘要翻译

本研究探讨大型语言模型（LLM）及其分词方案在表征和生成阿拉伯语根式形态时的有效性，旨在探究其是否捕捉到真实的形态结构，抑或仅依赖表层记忆。阿拉伯语形态系统为分析LLM如何处理复杂的非连接形态形式以及分词选择如何影响这一过程提供了丰富的测试平台。我们的研究首先评估了阿拉伯语专用分词器与多语言分词器相对于标准切分的形态保真度，随后利用新开发的测试集分析了LLM在能产性根式生成任务中的表现。通过对七个以阿拉伯语为核心及多语言LLM及其对应分词器的实验发现：分词器的形态对齐对于形态生成任务既非必要条件也非充分条件，这一结论对形态分词在下游任务性能中的作用提出了质疑。

摘要 (Abstract)

This work investigates how effectively large language models (LLMs) and their tokenization schemes represent and generate Arabic root-pattern morphology, probing whether they capture genuine morphological structure or rely on surface memorization. Arabic morphological system provides a rich testbed for analyzing how LLMs handle complex, non-concatenative forms and how tokenization choices influence this process. Our study begins with an evaluation of morphological fidelity across Arabic and multilingual tokenizers against gold-standard segmentation, followed by an analysis of LLM performance in productive root-pattern generation using a newly developed test set. Our findings across seven Arabic-centric and multilingual LLMs and their respective tokenizers reveal that tokenizer morphological alignment is not necessary nor sufficient for morphological generation, which questions the role of morphological tokenization in downstream performance.

关键词: Large Language Models, LLMs, Arabic morphology, tokenization, root-pattern morphology, morphological generation, multilingual models, morphological fidelity

183. ❌ MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification

作者: MiroMind Team, S. Bai, L. Bing, L. Lei, R. Li, X. Li, X. Lin, E. Min, L. Su, B. Wang, L. Wang, L. Wang, S. Wang, X. Wang, Y. Zhang, Z. Zhang, G. Chen, L. Chen, Z. Cheng, Y. Deng, Z. Huang, D. Ng, J. Ni, Q. Ren, X. Tang, B. L. Wang, H. Wang, N. Wang, C. Wei, Q. Wu, J. Xia, Y. Xiao, H. Xu, X. Xu, C. Xue, Z. Yang, Z. Yang, F. Ye, H. Ye, J. Yu, C. Zhang, W. Zhang, H. Zhao, P. Zhu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15726v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM驱动的智能体（MiroThinker-1.7和H1），专注于复杂长程推理任务，通过结构化规划、上下文推理和工具交互提升多步推理可靠性，并引入验证机制。高度相关的关键词包括：LLM Agents（核心主题）、Tool Use（工具交互）、Chain of Thought（多步推理）、System 2 Thinking（深度推理）、Self-Correction（验证和修正）。AI for Science得5分，因论文在科学推理基准测试中应用，但非核心创新。其他关键词如MoE、SFT、RAG等未在摘要中提及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了MiroThinker-1.7和H1研究智能体，通过结构化规划、工具交互和验证机制解决复杂长程推理任务，在开放网络研究、科学推理和金融分析基准测试中实现了最先进的性能。

摘要翻译

我们推出MiroThinker-1.7，这是一种专为复杂长程推理任务设计的新型研究智能体。在此基础上，我们进一步引入MiroThinker-H1，该版本通过增强重型推理能力扩展了智能体功能，以实现更可靠的多步骤问题求解。具体而言，MiroThinker-1.7通过强调结构化规划、情境推理与工具交互的智能体中期训练阶段，提升了每个交互步骤的可靠性。这使得在复杂任务中能进行更有效的多步骤交互与持续推理。MiroThinker-H1进一步将验证机制直接融入局部与全局层面的推理过程：在推理过程中可评估并优化中间决策，同时审计整体推理轨迹以确保最终答案由连贯的证据链支撑。在涵盖开放网络研究、科学推理与金融分析的基准测试中，MiroThinker-H1在深度研究任务上实现了最先进的性能，同时在专业领域保持强劲表现。我们还将MiroThinker-1.7及MiroThinker-1.7-mini作为开源模型发布，以显著提升的效率提供具有竞争力的研究智能体能力。

摘要 (Abstract)

We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.

关键词: research agent, long-horizon reasoning, multi-step problem solving, structured planning, tool interaction, verification, scientific reasoning, state-of-the-art performance

184. ❌ WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

作者: Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16871v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究交互式3D游戏世界生成模型，核心创新是使用相机姿态作为统一几何表示来解决动作控制和3D一致性问题。该论文与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、应用等）完全无关，因为这些关键词主要针对语言模型和通用AI技术，而本文专注于视频扩散模型和3D游戏世界生成。唯一相关的关键词是’World Models AND General World Models’，因为论文明确研究’gaming world models’，属于世界模型在特定领域（游戏）的应用，因此给予10分（高度相关，核心内容）。

!!! tip deepseek-chat TL;DR

该论文提出使用相机姿态作为统一几何表示来增强交互式3D游戏世界模型，解决了现有方法在精确动作控制和长时程3D一致性方面的不足，并通过大规模数据集和实验验证了其在动作可控性、视觉质量和空间一致性上的显著提升。

摘要翻译

视频扩散变换器的最新进展催生了交互式游戏世界模型，使用户能够在扩展时间跨度内探索生成的环境。然而，现有方法在精确动作控制和长时程三维一致性方面仍面临挑战。大多数先前研究将用户动作视为抽象的条件信号，忽略了动作与三维世界之间根本的几何耦合关系——即动作引发相对相机运动，这些运动在三维世界中累积为全局相机位姿。本文中，我们确立相机位姿作为一种统一的几何表征，以共同支撑即时动作控制与长期三维一致性。首先，我们定义了一个基于物理的连续动作空间，并将用户输入表示为李代数形式，以推导精确的六自由度相机位姿，这些位姿通过相机嵌入器注入生成模型，从而确保准确的动作对齐。其次，我们使用全局相机位姿作为空间索引来检索相关的历史观测数据，实现在长时程导航过程中对场景的几何一致性重访。为支持本研究，我们引入了一个大规模数据集，包含3000分钟标注有相机轨迹和文本描述的真实人类游戏录像。大量实验表明，我们的方法在动作可控性、长时程视觉质量和三维空间一致性方面显著优于当前最先进的交互式游戏世界模型。

摘要 (Abstract)

Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.

关键词: interactive gaming world models, camera pose, 3D consistency, video diffusion transformers, action controllability, long-horizon navigation, geometric representation, 6-DoF camera poses

185. ❌ SegviGen: Repurposing 3D Generative Model for Part Segmentation

作者: Lin Li, Haoran Feng, Zehuan Huang, Haohua Chen, Wenbo Nie, Shaohua Hou, Keqing Fan, Pan Hu, Sheng Wang, Buyu Li, Lu Sheng 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16869v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SegviGen专注于3D生成模型在3D部件分割任务中的重新利用，核心是利用预训练的3D生成模型的结构化先验知识。与评分关键词列表相比，该论文主要涉及计算机视觉和3D几何处理领域，而非大语言模型或深度学习技术原理的创新。唯一相关的关键词是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文利用了预训练的3D生成模型（一种预训练技术），但并非大语言模型的预训练，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

SegviGen提出了一种利用预训练3D生成模型进行3D部件分割的新框架，通过部件着色实现分割，在仅使用0.32%标注数据的情况下，交互式部件分割性能提升40%，全分割性能提升15%。

摘要翻译

我们提出SegviGen框架，该框架通过改造原生三维生成模型实现三维部件分割。现有技术方案通常通过蒸馏或多视角掩码聚合将强二维先验提升至三维空间，但常面临跨视角不一致性与边界模糊问题；另一类方法探索原生三维判别式分割，通常需要大规模标注三维数据与大量训练资源。相比之下，SegviGen利用预训练三维生成模型中编码的结构化先验，通过差异化部件着色引导分割，建立了一种新颖高效的三维部件分割框架。具体而言，SegviGen对三维资产进行编码，并在几何对齐重建的活跃体素上预测部件指示性颜色。该框架在统一架构中支持交互式部件分割、完整分割以及二维引导的完整分割。大量实验表明，SegviGen在交互式部件分割任务上超越现有最佳性能40%，在完整分割任务上提升15%，且仅需0.32%的标注训练数据。这证明预训练三维生成先验能有效迁移至三维部件分割任务，在有限监督条件下实现卓越性能。项目页面详见：https://fenghora.github.io/SegviGen-Page/。

摘要 (Abstract)

We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at https://fenghora.github.io/SegviGen-Page/.

关键词: 3D generative models, part segmentation, pretrained priors, colorization, limited supervision, interactive segmentation, geometry-aligned reconstruction, voxel-based

186. ❌ What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

作者: Moritz Pawlowsky, Antonis Vamvakeros, Alexander Weiss, Anja Bielefeld, Samuel J. Cooper, Ronan Docherty 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16840v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文研究视觉变换器（ViTs）中的位置偏差问题，特别是针对DINOv2等特征基础模型，通过微调使用ALiBi相对位置编码来减少偏差，并应用于材料科学中的显微镜图像分割。论文与大多数关键词无关，因为这些关键词主要针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG等），而本文聚焦于视觉变换器和计算机视觉应用。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文明确涉及材料科学中的AI应用（如显微镜图像分析），属于科学领域的AI应用，因此给予10分（高度相关）。其他关键词如’Post-training OR Supervised Fine-tuning OR SFT’仅部分相关（论文提到微调模型），但非核心内容，给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了视觉变换器（ViTs）中因位置编码导致的位置偏差问题，通过微调模型使用ALiBi相对位置编码来减少偏差，并成功应用于材料科学中复杂显微镜图像的无偏特征分割。

摘要翻译

视觉变换器（ViTs）——尤其是像DINOv2这样的特征基础模型——能够学习适用于多种下游任务的丰富表征。然而，其架构选择（例如位置编码）可能导致这些模型产生与语义内容无关的位置偏差和伪影。这使得零样本适配在材料科学等领域变得困难，因为该领域的图像通常是均匀微观结构的截面（即不具有特定方向性）。在本研究中，我们通过线性探测探究了ViTs中的位置偏差，发现该偏差存在于多种训练目标和位置编码方案中；随后，我们通过微调模型以采用ALiBi相对位置编码来减少这种偏差。我们证明，这些模型保留了理想的通用语义特征，且其无偏差的特征可成功应用于复杂显微图像的可训练分割任务中。

摘要 (Abstract)

Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.

关键词: Vision Transformers, positional bias, ALiBi positional encoding, DINOv2, material science, microscopy images, fine-tuning, segmentation

187. ❌ M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

作者: Kerui Ren, Guanghao Li, Changjian Jiang, Yingxiang Xu, Tao Lu, Linning Xu, Junting Dong, Jiangmiao Pang, Mulin Yu, Bo Dai 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16844v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文M^3专注于计算机视觉和机器人学领域，提出了一种结合多视图基础模型与密集匹配的单目Gaussian Splatting SLAM系统，用于未校准单目视频的流式重建。该研究与大多数关键词（如LLM技术、对齐、推理、代理等）完全无关，因为这些关键词主要涉及自然语言处理和大语言模型。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，得5分，因为SLAM和3D重建可视为AI在科学或工程领域的应用，但非核心生物或化学信息学。‘Large Language Models OR LLMs OR Foundation Models’得8分，因为论文明确使用了’Multi-view foundation models’，属于基础模型在视觉领域的应用，符合研究背景中’大模型在不同领域的研究应用’。其他关键词均得0分，因论文内容完全未涉及。

!!! tip deepseek-chat TL;DR

该论文解决了未校准单目视频流式重建中高精度姿态估计和在线细化的挑战，通过将多视图基础模型与密集匹配头结合并集成到单目Gaussian Splatting SLAM中，在多个基准测试中实现了最先进的姿态估计和场景重建精度。

摘要翻译

从未标定的单目视频流进行实时三维重建仍具挑战，这需要在动态环境中同时实现高精度位姿估计与高效在线优化。尽管将三维基础模型与SLAM框架结合是前景广阔的范式，但关键瓶颈依然存在：多数多视角基础模型以前馈方式估计位姿，产生的像素级对应关系缺乏严格几何优化所需精度。为此，我们提出M³模型，通过为多视角基础模型增设专用匹配头来获取细粒度密集对应关系，并将其集成至鲁棒的单目高斯溅射SLAM系统中。M³进一步引入动态区域抑制与跨推理内参对齐机制以提升跟踪稳定性。在多样化的室内外基准测试上的大量实验表明，该方法在位姿估计与场景重建方面均达到最先进精度。值得注意的是，在ScanNet++数据集上，M³相较VGGT-SLAM 2.0将ATE RMSE降低了64.3%，并在PSNR指标上以2.11 dB的优势超越ARTDECO。

摘要 (Abstract)

Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.

关键词: Monocular Gaussian Splatting SLAM, Multi-view foundation models, Dense matching, Pose estimation, Streaming reconstruction, 3D reconstruction, Dynamic area suppression, Intrinsic alignment

188. ❌ An assessment of data-centric methods for label noise identification in remote sensing data sets

作者: Felix Kröber, Genc Hoxha, Ribana Roscher 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16835v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究遥感数据集中标签噪声识别与处理的数据中心方法，属于深度学习在特定科学领域（遥感）的应用，但与您关注的大模型技术原理创新、LLMs、MoE、Scaling Laws、训练对齐技术、推理优化、智能体等核心关键词完全无关。仅与最后一个关键词’AI for Science’有一定关联，因为遥感属于地球科学应用领域，但论文未涉及大模型或深度学习技术原理的创新，只是传统深度学习在特定领域的应用研究，因此相关性较低。

!!! tip deepseek-chat TL;DR

该论文系统评估了三种数据中心方法在遥感数据集中识别和处理标签噪声的性能，证明了这些方法能有效过滤噪声并提升任务表现，并指出了将该方法迁移到遥感领域仍需进一步研究的方向。

摘要翻译

在许多现实世界数据集中普遍存在标签噪声，即错误标注问题，这已被证实会严重限制深度学习模型的泛化能力。然而，在遥感领域，数据集中标签噪声的自动化处理至今仍未受到足够关注。特别是，目前缺乏对以数据为中心的方法进行系统性分析——这些方法不仅能处理标签噪声，还能明确识别并隔离噪声标签。本文研究了三种此类方法，并评估了它们在不同标签噪声假设下的表现。为此，我们在两个基准数据集中注入了噪声水平从10%到70%不等的多种类型标签噪声，进而分析了所选方法过滤标签噪声的效果及其对任务性能的影响。通过分析，我们明确证实了以数据为中心的方法在标签噪声识别和任务性能提升两方面的价值。我们的研究揭示了在不同场景和目标下应如何选择最优方法。最后，我们指出了将以数据为中心的标签噪声方法迁移至遥感数据领域仍需进一步探索的方向。因此，本研究在推动以数据为中心的标签噪声方法的方法论建立及其在遥感领域实际应用方面迈出了重要一步。

摘要 (Abstract)

Label noise in the sense of incorrect labels is present in many real-world data sets and is known to severely limit the generalizability of deep learning models. In the field of remote sensing, however, automated treatment of label noise in data sets has received little attention to date. In particular, there is a lack of systematic analysis of the performance of data-centric methods that not only cope with label noise but also explicitly identify and isolate noisy labels. In this paper, we examine three such methods and evaluate their behavior under different label noise assumptions. To do this, we inject different types of label noise with noise levels ranging from 10 to 70% into two benchmark data sets, followed by an analysis of how well the selected methods filter the label noise and how this affects task performances. With our analyses, we clearly prove the value of data-centric methods for both parts - label noise identification and task performance improvements. Our analyses provide insights into which method is the best choice depending on the setting and objective. Finally, we show in which areas there is still a need for research in the transfer of data-centric label noise methods to remote sensing data. As such, our work is a step forward in bridging the methodological establishment of data-centric label noise methods and their usage in practical settings in the remote sensing domain.

关键词: label noise, data-centric methods, remote sensing, deep learning, benchmark datasets, noise identification, task performance, data quality

189. ❌ Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines

作者: Sourya Saha, Saptarshi Debroy 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16823v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究边缘计算中的深度强化学习调度算法，专注于XR应用的延迟和能耗优化。所有关键词均涉及大模型、深度学习技术原理或AI科学应用，而本文的核心是系统优化和资源管理，与大模型技术、AI科学应用无直接关联。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度强化学习的电池感知执行管理框架，用于边缘辅助XR系统，在动态网络条件下优化执行决策，实验表明该方法在保持高延迟合规性的同时，将设备电池寿命延长了最多163%。

摘要翻译

沉浸式扩展现实（XR）应用引入了对延迟极为敏感的工作负载，这些负载必须在能源和电池受限的设备上运行时满足严格的实时响应性要求，这使得终端设备与邻近边缘服务器之间的执行放置成为一项根本性的系统挑战。现有的自适应执行与计算卸载方法通常以优化平均性能指标为目标，未能充分捕捉闭环XR工作负载中实时延迟要求与设备电池寿命之间的持续交互关系。本文提出了一种面向边缘辅助XR系统的电池感知执行管理框架，该框架联合考虑了执行放置、工作负载质量、延迟要求及电池动态特性。我们设计了一种基于轻量级深度强化学习策略的在线决策机制，该机制能在动态网络条件下持续调整执行决策，同时保持较高的运动到光子（motion-to-photon）延迟合规性。实验结果表明，与延迟最优的本地执行方案相比，所提方法在稳定网络条件下可将设备预计电池寿命延长最高达163%，同时保持超过90%的运动到光子延迟合规率。即使在网络带宽可用性严重受限的情况下，该合规率仍不低于80%，从而证明了在沉浸式XR系统中显式管理延迟与能耗权衡的有效性。

摘要 (Abstract)

Immersive extended reality (XR) applications introduce latency-critical workloads that must satisfy stringent real-time responsiveness while operating on energy- and battery-constrained devices, making execution placement between end devices and nearby edge servers a fundamental systems challenge. Existing approaches to adaptive execution and computation offloading typically optimize average performance metrics and do not fully capture the sustained interaction between real-time latency requirements and device battery lifetime in closed-loop XR workloads. In this paper, we present a battery-aware execution management framework for edge-assisted XR systems that jointly considers execution placement, workload quality, latency requirements, and battery dynamics. We design an online decision mechanism based on a lightweight deep reinforcement learning policy that continuously adapts execution decisions under dynamic network conditions while maintaining high motion-to-photon latency compliance. Experimental results show that the proposed approach extends the projected device battery lifetime by up to 163% compared to latency-optimal local execution while maintaining over 90% motion-to-photon latency compliance under stable network conditions. Such compliance does not fall below 80% even under significantly limited network bandwidth availability, thereby demonstrating the effectiveness of explicitly managing latency-energy trade-offs in immersive XR systems.

关键词: Deep Reinforcement Learning, Edge Offloading, Latency-constrained, XR pipelines, Battery-aware, Execution Management, Motion-to-photon Latency, Network Conditions

190. ❌ Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling

作者: Christian Belardi, Justin Lovelace, Kilian Q. Weinberger, Carla P. Gomes 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16797v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型（Diffusion Models）的采样优化，提出使用自适应矩估计（adaptive moment estimation）来稳定采样过程中的噪声似然分数。虽然扩散模型属于深度学习领域，但论文内容与所有评分关键词（均围绕大语言模型LLMs及其相关技术、应用、优化方法）完全无关。论文未涉及LLMs、MoE、SLMs、对齐、微调、推理加速、AI for Science等任何关键词主题。

!!! tip deepseek-chat TL;DR

该论文针对扩散模型采样中噪声似然分数不稳定的问题，提出使用自适应矩估计进行稳定，在图像修复和类条件生成任务上取得了优于复杂方法的性能。

摘要翻译

引导扩散采样依赖于对通常难以处理的似然分数进行近似，这给采样动态引入了显著噪声。我们提出在采样过程中使用自适应矩估计来稳定这些含噪的似然分数。尽管方法简单，我们的方法在图像修复和类条件生成任务上取得了最先进的结果，超越了通常计算成本更高的复杂方法。我们在合成数据与真实数据上对本方法进行了实证分析，证明通过自适应矩来缓解梯度噪声为提高对齐度提供了一种有效途径。

摘要 (Abstract)

Guided diffusion sampling relies on approximating often intractable likelihood scores, which introduces significant noise into the sampling dynamics. We propose using adaptive moment estimation to stabilize these noisy likelihood scores during sampling. Despite its simplicity, our approach achieves state-of-the-art results on image restoration and class-conditional generation tasks, outperforming more complicated methods, which are often computationally more expensive. We provide empirical analysis of our method on both synthetic and real data, demonstrating that mitigating gradient noise through adaptive moments offers an effective way to improve alignment.

关键词: diffusion sampling, adaptive moment estimation, guided diffusion, likelihood scores, image restoration, class-conditional generation, gradient noise, alignment

191. ❌ WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation

作者: Muhammad Aamir, Naoya Muramatsu, Sangyun Shin, Matthew Wijers, Jiaxing Jhong, Xinyu Hou, Amir Patel, Andrew Markham 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16816v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation》专注于计算机视觉领域，特别是深度估计、3D重建和野生动物感知，使用RGB和LiDAR多模态数据。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，但论文内容不涉及大模型、LLMs、MoE、训练技术、推理优化、代理系统、模型压缩等主题。唯一可能相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文涉及AI在野生动物研究中的应用，属于科学应用的一个子领域，但并非核心内容，因此给予5分（有一定关联）。其他关键词与论文主题完全无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了WildDepth多模态数据集，用于解决动物深度估计和3D重建中缺乏度量尺度数据的问题，实验表明RGB-LiDAR融合将深度可靠性提高了10% RMSE，3D重建保真度提高了12% Chamfer距离。

摘要翻译

深度估计与三维重建作为计算机视觉的核心课题已被广泛研究。从具有相对简单几何形状的刚性物体（如车辆）起步，该领域的研究已扩展到处理包括具有挑战性的可变形物体（如人类与动物）在内的通用对象。然而，针对动物对象，现有模型大多基于无度量尺度的数据集进行训练，这类数据集虽有助于验证纯图像模型，却存在局限。为应对这一不足，我们提出了WildDepth——一个面向深度估计、行为检测及三维重建的多模态数据集与基准测试套件，其涵盖从家养到野外环境的多类动物，并提供同步采集的RGB与激光雷达数据。实验结果表明，使用多模态数据可将深度估计的可靠性提升高达10%（以均方根误差计），而RGB与激光雷达的融合则使三维重建的保真度在倒角距离指标上提升了12%。通过公开WildDepth数据集及其基准测试，我们旨在推动建立能够跨领域泛化的鲁棒多模态感知系统。

摘要 (Abstract)

Depth estimation and 3D reconstruction have been extensively studied as core topics in computer vision. Starting from rigid objects with relatively simple geometric shapes, such as vehicles, the research has expanded to address general objects, including challenging deformable objects, such as humans and animals. However, for the animal, in particular, the majority of existing models are trained based on datasets without metric scale, which can help validate image-only models. To address this limitation, we present WildDepth, a multimodal dataset and benchmark suite for depth estimation, behavior detection, and 3D reconstruction from diverse categories of animals ranging from domestic to wild environments with synchronized RGB and LiDAR. Experimental results show that the use of multi-modal data improves depth reliability by up to 10% RMSE, while RGB-LiDAR fusion enhances 3D reconstruction fidelity by 12% in Chamfer distance. By releasing WildDepth and its benchmarks, we aim to foster robust multimodal perception systems that generalize across domains.

关键词: depth estimation, 3D reconstruction, multimodal dataset, wildlife perception, RGB-LiDAR fusion, animal behavior detection, metric scale validation, computer vision

192. ❌ Dual Stream Independence Decoupling for True Emotion Recognition under Masked Expressions

作者: Jinsheng Wei, Xiguang Zhang, Zheng Shi, Guanming Lu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16760v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和情感识别领域，提出了一种基于apexframe的新范式和一个双流独立解耦框架来识别被掩盖的真实情感。论文内容涉及深度学习在情感识别中的应用，但未涉及大语言模型（LLM）、大模型技术原理、模型训练/对齐方法、推理优化、代理系统或科学AI等关键词。所有关键词均与大模型或深度学习在科学领域的应用无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于apexframe的新范式和双流独立解耦框架，以解决从稳定伪装状态的面部表情中识别真实情感的挑战，并通过实验验证了该框架能提高识别性能。

摘要翻译

从伪装表情中识别真实情绪因人为刻意掩饰而极具挑战性。现有范式通常从包含情绪伪装起始帧的伪装表情视频片段中识别真实情绪。然而，该范式可能无法反映实际的伪装状态，因为起始帧会在未达到稳定伪装状态时泄露真实情绪信息。为此，本文提出一种新颖的基于峰值帧的范式，从已达到稳定伪装状态的峰值帧中分类真实情绪。进一步，本文提出一种新颖的双流独立解耦框架，将真实情绪特征与伪装情绪特征解耦，避免伪装情绪对真实情绪的干扰。为实现高效解耦，我们设计了一个解耦损失组，包含分别学习真实情绪特征与伪装表情特征的两种分类损失，以及一项用于增强两类特征独立性的希尔伯特-施密特独立性损失。实验表明，基于峰值帧的范式具有挑战性，而所提出的解耦框架有效提升了识别性能。

摘要 (Abstract)

Recongnizing true emotions from masked expressions is extremely challenging due to deliberate concealment. Existing paradigms recognize true emotions from masked-expression clips that contain onsetframes just starting to disguise. However, this paradigm may not reflect the actual disguised state, as the onsetframe leaks the true emotional information without reaching a stable disguise state. Thus, this paper introduces a novel apexframe-based paradigm that classifies true emotions from the apexframe with a stable disguised state. Furthermore, this paper proposes a novel dual stream independence decoupling framework that decouples true and disguised emotion features, avoiding the interference of disguised emotions on true emotions. For efficient decoupling, we design a decoupling loss group, comprising two classification losses that learn true emotion and disguised expression features, respectively, and a Hilbert-Schmidt Independence loss that enhances the independence of two features. Experiments demonstrate that the apexframe-based paradigm is challenging, and the proposed decouple framework improves recogntion performances.

关键词: true emotion recognition, masked expressions, apexframe-based paradigm, dual stream independence decoupling, disguised emotion features, Hilbert-Schmidt Independence loss, facial expression analysis, emotion feature decoupling

193. ❌ SuCor: Susceptibility Distortion Correction via Parameter-Free and Self-Regularized Optimal Transport

作者: Sreekar Chigurupati, Eleftherios Garyfallidis 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16758v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SuCor专注于医学影像处理（特别是EPI图像的畸变校正），使用最优传输理论进行几何失真校正。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而该论文研究的是传统的医学图像处理算法，未涉及任何深度学习、大模型或AI技术，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于最优传输理论的参数化、自正则化方法SuCor，用于校正EPI图像中的磁敏感畸变，在HCP数据集上比FSL TOPUP方法取得了更好的互信息结果且计算效率更高。

摘要翻译

我们提出SuCor方法，该方法利用最优传输理论沿相位编码方向校正回波平面成像中由磁化率引起的几何畸变。给定一对反向相位编码的EPI三维数据，我们将畸变场的每一列建模为相反极性强度剖面之间的Wasserstein-2重心位移。通过在谱域施加弯曲能量惩罚项进行正则化，其强度通过莫罗佐夫偏差准则自动选择，无需人工调参。在人类连接组计划数据集（包含左右/右左编码的b0 EPI图像对及配准的T1结构参考像）上，SuCor与T1图像的体积互信息均值达到0.341，优于FSL TOPUP工具的0.317，且单CPU核心运行时间仅需约12秒。

摘要 (Abstract)

We present SuCor, a method for correcting susceptibility induced geometric distortions in echo planar imaging (EPI) using optimal transport (OT) along the phase encoding direction. Given a pair of reversed phase encoding EPI volumes, we model each column of the distortion field as a Wasserstein-2 barycentric displacement between the opposing-polarity intensity profiles. Regularization is performed in the spectral domain using a bending-energy penalty whose strength is selected automatically via the Morozov discrepancy principle, requiring no manual tuning. On a human connectome project (HCP) dataset with left-right/right-left b0 EPI pairs and a co-registered T1 structural reference, SuCor achieves a mean volumetric mutual information of 0.341 with the T1 image, compared to 0.317 for FSL TOPUP, while running in approximately 12 seconds on a single CPU core.

关键词: susceptibility distortion correction, optimal transport, echo planar imaging, Wasserstein barycenter, parameter-free regularization, Morozov discrepancy principle, geometric distortion, medical image processing

194. ❌ Semi-supervised Latent Disentangled Diffusion Model for Textile Pattern Generation

作者: Chenggong Hu, Yi Wang, Mengqi Xue, Haofei Zhang, Jie Song, Li Sun 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16747v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于纺织图案生成的计算机视觉任务，提出了一种基于潜在扩散模型和特征解耦的两阶段方法。论文内容完全围绕图像生成、特征表示和扩散模型展开，未涉及任何大语言模型、深度学习技术原理创新或AI for Science应用。所有评分关键词均与大语言模型、模型训练优化、推理加速、AI代理、科学AI应用等相关，与该论文的计算机视觉和图像生成主题无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SLDDM-TPG的半监督潜在解耦扩散模型，用于解决纺织图案生成任务中因特征混淆导致的细节保真度不足问题，在CTP-HD和VITON-HD数据集上取得了更好的生成质量。

摘要翻译

纺织品图案生成（Textile Pattern Generation, TPG）旨在根据给定的服装图像合成细粒度的纺织品图案图像。尽管先前的研究并未明确探讨TPG任务，但现有的图像到图像模型似乎是解决该问题的自然选择。然而，当直接应用这些方法时，它们往往产生不忠实的结果，由于复杂纺织品图案与服装图像中固有的非刚性纹理变形之间的特征混淆，无法保持细粒度细节。本文提出了一种新颖的方法SLDDM-TPG，用于实现忠实且高保真的纺织品图案生成。我们的方法包含两个阶段：（1）一个潜在解缠网络（Latent Disentangled Network, LDN），用于解决服装表征中的特征混淆问题，并构建一个多维、独立的服装特征空间；（2）一个半监督潜在扩散模型（Semi-supervised Latent Diffusion Model, S-LDM），该模型接收来自LDN的引导信号，并通过半监督扩散训练结合我们设计的细粒度对齐策略，生成忠实的结果。大量评估表明，SLDDM-TPG在我们的CTP-HD数据集上将FID降低了4.1，并将SSIM提升了最高0.116，同时在VITON-HD数据集上也展现出良好的泛化能力。

摘要 (Abstract)

Textile pattern generation (TPG) aims to synthesize fine-grained textile pattern images based on given clothing images. Although previous studies have not explicitly investigated TPG, existing image-to-image models appear to be natural candidates for this task. However, when applied directly, these methods often produce unfaithful results, failing to preserve fine-grained details due to feature confusion between complex textile patterns and the inherent non-rigid texture distortions in clothing images. In this paper, we propose a novel method, SLDDM-TPG, for faithful and high-fidelity TPG. Our method consists of two stages: (1) a latent disentangled network (LDN) that resolves feature confusion in clothing representations and constructs a multi-dimensional, independent clothing feature space; and (2) a semi-supervised latent diffusion model (S-LDM), which receives guidance signals from LDN and generates faithful results through semi-supervised diffusion training, combined with our designed fine-grained alignment strategy. Extensive evaluations show that SLDDM-TPG reduces FID by 4.1 and improves SSIM by up to 0.116 on our CTP-HD dataset, and also demonstrate good generalization on the VITON-HD dataset.

关键词: Textile pattern generation, Latent diffusion model, Feature disentanglement, Semi-supervised learning, Image-to-image translation, Fine-grained alignment, Clothing representation, High-fidelity synthesis

195. ❌ World Reconstruction From Inconsistent Views

作者: Lukas Höllein, Matthias Nießner 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16736v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D重建和计算机视觉领域，提出了一种从视频扩散模型生成的不一致视图中重建3D世界的方法。论文的核心是几何基础模型、非刚性对齐和3D重建技术，与大多数关键词（如LLMs、MoE、SFT、RAG、CoT等）无关。唯一相关的关键词是’World Models AND General World Models’，因为论文涉及从视频生成3D世界模型，但并非通用世界模型，因此给予10分（高度相关，但非核心）。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种从视频扩散模型生成的不一致视图中重建高质量3D世界的方法，通过非刚性对齐和全局优化实现3D一致性，优于基线方法。

摘要翻译

视频扩散模型能够生成高质量且多样化的动态场景，但其输出序列中的单帧图像常缺乏三维一致性，这导致三维场景的重建面临困难。为此，我们提出一种新方法，通过将视频帧非刚性对齐至全局一致的坐标系中，以处理这些不一致性问题，从而生成清晰且细节丰富的点云重建结果。首先，利用几何基础模型将每一帧提升为逐像素的三维点云，但由于前述不一致性，这些点云包含未对齐的表面。随后，我们提出一种定制化的非刚性迭代帧到模型ICP算法，以实现所有帧间的初始对齐，再通过全局优化进一步锐化点云。最后，我们将此点云作为三维重建的初始化条件，并提出一种新颖的反向形变渲染损失函数，以从不一致的视角中创建高质量且可探索的三维环境。实验表明，我们的三维场景在质量上优于基线方法，从而有效将视频模型转化为具有三维一致性的场景生成器。

摘要 (Abstract)

Video diffusion models generate high-quality and diverse worlds; however, individual frames often lack 3D consistency across the output sequence, which makes the reconstruction of 3D worlds difficult. To this end, we propose a new method that handles these inconsistencies by non-rigidly aligning the video frames into a globally-consistent coordinate frame that produces sharp and detailed pointcloud reconstructions. First, a geometric foundation model lifts each frame into a pixel-wise 3D pointcloud, which contains unaligned surfaces due to these inconsistencies. We then propose a tailored non-rigid iterative frame-to-model ICP to obtain an initial alignment across all frames, followed by a global optimization that further sharpens the pointcloud. Finally, we leverage this pointcloud as initialization for 3D reconstruction and propose a novel inverse deformation rendering loss to create high quality and explorable 3D environments from inconsistent views. We demonstrate that our 3D scenes achieve higher quality than baselines, effectively turning video models into 3D-consistent world generators.

关键词: 3D reconstruction, video diffusion models, inconsistent views, non-rigid alignment, pointcloud, geometric foundation model, global optimization, inverse deformation rendering

196. ❌ When the City Teaches the Car: Label-Free 3D Perception from Infrastructure

作者: Zhen Xu, Jinsu Yoo, Cristian Bautista, Zanming Huang, Tai-Yu Pan, Zhenzhen Liu, Katie Z Luo, Mark Campbell, Bharath Hariharan, Wei-Lun Chao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16742v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基础设施辅助的无标签3D感知用于自动驾驶，属于计算机视觉和自动驾驶领域，未涉及大语言模型、深度学习技术原理创新或AI for Science等关键词。所有关键词均与大模型、深度学习技术或科学AI应用无关，因此相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用城市基础设施作为无标签教师来训练自动驾驶车辆3D感知模型的新范式，在CARLA环境中实现了82.3%的检测准确率，展示了基础设施辅助学习作为降低标注成本的可行方法。

摘要翻译

为自动驾驶构建稳健的三维感知系统目前仍严重依赖大规模数据采集与人工标注，但随着部署范围扩展至不同城市和区域，这一范式变得难以持续。与此同时，现代城市正日益配备路侧单元（RSUs）——部署于道路沿线及交叉口的静态传感器，用于监测交通状况。这自然引出一个问题：城市本身能否辅助训练车辆？我们提出“设施教学、无标注三维感知”这一新范式，其中路侧单元作为静态、无监督的“教师”指导自车学习。利用其固定视角与重复观测优势，路侧单元可从无标注数据中学习局部三维检测器，并将预测结果广播给途经车辆；这些预测结果被聚合为伪标注监督信号，用于训练独立的自车检测器。最终模型在测试阶段无需依赖基础设施或通信支持。我们将这一构想实现为完全无需标注的三阶段流程，并在基于CARLA的多智能体环境中进行了概念可行性验证。以CenterPoint检测器为基础，我们的流程在车辆检测任务中达到82.3%的平均精度（AP），而完全监督的自车检测上限为94.4%。我们进一步系统分析了各阶段性能，评估其可扩展性，并证明其与现有以自车为中心的无标注方法具有互补性。这些结果表明，城市基础设施本身有望为自动驾驶车辆提供可扩展的监督信号，使“设施教学学习”成为降低三维感知标注成本的一种极具潜力的正交范式。

摘要 (Abstract)

Building robust 3D perception for self-driving still relies heavily on large-scale data collection and manual annotation, yet this paradigm becomes impractical as deployment expands across diverse cities and regions. Meanwhile, modern cities are increasingly instrumented with roadside units (RSUs), static sensors deployed along roads and at intersections to monitor traffic. This raises a natural question: can the city itself help train the vehicle? We propose infrastructure-taught, label-free 3D perception, a paradigm in which RSUs act as stationary, unsupervised teachers for ego vehicles. Leveraging their fixed viewpoints and repeated observations, RSUs learn local 3D detectors from unlabeled data and broadcast predictions to passing vehicles, which are aggregated as pseudo-label supervision for training a standalone ego detector. The resulting model requires no infrastructure or communication at test time. We instantiate this idea as a fully label-free three-stage pipeline and conduct a concept-and-feasibility study in a CARLA-based multi-agent environment. With CenterPoint, our pipeline achieves 82.3% AP for detecting vehicles, compared to a fully supervised ego upper bound of 94.4%. We further systematically analyze each stage, evaluate its scalability, and demonstrate complementarity with existing ego-centric label-free methods. Together, these results suggest that city infrastructure itself can potentially provide a scalable supervisory signal for autonomous vehicles, positioning infrastructure-taught learning as a promising orthogonal paradigm for reducing annotation cost in 3D perception.

关键词: 3D perception, self-driving, infrastructure-taught, label-free learning, roadside units, pseudo-label supervision, autonomous vehicles, CARLA simulation

197. ❌ Emotion-Aware Classroom Quality Assessment Leveraging IoT-Based Real-Time Student Monitoring

作者: Hai Nguyen, Hieu Dao, Hung Nguyen, Nam Vu, Cong Tran 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16719v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于物联网（IoT）和实时学生监控的情感感知课堂质量评估系统，属于计算机视觉、情感计算和物联网在教育领域的应用。论文摘要和标题中未提及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体技术关键词。所有评分关键词均与大模型技术、深度学习原理或科学AI应用直接相关，而本文专注于传统的多代理情感计算框架、人脸检测和课堂参与度分类，未涉及大模型或深度学习技术原理的创新。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一个基于物联网的实时多代理情感计算框架，用于通过监测学生情绪状态来评估课堂质量，在三个教育机构的测试中实现了88%的课堂参与度分类准确率。

摘要翻译

本研究提出了一种高吞吐量、实时多智能体情感计算框架，旨在通过情绪状态监测提升课堂学习效果。随着大班额教学和有限的师生互动日益成为教育工作者面临的挑战，对能够实时捕捉学生情绪与参与模式的可扩展数据驱动工具的需求日益增长。该系统使用包含1,500张标注图像和300段课堂检测视频的“课堂情绪数据集”进行评估。该系统专为物联网设备设计，通过高效的实时处理应对负载均衡与延迟挑战。实地测试在大型都市区的三所教育机构进行：一所小学（下文称A校）、一所初中（B校）和一所高中（C校）。系统表现出稳健性能，能以25帧/秒的速度检测多达50张人脸，并在课堂参与状态分类中达到88%的整体准确率。实施结果显示出积极成效，学生、教师和家长对课堂互动与教学适应性改善给予了积极反馈。本研究的主要贡献包括：建立了一个实用的、基于物联网的情绪感知学习环境框架，并引入了“课堂情绪数据集”以促进进一步的验证与研究。

摘要 (Abstract)

This study presents high-throughput, real-time multi-agent affective computing framework designed to enhance classroom learning through emotional state monitoring. As large classroom sizes and limited teacher student interaction increasingly challenge educators, there is a growing need for scalable, data-driven tools capable of capturing students’ emotional and engagement patterns in real time. The system was evaluated using the Classroom Emotion Dataset, consisting of 1,500 labeled images and 300 classroom detection videos. Tailored for IoT devices, the system addresses load balancing and latency challenges through efficient real-time processing. Field testing was conducted across three educational institutions in a large metropolitan area: a primary school (hereafter school A), a secondary school (school B), and a high school (school C). The system demonstrated robust performance, detecting up to 50 faces at 25 FPS and achieving 88% overall accuracy in classifying classroom engagement states. Implementation results showed positive outcomes, with favorable feedback from students, teachers, and parents regarding improved classroom interaction and teaching adaptation. Key contributions of this research include establishing a practical, IoT-based framework for emotion-aware learning environments and introducing the ‘Classroom Emotion Dataset’ to facilitate further validation and research.

关键词: Emotion-Aware Classroom, IoT-Based Monitoring, Real-Time Student Monitoring, Multi-Agent Affective Computing, Classroom Engagement, Classroom Emotion Dataset, Face Detection, Educational Technology

198. ❌ Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

作者: Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16711v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图像到视频生成中的对象级运动编辑，提出了一种无需训练的方法Search2Motion，并引入了注意力共识搜索策略。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是计算机视觉中的视频生成任务，未涉及大语言模型、深度学习技术原理创新或AI在生物/化学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种无需训练的框架Search2Motion，用于图像到视频生成中的对象级运动编辑，通过目标帧控制和注意力共识搜索实现了对象重定位并保持场景稳定性，在多个基准测试中优于现有方法。

摘要翻译

我们提出Search2Motion，一种用于图像到视频生成中物体级运动编辑的无训练框架。与先前需要轨迹、边界框、掩码或运动场的方法不同，Search2Motion采用基于目标帧的控制，利用首尾帧运动先验实现物体重定位，同时无需微调即可保持场景稳定性。通过语义引导的物体插入和鲁棒的背景修复，我们实现了可靠的目标帧构建。我们进一步证明，早期步的自注意力图能够预测物体和相机动态，提供可解释的用户反馈，并由此启发我们提出了ACE-Seed（基于注意力共识的早期步种子选择）——一种轻量级搜索策略，无需前瞻采样或外部评估器即可提升运动保真度。针对现有基准测试混淆物体与相机运动的问题，我们引入了S2M-DAVIS和S2M-OMB数据集用于稳定相机下的纯物体运动评估，同时提出了FLF2V-obj指标，该指标可在无需真实轨迹的情况下分离物体运动伪影。在FLF2V-obj和VBench基准测试中，Search2Motion始终优于基线方法。

摘要 (Abstract)

We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.

关键词: Search2Motion, training-free, object-level motion editing, image-to-video generation, attention consensus, target-frame control, motion fidelity, self-attention maps

199. ❌ HMAR: Hierarchical Modality-Aware Expert and Dynamic Routing Medical Image Retrieval Architecture

作者: Aojie Yuan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16679v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文的核心创新是提出了一种基于混合专家（MoE）架构的医学图像检索框架HMAR，因此与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分）。论文属于医学影像AI应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），但并非核心的生物信息学或化学信息学。论文未涉及大语言模型（LLMs）、模型训练技术（如预训练、微调、对齐）、推理优化、智能体、模型压缩等其他关键词，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对医学图像检索中特征编码单一、相似度度量模糊和缺乏细粒度检索的问题，提出了基于混合专家架构的HMAR框架，在RadioImageNet-CT数据集上实现了优于现有方法的检索精度。

摘要翻译

医学影像检索是计算机辅助诊断的关键组成部分，但现有系统存在三个长期局限：采用统一的特征编码，未能考虑解剖结构在临床重要性上的差异；基于粗略分类标签的相似性度量模糊不清；以及仅关注全局图像相似性，无法满足临床对细粒度区域特异性检索的需求。我们提出HMAR（分层模态感知专家与动态路由），一种基于混合专家架构的自适应检索框架。HMAR采用双专家机制：Expert0提取全局特征以进行整体相似性匹配，而Expert1学习位置不变的局部表征以实现精确病灶区域检索。一种两阶段对比学习策略消除了对昂贵边界框标注的依赖，滑动窗口匹配算法在推理时实现了密集的局部比较。哈希码通过科尔莫戈罗夫-阿诺德网络层生成，以支持高效的汉明距离搜索。在RadioImageNet-CT数据集（16种临床模式，29,903张图像）上的实验表明，HMAR在64位和128位哈希码上分别实现了0.711和0.724的平均精度均值，较当前最先进的ACIR方法分别提升了0.7%和1.1%。

摘要 (Abstract)

Medical image retrieval (MIR) is a critical component of computer-aided diagnosis, yet existing systems suffer from three persistent limitations: uniform feature encoding that fails to account for the varying clinical importance of anatomical structures, ambiguous similarity metrics based on coarse classification labels, and an exclusive focus on global image similarity that cannot meet the clinical demand for fine-grained region-specific retrieval. We propose HMAR (Hierarchical Modality-Aware Expert and Dynamic Routing), an adaptive retrieval framework built on a Mixture-of-Experts (MoE) architecture. HMAR employs a dual-expert mechanism: Expert0 extracts global features for holistic similarity matching, while Expert1 learns position-invariant local representations for precise lesion-region retrieval. A two-stage contrastive learning strategy eliminates the need for expensive bounding-box annotations, and a sliding-window matching algorithm enables dense local comparison at inference time. Hash codes are generated via Kolmogorov-Arnold Network (KAN) layers for efficient Hamming-distance search. Experiments on the RadioImageNet-CT dataset (16 clinical patterns, 29,903 images) show that HMAR achieves mean Average Precision (mAP) of 0.711 and 0.724 for 64-bit and 128-bit hash codes, improving over the state-of-the-art ACIR method by 0.7% and 1.1%, respectively.

关键词: Medical Image Retrieval, Mixture-of-Experts, Hierarchical Modality-Aware, Dynamic Routing, Contrastive Learning, Hash Codes, Kolmogorov-Arnold Network, Fine-grained Retrieval

200. ❌ vAccSOL: Efficient and Transparent AI Vision Offloading for Mobile Robots

作者: Adam Zahir, Michele Gucciardom Falk Selker, Anastasios Nanos, Kostis Papazafeiropoulos, Carlos J. Bernardos, Nicolas Weber, Roberto Gonzalez 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16685v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文vAccSOL专注于移动机器人AI视觉工作负载的高效卸载框架，涉及神经网络编译器、边缘计算和能效优化，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新或科学领域应用。所有评分关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文研究的是计算机视觉模型的部署和计算卸载，与这些关键词无任何关联。

!!! tip deepseek-chat TL;DR

论文提出了vAccSOL框架，通过神经网络编译器和轻量级执行框架实现移动机器人AI视觉工作负载的高效透明卸载，显著降低了功耗并提升了帧率。

摘要翻译

移动机器人正日益广泛地应用于巡检、巡逻和搜救等任务，其依赖计算机视觉实现环境感知、导航与自主决策。然而，由于机载计算资源有限且受严格的能耗约束，在机器人本体上执行现代视觉计算任务具有挑战性。尽管部分平台配备了嵌入式加速器，但这些加速器通常与专有软件栈绑定，导致用户自定义的计算负载只能在资源受限的附属计算机上运行。

本文提出vAccSOL框架，旨在异构机器人及边缘平台上实现高效、透明的基于人工智能的视觉任务执行。vAccSOL整合了两个核心组件：SOL（一种神经网络编译器，可生成具有最小运行时依赖的优化推理库）和vAccel（一种轻量级执行框架，能够透明地将推理任务调度至机器人本地或邻近的边缘基础设施执行）。该组合方案可在无需修改机器人应用程序的前提下，实现硬件优化的推理计算和灵活的任务部署策略。

我们在真实测试环境中对vAccSOL进行了评估，测试平台包括一台商业四足机器人和涵盖图像分类、视频分类及语义分割的十二个深度学习模型。与PyTorch编译器基准相比，SOL实现了相当或更优的推理性能。通过边缘卸载机制，vAccSOL相较于PyTorch将机器人端功耗降低最高达80%，边缘端功耗降低最高达60%，同时将视觉处理流水线的帧率提升最高达24倍，显著延长了电池供电机器人的持续运行时间。

摘要 (Abstract)

Mobile robots are increasingly deployed for inspection, patrol, and search-and-rescue operations, relying on computer vision for perception, navigation, and autonomous decision-making. However, executing modern vision workloads onboard is challenging due to limited compute resources and strict energy constraints. While some platforms include embedded accelerators, these are typically tied to proprietary software stacks, leaving user-defined workloads to run on resource-constrained companion computers. We present vAccSOL, a framework for efficient and transparent execution of AI-based vision workloads across heterogeneous robotic and edge platforms. vAccSOL integrates two components: SOL, a neural network compiler that generates optimized inference libraries with minimal runtime dependencies, and vAccel, a lightweight execution framework that transparently dispatches inference locally on the robot or to nearby edge infrastructure. This combination enables hardware-optimized inference and flexible execution placement without requiring modifications to robot applications. We evaluate vAccSOL on a real-world testbed with a commercial quadruped robot and twelve deep learning models covering image classification, video classification, and semantic segmentation. Compared to a PyTorch compiler baseline, SOL achieves comparable or better inference performance. With edge offloading, vAccSOL reduces robot-side power consumption by up to 80% and edge-side power by up to 60% compared to PyTorch, while increasing vision pipeline frame rate by up to 24x, extending the operating lifetime of battery-powered robots.

关键词: mobile robots, AI vision workloads, neural network compiler, edge offloading, inference optimization, power consumption reduction, frame rate improvement, heterogeneous platforms

201. ❌ $x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space

作者: Ruishan Guo, Ciyu Ruan, Haoyang Wang, Zihang Gong, Jingao Xu, Xinlei Chen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16671v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究计算机视觉中的多模态融合和运动估计问题，具体涉及事件相机、图像和LiDAR数据的融合，以及2D光流和3D场景流的联合估计。论文的核心是提出一种基于事件边缘空间的统一表示方法，并进行可靠性感知的自适应融合。所有给定的关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而该论文专注于计算机视觉中的传感器融合和运动估计，与这些关键词的主题完全无关。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为$x^2$-Fusion的方法，通过将事件相机、图像和LiDAR数据统一到事件边缘空间中进行多模态融合，以联合估计2D光流和3D场景流，并在标准条件和挑战性场景下实现了最先进的准确性。

摘要翻译

估计稠密的二维光流与三维场景流对动态场景理解至关重要。近期研究结合图像、激光雷达和事件数据联合预测二维与三维运动，但多数方法在异构的特征空间中分别处理。由于缺乏所有模态都能对齐的共享潜在空间，这些系统依赖多个模态专用模块，未能解决跨传感器失配问题，且使融合过程不必要地复杂化。事件相机天然提供时空边缘信号，我们可将其视为固有边缘场，用以锚定一个统一的潜在表示，称为事件边缘空间。基于此思路，我们提出$x^2$-Fusion方法，将多模态融合重新定义为表示统一：事件衍生的时空边缘定义了一个以边缘为中心的同质空间，图像与激光雷达特征在此共享表示中被显式对齐。在该空间内，我们执行可靠性感知的自适应融合，以估计模态可靠性并在性能退化时强调稳定线索。我们进一步采用跨维度对比学习，将二维光流与三维场景流紧密耦合。在合成与真实基准上的大量实验表明，$x^2$-Fusion在标准条件下达到了最先进的精度，并在挑战性场景中实现了显著提升。

摘要 (Abstract)

Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex.Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce $x^2$-Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation.Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that $x^2$-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.

关键词: Event cameras, Multimodal fusion, Optical flow, Scene flow, Cross-modality, Cross-dimension, Event Edge Space, Reliability-aware fusion

202. ❌ Spectral Property-Driven Data Augmentation for Hyperspectral Single-Source Domain Generalization

作者: Taiqin Chen, Yifeng Wang, Xiaochen Feng, Zhilin Zhu, Hao Sha, Yingjian Li, Yongbing Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16662v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于高光谱图像（HSI）分类中的单源域泛化问题，提出了一种基于光谱特性的数据增强方法（SPDDA）。论文的核心技术是计算机视觉和机器学习中的域泛化与数据增强，而非大语言模型（LLM）或深度学习技术原理的创新。所有关键词均直接涉及大语言模型（LLM）及其相关技术（如训练、对齐、推理优化、智能体等），或明确要求与LLM结合（如“MCTS AND LLM”）。论文内容与这些LLM核心技术完全无关，因此相关度评分为0。唯一可能产生微弱关联的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文涉及遥感（可视为地球科学的一个应用领域）和AI方法，但论文并非专注于生物信息学或化学信息学，也未强调其作为“AI for Science”的典型代表，因此给予较低的5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究针对高光谱图像单源域泛化中数据增强的真实性与多样性权衡问题，提出了一种光谱特性驱动的数据增强方法（SPDDA），通过光谱多样性模块和空间-光谱协同优化机制，在多个遥感基准测试上取得了优于现有方法的性能。

摘要翻译

高光谱图像（HSI）得益于其众多的光谱通道，为分类提供了丰富信息，但维度的增加和传感器差异性使其对跨域分布差异更为敏感，进而影响分类性能。针对这一问题，高光谱单源域泛化（SDG）通常在单源域训练数据可用的条件下，采用数据增强来模拟潜在的域偏移并提升模型鲁棒性。然而，盲目的增强可能产生与现实场景不符的样本，而过度强调真实性又会抑制多样性，这凸显了真实性与多样性之间的权衡，限制了对目标域的泛化能力。为解决这一挑战，我们提出了一种光谱特性驱动的数据增强方法（SPDDA），该方法显式地考虑了高光谱图像的内在特性，即依赖于设备的光谱通道数量变化以及相邻通道的混合效应。具体而言，SPDDA采用一个光谱多样性模块，沿光谱维度对源域数据进行重采样以生成具有不同光谱通道数的样本，并通过建模通道间相似性构建通道自适应的光谱混合器，从而避免固定的增强模式。为进一步提升增强样本的真实性，我们提出了一种空间-光谱协同优化机制，联合优化空间保真度约束与光谱连续性自约束。此外，光谱自约束的权重会根据空间约束分量进行自适应调整，从而防止光谱维度的过度平滑并保持空间结构。在三个遥感基准数据集上的大量实验表明，SPDDA的性能优于现有先进方法。

摘要 (Abstract)

While hyperspectral images (HSI) benefit from numerous spectral channels that provide rich information for classification, the increased dimensionality and sensor variability make them more sensitive to distributional discrepancies across domains, which in turn can affect classification performance. To tackle this issue, hyperspectral single-source domain generalization (SDG) typically employs data augmentation to simulate potential domain shifts and enhance model robustness under the condition of single-source domain training data availability. However, blind augmentation may produce samples misaligned with real-world scenarios, while excessive emphasis on realism can suppress diversity, highlighting a tradeoff between realism and diversity that limits generalization to target domains. To address this challenge, we propose a spectral property-driven data augmentation (SPDDA) that explicitly accounts for the inherent properties of HSI, namely the device-dependent variation in the number of spectral channels and the mixing of adjacent channels. Specifically, SPDDA employs a spectral diversity module that resamples data from the source domain along the spectral dimension to generate samples with varying spectral channels, and constructs a channel-wise adaptive spectral mixer by modeling inter-channel similarity, thereby avoiding fixed augmentation patterns. To further enhance the realism of the augmented samples, we propose a spatial-spectral co-optimization mechanism, which jointly optimizes a spatial fidelity constraint and a spectral continuity self-constraint. Moreover, the weight of the spectral self-constraint is adaptively adjusted based on the spatial counterpart, thus preventing over-smoothing in the spectral dimension and preserving spatial structure. Extensive experiments conducted on three remote sensing benchmarks demonstrate that SPDDA outperforms state-of-the-art methods.

关键词: Hyperspectral Image, Single-Source Domain Generalization, Data Augmentation, Spectral Property, Spatial-Spectral Co-optimization, Remote Sensing, Classification

203. ❌ Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

作者: Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, Ziwei Liu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16669v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation》专注于具身人工智能中的机器人-世界交互模拟，提出了一种新的动作条件4D生成式机器人模拟器。论文的核心是构建一个4D世界模型（Kinematic 4D World Modeling），用于时空具身模拟，这与关键词“World Models AND General World Models”高度相关，因为论文直接研究并构建了用于模拟机器人-世界交互动态的4D世界模型。然而，论文内容主要涉及机器人学、计算机视觉（视频生成、点云）、模拟器和具体的数据集构建（Robo4D-200k），并未涉及任何大语言模型（LLM）、深度学习技术原理创新（如MoE、缩放定律、各种训练微调技术、推理优化、智能体框架等）或AI在特定科学领域（如生物信息学）的应用。因此，除了“World Models”关键词外，其他所有关键词均与论文主题完全无关。

!!! tip deepseek-chat TL;DR

该论文针对现有机器人-世界交互模拟器大多局限于2D空间或静态环境提示、忽略交互本质上是4D时空事件的问题，提出了Kinema4D——一种新的动作条件4D生成式机器人模拟器，通过解耦机器人精确4D控制轨迹生成和环境反应的生成式4D建模，并结合大规模数据集Robo4D-200k，能够有效模拟物理合理、几何一致且与具体机器人无关的高保真交互，并首次展示了潜在的零样本迁移能力。

摘要翻译

模拟机器人-世界交互是具身人工智能的基石。近期，少数研究展现出利用视频生成技术超越传统模拟器僵化的视觉/物理约束的潜力。然而，这些方法主要工作在二维空间，或仅受静态环境线索引导，忽视了机器人-世界交互本质上是四维时空事件、需要精确交互建模这一基本现实。为恢复这一四维本质，同时确保精确的机器人控制，我们提出了Kinema4D——一种新型动作条件化四维生成式机器人模拟器。该方法将机器人-世界交互解耦为：i）机器人控制的精确四维表征：我们通过运动学驱动基于URDF（统一机器人描述格式）的三维机器人，生成精确的四维机器人控制轨迹；ii）环境反应的生成式四维建模：将四维机器人轨迹投影为点云图作为时空视觉信号，控制生成模型将复杂环境的反应动力学合成为同步的RGB/点云序列。为促进训练，我们构建了大规模数据集Robo4D-200k，包含201,426个具有高质量四维标注的机器人交互片段。大量实验表明，我们的方法能有效模拟物理合理、几何一致且与具体形态无关的交互，真实反映多样化的现实世界动力学。该方法首次展现出零样本迁移的潜力，为推进下一代具身模拟提供了高保真基础。

摘要 (Abstract)

Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments’ reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.

关键词: Embodied AI, 4D World Modeling, Robot-World Interaction, Spatiotemporal Simulation, Generative Simulator, Kinematic Control, Robo4D-200k Dataset, Zero-shot Transfer

204. ❌ HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

作者: Md Jahidul Islam 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16653v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉-语言模型（VLM）的适配器设计，特别是CLIP模型的参数高效微调（PEFT）。它提出了HeBA（异构瓶颈适配器），通过模态特定的结构归纳偏置（视觉用2D深度可分离卷积，文本用密集线性投影）、瓶颈正则化和主动梯度初始化来改进适配器设计。这与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为论文的核心是参数高效微调技术。然而，论文不涉及大语言模型（LLM）、MoE、小语言模型、缩放定律、预训练/后训练、对齐、RLHF、RAG、上下文扩展、推理加速、幻觉缓解、可解释AI、世界模型、模型合并、上下文学习或科学AI等主题，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

论文解决了视觉-语言模型（如CLIP）在下游任务适配中因统一处理视觉和文本模态而导致的性能限制问题，通过提出异构瓶颈适配器（HeBA）实现了在11个少样本基准上的最先进性能。

摘要翻译

将如CLIP等大规模视觉-语言模型（VLMs）适配至下游任务时，常受限于“一刀切”的架构方法，即视觉与文本标记均通过宽泛的通用适配器进行统一处理。我们认为，这种同质化处理忽视了不同模态的独特结构特性——图像的空间局部性与文本的语义密集性。为解决这一问题，我们提出HeBA（异构瓶颈适配器），这是一个统一的架构框架，引入了针对模态的结构化归纳偏置。HeBA通过三项关键架构创新，突破了传统设计：（1）异构性：它通过二维深度可分离卷积处理视觉标记以保留空间关联性，同时通过密集线性投影处理文本标记以捕捉语义关系；（2）瓶颈正则化：与标准的扩展型适配器不同，HeBA采用压缩瓶颈结构（D -> D/4），显式地迫使模型学习紧凑且鲁棒的特征，并作为结构化正则器；（3）主动梯度初始化：我们挑战了限制性的零初始化范式，采用Kaiming初始化策略，确保足够的初始梯度流动以加速收敛，同时不损害冻结主干网络的预训练知识。大量实验表明，HeBA在架构上的专业化设计实现了更优的稳定性和准确性，在11个少样本基准测试中创造了新的性能记录。代码发布于https://github.com/Jahid12012021/VLM-HeBA。

摘要 (Abstract)

Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a “one-size-fits-all” architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities – spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone’s pre-trained knowledge. Extensive experiments demonstrate that HeBA’s architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.

关键词: Vision-Language Models, Parameter-efficient Fine-tuning, Adapter Design, Heterogeneous Bottleneck Adapter, Few-shot Learning, CLIP Adaptation, Modality-specific Processing, Bottleneck Regularization

205. ❌ Efficient Brood Cell Detection in Layer Trap Nests for Bees and Wasps: Balancing Labeling Effort and Species Coverage

作者: Chenchang Liu, Felix Fornoff, Annika Grasreiner, Patrick Maeder, Henri Greil, Marco Seeland 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16652v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习进行蜜蜂和黄蜂巢穴中育雏细胞的检测与分类，属于计算机视觉在生态学中的应用。论文的核心是提出一种新的损失函数（CFPL）来处理数据标注不完整和类别不平衡问题。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文有一定关联（5分），因为论文将AI（具体是深度学习）应用于生物多样性监测这一科学领域。其他所有关键词均与大语言模型（LLM）、模型训练技术、推理优化、智能体等前沿大模型技术直接相关，而本文完全不涉及这些内容，因此相关度为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于深度学习和新型约束假阳性损失（CFPL）的方法，用于高效检测和分类层叠式陷阱巢穴中的蜜蜂和黄蜂育雏细胞，在减少标注工作量的同时有效缓解了数据不平衡问题并提升了模型性能。

摘要翻译

监测洞巢野生蜜蜂与胡蜂对生物多样性研究与保护至关重要。层板陷阱巢正成为研究此类昆虫丰度与物种丰富度的有效工具，可揭示其筑巢活动与生态需求。然而，人工评估层板陷阱巢以检测和分类育幼巢室耗时耗力。为此，我们提出一种基于深度学习的方法，用于高效检测和分类层板陷阱巢中的育幼巢室。层板陷阱巢因育幼巢室密集分布而带来额外挑战，导致每张图像需耗费大量标注精力。此外，我们观察到类别分布存在显著不平衡，常见物种的出现次数明显多于稀有物种。对常见物种进行全面标注耗时且会加剧数据不平衡，而部分标注则会导致数据不完整，从而降低模型性能。为减少标注工作量并缓解未标注数据的影响，我们提出一种新颖的约束假阳性损失策略。该策略动态屏蔽来自未标注数据的预测，防止其在训练过程中干扰分类损失。我们在一个季节内收集的712张层板陷阱巢图像数据集上评估了该方法，数据集涵盖28个细粒度类别，描述了育幼巢室的分类学特征与状态。为最小化标注工作量，我们将训练集每类标签数量上限设为300个。实验结果表明，深度学习可有效用于检测层板陷阱巢中的育幼巢室。我们的约束假阳性损失方法进一步提升了性能，在平衡模型精度与标注工作量的同时，也缓解了类别不平衡问题。

摘要 (Abstract)

Monitoring cavity-nesting wild bees and wasps is vital for biodiversity research and conservation. Layer trap nests (LTNs) are emerging as a valuable tool to study the abundance and species richness of these insects, offering insights into their nesting activities and ecological needs. However, manually evaluating LTNs to detect and classify brood cells is labor-intensive and time-consuming. To address this, we propose a deep learning based approach for efficient brood cell detection and classification in LTNs. LTNs present additional challenges due to densely packed brood cells, leading to a high labeling effort per image. Moreover, we observe a significant imbalance in class distribution, with common species having notably more occurrences than rare species. Comprehensive labeling of common species is time-consuming and exacerbates data imbalance, while partial labeling introduces data incompleteness which degrades model performance. To reduce labeling effort and mitigate the impact of unlabeled data, we introduce a novel Constrained False Positive Loss (CFPL) strategy. CFPL dynamically masks predictions from unlabeled data, preventing them from interfering with the classification loss during training. We evaluate our approach on a dataset of 712 LTN images collected over one season, covering 28 fine-grained classes describing the taxonomy and status of brood cells. To minimize labeling effort, we limit the training set to a maximum of 300 labels per class. Experimental results demonstrate that deep learning can be effectively used to detect brood cells in LTNs. Our CFPL method further improves performance and balances model accuracy and labeling effort while also mitigating class imbalance.

关键词: deep learning, brood cell detection, layer trap nests, constrained false positive loss, labeling effort, class imbalance, biodiversity monitoring, wild bees and wasps

206. ❌ BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection

作者: Melissa Schween, Mathis Kruse, Bodo Rosenhahn 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16645v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出BUSSARD模型，使用语言模型嵌入场景图中的对象和关系标记以利用现实世界的语义知识，这属于大模型在计算机视觉/场景理解领域的应用，因此与’Large Language Models’相关度5分。论文研究场景图中的异常关系检测，属于AI在特定领域（视觉场景分析）的应用，与’AI for Science’有一定关联，相关度5分。其他关键词主要涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等，论文未涉及这些具体技术，因此相关度为0分。

!!! tip deepseek-chat TL;DR

该论文提出了BUSSARD模型，一种基于标准化流的模型，用于检测从图像生成的场景图中的异常关系，在SARD数据集上比现有最优方法AUROC提升约10%且速度快5倍。

摘要翻译

我们提出双向通用场景特异性异常关系检测模型（Bijective Universal Scene-Specific Anomalous Relationship Detection, BUSSARD），这是一种基于标准化流的模型，用于检测从图像生成的场景图（scene graphs）中的异常关系。本研究采用多模态方法，通过语言模型嵌入场景图中的物体与关系标记，以利用现实世界的语义知识。我们使用标准化流模型学习双向变换，将场景图中的物体-关系-物体三元组映射至简单的基础分布（通常为高斯分布），从而通过似然估计实现异常检测。我们在包含办公室和餐厅场景的SARD数据集上评估了本方法。与当前最优模型相比，我们的方法在AUROC指标上提升了约10%，同时检测速度提高了五倍。通过消融实验，我们证明了模型具有卓越的鲁棒性和泛化能力，特别是在处理同义词方面：基线模型性能出现17.5%的波动时，我们的模型仍保持稳定性能。这项工作展示了基于学习的方法在场景图关系异常检测中的强大潜力。代码发布于 https://github.com/mschween/BUSSARD。

摘要 (Abstract)

We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images. Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world. A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation. We evaluate our approach on the SARD dataset containing office and dining room scenes. Our method achieves around 10% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster. Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5% deviation. This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs. Our code is available at https://github.com/mschween/BUSSARD .

关键词: normalizing flows, anomaly detection, scene graphs, language model embeddings, multimodal approach, object-relation-object triplets, likelihood estimation, SARD dataset

207. ❌ FlowComposer: Composable Flows for Compositional Zero-Shot Learning

作者: Zhenqi He, Lin Li, Long Chen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16641v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究组合零样本学习（CZSL），使用视觉语言模型（VLMs）和参数高效微调（PEFT）方法，因此仅与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为摘要明确提到’parameter-efficient fine-tuning (PEFT)‘和’PEFT-based designs’。其他关键词涉及大模型技术原理（如LLMs、MoE、Scaling Laws）、训练方法（如RLHF、Instruction Tuning）、应用领域（如AI for Science）或特定技术（如RAG、Quantization），均未在论文中提及或相关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对组合零样本学习中现有参数高效微调方法存在的隐式组合构建和特征纠缠问题，提出了FlowComposer框架，通过流匹配和可学习组合器显式融合属性与对象特征，在多个基准测试中显著提升了模型性能。

摘要翻译

组合零样本学习（CZSL）旨在通过重组从已见属性-对象对中学到的基本元素来识别未见过的组合。近期基于视觉-语言模型（VLM）的CZSL方法通常采用参数高效微调（PEFT），通过视觉解耦器进行特征分解，并利用词元级提示或前缀编码来表征组合。然而，此类基于PEFT的设计存在两个根本性局限：（1）隐式组合构建，即组合仅通过词元拼接或分支提示调优实现，而未在嵌入空间中进行显式操作；（2）残留特征纠缠，即不完善的解耦导致属性、对象与组合特征相互污染。这些问题共同限制了当前CZSL模型的泛化能力。本文首次系统研究流匹配在CZSL中的应用，提出FlowComposer——一个模型无关的框架。该框架学习两个基本流，将视觉特征分别导向属性和对象的文本嵌入，并通过可学习的组合器显式融合其速度场以生成组合流。为利用不可避免的残留纠缠，我们进一步设计泄漏引导增强方案，将泄漏特征复用为辅助信号。我们在三个公开CZSL基准上对FlowComposer进行综合评估，将其作为即插即用模块集成到多种基线模型中，均取得显著性能提升。

摘要 (Abstract)

Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT). They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions. However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models. In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals. We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.

关键词: Compositional Zero-Shot Learning, CZSL, Vision-Language Models, Parameter-efficient Fine-tuning, PEFT, Flow Matching, Feature Disentanglement, Composition Flow

208. ❌ TCATSeg: A Tooth Center-Wise Attention Network for 3D Dental Model Semantic Segmentation

作者: Qiang He, Wentian Qu, Jiajia Dai, Changsong Lei, Shaofeng Wang, Feifei Zuo, Yajie Wang, Yaqian Liang, Xiaoming Deng, Cuixia Ma, Yong-Jin Liu, Hongan Wang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16620v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文TCATSeg专注于3D牙科模型语义分割，提出了一种结合局部几何特征和全局语义上下文的新框架，并创建了包含400个模型的新数据集。所有关键词均与大模型、深度学习技术原理或相关应用领域相关，但论文内容未涉及任何大模型技术（如LLMs、MoE、训练方法、推理优化、代理系统等），也未提及生物信息学或化学信息学的具体应用。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为牙科属于医学科学领域，AI应用于牙科可视为AI for Science的一个子领域，但论文未明确使用这些术语，且重点在计算机视觉而非大模型，因此给予5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对3D牙科模型语义分割中因牙齿排列复杂和形状相似导致的准确性问题，提出了一种结合局部几何特征和全局语义上下文的新框架TCATSeg，并创建了新数据集，实验表明其优于现有方法。

摘要翻译

三维牙科模型的精确语义分割对于正畸和种植牙等数字化牙科应用至关重要。然而，由于牙齿排列复杂且相邻牙齿形态相似，现有方法往往侧重于局部几何特征而忽视全局上下文信息，导致分割精度受限。为此，我们提出TCATSeg这一新颖框架，将局部几何特征与全局语义上下文相结合。我们引入一组稀疏但具有物理意义的超点来捕捉全局语义关系，从而提升分割精度。此外，我们构建了一个包含400个牙科模型（含正畸前样本）的新数据集，以评估本方法的泛化能力。大量实验表明，TCATSeg的性能优于现有先进方法。

摘要 (Abstract)

Accurate semantic segmentation of 3D dental models is essential for digital dentistry applications such as orthodontics and dental implants. However, due to complex tooth arrangements and similarities in shape among adjacent teeth, existing methods struggle with accurate segmentation, because they often focus on local geometry while neglecting global contextual information. To address this, we propose TCATSeg, a novel framework that combines local geometric features with global semantic context. We introduce a set of sparse yet physically meaningful superpoints to capture global semantic relationships and enhance segmentation accuracy. Additionally, we present a new dataset of 400 dental models, including pre-orthodontic samples, to evaluate the generalization of our method. Extensive experiments demonstrate that TCATSeg outperforms state-of-the-art approaches.

关键词: 3D dental model, semantic segmentation, tooth center-wise attention, global semantic context, sparse superpoints, orthodontics, dental implants, TCATSeg

209. ❌ ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery

作者: Weiqin Jiao, Hao Cheng, George Vosselman, Claudio Persello 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16616v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的多边形矢量化任务，提出ACPV-Net框架从航空影像生成矢量地图，涉及语义分割、几何建模和拓扑一致性，但完全不涉及大语言模型、深度学习技术原理创新或科学领域AI应用，与所有评分关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了ACPV-Net框架，解决了从航空影像一次性生成所有土地覆盖类别的完整矢量地图的问题，实现了共享边界且无间隙或重叠的多边形生成，并在Deventer-512和WHU-Building数据集上超越了现有方法。

摘要翻译

我们致力于解决从航空影像单次生成完整矢量地图表示的问题：为所有土地覆盖类别生成具有共享边界、无缝隙且无重叠的多边形。现有多边形化方法通常针对单一类别；通过逐类别运行将其扩展至多类别时，常导致拓扑不一致问题，如重复边缘、缝隙和重叠。我们将这一新任务形式化为全类别多边形矢量化（All-Class Polygonal Vectorization, ACPV），并发布了首个公开基准数据集Deventer-512，其中包含标准化评估指标，可联合评估语义保真度、几何精度、顶点效率、类别级拓扑保真度及全局拓扑一致性。为实现ACPV，我们提出了ACPV-Net这一统一框架，引入了一种新颖的语义监督条件化（Semantically Supervised Conditioning, SSC）机制，将语义感知与几何基元生成相耦合，并通过设计中的拓扑重建强制实现共享边缘一致性。在强制执行此类严格拓扑约束的同时，ACPV-Net在Deventer-512数据集上所有类别的多边形质量均超越了所有单一类别基线方法。该框架无需任何架构修改即可应用于单类别多边形矢量化，在WHU-Building数据集上取得了当前最佳报告结果。数据、代码与模型将通过以下地址发布：https://github.com/HeinzJiao/ACPV-Net。

摘要 (Abstract)

We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and without gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the first public benchmark, Deventer-512, with standardized metrics jointly evaluating semantic fidelity, geometric accuracy, vertex efficiency, per-class topological fidelity and global topological consistency. To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that enforces shared-edge consistency by design. While enforcing such strict topological constraints, ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512. It also applies to single-class polygonal vectorization without any architectural modification, achieving the best-reported results on WHU-Building. Data, code, and models will be released at: https://github.com/HeinzJiao/ACPV-Net.

关键词: Polygonal Vectorization, Vector Map Generation, Aerial Imagery, Semantic Segmentation, Topological Consistency, ACPV-Net, Land-cover Classification, Geometric Modeling

210. ❌ Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

作者: Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16600v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉语言模型（VLM）的生成式奖励模型（GRM），通过强化学习（RL）优化中间评估标准（rubric）的质量。核心相关关键词包括：RLHF（权重1.0，评分10分）——论文使用RL优化奖励模型，是核心方法；Post-training/SFT（权重1.0，评分8分）——使用Proxy-SFT作为代理验证器；Instruction Tuning/Alignment（权重1.0，评分8分）——涉及奖励模型对齐；Large Language Models（权重1.0，评分8分）——论文基于LLM/VLM技术。其他关键词如MoE、SLMs、Scaling Laws、PEFT等与论文内容无关，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出Proxy-GRM方法，通过代理引导的评估标准验证来优化视觉语言模型的生成式奖励模型，在多个基准测试中达到最先进性能，且学习到的评估标准可迁移到未见过的评估器。

摘要翻译

面向视觉语言模型（VLM）的生成式奖励模型（GRM）通常通过三阶段流程评估输出：评分标准生成、基于准则的评分以及最终裁决。然而，中间阶段的评分标准很少被直接优化。先前的研究通常要么将评分标准视为附带产物，要么依赖昂贵的“大语言模型即评委”检查，这种方法无法提供可微分的信号且训练阶段的指导有限。我们提出了Proxy-GRM，该方法将代理引导的评分标准验证引入强化学习（RL），以显式提升评分标准质量。具体而言，我们训练轻量级代理模型（Proxy-SFT和Proxy-RL），这些代理接收候选评分标准以及原始查询和偏好对，然后仅以评分标准为证据预测偏好排序。代理模型的预测准确度作为评分标准质量的奖励信号，激励模型生成内部一致且可迁移的评分标准。在约5万个数据样本上，Proxy-GRM在VL-Reward Bench、Multimodal Reward Bench和MM-RLHF-Reward Bench上达到了最先进的性能，优于使用四倍数据量训练的方法。消融实验表明Proxy-SFT是比Proxy-RL更强的验证器，且隐式奖励聚合效果最佳。关键的是，学习到的评分标准能够迁移至未见过的评估器，在测试阶段无需额外训练即可提升奖励准确度。我们的代码公开于https://github.com/Qwen-Applications/Proxy-GRM。

摘要 (Abstract)

Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy’s prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.

关键词: Generative Reward Models, Vision-Language Models, Rubric Optimization, Reinforcement Learning, Proxy-Guided Verification, Transferable Rubrics, Preference Learning, Multimodal Evaluation

211. ❌ On the Transfer of Collinearity to Computer Vision

作者: Frederik Beuth, Danny Kowerko 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16592v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究计算机视觉中的共线性原理应用，与深度学习结合进行缺陷检测，但所有关键词均聚焦于大语言模型（LLMs）及相关技术（如MoE、RLHF、RAG等），而本文未涉及任何语言模型、自然语言处理或大模型技术，仅使用传统深度学习进行视觉任务，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文将人类视觉中的共线性原理引入计算机视觉，开发原型模型并在四个用例中测试，发现共线性能显著提升晶圆和纳米材料缺陷检测性能，但对ImageNet效果有限，适用于人造线性结构图像。

摘要翻译

共线性是人类大脑中一种增强沿直线排列的空间对齐边缘的视觉感知现象。然而，人类在现实世界中为何拥有这一原理尚不明确，其在计算机视觉与工程应用中的利用更是一个尚未充分探索的领域。在本工作中，我们的目标是将共线性原理迁移至计算机视觉领域，并探索这一新颖原理在计算机视觉应用中的潜在用途。我们开发了一个原型模型来例证该原理，随后对其进行了系统测试，并在四个用例场景中进行了基准评估。我们选择的用例旨在涵盖广泛的潜在应用与场景：探索共线性与深度学习的结合（用例I和II）、将共线性与显著性模型结合使用（用例II），以及将其作为特征检测器（用例I）。在第一个用例中，我们发现共线性能够改进晶圆缺陷检测，通过共线性使性能提升1.24倍（错误率从6.5%降至5.26%）。在第二个用例中，我们测试了纳米技术材料中的缺陷识别，通过共线性实现了3.2倍的性能提升（深度学习，错误率从21.65%降至6.64%），并同时探索了显著性模型。在第三个实验中，我们研究了遮挡场景；而在第四个实验中，我们在ImageNet数据集上进行测试，观察到共线性对其可能并无显著益处。因此，我们可以汇总出共线性有益的场景列表（晶圆、纳米技术、遮挡）以及无益的场景（ImageNet）。由此我们推断，共线性可能适用于工业应用，因为当感兴趣的图像结构是人造物体时（其通常由线条构成），共线性会带来帮助。我们的工作为计算机视觉提供了另一种工具，希望能捕捉人类视觉处理的强大能力。

摘要 (Abstract)

Collinearity is a visual perception phenomenon in the human brain that amplifies spatially aligned edges arranged along a straight line. However, it is vague for which purpose humans might have this principle in the real-world, and its utilization in computer vision and engineering applications even is a largely unexplored field. In this work, our goal is to transfer the collinearity principle to computer vision, and we explore the potential usages of this novel principle for computer vision applications. We developed a prototype model to exemplify the principle, then tested it systematically, and benchmarked it in the context of four use cases. Our cases are selected to spawn a broad range of potential applications and scenarios: sketching the combination of collinearity with deep learning (case I and II), using collinearity with saliency models (case II), and as a feature detector (case I). In the first use case, we found that collinearity is able to improve the fault detection of wafers and obtain a performance increase by a factor 1.24 via collinearity (decrease of the error rate from 6.5% to 5.26%). In the second use case, we test the defect recognition in nanotechnology materials and achieve a performance increase by 3.2x via collinearity (deep learning, error from 21.65% to 6.64%), and also explore saliency models. As third experiment, we cover occlusions; while as fourth experiment, we test ImageNet and observe that it might not be very beneficial for ImageNet. Therefore, we can assemble a list of scenarios for which collinearity is beneficial (wafers, nanotechnology, occlusions), and for what is not beneficial (ImageNet). Hence, we infer collinearity might be suitable for industry applications as it helps if the image structures of interest are man-made because they often consist of lines. Our work provides another tool for CV, hope to capture the power of human processing.

关键词: collinearity, computer vision, deep learning, defect detection, wafer inspection, nanotechnology, saliency models, feature detector

212. ❌ HistoAtlas: A Pan-Cancer Morphology Atlas Linking Histomics to Molecular Programs and Clinical Outcomes

作者: Pierre-Antoine Bannier 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16587v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文HistoAtlas专注于癌症组织病理学图像分析，通过计算图谱将组织形态特征与分子程序、临床结果关联。所有关键词中，仅’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为该研究属于生物信息学/计算病理学领域，是AI在科学（特别是生物医学）中的应用。其他关键词均涉及大模型、深度学习技术原理（如LLM训练、推理优化、Agent系统等），而本文未使用或讨论这些技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究构建了一个泛癌计算图谱HistoAtlas，从H&E病理切片中提取可解释的组织形态特征，并系统性地将这些特征与生存率、基因表达、突变和免疫亚型关联，实现了从常规病理图像中进行大规模生物标志物发现。

摘要翻译

我们推出HistoAtlas——一个泛癌计算图谱，该图谱从21种TCGA癌症类型的6,745张诊断性H&E切片中提取出38个可解释的组织学特征，并系统性地将每个特征与生存率、基因表达、体细胞突变及免疫亚型相关联。所有关联均经过协变量校正、多重检验校正，并按证据强度分级。该图谱重现了已知生物学机制（从免疫浸润与预后到增殖激酶信号传导），同时揭示了具有差异结局的区域特异性免疫信号与形态学亚型。每项结果均可空间追溯至组织区域及单个细胞，经过统计学校准，并支持开放查询。HistoAtlas使得仅通过常规H&E切片即可实现系统化、大规模的生物标志物发现，无需特殊染色或测序。数据及交互式网络图谱可通过https://histoatlas.com 免费获取。

摘要 (Abstract)

We present HistoAtlas, a pan-cancer computational atlas that extracts 38 interpretable histomic features from 6,745 diagnostic H&E slides across 21 TCGA cancer types and systematically links every feature to survival, gene expression, somatic mutations, and immune subtypes. All associations are covariate-adjusted, multiple-testing corrected, and classified into evidence-strength tiers. The atlas recovers known biology, from immune infiltration and prognosis to proliferation and kinase signaling, while uncovering compartment-specific immune signals and morphological subtypes with divergent outcomes. Every result is spatially traceable to tissue compartments and individual cells, statistically calibrated, and openly queryable. HistoAtlas enables systematic, large-scale biomarker discovery from routine H&E without specialized staining or sequencing. Data and an interactive web atlas are freely available at https://histoatlas.com .

关键词: HistoAtlas, pan-cancer, histomic features, computational atlas, H&E slides, biomarker discovery, TCGA, immune subtypes

213. ❌ VideoMatGen: PBR Materials through Joint Generative Modeling

作者: Jon Hasselgren, Zheng Zeng, Milos Hasan, Jacob Munkberg 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16566v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于视频扩散Transformer架构生成3D形状的物理材质，属于计算机视觉和图形学领域，与所有评分关键词（均聚焦于大语言模型、深度学习技术原理及其在科学领域的应用）无直接关联。论文未涉及LLMs、MoE、SLMs、Scaling Laws、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、推理加速、KV缓存压缩、思维链、系统2思维、MCTS、自校正、智能体、工具使用、多智能体、量化、推测解码、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或AI for Science等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于视频扩散Transformer架构的方法，通过联合建模多种材质属性来为3D形状生成物理真实的材质，并引入自定义变分自编码器实现多模态紧凑编码。

摘要翻译

本文提出一种基于视频扩散变换器架构、为三维形状生成物理材质的方法。该方法以输入几何体与文本描述为条件，联合建模多种材质属性（基础色、粗糙度、金属度、高度图）以形成物理可信的材质。我们进一步引入一种定制化的变分自编码器，将多种材质模态编码至紧凑的潜在空间，从而在不增加标记数量的情况下实现多模态联合生成。该流程能够根据文本提示为三维形状生成高质量材质，并与主流内容创作工具兼容。

摘要 (Abstract)

We present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.

关键词: Video Diffusion Transformer, Physically-based Materials, 3D Shapes, Joint Generative Modeling, Variational Auto-encoder, Material Properties, Text-to-Material, Content Creation

214. ❌ Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration

作者: Amirhossein Kazerouni, Maitreya Suin, Tristan Aumentado-Armstrong, Sina Honari, Amanpreet Walia, Iqbal Mohomed, Konstantinos G. Derpanis, Babak Taati, Alex Levinshtein 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16570v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的图像修复技术，具体涉及人脸修复和全场景修复，使用扩散模型进行图像生成。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用相关，而本文专注于计算机视觉中的图像处理，未涉及任何大语言模型技术、深度学习原理创新或AI在生物医药等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Face2Scene的两阶段图像修复框架，利用人脸作为感知先知来估计退化信息，并指导整个图像（包括身体和背景）的修复，实验证明该方法优于现有技术。

摘要翻译

图像复原领域的最新进展使得基于参考的人脸复原模型能够从退化输入中实现高保真度的人脸恢复。然而，此类方法仅聚焦于面部区域，忽视了包括身体与背景在内的完整场景中的退化问题，这限制了其实际应用价值。与此同时，全场景复原方法往往完全忽略退化线索，导致预测结果欠定并产生视觉伪影。本研究提出Face2Scene，一个两阶段复原框架，其利用人脸作为感知先导来估计退化特征并指导整幅图像的复原。给定一张退化图像及一个或多个身份参考图像，我们首先应用基于参考的人脸复原模型重建高质量的面部细节。从已复原-退化的人脸配对中，我们提取出人脸衍生的退化编码，该编码捕获了退化属性（如噪声、模糊、压缩），随后将其转化为多尺度退化感知标记。这些标记通过条件约束一个扩散模型，以单步方式复原包括身体与背景在内的完整场景。大量实验证明，相较于现有先进方法，所提方案具有显著的优越性。

摘要 (Abstract)

Recent advances in image restoration have enabled high-fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we propose Face2Scene, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.

关键词: image restoration, face restoration, diffusion model, degradation estimation, full-scene restoration, reference-based restoration, perceptual oracle, degradation-aware tokens

215. ❌ Understanding Cell Fate Decisions with Temporal Attention

作者: Florian Bürger, Martim Dias Gomes, Adrián E. Granada, Noémie Moreau, Katarzyna Bozek 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16562v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文使用Transformer模型进行细胞命运预测，属于深度学习在生物医学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到可解释性框架，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。其他关键词主要涉及大语言模型（LLM）的特定技术、训练方法、推理优化、代理系统等，而本文专注于计算机视觉中的Transformer用于细胞图像序列分析，未涉及LLM、MoE、缩放定律、训练对齐、推理加速、代理等概念，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于Transformer的深度学习模型，直接从癌细胞在化疗处理下的长期活细胞成像序列中预测细胞命运，实现了高准确率（平衡准确率0.94，F1分数0.93），并通过注意力机制提供了对细胞决策非遗传决定因素的可解释生物学见解。

摘要翻译

理解细胞命运的非遗传决定因素对于开发和改进癌症疗法至关重要，因为遗传相同的细胞在相同治疗条件下可能表现出不同的结局。本研究提出了一种基于深度学习的方法，用于根据化疗条件下癌细胞群体的原始长期活细胞成像记录预测细胞命运。我们训练的Transformer模型能够直接从原始图像序列预测细胞命运，无需依赖预定义的形态学或分子特征。除分类任务外，我们引入了一个全面的可解释性框架，用于解读指导模型预测的时序与形态学线索。我们证明仅基于视频数据即可预测细胞结局，该模型的平衡准确率达到0.94，F1分数为0.93。注意力机制与掩蔽实验进一步表明，预测细胞命运的信号并非唯一存在于细胞轨迹的最终帧，在事件发生前10小时即可实现可靠预测。我们的分析揭示了有丝分裂和凋亡序列中预测信息的差异化时间分布，以及细胞形态和p53信号通路在决定细胞结局中的作用。这些发现共同表明，基于注意力机制的时序模型能够实现精确的细胞命运预测，同时为细胞决策的非遗传决定因素提供具有生物学可解释性的见解。代码发布于https://github.com/bozeklab/Cell-Fate-Prediction。

摘要 (Abstract)

Understanding non-genetic determinants of cell fate is critical for developing and improving cancer therapies, as genetically identical cells can exhibit divergent outcomes under the same treatment conditions. In this work, we present a deep learning approach for cell fate prediction from raw long-term live-cell recordings of cancer cell populations under chemotherapeutic treatment. Our Transformer model is trained to predict cell fate directly from raw image sequences, without relying on predefined morphological or molecular features. Beyond classification, we introduce a comprehensive explainability framework for interpreting the temporal and morphological cues guiding the model’s predictions. We demonstrate that prediction of cell outcomes is possible based on the video only, our model achieves balanced accuracy of 0.94 and an F1-score of 0.93. Attention and masking experiments further indicate that the signal predictive of the cell fate is not uniquely located in the final frames of a cell trajectory, as reliable predictions are possible up to 10 h before the event. Our analysis reveals distinct temporal distribution of predictive information in the mitotic and apoptotic sequences, as well as the role of cell morphology and p53 signaling in determining cell outcomes. Together, these findings demonstrate that attention-based temporal models enable accurate cell fate prediction while providing biologically interpretable insights into non-genetic determinants of cellular decision-making. The code is available at https://github.com/bozeklab/Cell-Fate-Prediction.

关键词: cell fate prediction, Transformer model, live-cell imaging, temporal attention, explainable AI, cancer therapy, deep learning, biological interpretability

216. ❌ Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

作者: Jiale Song, Jiaxin Luo, Xue-song Tang, Kuangrong Hao, Mingbo Zhao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16558v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究大型视觉语言模型（LVLMs）中的物体幻觉问题，提出了一种基于分割的注意力熵（SAE）方法来检测和缓解幻觉。与关键词的相关性分析如下：1）与"Large Language Models"高度相关（8分），因为LVLMs是大语言模型的视觉扩展；2）与"Hallucination Mitigation"高度相关（10分），这是论文的核心研究问题；3）与"Mechanistic Interpretability"相关（8分），论文通过分析视觉注意力模式来解释幻觉机制；4）与"AI for Science"有一定关联（5分），论文在机器人感知和决策场景中进行了评估；其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型中的物体幻觉问题，提出了一种基于分割的注意力熵（SAE）方法，通过量化视觉注意力不确定性和调整注意力模式，在不增加训练成本的情况下有效减少了幻觉，提高了模型的可靠性。

摘要翻译

大型视觉语言模型（LVLMs）在多模态任务中展现出强大性能，但物体幻觉问题严重削弱了其可靠性。现有研究多集中于文本模态，将幻觉归因于过强的语言先验和不足的视觉基础。与之相反，我们发现视觉模态内部的异常注意力模式同样会引发物体幻觉。基于此观察，我们提出了基于分割的注意力熵（Segmentation-based Attention Entropy, SAE），该方法利用语义分割在物体级语义空间中量化视觉注意力的不确定性。基于SAE，我们进一步设计了用于幻觉检测的可靠性评分，以及一种SAE引导的注意力调整方法，该方法在推理阶段修正视觉注意力以缓解幻觉。我们在公开基准测试和四足机器人的具身多模态实际场景中对所提方法进行了评估。实验结果表明，SAE在无需任何额外训练成本的情况下显著减少了物体幻觉，从而实现了更可信的LVLM驱动感知与决策。

摘要 (Abstract)

Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.

关键词: Large Vision-Language Models, Object Hallucinations, Attention Patterns, Segmentation-based Attention Entropy, Hallucination Detection, Visual Attention Adjustment, Multimodal Tasks, Embodied Robotics

217. ❌ Bridging the Simulation-to-Reality Gap in Electron Microscope Calibration via VAE-EM Estimation

作者: Jilles S. van Hulst, W. P. M. H., Heemels, Duarte J. Antunes 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16549v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用变分自编码器（VAE）和期望最大化（EM）方法解决电子显微镜校准中的仿真到现实差距问题，属于AI在科学领域的应用。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或评分关键词中的具体技术（如MoE、SFT、RAG等）。唯一相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文应用AI（VAE）解决科学仪器（电子显微镜）的校准问题，属于AI在科学领域的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于变分自编码器和期望最大化的联合估计方法，以解决扫描透射电子显微镜校准中的仿真到现实差距问题，在真实仪器上实现了估计误差减半和更快的校准速度。

摘要翻译

电子显微镜已在多个领域促成众多科学突破。一个关键挑战在于如何基于图像调整显微镜参数，以克服导致图像质量下降的光学像差。由于诊断图像具有高维度和噪声特性，且无法从单张图像中确定最优参数，这一校准问题极具挑战性。我们通过采用在模拟数据上训练的变分自编码器（VAEs）来解决扫描透射电子显微镜（STEM）的校准问题：VAEs能够学习图像的低维表示，而现有方法大多仅提取标量值。随后，我们利用期望最大化（EM）方法，同时估计将校准参数映射到编码表示的模型以及最优校准参数。这种联合估计方法明确解决了数据驱动方法中固有的“仿真-现实差距”——此类方法使用数字孪生模拟数据进行训练。我们利用光学系统已知的对称特性，确立了联合估计问题的全局可识别性，从而确保存在唯一最优解。我们在真实STEM设备上证明，该方法比现有方法显著更快、更稳定，在减少观测次数的同时将估计误差降低了一半。这标志着STEM自动校准技术的重要进展，并展现了VAEs在图像信息压缩方面的潜力。除显微技术外，VAE-EM框架也适用于逆问题研究，尤其适用于模拟训练数据存在现实差距、且非单射映射会阻碍获得唯一解的场景。

摘要 (Abstract)

Electron microscopy has enabled many scientific breakthroughs across multiple fields. A key challenge is the tuning of microscope parameters based on images to overcome optical aberrations that deteriorate image quality. This calibration problem is challenging due to the high-dimensional and noisy nature of the diagnostic images, and the fact that optimal parameters cannot be identified from a single image. We tackle the calibration problem for Scanning Transmission Electron Microscopes (STEM) by employing variational autoencoders (VAEs), trained on simulated data, to learn low-dimensional representations of images, whereas most existing methods extract only scalar values. We then simultaneously estimate the model that maps calibration parameters to encoded representations and the optimal calibration parameters using an expectation maximization (EM) approach. This joint estimation explicitly addresses the simulation-to-reality gap inherent in data-driven methods that train on simulated data from a digital twin. We leverage the known symmetry property of the optical system to establish global identifiability of the joint estimation problem, ensuring that a unique optimum exists. We demonstrate that our approach is substantially faster and more consistent than existing methods on a real STEM, achieving a 2x reduction in estimation error while requiring fewer observations. This represents a notable advance in automated STEM calibration and demonstrates the potential of VAEs for information compression in images. Beyond microscopy, the VAE-EM framework applies to inverse problems where simulated training data introduces a reality gap and where non-injective mappings would otherwise prevent unique solutions.

关键词: electron microscopy, calibration, variational autoencoder, expectation maximization, simulation-to-reality gap, inverse problems, STEM, VAE-EM framework

218. ❌ SAMSEM – A Generic and Scalable Approach for IC Metal Line Segmentation

作者: Christian Gehrmann, Jonas Ricker, Simon Damm, Deruo Cheng, Julian Speith, Yiqiong Shi, Asja Fischer, Christof Paar 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16548v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文SAMSEM专注于将Segment Anything Model 2（SAM2）应用于集成电路金属线分割，属于计算机视觉和硬件安全领域。与评分关键词列表中的大多数大模型技术关键词（如LLMs、MoE、Scaling Laws、RLHF等）无直接关联，因为这些关键词主要针对自然语言处理或通用大模型技术。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，得8分，因为论文涉及AI在科学（硬件验证）中的应用，但非生物信息学或化学信息学。‘Post-training OR Supervised Fine-tuning OR SFT’得5分，因论文提到对SAM2模型进行微调（fine-tune），属于监督微调范畴，但非核心焦点。其他关键词均得0分，因论文未涉及大模型技术原理、推理、对齐、压缩等主题。

!!! tip deepseek-chat TL;DR

该论文提出SAMSEM方法，通过微调Segment Anything Model 2并采用多尺度分割和拓扑损失，解决了集成电路扫描电镜图像中金属线分割的泛化问题，在跨不同技术和制造条件的IC上实现了低至0.62%的错误率。

摘要翻译

在全球化的硬件供应链背景下，硬件组件的可信保障受到广泛关注，尤其在密码学应用和高风险场景中。对在不信任环境中制造的芯片进行恶意电路检测时，识别集成电路（Integrated Circuits, ICs）扫描电子显微镜（Scanning Electron Microscope, SEM）图像中的金属线路是关键步骤之一。由于制造工艺和技术的差异，此类验证通常需要针对每个目标IC调整参数与算法。基于某一IC图像训练的机器学习模型往往难以准确识别其他IC的金属线路。为应对这一挑战，我们通过将Meta的Segment Anything Model 2（SAM2）适配至IC金属线路分割领域，构建了SAMSEM系统。具体而言，我们开发了一种多尺度分割方法，能够处理不同尺寸、分辨率和放大倍率的SEM图像。此外，我们在像素级损失函数基础上引入了基于拓扑结构的损失函数，使分割任务聚焦于电气连通性而非像素级精度。通过超参数优化，我们对SAM2模型进行微调，最终获得了一个能够泛化至不同技术节点、制造材料、样品制备方法和SEM成像技术的模型。为此，我们利用了一个前所未有的数据集，该数据集包含来自14种不同IC的48个金属层的SEM图像。在7个IC上进行微调后，SAMSEM在相同IC的其他图像上评估时错误率低至0.72%。对于其余7个未见过的IC，其错误率仍可低至5.53%。最终，当在所有14个IC上进行微调时，我们观察到错误率为0.62%。因此，SAMSEM被证明是一种可靠工具，显著推进了金属线路分割领域的前沿进展，而这正是集成电路后制造验证中的关键挑战。

摘要 (Abstract)

In light of globalized hardware supply chains, the assurance of hardware components has gained significant interest, particularly in cryptographic applications and high-stakes scenarios. Identifying metal lines on scanning electron microscope (SEM) images of integrated circuits (ICs) is one essential step in verifying the absence of malicious circuitry in chips manufactured in untrusted environments. Due to varying manufacturing processes and technologies, such verification usually requires tuning parameters and algorithms for each target IC. Often, a machine learning model trained on images of one IC fails to accurately detect metal lines on other ICs. To address this challenge, we create SAMSEM by adapting Meta’s Segment Anything Model 2 (SAM2) to the domain of IC metal line segmentation. Specifically, we develop a multi-scale segmentation approach that can handle SEM images of varying sizes, resolutions, and magnifications. Furthermore, we deploy a topology-based loss alongside pixel-based losses to focus our segmentation on electrical connectivity rather than pixel-level accuracy. Based on a hyperparameter optimization, we then fine-tune the SAM2 model to obtain a model that generalizes across different technology nodes, manufacturing materials, sample preparation methods, and SEM imaging technologies. To this end, we leverage an unprecedented dataset of SEM images obtained from 48 metal layers across 14 different ICs. When fine-tuned on seven ICs, SAMSEM achieves an error rate as low as 0.72% when evaluated on other images from the same ICs. For the remaining seven unseen ICs, it still achieves error rates as low as 5.53%. Finally, when fine-tuned on all 14 ICs, we observe an error rate of 0.62%. Hence, SAMSEM proves to be a reliable tool that significantly advances the frontier in metal line segmentation, a key challenge in post-manufacturing IC verification.

关键词: IC metal line segmentation, Segment Anything Model 2, SEM images, fine-tuning, multi-scale segmentation, topology-based loss, hardware verification, generalization across ICs

作者: Mangyu Kong, Jaewon Lee, Seongwon Lee, Euntai Kim 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16538v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究3D高斯泼溅（3DGS）中的姿态细化问题，提出结合蒙特卡洛姿态采样和基于费舍尔信息的PnP优化的重定位框架。论文主题是计算机视觉中的3D场景重建和相机姿态估计，与所有评分关键词（均涉及大模型、深度学习技术原理或AI在科学领域的应用）完全无关。论文未涉及任何语言模型、模型训练、推理优化、对齐、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文针对3D高斯泼溅中姿态细化对初始姿态和重建几何高度敏感的问题，提出了一个结合蒙特卡洛姿态采样和费舍尔信息优化PnP的重定位框架，显著提高了室内外场景的定位精度和稳定性。

摘要翻译

三维高斯泼溅（3D Gaussian Splatting，简称3DGS）作为一种强大的场景表征方法，近年来受到广泛关注，并越来越多地应用于视觉定位与姿态优化任务。然而，尽管其具备高质量的可微分渲染能力，基于3DGS的姿态优化鲁棒性仍对初始相机位姿与重建几何结构高度敏感。本研究深入探讨了这些局限性，并识别出两个主要的不确定性来源：（一）姿态先验不确定性，通常源于回归或检索模型输出的单一确定性估计；（二）几何不确定性，由3DGS重建中的缺陷引起，这些缺陷会将误差传播至PnP求解器中。此类不确定性可能扭曲重投影几何关系并破坏优化稳定性，即使渲染出的外观仍看似合理。为应对这些不确定性，我们提出了一种结合蒙特卡洛姿态采样与基于费舍尔信息（Fisher Information）的PnP优化的重定位框架。该方法显式地考虑了姿态与几何不确定性，且无需重新训练或额外监督。在多种室内外基准测试中，我们的方法持续提升了定位精度，并显著增强了在姿态与深度噪声影响下的稳定性。

摘要 (Abstract)

3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible. To address these uncertainties, we introduce a relocalization framework that combines Monte Carlo pose sampling with Fisher Information-based PnP optimization. Our method explicitly accounts for both pose and geometric uncertainty and requires no retraining or additional supervision. Across diverse indoor and outdoor benchmarks, our approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.

关键词: 3D Gaussian Splatting, pose refinement, camera pose, geometric uncertainty, Monte Carlo sampling, Fisher Information, PnP optimization, visual localization

220. ❌ An approximate graph elicits detonation lattice

作者: Vansh Sharma, Venkat Raman 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16524v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究基于图论的算法用于爆轰晶格的精确分割和测量，属于计算物理/流体力学领域。所有关键词均与大模型、深度学习技术原理或AI应用相关，但论文未涉及任何大模型、深度学习、AI技术或相关方法（如MoE、RLHF、RAG等）。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于科学计算应用，但未明确使用AI方法，仅基于传统算法，因此给予5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图论的无训练算法，用于从3D压力迹线中精确分割和测量爆轰晶格，解决了该领域长期存在的手动和原始2D边缘检测方法的局限性。

摘要翻译

本研究提出一种基于图论的新型算法，用于从三维压力迹线（称为爆轰晶格）中实现爆轰胞格的精确分割与测量，以解决该领域普遍采用的手动及原始二维边缘检测方法的局限性。该算法无需训练，通过分割模型旨在准确提取爆轰研究中的长期难题——胞格结构。首先，在生成数据上展示了分割效果，预测误差为2%。随后，利用三维模拟数据验证了基于图论工作流程的性能。统计量与联合概率密度结果表明，沿波传播方向排列的胞格呈椭圆形，偏差为17%，而体积上更大的离散度反映了线性变异性的立方放大效应。尽管该框架具有鲁棒性，对高度复杂胞格结构进行可靠分割与量化仍存在挑战。然而，基于图论的表达形式能泛化应用于多种胞格几何形态，使其成为爆轰分析的实用工具，并为未来在三波点碰撞研究中的拓展奠定了坚实基础。

摘要 (Abstract)

This study presents a novel algorithm based on graph theory for the precise segmentation and measurement of detonation cells from 3D pressure traces, termed detonation lattices, addressing the limitations of manual and primitive 2D edge detection methods prevalent in the field. Using a segmentation model, the proposed training-free algorithm is designed to accurately extract cellular patterns, a longstanding challenge in detonations research. First, the efficacy of segmentation on generated data is shown with a prediction error 2%. Next, 3D simulation data is used to establish performance of the graph-based workflow. The results of statistics and joint probability densities show oblong cells aligned with the wave propagation axis with 17% deviation, whereas larger dispersion in volume reflects cubic amplification of linear variability. Although the framework is robust, it remains challenging to reliably segment and quantify highly complex cellular patterns. However, the graph-based formulation generalizes across diverse cellular geometries, positioning it as a practical tool for detonation analysis and a strong foundation for future extensions in triple-point collision studies.

关键词: detonation lattice, graph theory, segmentation algorithm, 3D pressure traces, cellular patterns, training-free, wave propagation, triple-point collision

221. ❌ VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations

作者: Fucai Ke, Zhixi Cai, Boying Li, Long Chen, Beibei Lin, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Hamid Rezatofighi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16506v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多视角视觉推理，提出了VIEW2SPACE基准和Grounded Chain-of-Thought with Visual Evidence方法。与大多数关键词无关，因为论文聚焦于视觉推理而非大模型技术。仅与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分），因为论文明确提出了Grounded Chain-of-Thought方法；与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（5分），因为涉及深度组合推理。其他关键词如大模型、训练技术、代理等均未涉及。

!!! tip deepseek-chat TL;DR

论文研究了从稀疏观测进行多视角视觉推理的挑战，提出了VIEW2SPACE基准和Grounded Chain-of-Thought方法，显著提升了模型在中等难度任务上的性能，并发现深度组合推理仍是根本性难题。

摘要翻译

多视角视觉推理对于智能系统至关重要，这些系统必须从稀疏且离散的视点理解复杂环境，然而现有研究主要集中在单图像或时间密集的视频场景中。在现实世界场景中，跨视角推理需要在没有显式指导的情况下整合局部观测，而收集具有精确几何与语义标注的大规模多视角数据仍然具有挑战性。为弥补这一空白，我们利用基于物理的仿真技术构建了多样化、高保真的3D场景，并配有精确的每视角元数据，从而实现可扩展的数据生成，并保持向真实世界场景的可迁移性。基于此引擎，我们引入了VIEW2SPACE——一个用于稀疏多视角推理的多维基准，以及一个可扩展的、分离的训练划分，支持数百万个基于真实场景的问答对。利用该基准，对当前最先进的视觉语言与空间模型进行的全面评估表明，多视角推理在很大程度上仍未得到解决，大多数模型的性能仅略高于随机猜测。我们进一步探究了训练能否弥合这一差距。我们提出的基于视觉证据的链式推理方法在中等难度下显著提升了性能，并能泛化至真实世界数据，在跨数据集评估中超越了现有方法。我们还进一步围绕模型规模、数据规模、推理深度和可见性约束进行了难度感知的扩展性分析，结果表明，尽管几何感知能力在足够可见性下可通过扩展得到提升，但跨稀疏视角的深度组合推理仍然是一个根本性挑战。

摘要 (Abstract)

Multi-view visual reasoning is essential for intelligent systems that must understand complex environments from sparse and discrete viewpoints, yet existing research has largely focused on single-image or temporally dense video settings. In real-world scenarios, reasoning across views requires integrating partial observations without explicit guidance, while collecting large-scale multi-view data with accurate geometric and semantic annotations remains challenging. To address this gap, we leverage physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable data generation that remains transferable to real-world settings. Based on this engine, we introduce VIEW2SPACE, a multi-dimensional benchmark for sparse multi-view reasoning, together with a scalable, disjoint training split supporting millions of grounded question-answer pairs. Using this benchmark, a comprehensive evaluation of state-of-the-art vision-language and spatial models reveals that multi-view reasoning remains largely unsolved, with most models performing only marginally above random guessing. We further investigate whether training can bridge this gap. Our proposed Grounded Chain-of-Thought with Visual Evidence substantially improves performance under moderate difficulty, and generalizes to real-world data, outperforming existing approaches in cross-dataset evaluation. We further conduct difficulty-aware scaling analyses across model size, data scale, reasoning depth, and visibility constraints, indicating that while geometric perception can benefit from scaling under sufficient visibility, deep compositional reasoning across sparse views remains a fundamental challenge.

关键词: multi-view visual reasoning, sparse observations, VIEW2SPACE benchmark, Grounded Chain-of-Thought, visual evidence, scaling analysis, compositional reasoning, cross-dataset evaluation

222. ❌ GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

作者: Jiaxin Zhang, Junjun Jiang, Haijie Li, Youyu Chen, Kui Jiang, Dave Zhenyu Chen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16461v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）的几何感知能力激活，属于大模型技术原理创新。高度相关关键词：1）‘Large Language Models’（论文明确研究MLLMs，权重1.0，评分10）；2）‘Pre-training’（提出Geometry-Aligned Pre-training范式，权重1.0，评分10）。中等相关：‘Post-training’（提及text-dominated fine-tuning作为对比，权重1.0，评分5）。其余关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Agents等均未涉及，评分为0。加权总分计算：10×1.0 + 10×1.0 + 5×1.0 = 25.0。作者列表不包含指定专家。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在纯RGB输入下3D空间感知能力不足的问题，提出了一种几何对齐预训练范式GAP-MLLM，通过视觉提示联合任务和多级渐进融合模块，显著提升了模型在3D视觉定位、密集描述和视频目标检测等任务上的性能。

摘要翻译

多模态大语言模型（MLLMs）在语义推理方面表现出色，但在仅依赖纯RGB输入时，其三维空间感知能力存在不足。尽管现有方法利用来自三维重建模型的隐式几何先验，但与使用显式三维数据的方法相比，基于图像的方法仍存在显著性能差距。我们认为，这一差距并非源于几何先验不足，而是源于训练范式的不匹配：以文本为主导的微调未能有效激活MLLMs内部的几何表征。现有方法通常采用简单的特征拼接，并直接针对下游任务进行优化，缺乏针对几何结构的专门监督，导致对结构信息的利用不足。为克服这一局限，我们提出GAP-MLLM，一种几何对齐的预训练范式，旨在下游适应前显式激活模型的结构感知能力。具体而言，我们引入一种视觉提示联合任务，迫使MLLMs在预测语义标签的同时生成稀疏点云图，从而强化几何感知。此外，我们设计了一个包含令牌级门控机制的多层次渐进融合模块，能够在保持语义推理能力的前提下，自适应地整合几何先验。大量实验表明，GAP-MLLM显著增强了几何特征融合能力，并在三维视觉定位、三维密集描述生成和三维视频目标检测任务中持续提升性能。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before downstream adaptation. Specifically, we introduce a visual-prompted joint task that compels the MLLMs to predict sparse pointmaps alongside semantic labels, thereby enforcing geometric awareness. Furthermore, we design a multi-level progressive fusion module with a token-level gating mechanism, enabling adaptive integration of geometric priors without suppressing semantic reasoning. Extensive experiments demonstrate that GAP-MLLM significantly enhances geometric feature fusion and consistently enhances performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.

关键词: Multimodal Large Language Models, 3D Spatial Perception, Geometry-Aligned Pre-training, Visual-Prompted Joint Task, Multi-level Progressive Fusion, 3D Visual Grounding, 3D Dense Captioning, Geometric Feature Fusion

223. ❌ Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

作者: Weiqing Li, Jinyue Guo, Yaqi Wang, Haiyang Xiao, Yuewei Zhang, Guohua Liu, Hao Henry Wang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16455v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Evo-Retriever提出了一种基于LLM引导的课程演化检索框架，用于多模态文档检索。核心相关性在于：1）明确使用LLM作为元控制器来指导训练课程演化，与’Large Language Models’高度相关（8分）；2）专注于检索增强生成（RAG）中的检索部分，通过多视图对齐和对比学习改进跨模态检索，与’Retrieval-Augmented Generation’高度相关（8分）。其他关键词如MoE、SFT、RLHF、量化等未在摘要中提及或与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态文档检索中跨模态嵌入不一致的问题，提出了Evo-Retriever框架，通过LLM引导的课程演化和视图-路径协作，在ViDoRe V2和MMEB数据集上实现了最先进的检索性能。

摘要翻译

视觉语言模型（VLMs）在数据映射方面表现出色，但现实世界文档的异构性和非结构化特性破坏了跨模态嵌入的一致性。近期提出的延迟交互方法通过多向量表示增强了图像-文本对齐，然而传统训练方法受限于样本数量且采用静态策略，无法适应模型的动态演化，导致跨模态检索混淆。为解决这一问题，我们提出了Evo-Retriever——一种基于新型“视点-路径”协作机制、采用大语言模型（LLM）引导课程演化的检索框架。首先，我们通过多尺度与多方向视角的多视图图像对齐来增强细粒度匹配能力。其次，采用双向对比学习策略生成“困难查询”，并为视觉与文本消歧建立互补学习路径以重新平衡监督信号。最后，将上述协作产生的模型状态摘要输入LLM元控制器，该控制器利用专家知识自适应调整训练课程，从而促进模型演化。在ViDoRe V2和MMEB（VisDoc）数据集上，Evo-Retriever取得了最先进的性能，nDCG@5分数分别达到65.2%和77.1%。

摘要 (Abstract)

Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model’s dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates “hard queries” and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model’s evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.

关键词: LLM-guided curriculum evolution, multimodal document retrieval, visual-language models, cross-modal embeddings, viewpoint-pathway collaboration, bidirectional contrastive learning, hard queries, state-of-the-art performance

224. ❌ TinyGLASS: Real-Time Self-Supervised In-Sensor Anomaly Detection

作者: Pietro Bonazzi, Rafael Sutter, Luigi Capogrosso, Mischa Buob, Michele Magno 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16451v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的异常检测，特别是针对边缘设备的轻量化模型部署。虽然论文涉及模型压缩（INT8量化）和边缘AI部署，但所有关键词都围绕大语言模型（LLM）及其相关技术，而本文研究的是视觉异常检测模型（基于ResNet的卷积神经网络），与LLM技术无直接关联。因此，除’Quantization OR Model Compression OR Low-bit Weights’因涉及INT8量化获得5分（中等关联）外，其余关键词均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了TinyGLASS，一种轻量化的自监督异常检测框架，通过使用ResNet-18骨干网络和INT8量化，在索尼IMX500智能视觉传感器上实现了实时检测，在保持94.2% AUROC性能的同时将参数压缩8.7倍，并在8MB内存限制下达到20 FPS。

摘要翻译

异常检测在工业质量控制中起着关键作用，该领域必须在标注缺陷样本稀缺的情况下准确识别瑕疵。近期自监督方法（如GLASS）仅使用无缺陷数据学习正常视觉模式，在工业基准测试中展现出强大性能。然而，其计算需求限制了在资源受限边缘平台上的部署。

本研究提出TinyGLASS，这是针对索尼IMX500智能视觉传感器实时片上异常检测而设计的轻量化GLASS框架适配方案。该架构将原始的WideResNet-50骨干网络替换为紧凑的ResNet-18，并引入面向部署的改进，使其能够通过索尼模型压缩工具包实现静态图追踪和INT8量化。

除在MVTec-AD基准测试中评估性能外，我们还研究了模型对受污染训练数据的鲁棒性，并引入名为MMS Dataset的自定义工业数据集进行跨设备评估。实验结果表明，TinyGLASS在保持竞争力的检测性能同时实现了8.7倍的参数量压缩，在MVTec-AD上达到94.2%的图像级AUROC（异常接收者操作特征曲线下面积），并在IMX500平台8MB内存限制下以20 FPS（帧每秒）运行。

系统性能分析显示其具有低功耗（每次推理4.0毫焦）、实时端到端延迟（20 FPS）和高能效（470 GMAC/J）。此外，该模型在中等程度训练数据污染下仍能保持稳定的性能表现。

摘要 (Abstract)

Anomaly detection plays a key role in industrial quality control, where defects must be identified despite the scarcity of labeled faulty samples. Recent self-supervised approaches, such as GLASS, learn normal visual patterns using only defect-free data and have shown strong performance on industrial benchmarks. However, their computational requirements limit deployment on resource-constrained edge platforms. This work introduces TinyGLASS, a lightweight adaptation of the GLASS framework designed for real-time in-sensor anomaly detection on the Sony IMX500 intelligent vision sensor. The proposed architecture replaces the original WideResNet-50 backbone with a compact ResNet-18 and introduces deployment-oriented modifications that enable static graph tracing and INT8 quantization using Sony’s Model Compression Toolkit. In addition to evaluating performance on the MVTec-AD benchmark, we investigate robustness to contaminated training data and introduce a custom industrial dataset, named MMS Dataset, for cross-device evaluation. Experimental results show that TinyGLASS achieves 8.7x parameter compression while maintaining competitive detection performance, reaching 94.2% image-level AUROC on MVTec-AD and operating at 20 FPS within the 8 MB memory constraints of the IMX500 platform. System profiling demonstrates low power consumption (4.0 mJ per inference), real-time end-to-end latency (20 FPS), and high energy efficiency (470 GMAC/J). Furthermore, the model maintains stable performance under moderate levels of training data contamination.

关键词: anomaly detection, self-supervised learning, edge computing, model compression, INT8 quantization, real-time inference, industrial quality control, MVTec-AD benchmark

225. ❌ ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars

作者: Kaiwen Song, Jinkai Cui, Juyong Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16447v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机图形学中的3D高斯表示和渐进式渲染技术，用于XR和远程呈现应用。所有评分关键词均涉及大语言模型、深度学习技术原理或AI在科学领域的应用，而本文研究的是3D建模和渲染方法，与这些关键词完全无关。论文未涉及任何语言模型、深度学习训练技术、AI代理、推理方法或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于层次化3D高斯表示的渐进式可动画虚拟化身系统，能够在波动的网络带宽和计算资源下实现渐进式传输和渲染，从而实现平滑的质量提升。

摘要翻译

在实际的实时扩展现实与远程呈现应用中，网络与计算资源频繁波动，因此需要一种渐进式的三维表示方法。为此，我们提出了ProgressiveAvatars——一种基于三维高斯层级结构的渐进式虚拟化身表示方法，该层级通过在模板网格上进行自适应隐式细分生成。三维高斯在面部局部坐标系中定义，以在多种细节级别下保持随表情和头部运动而变化的动画能力。当屏幕空间信号显示细节不足时，层级结构会进行扩展，将资源分配到重要区域。借助重要性排序机制，ProgressiveAvatars支持增量加载与渲染，在新高斯元素到达时逐步添加，同时保留已有内容，从而在不同带宽条件下实现平滑的质量提升。ProgressiveAvatars能够在波动的网络带宽以及变化的计算与内存资源下实现渐进式传输与渐进式渲染。

摘要 (Abstract)

In practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive 3D representation is needed. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown by adaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face-local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. Leveraging importance ranking, ProgressiveAvatars supports incremental loading and rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.

关键词: ProgressiveAvatars, 3D Gaussians, progressive rendering, XR applications, telepresence, adaptive implicit subdivision, animatable avatars, incremental loading

226. ❌ Unified Removal of Raindrops and Reflections: A New Benchmark and A Novel Pipeline

作者: Xingyu Liu, Zewei He, Yu Chen, Chunyu Zhu, Zixuan Chen, Xing Luo, Zhe-Ming Lu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16446v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的图像去雨滴和去反射任务，提出了一种基于扩散模型的框架DiffUR³并构建了RDRF数据集。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science相关，而本文研究的是传统的图像处理问题，未涉及大模型、深度学习创新技术或科学领域应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文首次定义了雨滴和反射的统一去除任务，提出了基于扩散模型的DiffUR³框架并构建了RDRF数据集，在基准测试和真实场景图像上实现了最先进的性能。

摘要翻译

在雨天透过玻璃表面或挡风玻璃拍摄图像时，雨滴和反射现象常同时出现，显著降低所摄图像的可见度。这一实际问题缺乏关注且亟待解决。现有的去雨滴、去反射及一体化模型均未能有效处理此类复合退化现象。为此，我们首次正式定义了雨滴与反射联合去除任务，并构建了真实拍摄数据集——雨滴与反射数据集，该数据集提供了大量高质量、多样化的图像对作为新基准。随后，我们提出了一种基于扩散模型的新型框架，通过多项针对性设计应对这一挑战性任务。该框架借助强大的生成先验，成功消除了两类退化效应。大量实验表明，我们的方法在基准测试及复杂真实场景图像中均达到了最优性能。雨滴与反射数据集及相关代码将在论文录用后公开。

摘要 (Abstract)

When capturing images through glass surfaces or windshields on rainy days, raindrops and reflections frequently co-occur to significantly reduce the visibility of captured images. This practical problem lacks attention and needs to be resolved urgently. Prior de-raindrop, de-reflection, and all-in-one models have failed to address this composite degradation. To this end, we first formally define the unified removal of raindrops and reflections (UR$^3$) task for the first time and construct a real-shot dataset, namely RainDrop and ReFlection (RDRF), which provides a new benchmark with substantial, high-quality, diverse image pairs. Then, we propose a novel diffusion-based framework (i.e., DiffUR$^3$) with several target designs to address this challenging task. By leveraging the powerful generative prior, DiffUR$^3$ successfully removes both types of degradations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on our benchmark and on challenging in-the-wild images. The RDRF dataset and the codes will be made public upon acceptance.

关键词: raindrop removal, reflection removal, diffusion model, image restoration, benchmark dataset, computer vision, degradation removal, real-shot images

227. ❌ Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation

作者: Hunain Ahmed Jillani, Ahmed Tawfik Aboukhadra, Ahmed Elhayek, Jameel Malik, Nadia Robertini, Didier Stricker 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16444v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D手部网格重建的计算机视觉任务，使用知识蒸馏和轻量级网络来加速现有模型。虽然涉及模型压缩和效率优化，但所有关键词均针对大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等），而本文研究的是基于卷积/Transformer的视觉模型（HaMeR、ViT-H、MobileNet等），不涉及任何语言模型或文本生成任务。因此，所有关键词与论文内容完全无关，评分为0。

!!! tip deepseek-chat TL;DR

该论文通过知识蒸馏和轻量级骨干网络（如MobileNet、MobileViT）加速3D手部重建模型HaMeR，在保持精度损失仅0.4mm的同时实现1.5倍推理速度提升，适用于低功耗设备。

摘要翻译

快速且精确的三维手部重建对于虚拟现实/增强现实（VR/AR）、人机交互、机器人技术和医疗保健等实时应用至关重要。目前大多数先进方法依赖于庞大模型，限制了其在头戴设备、智能手机和嵌入式系统等资源受限设备上的应用。本文研究了如何通过轻量级神经网络结合知识蒸馏技术，在保持可比重建精度的同时，使复杂的三维手部重建模型变得更快速、更轻便。虽然我们的方法适用于多种手部重建框架，但主要聚焦于提升当前重建精度领先方法HaMeR的性能。我们将其原有的ViT-H主干网络替换为更轻量的替代方案，包括MobileNet、MobileViT、ConvNeXt和ResNet，并评估了三种知识蒸馏策略：输出级蒸馏、特征级蒸馏以及二者结合的混合蒸馏。实验表明，使用体积仅为原模型35%的轻量级主干网络，可实现1.5倍的推理加速，同时保持相近的性能质量，仅产生0.4毫米的微小精度差异。具体而言，我们证明输出级蒸馏能显著提升学生模型的性能，而特征级蒸馏对高容量学生模型更为有效。总体而言，这些发现为在低功耗设备上实现高效的实际应用铺平了道路。代码与模型已公开于https://github.com/hunainahmedj/Fast-HaMeR。

摘要 (Abstract)

Fast and accurate 3D hand reconstruction is essential for real-time applications in VR/AR, human-computer interaction, robotics, and healthcare. Most state-of-the-art methods rely on heavy models, limiting their use on resource-constrained devices like headsets, smartphones, and embedded systems. In this paper, we investigate how the use of lightweight neural networks, combined with Knowledge Distillation, can accelerate complex 3D hand reconstruction models by making them faster and lighter, while maintaining comparable reconstruction accuracy. While our approach is suited for various hand reconstruction frameworks, we focus primarily on boosting the HaMeR model, currently the leading method in terms of reconstruction accuracy. We replace its original ViT-H backbone with lighter alternatives, including MobileNet, MobileViT, ConvNeXt, and ResNet, and evaluate three knowledge distillation strategies: output-level, feature-level, and a hybrid of both. Our experiments show that using lightweight backbones that are only 35% the size of the original achieves 1.5x faster inference speed while preserving similar performance quality with only a minimal accuracy difference of 0.4mm. More specifically, we show how output-level distillation notably improves student performance, while feature-level distillation proves more effective for higher-capacity students. Overall, the findings pave the way for efficient real-world applications on low-power devices. The code and models are publicly available under https://github.com/hunainahmedj/Fast-HaMeR.

关键词: 3D hand reconstruction, knowledge distillation, lightweight neural networks, model acceleration, HaMeR, real-time applications, inference speed, resource-constrained devices

228. ❌ IRIS: A Real-World Benchmark for Inverse Recovery and Identification of Physical Dynamic Systems from Monocular Video

作者: Rasul Khanbayov, Mohamed Rayan Barhdadi, Erchin Serpedin, Hasan Kurban 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16432v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要关注物理动态系统的无监督参数估计和方程识别，属于计算机视觉和物理信息机器学习领域。论文中提到了使用VLM（视觉语言模型）进行时间推理作为基线方法之一，这属于AI for Science的应用范畴，因此给该关键词5分。其他所有关键词都涉及大语言模型、深度学习技术原理、模型训练优化等具体技术，与论文的物理系统识别和基准创建主题完全无关，因此均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了IRIS基准数据集，解决了从单目视频中无监督估计物理参数和识别控制方程缺乏统一评估标准的问题，包含220个高保真实世界视频并建立了标准化评估协议。

摘要翻译

视频无监督物理参数估计领域缺乏统一的基准：现有方法均在互不重叠的合成数据上进行评估，唯一的真实世界数据集局限于单体系系统，且尚无成熟协议用于控制方程识别。本研究提出IRIS基准——一个包含220段真实世界视频的高保真数据集，视频以4K分辨率、60帧/秒拍摄，涵盖单体和多体动力学系统，并提供独立测量的真实参数与不确定性估计。每个动力学系统均在受控实验室条件下记录，并配有其控制方程，从而实现原理性评估。我们定义了标准化评估协议，涵盖参数准确性、可识别性、外推能力、鲁棒性及控制方程选择。通过评估多种基线方法（包括多步物理损失框架及四种互补的方程识别策略：视觉语言模型时序推理、描述-分类提示法、基于CNN的分类方法及路径标注法），为所有IRIS场景建立了参考性能指标，并揭示了系统性失效模式以推动未来研究。数据集、标注、评估工具包及所有基线实现均已公开发布。

摘要 (Abstract)

Unsupervised physical parameter estimation from video lacks a common benchmark: existing methods evaluate on non-overlapping synthetic data, the sole real-world dataset is restricted to single-body systems, and no established protocol addresses governing-equation identification. This work introduces IRIS, a high-fidelity benchmark comprising 220 real-world videos captured at 4K resolution and 60,fps, spanning both single- and multi-body dynamics with independently measured ground-truth parameters and uncertainty estimates. Each dynamical system is recorded under controlled laboratory conditions and paired with its governing equations, enabling principled evaluation. A standardized evaluation protocol is defined encompassing parameter accuracy, identifiability, extrapolation, robustness, and governing-equation selection. Multiple baselines are evaluated, including a multi-step physics loss formulation and four complementary equation-identification strategies (VLM temporal reasoning, describe-then-classify prompting, CNN-based classification, and path-based labelling), establishing reference performance across all IRIS scenarios and exposing systematic failure modes that motivate future research. The dataset, annotations, evaluation toolkit, and all baseline implementations are publicly released.

关键词: physical parameter estimation, monocular video, benchmark dataset, governing-equation identification, unsupervised learning, real-world videos, evaluation protocol, dynamical systems

作者: Joona Kareinen, Veikka Immonen, Tuomas Eerola, Lumi Haraguchi, Lasse Lensu, Kaisa Kraft, Sanna Suikkanen, Heikki Kälviäinen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16427v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文研究基于自监督跨模态学习的浮游生物识别方法，属于AI在科学领域的应用（具体为海洋生物学/生物信息学），因此仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’相关（评8分，因论文聚焦生物图像识别，属于AI for Science范畴，但非典型生物信息学/化学信息学）。其他关键词均涉及大模型技术原理、训练方法、推理优化、代理系统等，而本文未使用或讨论任何大语言模型、深度学习技术原理创新，仅采用传统自监督学习和k-NN分类器，故其余关键词评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于自监督跨模态协调的浮游生物识别方法，利用图像和光学测量数据无需大量标注即可训练模型，在仅需少量标注图像的情况下实现了高识别准确率，并优于仅使用图像的自监督基线方法。

摘要翻译

本文提出将自监督跨模态协调作为一种策略，利用多模态及大量未标记浮游生物数据来构建浮游生物识别模型。自动化成像仪器促进了浮游生物图像数据的大规模持续采集。当前自动浮游生物图像识别方法主要依赖于监督式方法，这些方法需要人工标注的训练集，而标注工作耗时费力。另一方面，一些现代浮游生物成像仪器在图像信息之外还补充了光学测量数据，例如散射和荧光剖面，这些数据目前在浮游生物识别中尚未得到广泛应用。在本研究中，我们探索了利用此类测量数据来指导学习过程而无需人工标注的可能性。受对比语言-图像预训练（Contrastive Language-Image Pre-training, CLIP）相关理念的启发，我们仅使用二元监督信息（指示给定图像与剖面是否来自同一颗粒或不同颗粒）来训练两种模态的编码器。对于浮游生物识别，我们采用一个已知浮游生物物种的小型标注库，并结合 $k$-NN 分类器。该方法产生了一个本质上的多模态识别模型，即能够同时利用从图像和剖面数据中提取的信息。我们证明，所提出的方法在仅需极少量标注图像的情况下即可实现高识别准确率。此外，我们还表明该方法优于仅使用图像的自监督基线模型。代码发布于 https://github.com/Jookare/cross-modal-plankton。

摘要 (Abstract)

This paper considers self-supervised cross-modal coordination as a strategy enabling utilization of multiple modalities and large volumes of unlabeled plankton data to build models for plankton recognition. Automated imaging instruments facilitate the continuous collection of plankton image data on a large scale. Current methods for automatic plankton image recognition rely primarily on supervised approaches, which require labeled training sets that are labor-intensive to collect. On the other hand, some modern plankton imaging instruments complement image information with optical measurement data, such as scatter and fluorescence profiles, which currently are not widely utilized in plankton recognition. In this work, we explore the possibility of using such measurement data to guide the learning process without requiring manual labeling. Inspired by the concepts behind Contrastive Language-Image Pre-training, we train encoders for both modalities using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles. For plankton recognition, we employ a small labeled gallery of known plankton species combined with a $k$-NN classifier. This approach yields a recognition model that is inherently multimodal, i.e., capable of utilizing information extracted from both image and profile data. We demonstrate that the proposed method achieves high recognition accuracy while requiring only a minimal number of labeled images. Furthermore, we show that the approach outperforms an image-only self-supervised baseline. Code available at https://github.com/Jookare/cross-modal-plankton.

关键词: plankton recognition, cross-modal learning, self-supervised learning, contrastive learning, multimodal model, image analysis, optical measurements, k-NN classifier

230. ❌ Near-light Photometric Stereo with Symmetric Lights

作者: Lilika Makabe, Heng Guo, Hiroaki Santo, Fumio Okura, Yasuyuki Matsushita 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16404v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是计算机视觉中的近光源光度立体视觉问题，提出了一种利用对称光源排列的线性求解方法。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，而该论文专注于传统计算机视觉中的几何重建问题，未涉及任何大模型、深度学习或AI for Science内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用对称光源排列的线性求解方法，用于近光源光度立体视觉中的表面法线和深度估计，无需初始化即可获得闭式解，并在未完全校准光源空间偏移的情况下仍能工作，实验表明其形状恢复精度与最先进的校准方法相当。

摘要翻译

本文提出一种利用对称光源布局的近光源光度立体线性求解方法。与传统非凸优化方法不同，通过布置多组对称的近光源对，本方法无需初始化即可推导出表面法线与深度的闭合解。此外，只要光源关于任意点对称分布，即使整体空间偏移未经标定，本方法仍能有效工作。实验结果表明，本方法在形状恢复精度上达到了与当前最先进的标定型近光源光度立体方法相当的结果，同时显著降低了对精细深度初始化和光源标定的要求。

摘要 (Abstract)

This paper describes a linear solution method for near-light photometric stereo by exploiting symmetric light source arrangements. Unlike conventional non-convex optimization approaches, by arranging multiple sets of symmetric nearby light source pairs, our method derives a closed-form solution for surface normal and depth without requiring initialization. In addition, our method works as long as the light sources are symmetrically distributed about an arbitrary point even when the entire spatial offset is uncalibrated. Experiments showcase the accuracy of shape recovery accuracy of our method, achieving comparable results to the state-of-the-art calibrated near-light photometric stereo method while significantly reducing requirements of careful depth initialization and light calibration.

关键词: photometric stereo, near-light, symmetric lights, surface normal, depth estimation, closed-form solution, light calibration, shape recovery

231. ❌ HGP-Mamba: Integrating Histology and Generated Protein Features for Mamba-based Multimodal Survival Risk Prediction

作者: Jing Dai, Chen Wu, Ming Wu, Qibin Zhang, Zexi Wu, Jingdong Zhang, Hongming Xu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16421v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文HGP-Mamba专注于癌症生存风险预测，属于AI for Science（生物信息学）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到使用预训练的基础模型（foundation models）提取蛋白质特征，与’Large Language Models OR LLMs OR Foundation Models’和’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（各5分）。其他关键词主要涉及大模型技术原理（如MoE、RLHF、量化等）或特定应用（如Agent、RAG），论文未直接涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出HGP-Mamba框架，通过整合组织学图像和生成的蛋白质特征，解决了癌症生存风险预测中蛋白质数据稀缺的问题，并在多个数据集上实现了最先进的性能。

摘要翻译

多模态学习的最新进展显著提升了癌症生存风险预测的准确性。然而，蛋白质标记物与组织病理学图像的联合预后潜力仍未得到充分探索，这主要归因于蛋白质表达谱分析的高成本与有限可用性。为应对这一挑战，我们提出了HGP-Mamba——一个基于Mamba架构的多模态框架，能高效整合组织学特征与生成的蛋白质特征以进行生存风险预测。具体而言，我们引入了蛋白质特征提取器（Protein Feature Extractor, PFE），该模块利用预训练的基础模型直接从全切片图像（Whole Slide Images, WSIs）中提取高通量蛋白质嵌入，从而实现了分子信息的数据高效整合。结合捕获形态学特征的组织学嵌入，我们进一步提出了局部交互感知Mamba（Local Interaction-aware Mamba, LiAM）以实现细粒度特征交互，以及全局交互增强Mamba（Global Interaction-enhanced Mamba, GiEM）以促进切片层面的整体模态融合，从而捕捉复杂的跨模态依赖关系。在四个公开癌症数据集上的实验表明，与现有方法相比，HGP-Mamba在保持卓越计算效率的同时实现了最先进的预测性能。我们的源代码已在此https链接公开。

摘要 (Abstract)

Recent advances in multimodal learning have significantly improved cancer survival risk prediction. However, the joint prognostic potential of protein markers and histopathology images remains underexplored, largely due to the high cost and limited availability of protein expression profiling. To address this challenge, we propose HGP-Mamba, a Mamba-based multimodal framework that efficiently integrates histological with generated protein features for survival risk prediction. Specifically, we introduce a protein feature extractor (PFE) that leverages pretrained foundation models to derive high-throughput protein embeddings directly from Whole Slide Images (WSIs), enabling data-efficient incorporation of molecular information. Together with histology embeddings that capture morphological patterns, we further introduce the Local Interaction-aware Mamba (LiAM) for fine-grained feature interaction and the Global Interaction-enhanced Mamba (GiEM) to promote holistic modality fusion at the slide level, thus capture complex cross-modal dependencies. Experiments on four public cancer datasets demonstrate that HGP-Mamba achieves state-of-the-art performance while maintaining superior computational efficiency compared with existing methods. Our source code is publicly available at this https URL.

关键词: HGP-Mamba, multimodal survival risk prediction, histology, protein features, Mamba, Whole Slide Images, cancer datasets, computational efficiency

232. ❌ 3D Fourier-based Global Feature Extraction for Hyperspectral Image Classification

作者: Muhammad Ahmad 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16426v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于高光谱图像分类（HSIC），提出了一种结合3D卷积和傅里叶变换的深度学习架构（HGFNet），并引入了自适应焦点损失（AFL）来处理类别不平衡。论文的核心是计算机视觉和遥感领域的深度学习应用，特别是针对高光谱数据的空间-光谱特征提取。所有关键词均与大语言模型（LLMs）、模型训练技术、推理优化、对齐方法、代理系统等大模型核心技术无关，因此除“AI for Science OR Bioinformatics OR Cheminformatics”外，其他关键词均得0分。该关键词得5分，因为论文属于AI在科学（遥感/图像分析）领域的应用，但并非生物信息学或化学信息学的直接研究，关联性一般。

!!! tip deepseek-chat TL;DR

该论文针对高光谱图像分类中Transformer模型可扩展性差和现有傅里叶方法忽略谱间依赖的问题，提出了一种集成3D卷积和多种傅里叶变换的混合网络（HGFNet）以及自适应焦点损失，实现了高效、鲁棒的空间-光谱表示学习并改善了类别不平衡下的分类性能。

摘要翻译

高光谱图像分类（HSIC）通过利用丰富的空谱相关性的深度学习方法取得了显著进展。然而，现有方法仍面临根本性局限：基于Transformer的模型因自注意力机制的二次复杂度而存在可扩展性不足的问题，而近期基于傅里叶变换的方法通常依赖二维空间快速傅里叶变换（2D FFT），且大多忽视了高光谱数据固有的关键波段间光谱依赖性。为解决这些挑战，我们提出混合GFNet（HGFNet），这是一种新颖的架构，它通过GFNet风格的模块，将局部三维卷积特征提取与频域全局滤波相结合，以实现高效且鲁棒的空谱表征学习。HGFNet引入了三种专为高光谱影像设计的互补频率变换：光谱傅里叶变换（沿光谱轴的一维FFT）、空间傅里叶变换（在空间维度上的二维FFT）以及空间-空间傅里叶变换（在光谱和空间维度上联合进行的三维FFT），从而实现全面且高维的频率建模。三维卷积层捕捉细粒度的局部空谱结构，而基于傅里叶的全局滤波模块则高效建模长程依赖并抑制噪声。为了进一步缓解高光谱图像分类中常见的严重类别不平衡问题，HGFNet引入了自适应焦点损失（Adaptive Focal Loss, AFL），该损失函数动态调整类别的聚焦程度与权重，从而提升对代表性不足类别的区分能力。

摘要 (Abstract)

Hyperspectral image classification (HSIC) has been significantly advanced by deep learning methods that exploit rich spatial-spectral correlations. However, existing approaches still face fundamental limitations: transformer-based models suffer from poor scalability due to the quadratic complexity of self-attention, while recent Fourier transform-based methods typically rely on 2D spatial FFTs and largely ignore critical inter-band spectral dependencies inherent to hyperspectral data. To address these challenges, we propose Hybrid GFNet (HGFNet), a novel architecture that integrates localized 3D convolutional feature extraction with frequency-domain global filtering via GFNet-style blocks for efficient and robust spatial-spectral representation learning. HGFNet introduces three complementary frequency transforms tailored to hyperspectral imagery: Spectral Fourier Transform (a 1D FFT along the spectral axis), Spatial Fourier Transform (a 2D FFT over spatial dimensions), and Spatial-Spatial Fourier Transform (a 3D FFT jointly over spectral and spatial dimensions), enabling comprehensive and high-dimensional frequency modeling. The 3D convolutional layers capture fine-grained local spatial-spectral structures, while the Fourier-based global filtering modules efficiently model long-range dependencies and suppress noise. To further mitigate the severe class imbalance commonly observed in HSIC, HGFNet incorporates an Adaptive Focal Loss (AFL) that dynamically adjusts class-wise focusing and weighting, improving discrimination for underrepresented classes.

关键词: Hyperspectral image classification, 3D Fourier transform, Spatial-spectral representation, GFNet, Adaptive Focal Loss, Class imbalance, Deep learning, Remote sensing

233. ❌ Unpaired Cross-Domain Calibration of DMSP to VIIRS Nighttime Light Data Based on CUT Network

作者: Zhan Tong, ChenXu Zhou, Fei Tang, Yiming Tu, Tianyu Qin, Kaihao Fang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16385v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是遥感图像处理领域，具体针对DMSP和VIIRS夜间灯光数据的跨传感器校准问题，使用对比无配对翻译网络进行数据转换。论文内容完全属于遥感、地理信息系统和图像处理领域，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）没有任何直接关联。论文中未涉及任何语言模型、深度学习架构、训练方法、推理优化、AI代理或科学AI应用等相关内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于对比无配对翻译网络的跨传感器校准方法，将DMSP夜间灯光数据转换为VIIRS格式，有效解决了传感器不兼容问题并校准了DMSP缺陷，生成了与真实VIIRS观测高度一致的长时间序列数据。

摘要翻译

国防气象卫星计划（DMSP-OLS）与索米国家极轨伙伴关系卫星（SNPP-VIIRS）的夜间灯光（NTL）数据是监测城市化进程的关键工具，但传感器之间的不兼容性阻碍了长期分析。本研究提出一种基于对比式非配对翻译（CUT）网络的跨传感器校准方法，将DMSP数据转换为类VIIRS格式，以修正DMSP数据的缺陷。该方法采用多层分块对比学习，最大化对应图像块之间的互信息，在保持内容一致性的同时学习跨域相似性。利用2012-2013年重叠时段数据进行训练，该网络处理1992-2013年的DMSP影像，生成增强的VIIRS风格栅格数据。验证结果表明，生成的类VIIRS数据与实际VIIRS观测值（R平方大于0.87）及社会经济指标具有高度一致性。该方法有效解决了跨传感器数据融合问题并校准了DMSP缺陷，为延长夜间灯光时间序列提供了可靠尝试。

摘要 (Abstract)

Defense Meteorological Satellite Program (DMSP-OLS) and Suomi National Polar-orbiting Partnership (SNPP-VIIRS) nighttime light (NTL) data are vital for monitoring urbanization, yet sensor incompatibilities hinder long-term analysis. This study proposes a cross-sensor calibration method using Contrastive Unpaired Translation (CUT) network to transform DMSP data into VIIRS-like format, correcting DMSP defects. The method employs multilayer patch-wise contrastive learning to maximize mutual information between corresponding patches, preserving content consistency while learning cross-domain similarity. Utilizing 2012-2013 overlapping data for training, the network processes 1992-2013 DMSP imagery to generate enhanced VIIRS-style raster data. Validation results demonstrate that generated VIIRS-like data exhibits high consistency with actual VIIRS observations (R-squared greater than 0.87) and socioeconomic indicators. This approach effectively resolves cross-sensor data fusion issues and calibrates DMSP defects, providing reliable attempt for extended NTL time-series.

关键词: nighttime light data, cross-sensor calibration, Contrastive Unpaired Translation, DMSP-OLS, VIIRS, data fusion, time-series analysis, remote sensing

234. ❌ Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

作者: Yunpeng Qu, Kaidong Zhang, Yukang Ding, Ying Chen, Jian Wang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16373v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像标记化（tokenization）和生成，提出了一种新的语义一维标记器SemTok，用于图像重建和生成。虽然论文涉及生成模型和标记化技术，但其核心内容完全围绕视觉模态（图像），未涉及任何语言模型、大模型技术原理、对齐方法、推理技术、代理系统或科学AI应用。所有评分关键词均针对语言模型及相关技术，与论文的视觉焦点无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有视觉标记器将图像映射为固定二维空间网格、难以捕捉紧凑全局语义的问题，提出了一种语义一维标记器SemTok，通过2D到1D标记化方案、语义对齐约束和两阶段生成训练策略，在图像重建和生成任务中实现了最先进的性能。

摘要翻译

基于潜在空间的视觉生成模型已取得巨大成功，凸显了视觉分词化的重要性。将图像映射到潜在空间可提升效率，并实现多模态对齐以扩展下游任务规模。现有视觉分词器主要将图像映射为固定的二维空间网格，并专注于像素级重建，这阻碍了捕获具有紧凑全局语义的表征。为解决这些问题，我们提出\textbf{SemTok}——一种语义一维分词器，可将二维图像压缩为具有高层语义的一维离散标记。SemTok在图像重建领域确立了新的技术标杆，以极其紧凑的标记表征实现了卓越的保真度。这得益于包含三项关键创新的协同框架：二维到一维的分词化方案、语义对齐约束以及两阶段生成式训练策略。基于SemTok，我们构建了掩码自回归生成框架，在下游图像生成任务中取得了显著提升。实验验证了我们语义一维分词化方法的有效性。代码将开源发布。

摘要 (Abstract)

Visual generative models based on latent space have achieved great success, underscoring the significance of visual tokenization. Mapping images to latents boosts efficiency and enables multimodal alignment for scaling up in downstream tasks. Existing visual tokenizers primarily map images into fixed 2D spatial grids and focus on pixel-level restoration, which hinders the capture of representations with compact global semantics. To address these issues, we propose \textbf{SemTok}, a semantic one-dimensional tokenizer that compresses 2D images into 1D discrete tokens with high-level semantics. SemTok sets a new state-of-the-art in image reconstruction, achieving superior fidelity with a remarkably compact token representation. This is achieved via a synergistic framework with three key innovations: a 2D-to-1D tokenization scheme, a semantic alignment constraint, and a two-stage generative training strategy. Building on SemTok, we construct a masked autoregressive generation framework, which yields notable improvements in downstream image generation tasks. Experiments confirm the effectiveness of our semantic 1D tokenization. Our code will be open-sourced.

关键词: visual tokenization, semantic tokenizer, image reconstruction, image generation, 1D discrete tokens, masked autoregressive generation, latent space, multimodal alignment

235. ❌ DermaFlux: Synthetic Skin Lesion Generation with Rectified Flows for Enhanced Image Classification

作者: Stathis Galanakis, Alexandros Koliousis, Stefanos Zafeiriou 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16392v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究皮肤病图像生成与分类，核心创新在于使用基于Rectified Flow的生成模型和LoRA微调技术。与关键词相关性分析：1）‘PEFT/LoRA’高度相关（10分），论文明确使用LoRA进行参数高效微调；2）‘AI for Science’高度相关（10分），属于生物医学AI应用；3）‘Large Language Models’部分相关（5分），使用Llama 3.2生成文本描述；其余关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对皮肤病分类中数据稀缺和类别不平衡问题，提出了基于Rectified Flow和LoRA微调的DermaFlux框架，通过生成合成皮肤病图像将分类准确率提升了6-9%。

摘要翻译

尽管深度生成建模领域近期取得了进展，皮肤病变分类系统仍受限于大规模、多样化且标注完善的临床数据集的稀缺性，导致良性与恶性病变间的类别不平衡，进而降低了模型的泛化性能。我们提出了DermaFlux——一种基于修正流（rectified flow）的文本到图像生成框架，能够根据皮肤病学属性的自然语言描述合成具有临床依据的皮肤病变图像。DermaFlux基于Flux.1架构，通过在精心整理的大规模公开临床图像数据集上采用参数高效的低秩自适应（Low-Rank Adaptation, LoRA）进行微调。我们依据既定的皮肤病学标准（包括病变不对称性、边缘不规则性和颜色变化），利用Llama 3.2生成的合成文本描述构建图像-文本对。大量实验表明，DermaFlux能够生成多样化且具有临床意义的皮肤病图像：当用于增强小规模真实数据集时，可将二元分类性能提升最高达6%；而当分类器完全使用DermaFlux生成的合成图像（而非基于扩散模型的合成图像）进行训练时，性能提升最高达9%。我们使用仅2,500张真实图像和4,375张DermaFlux生成样本微调的ImageNet预训练视觉Transformer（ViT），实现了78.04%的二元分类准确率和0.859的AUC值，以8%的优势超越了当前次优的皮肤病学模型。

摘要 (Abstract)

Despite recent advances in deep generative modeling, skin lesion classification systems remain constrained by the limited availability of large, diverse, and well-annotated clinical datasets, resulting in class imbalance between benign and malignant lesions and consequently reduced generalization performance. We introduce DermaFlux, a rectified flow-based text-to-image generative framework that synthesizes clinically grounded skin lesion images from natural language descriptions of dermatological attributes. Built upon Flux.1, DermaFlux is fine-tuned using parameter-efficient Low-Rank Adaptation (LoRA) on a large curated collection of publicly available clinical image datasets. We construct image-text pairs using synthetic textual captions generated by Llama 3.2, following established dermatological criteria including lesion asymmetry, border irregularity, and color variation. Extensive experiments demonstrate that DermaFlux generates diverse and clinically meaningful dermatology images that improve binary classification performance by up to 6% when augmenting small real-world datasets, and by up to 9% when classifiers are trained on DermaFlux-generated synthetic images rather than diffusion-based synthetic images. Our ImageNet-pretrained ViT fine-tuned with only 2,500 real images and 4,375 DermaFlux-generated samples achieves 78.04% binary classification accuracy and an AUC of 0.859, surpassing the next best dermatology model by 8%.

关键词: skin lesion generation, rectified flow, LoRA fine-tuning, dermatology image synthesis, medical image classification, data augmentation, text-to-image generation, clinical datasets

236. ❌ Advancing Visual Reliability: Color-Accurate Underwater Image Enhancement for Real-Time Underwater Missions

作者: Yiqiang Zhou, Yifan Chen, Zhe Sun, Jijun Lu, Ye Zheng, Xuelong Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16363v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的传统图像增强任务，具体研究水下图像的实时颜色恢复和增强。论文的核心技术是设计轻量级卷积神经网络模块（如自适应加权通道补偿、多分支重参数化空洞卷积、统计全局颜色调整）以实现高效推理。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新、或AI在科学领域的应用（如生物信息学）直接相关。本论文的研究内容（水下图像增强）与这些关键词的主题（大模型技术、对齐、推理、代理、科学AI应用等）完全不同，没有任何技术重叠或概念关联，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级实时水下图像增强框架，通过自适应颜色补偿、多分支重参数化卷积和统计全局调整模块，在仅3880参数下实现了409 FPS的推理速度，并在多个数据集上取得了最先进的性能，显著提升了水下任务的视觉可靠性。

摘要翻译

水下图像增强对于为水下平台提供可靠的视觉信息至关重要，因为水相关环境中的强烈吸收和散射通常会导致图像质量下降。现有高性能方法往往依赖复杂架构，这阻碍了其在水下设备上的部署。轻量级方法则常为追求速度而牺牲质量，难以处理严重退化的水下图像。为应对这一局限，本文提出一种具有精准色彩复原能力的实时水下图像增强框架。首先，引入自适应加权通道补偿模块，以绿色通道为参考锚点，实现红蓝通道的动态色彩恢复。其次，设计了一种多分支重参数化空洞卷积，该结构在训练时采用多分支融合，在推理时进行结构重参数化，从而以低计算开销实现大感受野表征。最后，采用基于统计先验的统计全局色彩调整模块以优化整体色彩表现。在八个数据集上的大量实验表明，所提方法在七项评估指标上均达到最先进性能。该模型仅包含3,880个推理参量，推理速度达到409 FPS。我们的方法在多样环境条件下将UCIQE分数提升了29.7%，在ROV平台上的部署以及在下游任务中的性能提升进一步验证了其对于实时水下任务的优越性。

摘要 (Abstract)

Underwater image enhancement plays a crucial role in providing reliable visual information for underwater platforms, since strong absorption and scattering in water-related environments generally lead to image quality degradation. Existing high-performance methods often rely on complex architectures, which hinder deployment on underwater devices. Lightweight methods often sacrifice quality for speed and struggle to handle severely degraded underwater images. To address this limitation, we present a real-time underwater image enhancement framework with accurate color restoration. First, an Adaptive Weighted Channel Compensation module is introduced to achieve dynamic color recovery of the red and blue channels using the green channel as a reference anchor. Second, we design a Multi-branch Re-parameterized Dilated Convolution that employs multi-branch fusion during training and structural re-parameterization during inference, enabling large receptive field representation with low computational overhead. Finally, a Statistical Global Color Adjustment module is employed to optimize overall color performance based on statistical priors. Extensive experiments on eight datasets demonstrate that the proposed method achieves state-of-the-art performance across seven evaluation metrics. The model contains only 3,880 inference parameters and achieves an inference speed of 409 FPS. Our method improves the UCIQE score by 29.7% under diverse environmental conditions, and the deployment on ROV platforms and performance gains in downstream tasks further validate its superiority for real-time underwater missions.

关键词: Underwater image enhancement, Color restoration, Real-time processing, Lightweight model, Adaptive Weighted Channel Compensation, Multi-branch Re-parameterized Dilated Convolution, Statistical Global Color Adjustment, ROV deployment

237. ❌ Learning Human-Object Interaction for 3D Human Pose Estimation from LiDAR Point Clouds

作者: Daniel Sungho Jung, Dohee Cho, Kyoung Mu Lee 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16343v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于从LiDAR点云进行3D人体姿态估计的计算机视觉任务，提出了一种Human-Object Interaction Learning (HOIL)框架来解决空间模糊性和类别不平衡问题。论文内容涉及点云处理、对比学习、特征池化等具体计算机视觉技术，但完全不涉及大语言模型(LLMs)、深度学习技术原理创新（如MoE、Scaling Laws、微调方法等）、AI代理、推理方法、模型优化技术或AI for Science等关键词领域。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文是纯粹的3D视觉任务，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种Human-Object Interaction Learning (HOIL)框架，通过人类-物体交互感知对比学习和接触感知部分引导池化，解决了从LiDAR点云进行3D人体姿态估计时的空间模糊性和类别不平衡问题。

摘要翻译

基于激光雷达点云理解人类行为是自动驾驶领域最核心的任务之一，因其与行人安全密切相关，但在复杂的人-物交互和杂乱背景干扰下仍面临巨大挑战。然而，现有方法大多忽视了利用人-物交互关系构建鲁棒三维人体姿态估计框架的潜力。引入人-物交互主要基于两大挑战：首先，人-物交互会导致人体与物体点云间的空间模糊性，常在交互区域引发三维人体关键点预测错误；其次，交互与非交互身体部位的点云数量存在严重类别不平衡，手、足等高频交互部位在激光雷达数据中往往观测稀疏。为应对这些挑战，我们提出一种人-物交互学习框架（Human-Object Interaction Learning, HOIL），用于从激光雷达点云实现鲁棒的三维人体姿态估计。针对空间模糊性问题，我们提出人-物交互感知对比学习（HOICL），通过增强人体与物体点云间的特征区分度（尤其在交互区域）来缓解该问题。针对类别不平衡问题，我们设计接触感知部位引导池化（CPPool），通过压缩过表征点云同时保留交互部位的信息点，实现表征能力的自适应再分配。此外，我们提出可选的基于接触关系的时序优化模块，利用时序接触线索修正单帧错误的关键点估计。实验表明，HOIL框架能有效利用人-物交互关系解决交互区域的空间模糊性与类别不平衡问题。代码将公开。

摘要 (Abstract)

Understanding humans from LiDAR point clouds is one of the most critical tasks in autonomous driving due to its close relationships with pedestrian safety, yet it remains challenging in the presence of diverse human-object interactions and cluttered backgrounds. Nevertheless, existing methods largely overlook the potential of leveraging human-object interactions to build robust 3D human pose estimation frameworks. There are two major challenges that motivate the incorporation of human-object interaction. First, human-object interactions introduce spatial ambiguity between human and object points, which often leads to erroneous 3D human keypoint predictions in interaction regions. Second, there exists severe class imbalance in the number of points between interacting and non-interacting body parts, with the interaction-frequent regions such as hand and foot being sparsely observed in LiDAR data. To address these challenges, we propose a Human-Object Interaction Learning (HOIL) framework for robust 3D human pose estimation from LiDAR point clouds. To mitigate the spatial ambiguity issue, we present human-object interaction-aware contrastive learning (HOICL) that effectively enhances feature discrimination between human and object points, particularly in interaction regions. To alleviate the class imbalance issue, we introduce contact-aware part-guided pooling (CPPool) that adaptively reallocates representational capacity by compressing overrepresented points while preserving informative points from interacting body parts. In addition, we present an optional contact-based temporal refinement that refines erroneous per-frame keypoint estimates using contact cues over time. As a result, our HOIL effectively leverages human-object interaction to resolve spatial ambiguity and class imbalance in interaction regions. Codes will be released.

关键词: 3D human pose estimation, LiDAR point clouds, human-object interaction, contrastive learning, class imbalance, autonomous driving, feature discrimination, contact-aware pooling

238. ❌ Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation

作者: Xinhao Cai, Gensheng Pei, Zeren Sun, Yazhou Yao, Fumin Shen, Wenguan Wang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16340v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation》专注于计算机视觉中的单目深度估计任务，提出了一种结合扩散模型和真实世界先验的确定性框架。虽然论文涉及深度学习（扩散模型）在视觉任务中的应用，但所有评分关键词均明确针对大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、CoT、LLM Agents等），或特定科学领域应用（如Bioinformatics）。论文内容完全不涉及语言模型、自然语言处理、大模型技术原理或AI for Science的具体子领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Iris的确定性框架，通过将真实世界先验整合到扩散模型中，解决了单目深度估计中细节缺失和合成到真实场景泛化能力不足的问题，显著提升了性能并增强了野外泛化能力。

摘要翻译

本文提出 \textbf{Iris}，一种用于单目深度估计（Monocular Depth Estimation, MDE）的确定性框架，它将真实世界先验整合到扩散模型中。传统的前馈方法依赖大量训练数据，但仍会丢失细节。先前基于扩散的方法利用丰富的生成先验，但在从合成到真实场景的领域迁移上存在困难。相比之下，Iris 能够保留精细细节，在从合成到真实场景的泛化能力上表现优异，并且在有限训练数据下仍保持高效。为此，我们引入了一种两阶段的先验到几何确定性调度方法：先验阶段采用谱门控蒸馏（Spectral-Gated Distillation, SGD）来迁移低频真实先验，同时保持高频细节不受约束；几何阶段则应用谱门控一致性（Spectral-Gated Consistency, SGC）来保证高频保真度，并利用合成真值进行细化。两个阶段共享权重，并采用从高到低的时间步调度执行。大量实验结果证实，Iris 在 MDE 性能上取得了显著提升，并展现出强大的野外场景泛化能力。

摘要 (Abstract)

In this paper, we propose \textbf{Iris}, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model. Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.

关键词: Monocular Depth Estimation, Diffusion Model, Real-World Priors, Domain Transfer, Spectral-Gated Distillation, Spectral-Gated Consistency, Deterministic Framework, In-the-wild Generalization

239. ❌ PKINet-v2: Towards Powerful and Efficient Poly-Kernel Remote Sensing Object Detection

作者: Xinhao Cai, Liulei Li, Gensheng Pei, Zeren Sun, Yazhou Yao, Wenguan Wang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16341v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于遥感图像目标检测的卷积神经网络架构设计，提出了一种名为PKINet-v2的骨干网络，通过多尺度感受野和异构核重参数化技术提高检测精度和效率。论文内容完全围绕计算机视觉中的卷积神经网络架构优化，未涉及任何大语言模型、深度学习技术原理创新、AI for Science应用或评分关键词中的其他大模型相关技术。所有关键词均与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文针对遥感图像中目标几何形状多样和尺寸范围广泛带来的检测挑战，提出了一种名为PKINet-v2的骨干网络，通过协同各向异性和各向同性卷积核构建多尺度感受野，并采用异构核重参数化策略实现高效部署，在多个基准测试中达到了最先进的准确率并实现了3.9倍的加速。

摘要翻译

遥感图像中的目标检测面临几何与空间复杂性共存的挑战：目标可能以多样长宽比出现，同时在多变场景下其尺寸跨度极大。现有遥感图像骨干网络分别应对这两项挑战：或采用各向异性条带卷积核建模细长目标，或使用各向同性大卷积核捕获广阔上下文。然而，这种孤立处理方式导致互补性缺陷：纯条带设计会破坏规则形状目标的空间连贯性并弱化微小细节，而各向同性大卷积核常为细长结构引入严重背景噪声与几何失配。本文扩展PKINet，提出一种强大高效的骨干网络——多核初始网络第二版（Poly Kernel Inception Network v2，简称PKINet-v2），在统一范式中协同应对双重挑战。PKINet-v2将各向异性轴向条带卷积与各向同性方形卷积核有机结合，构建多尺度感受野，在保持细粒度局部纹理的同时，渐进聚合跨尺度的长程上下文信息。为实现高效部署，我们进一步提出异构核重参数化策略（Heterogeneous Kernel Re-parameterization，HKR），将训练阶段所有异构分支融合为单一深度可分离卷积进行推理，在保持精度的同时消除碎片化卷积核启动开销。在DOTA-v1.0、DOTA-v1.5、HRSC2016和DIOR-R四个广泛使用的基准数据集上的大量实验表明，PKINet-v2在实现最优精度的同时，相比PKINet-v1获得$\textbf{3.9}$倍的帧率加速，在效能与效率上均超越现有遥感骨干网络。

摘要 (Abstract)

Object detection in remote sensing images (RSIs) is challenged by the coexistence of geometric and spatial complexity: targets may appear with diverse aspect ratios, while spanning a wide range of object sizes under varied contexts. Existing RSI backbones address the two challenges separately, either by adopting anisotropic strip kernels to model slender targets or by using isotropic large kernels to capture broader context. However, such isolated treatments lead to complementary drawbacks: the strip-only design can disrupt spatial coherence for regular-shaped objects and weaken tiny details, whereas isotropic large kernels often introduce severe background noise and geometric mismatch for slender structures. In this paper, we extend PKINet, and present a powerful and efficient backbone that jointly handles both challenges within a unified paradigm named Poly Kernel Inception Network v2 (PKINet-v2). PKINet-v2 synergizes anisotropic axial-strip convolutions with isotropic square kernels and builds a multi-scope receptive field, preserving fine-grained local textures while progressively aggregating long-range context across scales. To enable efficient deployment, we further introduce a Heterogeneous Kernel Re-parameterization (HKR) Strategy that fuses all heterogeneous branches into a single depth-wise convolution for inference, eliminating fragmented kernel launches without accuracy loss. Extensive experiments on four widely-used benchmarks, including DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R, demonstrate that PKINet-v2 achieves state-of-the-art accuracy while delivering a $\textbf{3.9}\times$ FPS acceleration compared to PKINet-v1, surpassing previous remote sensing backbones in both effectiveness and efficiency.

关键词: remote sensing object detection, backbone network, poly kernel inception, anisotropic convolution, isotropic convolution, heterogeneous kernel re-parameterization, multi-scale receptive field, efficient deployment

240. ❌ SpikeCLR: Contrastive Self-Supervised Learning for Few-Shot Event-Based Vision using Spiking Neural Networks

作者: Maxime Vaillant, Axel Carlier, Lai Xing Ng, Christophe Hurter, Benoit R. Cottereau 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16338v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究事件相机视觉与脉冲神经网络（SNN）的自监督学习框架SpikeCLR，属于计算机视觉和神经形态计算领域，与绝大多数大模型/深度学习技术关键词无关。仅与’Pre-training’和’Post-training’有一定关联（5分），因为论文涉及自监督预训练和后续微调，但并非针对大语言模型。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对事件相机视觉中标记数据稀缺的问题，提出了SpikeCLR自监督学习框架，通过事件特定的数据增强和脉冲神经网络训练，在少样本和半监督设置下实现了比监督学习更好的性能。

摘要翻译

基于事件的视觉传感器为高速感知提供了显著优势，包括微秒级时间分辨率、高动态范围和低功耗。当与脉冲神经网络（SNNs）结合时，它们可部署在神经形态硬件上，实现在嵌入式系统上的高能效应用。然而，这种潜力因缺乏有效训练此类模型所需的大规模标注数据集而受到严重限制。在本研究中，我们提出了SpikeCLR，一种对比自监督学习框架，使SNNs能够从未标注的事件数据中学习鲁棒的视觉表征。我们通过代理梯度训练将已有的基于帧的方法适配到脉冲领域，并引入了一套利用空间、时间和极性变换的事件数据专用增强方法。通过在CIFAR10-DVS、N-Caltech101、N-MNIST和DVS-Gesture基准数据集上的大量实验，我们证明自监督预训练结合后续微调在低数据条件下优于监督学习，在少样本和半监督设置中取得了一致的性能提升。我们的消融研究表明，结合空间和时间增强对于学习事件数据中有效的时空不变性至关重要。我们进一步表明，学习到的表征能够跨数据集迁移，这有助于在标签稀缺环境下开发强大的基于事件的模型。

摘要 (Abstract)

Event-based vision sensors provide significant advantages for high-speed perception, including microsecond temporal resolution, high dynamic range, and low power consumption. When combined with Spiking Neural Networks (SNNs), they can be deployed on neuromorphic hardware, enabling energy-efficient applications on embedded systems. However, this potential is severely limited by the scarcity of large-scale labeled datasets required to effectively train such models. In this work, we introduce SpikeCLR, a contrastive self-supervised learning framework that enables SNNs to learn robust visual representations from unlabeled event data. We adapt prior frame-based methods to the spiking domain using surrogate gradient training and introduce a suite of event-specific augmentations that leverage spatial, temporal, and polarity transformations. Through extensive experiments on CIFAR10-DVS, N-Caltech101, N-MNIST, and DVS-Gesture benchmarks, we demonstrate that self-supervised pretraining with subsequent fine-tuning outperforms supervised learning in low-data regimes, achieving consistent gains in few-shot and semi-supervised settings. Our ablation studies reveal that combining spatial and temporal augmentations is critical for learning effective spatio-temporal invariances in event data. We further show that learned representations transfer across datasets, contributing to efforts for powerful event-based models in label-scarce settings.

关键词: Spiking Neural Networks, Event-based Vision, Self-supervised Learning, Contrastive Learning, Few-shot Learning, Neuromorphic Computing, Data Augmentation, Transfer Learning

241. ❌ DriveFix: Spatio-Temporally Coherent Driving Scene Restoration

作者: Heyu Si, Brandon James Denis, Muyang Sun, Dragos Datcu, Yaoru Li, Xin Jin, Ruiju Fu, Yuliia Tatarinova, Federico Landi, Jie Song, Mingli Song, Qi Guo 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16306v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DriveFix专注于自动驾驶场景的4D场景重建与修复，使用扩散先验和transformer架构实现时空一致的驾驶场景恢复。所有关键词均与大语言模型（LLM）技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文研究的是计算机视觉和自动驾驶领域的多视图几何与生成模型，不涉及任何语言模型技术。唯一略有相关的是’World Models AND General World Models’，因为论文提到’robust 4D world modeling’，但这是指物理世界的4D建模（3D空间+时间），而非AI中的通用世界模型概念，因此给5分表示微弱关联。其他关键词均完全无关。

!!! tip deepseek-chat TL;DR

论文提出了DriveFix框架，通过交错扩散transformer架构解决自动驾驶场景中多视图修复的时空不一致问题，实现了在多个数据集上最先进的4D场景重建和新视图合成性能。

摘要翻译

近期，4D场景重建领域——尤其是利用扩散先验的方法——在自动驾驶的新视角合成方面展现出潜力。然而，这些方法通常独立或以逐视角方式处理帧序列，导致严重缺乏时空协同性，从而引发跨摄像头的空间错位与序列中的时间漂移。我们提出DriveFix，一种新颖的多视角修复框架，旨在确保驾驶场景的时空连贯性。该方法采用交错式扩散Transformer架构，配备专门模块以显式建模时间依赖性与跨摄像头空间一致性。通过将生成过程以历史上下文为条件，并结合几何感知的训练损失，DriveFix确保修复后的视角遵循统一的3D几何结构，从而实现高保真纹理的一致性传播，并显著减少伪影。在Waymo、nuScenes和PandaSet数据集上的大量评估表明，DriveFix在重建与新视角合成任务上均达到最先进的性能，标志着面向实际部署的鲁棒4D世界建模迈出了重要一步。

摘要 (Abstract)

Recent advancements in 4D scene reconstruction, particularly those leveraging diffusion priors, have shown promise for novel view synthesis in autonomous driving. However, these methods often process frames independently or in a view-by-view manner, leading to a critical lack of spatio-temporal synergy. This results in spatial misalignment across cameras and temporal drift in sequences. We propose DriveFix, a novel multi-view restoration framework that ensures spatio-temporal coherence for driving scenes. Our approach employs an interleaved diffusion transformer architecture with specialized blocks to explicitly model both temporal dependencies and cross-camera spatial consistency. By conditioning the generation on historical context and integrating geometry-aware training losses, DriveFix enforces that the restored views adhere to a unified 3D geometry. This enables the consistent propagation of high-fidelity textures and significantly reduces artifacts. Extensive evaluations on the Waymo, nuScenes, and PandaSet datasets demonstrate that DriveFix achieves state-of-the-art performance in both reconstruction and novel view synthesis, marking a substantial step toward robust 4D world modeling for real-world deployment.

关键词: 4D scene reconstruction, autonomous driving, spatio-temporal coherence, multi-view restoration, diffusion transformer, novel view synthesis, world modeling, driving scenes

242. ❌ Persistent Story World Simulation with Continuous Character Customization

作者: Jinlu Zhang, Qiyun Wang, Baoxiang Du, Jiayi Ji, Jing He, Rongsheng Zhang, Tangjie Lv, Xiaoshuai Sun, Rongrong Ji 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16285v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究故事可视化中的角色定制问题，提出了EverTale系统。与关键词的相关性分析如下：1）与"PEFT OR LoRA OR Parameter-efficient Fine-tuning"高度相关（10分），因为论文明确使用LoRA模块进行角色适配；2）与"Chain of Thought OR CoT Reasoning OR Multi-step Reasoning"高度相关（10分），因为论文使用MLLM-as-Judge进行链式思维推理来评估角色保真度；3）其他关键词均未在论文中提及或相关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了故事可视化中角色定制、语义对齐和新身份持续集成难以协同的问题，提出了EverTale系统，通过统一的LoRA模块、基于链式思维推理的质量门控和角色感知采样策略，在单角色和多角色故事可视化任务上取得了优越性能。

摘要翻译

故事可视化在计算机视觉领域日益受到关注。然而，现有方法往往难以在精准角色定制、语义对齐与新身份持续集成之间实现协同。为应对这一挑战，本文提出EverTale——一个面向连续故事角色定制的故事世界模拟器。我们首先提出一体化角色集成器，通过统一的LoRA模块实现连续角色适配，无需传统方法中针对每个角色的独立优化模块。随后，我们通过MLLM-as-Judge构建角色质量评估门控机制，利用思维链推理确保每次角色适配过程的保真度，从而判定模型应进入下一角色定制阶段或需对当前角色进行补充训练。此外，我们提出角色感知区域聚焦采样策略，以解决现有多角色视觉叙事中的身份退化与布局冲突问题，通过高效协调局部角色细节与全局场景语境，实现自然的多角色生成。实验结果表明，在单角色与多角色故事可视化任务中，EverTale相较于更广泛的对比方法均展现出优越性能。代码将公开提供。

摘要 (Abstract)

Story visualization has gained increasing attention in computer vision. However, current methods often fail to achieve a synergy between accurate character customization, semantic alignment, and continuous integration of new identities. To tackle this challenge, in this paper we present EverTale, a story world simulator for continuous story character customization. We first propose an All-in-One-World Character Integrator to achieve continuous character adaptation within unified LoRA module, eliminating the need for per-character optimization modules of previous methods. Then, we incorporate a Character Quality Gate via MLLM-as-Judge to ensure the fidelity of each character adaptation process through chain-of-thought reasoning, determining whether the model can proceed to the next character or require additional training on the current one. We also introduce a Character-Aware Region-Focus Sampling strategy to address the identity degradation and layout conflicts in existing multi-character visual storytelling, ensuring natural multi-character generation by harmonizing local character-specific details with global scene context with higher efficiency. Experimental results show that our EverTale achieves superior performance against a wider range of compared methods on both single- and multi-character story visualization. Codes will be available.

关键词: Story Visualization, Character Customization, LoRA, Chain-of-Thought Reasoning, Multi-character Generation, MLLM-as-Judge, Continuous Adaptation, EverTale

243. ❌ Micro-AU CLIP: Fine-Grained Contrastive Learning from Local Independence to Global Dependency for Micro-Expression Action Unit Detection

作者: Jinsheng Wei, Fengzhou Guo, Yante Li, Haoyu Chen, Guanming Lu, Guoying Zhao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16302v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究微表情动作单元检测，提出了一种名为Micro-AU CLIP的框架，使用对比学习进行细粒度视觉-文本特征对齐。论文内容主要涉及计算机视觉、情感计算和细粒度特征学习，与绝大多数大模型和深度学习技术原理关键词（如LLM、MoE、RLHF、RAG等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为微表情分析属于行为科学和心理学交叉领域，可视为AI在科学（行为科学）中的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他所有关键词均未在论文标题或摘要中提及，与论文研究主题无直接关系。

!!! tip deepseek-chat TL;DR

该论文针对微表情动作单元检测中现有方法对局部区域感知不足的问题，提出了Micro-AU CLIP框架，通过局部语义独立建模和全局语义依赖建模来学习细粒度特征，实现了最先进的检测性能。

摘要翻译

微表情动作单元为细粒度真实情绪分析提供了客观线索。现有大多数微动作单元检测方法从整体面部图像/视频中学习动作单元特征，这与动作单元固有的局部性相冲突，导致对动作单元区域的感知不足。实际上，每个动作单元独立对应于特定的局部面部肌肉运动（局部独立性），而在特定情绪状态下，某些动作单元之间存在固有的依赖关系（全局依赖性）。因此，本文探索了从独立性到依赖性模式的有效性，并提出了一种新颖的微动作单元检测框架——微动作单元CLIP，该框架独特地将动作单元检测过程分解为局部语义独立性建模和全局语义依赖性建模。在局部语义独立性建模中，设计了补丁令牌注意力机制，将动作单元区域内的多个局部特征映射到同一特征空间；在全局语义依赖性建模中，提出了全局依赖性注意力机制和全局依赖性损失函数，以建模不同动作单元之间的全局依赖关系，从而增强每个动作单元的特征。此外，考虑到CLIP在微语义对齐方面的固有局限性，设计了微动作单元对比损失函数，通过视觉特征与文本特征的细粒度对齐来学习动作单元特征。同时，微动作单元CLIP以无情绪标签的方式有效应用于微表情识别。实验结果表明，微动作单元CLIP能够充分学习细粒度的微动作单元特征，实现了最先进的性能。

摘要 (Abstract)

Micro-expression (ME) action units (Micro-AUs) provide objective clues for fine-grained genuine emotion analysis. Most existing Micro-AU detection methods learn AU features from the whole facial image/video, which conflicts with the inherent locality of AU, resulting in insufficient perception of AU regions. In fact, each AU independently corresponds to specific localized facial muscle movements (local independence), while there is an inherent dependency between some AUs under specific emotional states (global dependency). Thus, this paper explores the effectiveness of the independence-to-dependency pattern and proposes a novel micro-AU detection framework, micro-AU CLIP, that uniquely decomposes the AU detection process into local semantic independence modeling (LSI) and global semantic dependency (GSD) modeling. In LSI, Patch Token Attention (PTA) is designed, mapping several local features within the AU region to the same feature space; In GSD, Global Dependency Attention (GDA) and Global Dependency Loss (GDLoss) are presented to model the global dependency relationships between different AUs, thereby enhancing each AU feature. Furthermore, considering CLIP’s native limitations in micro-semantic alignment, a microAU contrastive loss (MiAUCL) is designed to learn AU features by a fine-grained alignment of visual and text features. Also, Micro-AU CLIP is effectively applied to ME recognition in an emotion-label-free way. The experimental results demonstrate that Micro-AU CLIP can fully learn fine-grained micro-AU features, achieving state-of-the-art performance.

关键词: Micro-expression, Action Unit Detection, Contrastive Learning, Fine-grained Alignment, Local Independence, Global Dependency, CLIP, Micro-AU CLIP

244. ❌ Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

作者: TianTian Dang, Chao Bi, Shufan Shen, Jinzhe Liu, Qingming Huang, Shuhui Wang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16284v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大型视觉语言模型（LVLMs）的幻觉缓解问题，提出了一种基于归因引导的稀疏特征引导框架。与关键词的相关性分析如下：1）与’Large Language Models’相关度8分：论文明确研究LVLMs，属于大模型范畴，但更侧重于视觉语言交叉领域；2）与’Hallucination Mitigation’相关度10分：这是论文的核心研究问题，直接针对幻觉缓解提出新方法；3）与’Mechanistic Interpretability’相关度5分：论文使用基于因果干预的归因方法来量化层间相关性，涉及模型解释性，但非主要焦点；4）其他关键词（如MoE、SFT、RAG等）与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型生成幻觉的问题，提出了一种名为Locate-Then-Sparsify的插件式框架，通过归因方法量化各层与幻觉的相关性，并据此进行分层特征引导，有效缓解幻觉同时保持模型性能。

摘要翻译

尽管大规模视觉语言模型（LVLMs）已取得显著进展，但其生成幻觉的倾向削弱了可靠性并限制了更广泛的实际应用。在缓解幻觉的方法中，特征导向成为一种有前景的途径，它能在不增加推理成本的情况下减少LVLMs的错误输出。然而，现有方法在所有层中采用统一的特征导向策略。这种启发式方法忽略了层间差异，可能干扰与幻觉无关的层，最终导致通用任务性能下降。本文提出一种即插即用框架——定位后稀疏化特征导向（LTS-FS），该框架根据每层的幻觉相关性动态调整导向强度。我们首先构建了一个包含词元级和句子级幻觉案例的合成数据集。基于此数据集，我们引入一种基于因果干预的归因方法来量化每层的幻觉相关性。借助各层的归因分数，我们提出分层策略，将这些分数转化为各层的特征导向强度，从而实现对幻觉相关层的精准调控。在多个LVLMs和基准测试上的广泛实验表明，我们的LTS-FS框架在有效缓解幻觉的同时，保持了模型的强劲性能。

摘要 (Abstract)

Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose a plug-and-play framework called Locate-Then-Sparsify for Feature Steering (LTS-FS), which controls the steering intensity according to the hallucination relevance of each layer. We first construct a synthetic dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that our LTS-FS framework effectively mitigates hallucination while preserving strong performance.

关键词: Large Vision-Language Models, Hallucination Mitigation, Feature Steering, Attribution Method, Causal Intervention, Layer-wise Strategy, Plug-and-play Framework

245. ❌ VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

作者: Tengjiao Yin, Jinglei Shi, Heng Guo, Xi Wang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16271v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频扩散模型的几何一致性对齐，与大多数关键词无关。仅与’Post-training OR Supervised Fine-tuning OR SFT’和’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（各5分），因为论文提到使用SFT或强化学习进行后训练对齐，但核心是几何奖励模型而非通用对齐技术。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对视频扩散模型缺乏几何监督导致生成视频出现不一致伪影的问题，提出了一种基于几何的奖励模型和采样策略，通过后训练或推理时优化来对齐模型，实验证明该方法能有效提升生成视频的几何一致性。

摘要翻译

视频扩散模型在训练过程中缺乏显式的几何监督，导致生成视频中出现物体形变、空间漂移和深度违例等不一致性伪影。为应对这一局限，我们提出一种基于几何的奖励模型，该模型利用预训练的几何基础模型，通过跨帧重投影误差评估多视角一致性。与先前在像素空间测量不一致性的几何度量方法（其中像素强度可能引入额外噪声）不同，我们的方法以逐点方式进行误差计算，从而产生更具物理基础且更鲁棒的误差度量。此外，我们引入一种几何感知的采样策略，过滤低纹理和非语义区域，将评估聚焦于具有可靠对应关系的几何意义区域，以提升鲁棒性。我们将此奖励模型通过两种互补路径应用于视频扩散模型的对齐：通过监督微调或强化学习对双向模型进行训练后优化，以及通过测试时缩放（以我们的奖励作为路径验证器）对因果视频模型（如流式视频生成器）进行推理时优化。实验结果验证了我们设计的有效性，表明与其他变体相比，我们基于几何的奖励模型提供了更优越的鲁棒性。通过实现高效的推理时缩放，我们的方法为增强开源视频模型提供了一种实用解决方案，无需为重新训练投入大量计算资源。

摘要 (Abstract)

Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.

关键词: video diffusion models, geometric consistency, reward model, multi-view consistency, post-training, inference-time optimization, cross-frame reprojection, geometry-aware sampling

246. ❌ FG-SGL: Fine-Grained Semantic Guidance Learning via Motion Process Decomposition for Micro-Gesture Recognition

作者: Jinsheng Wei, Zhaodi Xu, Guanming Lu, Haoyu Chen, Jingjie Yan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16269v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FG-SGL: Fine-Grained Semantic Guidance Learning via Motion Process Decomposition for Micro-Gesture Recognition》专注于微手势识别（MGR）的计算机视觉任务，提出了一种结合细粒度和类别级语义指导的框架，并构建了细粒度文本数据集。虽然论文提到了“vision-language models”，但这指的是视觉-语言模型（如CLIP等），而非大语言模型（LLMs）。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science（如生物信息学）直接相关，而本文的核心是计算机视觉中的动作识别，未涉及任何大模型技术、深度学习原理创新或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对微手势识别中类间差异细微的挑战，提出了一个结合细粒度和类别级语义指导的框架FG-SGL，通过构建细粒度文本数据集和多级对比优化策略，提升了识别性能。

摘要翻译

微手势识别（MGR）由于类别间差异细微而极具挑战性。现有方法依赖于类别级监督，这不足以捕捉细微且局部化的运动差异。因此，本文提出一种细粒度语义引导学习（Fine-Grained Semantic Guidance Learning, FG-SGL）框架，该框架联合集成细粒度与类别级语义，以引导视觉-语言模型感知局部微手势运动。其中，细粒度语义引导模块（FG-SA）采用细粒度语义线索来引导局部运动特征的学习，而类别级语义增强模块（CP-A）则通过类别级语义指导提升微手势特征的可分离性。为支持细粒度语义引导，本研究构建了一个人工标注的细粒度文本数据集，该数据集从四个精炼的语义维度描述微手势的动态过程。此外，设计了一种多层级对比优化策略，以从粗到细的模式联合优化两个模块。实验表明，FG-SGL取得了具有竞争力的性能，验证了细粒度语义引导对微手势识别的有效性。

摘要 (Abstract)

Micro-gesture recognition (MGR) is challenging due to subtle inter-class variations. Existing methods rely on category-level supervision, which is insufficient for capturing subtle and localized motion differences. Thus, this paper proposes a Fine-Grained Semantic Guidance Learning (FG-SGL) framework that jointly integrates fine-grained and category-level semantics to guide vision–language models in perceiving local MG motions. FG-SA adopts fine-grained semantic cues to guide the learning of local motion features, while CP-A enhances the separability of MG features through category-level semantic guidance. To support fine-grained semantic guidance, this work constructs a fine-grained textual dataset with human annotations that describes the dynamic process of MGs in four refined semantic dimensions. Furthermore, a Multi-Level Contrastive Optimization strategy is designed to jointly optimize both modules in a coarse-to-fine pattern. Experiments show that FG-SGL achieves competitive performance, validating the effectiveness of fine-grained semantic guidance for MGR.

关键词: Micro-gesture recognition, Fine-grained semantic guidance, Vision-language models, Motion process decomposition, Multi-level contrastive optimization, Fine-grained textual dataset, Local motion features, Category-level semantics

247. ❌ When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

作者: Xiaokun Sun, Yubo Wang, Haoyu Cao, Linli Xu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16256v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）在视频问答中的推理问题，核心涉及Chain-of-Thought推理（高度相关）、幻觉缓解（高度相关）以及大语言模型基础（高度相关）。论文提出FrameRepeat框架，通过重复评分模块和训练策略来增强视觉线索，这涉及自我改进和可解释性方面（中等相关）。其他关键词如MoE、量化、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在视频问答中因视觉锚点漂移导致的推理性能下降问题，提出了FrameRepeat框架，通过自动识别和重复关键帧来增强视觉线索，有效缓解了幻觉并提升了模型性能。

摘要翻译

近期，多模态大语言模型通过整合思维链推理在复杂视觉任务中展现出显著潜力。然而，在视频问答任务中，延长的思维过程并不总能带来性能提升，甚至可能因“视觉锚点漂移”现象导致性能下降——即模型逐渐依赖自生成的文本内容，忽视视觉输入并产生幻觉。现有缓解方法通常引入特定机制，使模型在推理过程中重新关注视觉输入，但这些方法往往需要高昂的训练成本，且在不同架构间泛化能力较差。为此，我们提出FrameRepeat框架，这是一种自动化增强方案，其核心包含一个轻量级的重复评分模块，使视频大语言模型能够自主识别需要强化的关键帧。我们同时提出一种新颖的训练策略“加一训练法”，该方法利用多模态大语言模型的输出概率生成代表重复增益的监督信号，用以训练帧评分网络，从而指导帧重复行为。在多种模型和数据集上的实验结果表明，FrameRepeat能在推理过程中有效强化重要视觉线索，并具备良好的泛化能力。

摘要 (Abstract)

Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting’’, where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically introduce specific mechanisms for the model to re-attend to visual inputs during inference, these approaches often incur prohibitive training costs and suffer from poor generalizability across different architectures. To address this, we propose FrameRepeat, an automated enhancement framework which features a lightweight repeat scoring module that enables Video-LLMs to autonomously identify which frames should be reinforced. We introduce a novel training strategy, Add-One-In (AOI), that uses MLLM output probabilities to generate supervision signals representing repeat gain. This can be used to train a frame scoring network, which guides the frame repetition behavior. Experimental results across multiple models and datasets demonstrate that FrameRepeat is both effective and generalizable in strengthening important visual cues during the reasoning process.

关键词: Multimodal Large Language Models, Video Question Answering, Chain-of-Thought reasoning, visual anchor drifting, hallucination mitigation, frame repetition, autonomous enhancement, training strategy

248. ❌ Point-to-Mask: From Arbitrary Point Annotations to Mask-Level Infrared Small Target Detection

作者: Weihua Gao, Wenlong Niu, Jie Tang, Man Yang, Jiafeng Zhang, Xiaodong Peng 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16257v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于红外小目标检测（IRSTD）的计算机视觉任务，提出了一种从点标注到掩码级检测的框架，涉及物理驱动自适应掩码生成和半径感知点回归网络。论文内容与大多数关键词（如LLMs、MoE、RLHF、RAG等）完全无关，因为这些关键词主要涉及大语言模型、训练技术、推理优化等自然语言处理领域。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为红外小目标检测可视为AI在科学或工程领域的应用（如遥感、监控），但论文未明确提及生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Point-to-Mask的框架，通过物理驱动自适应掩码生成和半径感知点回归网络，将低成本点标注转换为掩码级红外小目标检测，在降低标注成本的同时接近全监督性能。

摘要翻译

红外小目标检测方法主要将任务构建为像素级分割，这需要成本高昂的密集标注，且不适用于纹理微弱、边界模糊的微小目标。为解决此问题，我们提出了Point-to-Mask框架，该框架通过两个组件将低成本的点监督与掩码级检测相连接：一个物理驱动的自适应掩码生成模块，该模块将点标注转换为紧凑的目标掩码和几何线索；以及一个轻量级的半径感知点回归网络，该网络利用时空运动线索将红外小目标检测重新构建为目标中心定位和有效半径回归。这两个模块形成一个闭环：PAMG在训练期间生成伪掩码和几何监督，而RPR-Net的几何预测在推理阶段反馈给PAMG以进行像素级掩码恢复。为促进系统评估，我们进一步构建了SIRSTD-Pixel，这是一个具有精细化像素级标注的序列数据集。实验表明，所提框架实现了高质量的伪标签、高检测精度和高效推理，在点监督设置下以显著降低的标注成本接近全监督性能。代码和数据集将在以下网址提供：https://github.com/GaoScience/point-to-mask。

摘要 (Abstract)

Infrared small target detection (IRSTD) methods predominantly formulate the task as pixel-level segmentation, which requires costly dense annotations and is not well suited to tiny targets with weak texture and ambiguous boundaries. To address this issue, we propose Point-to-Mask, a framework that bridges low-cost point supervision and mask-level detection through two components: a Physics-driven Adaptive Mask Generation (PAMG) module that converts point annotations into compact target masks and geometric cues, and a lightweight Radius-aware Point Regression Network (RPR-Net) that reformulates IRSTD as target center localization and effective radius regression using spatiotemporal motion cues. The two modules form a closed loop: PAMG generates pseudo masks and geometric supervision during training, while the geometric predictions of RPR-Net are fed back to PAMG for pixel-level mask recovery during inference. To facilitate systematic evaluation, we further construct SIRSTD-Pixel, a sequential dataset with refined pixel-level annotations. Experiments show that the proposed framework achieves strong pseudo-label quality, high detection accuracy, and efficient inference, approaching full-supervision performance under point-supervised settings with substantially lower annotation cost. Code and datasets will be available at: https://github.com/GaoScience/point-to-mask.

关键词: Infrared small target detection, Point supervision, Mask-level detection, Physics-driven adaptive mask generation, Radius-aware point regression, Spatiotemporal motion cues, Pseudo-label generation, Annotation cost reduction

作者: Hongwei Lin, Xun Huang, Chenglu Wen, Cheng Wang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16261v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AW-MoE专注于将Mixture of Experts (MoE)应用于多模态3D目标检测，以解决恶劣天气条件下的鲁棒性问题。论文核心创新是集成MoE框架，因此与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分）。其他关键词主要涉及大语言模型（LLMs）的技术、训练方法、推理优化、代理系统等，而本文研究的是计算机视觉和自动驾驶中的3D目标检测，未涉及语言模型、文本生成、对齐训练、压缩加速等主题，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出AW-MoE框架，通过集成Mixture of Experts和图像引导的天气感知路由，解决了自动驾驶中恶劣天气下多模态3D目标检测的性能冲突问题，在真实数据集上实现了约15%的性能提升。

摘要翻译

恶劣天气下的鲁棒三维目标检测对于自动驾驶至关重要。然而，现有方法大多简单地将所有天气样本合并训练，忽视了不同天气场景间的数据分布差异，导致性能冲突。为解决此问题，我们提出了AW-MoE框架，该框架创新地将专家混合模型（Mixture of Experts, MoE）集成到天气鲁棒的多模态三维目标检测方法中。AW-MoE包含图像引导的天气感知路由（Image-guided Weather-aware Routing, IWR），该模块利用图像特征在不同天气条件下卓越的区分能力及其对场景变化的不变性，实现精确的天气分类。基于此准确分类，IWR选择最相关的K个天气特定专家（Weather-Specific Experts, WSE）来处理数据差异，确保在所有天气条件下实现最优检测。此外，我们提出了统一双模态增强（Unified Dual-Modal Augmentation, UDMA），用于同步进行激光雷达（LiDAR）与四维雷达（4D Radar）双模态数据增强，同时保持场景的真实性。在真实世界数据集上的大量实验表明，AW-MoE在恶劣天气下的性能相比现有最优方法提升了约15%，而推理开销可忽略不计。此外，将AW-MoE集成到现有基线检测器中，其性能提升超越了当前最优方法。这些结果证明了我们AW-MoE框架的有效性和强大可扩展性。代码将在https://github.com/windlinsherlock/AW-MoE 公开。

摘要 (Abstract)

Robust 3D object detection under adverse weather conditions is crucial for autonomous driving. However, most existing methods simply combine all weather samples for training while overlooking data distribution discrepancies across different weather scenarios, leading to performance conflicts. To address this issue, we introduce AW-MoE, the framework that innovatively integrates Mixture of Experts (MoE) into weather-robust multi-modal 3D object detection approaches. AW-MoE incorporates Image-guided Weather-aware Routing (IWR), which leverages the superior discriminability of image features across weather conditions and their invariance to scene variations for precise weather classification. Based on this accurate classification, IWR selects the top-K most relevant Weather-Specific Experts (WSE) that handle data discrepancies, ensuring optimal detection under all weather conditions. Additionally, we propose a Unified Dual-Modal Augmentation (UDMA) for synchronous LiDAR and 4D Radar dual-modal data augmentation while preserving the realism of scenes. Extensive experiments on the real-world dataset demonstrate that AW-MoE achieves ~ 15% improvement in adverse-weather performance over state-of-the-art methods, while incurring negligible inference overhead. Moreover, integrating AW-MoE into established baseline detectors yields performance improvements surpassing current state-of-the-art methods. These results show the effectiveness and strong scalability of our AW-MoE. We will release the code publicly at https://github.com/windlinsherlock/AW-MoE.

关键词: Mixture of Experts, 3D object detection, adverse weather, multi-modal, autonomous driving, LiDAR, 4D Radar, weather-robust

250. ❌ Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification

作者: Trong-Duc Nguyen, Hoang-Long Nguyen, Huy-Hieu Pham 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16249v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学图像分析中的白细细胞分类问题，使用深度学习技术（如Swin Transformer、Pix2Pix、MedSigLIP）和生物启发式方法解决极端长尾分布问题。论文与大多数关键词（涉及大模型技术原理、训练方法、推理优化、代理系统等）完全无关，因为这些关键词主要针对大语言模型（LLM）及相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物信息学/医学AI应用领域，但并非核心大模型研究，因此给予8分（有一定关联，但非核心）。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合深度学习和生物启发式方法的混合框架，用于解决极端长尾分布下的白细细胞分类问题，在WBCBench 2026挑战中取得了0.77139的Macro-F1分数。

摘要翻译

自动化白细胞分类对白血病筛查至关重要，但极端的类别不平衡、长尾分布及域偏移问题持续带来挑战，导致深度模型过度拟合主导类别而在罕见亚型上失效。我们提出一种罕见类别泛化的混合框架，该框架集成了基于生成式Pix2Pix的修复模块以消除伪影、采用结合MedSigLIP对比嵌入的Swin Transformer集成模型进行鲁棒表征学习，并引入受生物学启发的优化步骤——利用几何尖锐度与基于马哈拉诺比斯距离的形态学约束来恢复分布外预测。在WBCBench 2026挑战赛的评估中，我们的方法在私有排行榜上取得了0.77139的宏观F1分数，展现了在严重不平衡条件下的强大性能，并凸显了将生物学先验知识融入深度学习对血液学图像分析的重要价值。

摘要 (Abstract)

Automated white blood cell (WBC) classification is essential for leukemia screening but remains challenged by extreme class imbalance, long-tail distributions, and domain shift, leading deep models to overfit dominant classes and fail on rare subtypes. We propose a hybrid framework for rare-class generalization that integrates a generative Pix2Pix-based restoration module for artifact removal, a Swin Transformer ensemble with MedSigLIP contrastive embeddings for robust representation learning, and a biologically-inspired refinement step using geometric spikiness and Mahalanobis-based morphological constraints to recover out-of-distribution predictions. Evaluated on the WBCBench 2026 challenge, our method achieves a Macro-F1 of 0.77139 on the private leaderboard, demonstrating strong performance under severe imbalance and highlighting the value of incorporating biological priors into deep learning for hematological image analysis.

关键词: white blood cell classification, deep learning, long-tail distribution, Swin Transformer, MedSigLIP, biological priors, hematological image analysis, class imbalance

251. ❌ Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

作者: Junxin Wang, Dai Guan, Weijie Qiu, Zhihang Li, Yongbo Gai, Zhengyi Yang, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16253v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言过程奖励模型（VL-PRMs）的可靠性问题，提出显式视觉前提验证（EVPV）方法，通过验证视觉前提的可靠性来校准奖励评分。与关键词的相关性分析：1）与’Large Language Models’有一定关联（5分），因为VL-PRMs通常基于大语言模型构建；2）与’Chain of Thought Reasoning’高度相关（8分），论文关注多步推理过程的评分和验证；3）与’System 2 Thinking’有一定关联（5分），涉及深度推理过程的评估；4）与’Hallucination Mitigation’高度相关（10分），核心解决视觉前提幻觉导致的错误评分问题；5）与’Explainable AI’高度相关（8分），通过显式验证接口提高模型可解释性。其他关键词与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言过程奖励模型在评分多步推理步骤时因视觉前提幻觉导致系统错误的问题，提出了显式视觉前提验证方法，通过验证视觉前提可靠性来校准奖励评分，实验表明该方法能显著提高步骤验证准确性和候选重排性能。

摘要翻译

视觉语言过程奖励模型（VL-PRMs）正日益被用于对中间推理步骤进行评分，并在测试时扩展中重新排序候选答案。然而，它们通常充当黑盒评判者：一个较低的步骤评分可能反映了真实的推理错误，也可能仅仅是验证器对图像的误判。这种感知与推理之间的纠缠导致了系统性误报（奖励幻觉的视觉前提）和漏报（惩罚正确基于事实的陈述），从而削弱了重新排序和错误定位的效果。我们提出了显式视觉前提验证（EVPV），这是一种轻量级的验证接口，它将步骤评分条件化于该步骤所依赖的视觉前提的可靠性之上。该方法提示策略生成一个逐步的视觉检查清单，使所需的视觉事实显式化，同时一个约束提取器独立地从输入图像中推导出结构化的视觉约束。EVPV将检查清单中的声明与这些约束进行匹配，以计算一个标量化的视觉可靠性信号，并通过可靠性门控来校准PRM的步骤奖励：当可靠性低时，对视觉依赖步骤的奖励会被削弱；当可靠性高时，奖励则得以保留。这在不依赖每步工具调用的情况下，将感知不确定性从逻辑评估中解耦出来。在VisualProcessBench和六个多模态推理基准上的实验表明，EVPV改善了步骤级验证，并持续提升了相对于强基线的N选最佳重新排序准确率。此外，向提取的约束中注入受控的损坏会导致性能单调下降，这提供了因果证据，表明性能提升源于约束保真度和显式前提验证，而非偶然的提示效应。代码发布于：https://github.com/Qwen-Applications/EVPV-PRM

摘要 (Abstract)

Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier’s misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: https://github.com/Qwen-Applications/EVPV-PRM

关键词: Vision-Language Process Reward Models, Explicit Visual Premise Verification, Hallucination Mitigation, Multimodal Reasoning, Step-level Verification, Reward Calibration, Visual Reliability, Constraint Fidelity

252. ❌ Visual Prompt Discovery via Semantic Exploration

作者: Jaechang Kim, Yotaro Shimose, Zhao Wang, Kuang-Da Wang, Jungseul Ok, Shingo Takamatsu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16250v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大型视觉语言模型（LVLMs）的视觉提示发现，属于大模型应用领域。核心相关关键词：1）‘Large Language Models’（权重1.0）- 论文直接研究LVLMs，是核心内容，给10分；2）‘LLM Agents’（权重1.0）- 论文提出基于代理的自动化探索框架（agent-driven experiments），属于代理工作流，给10分；3）‘Tool Use’（权重1.0）- 论文涉及视觉提示作为工具（image manipulation code）来增强LVLM感知，给10分。其他关键词如MoE、量化、推理加速等与论文内容无关，均给0分。加权总分计算：10×1.0 + 10×1.0 + 10×1.0 = 30.0。作者列表中未指定专家，无加分。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型在图像理解和视觉推理中的感知失败问题，提出了一种基于语义探索的自动化框架SEVEX，以发现任务特定的视觉提示，显著提升了模型在基准测试上的准确性和效率。

摘要翻译

大型视觉语言模型在图像理解和视觉推理方面面临显著挑战，常导致关键感知失效。融入图像处理代码的视觉提示技术已展现出缓解这些问题的潜力。尽管该方向前景广阔，但以往的视觉提示生成方法多聚焦于工具选择，而非诊断和解决大型视觉语言模型感知失效的根本原因。由于大型视觉语言模型的不透明性和不可预测性，最优视觉提示必须通过实验探索发现，而现有研究主要依赖人工试错。

我们提出一种自动化语义探索框架，用于发现面向特定任务的视觉提示。该方法通过智能体驱动的实验实现多样化且高效的探索，最大限度减少人工干预，并避免逐样本生成的低效问题。我们提出名为SEVEX的语义探索算法，以应对视觉提示探索中的两大核心挑战：（1）冗长底层代码导致的注意力分散；（2）视觉提示庞大且非结构化的搜索空间。具体而言，该方法通过构建抽象概念空间作为搜索域，采用新颖性引导的选择算法，并结合语义反馈驱动的概念生成流程，从而基于实证结果高效探索多样化的视觉提示。

我们在评估大型视觉语言模型感知能力的BlindTest和BLINK基准测试中对SEVEX进行了验证。实验结果表明，SEVEX在任务准确率、推理效率、探索效率和探索稳定性方面均显著优于基线方法。值得注意的是，该框架发现了超越传统工具使用范畴的复杂且反直觉的视觉策略，为通过自动化、任务导向的视觉提示增强大型视觉语言模型感知能力提供了新范式。

摘要 (Abstract)

LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown promising potential in mitigating these issues. While emerged as a promising direction, previous methods for visual prompt generation have focused on tool selection rather than diagnosing and mitigating the root causes of LVLM perception failures. Because of the opacity and unpredictability of LVLMs, optimal visual prompts must be discovered through empirical experiments, which have relied on manual human trial-and-error. We propose an automated semantic exploration framework for discovering task-wise visual prompts. Our approach enables diverse yet efficient exploration through agent-driven experiments, minimizing human intervention and avoiding the inefficiency of per-sample generation. We introduce a semantic exploration algorithm named SEVEX, which addresses two major challenges of visual prompt exploration: (1) the distraction caused by lengthy, low-level code and (2) the vast, unstructured search space of visual prompts. Specifically, our method leverages an abstract idea space as a search space, a novelty-guided selection algorithm, and a semantic feedback-driven ideation process to efficiently explore diverse visual prompts based on empirical results. We evaluate SEVEX on the BlindTest and BLINK benchmarks, which are designed to assess LVLM perception. Experimental results demonstrate that SEVEX significantly outperforms baseline methods in task accuracy, inference efficiency, exploration efficiency, and exploration stability. Notably, our framework discovers sophisticated and counter-intuitive visual strategies that go beyond conventional tool usage, offering a new paradigm for enhancing LVLM perception through automated, task-wise visual prompts.

关键词: Large Vision-Language Models, Visual Prompt Discovery, Semantic Exploration, Agent-driven Experiments, LVLM Perception, Automated Framework, Task-wise Visual Prompts, SEVEX Algorithm

253. ❌ PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space

作者: Ryutaro Miya, Kazuyoshi Fushinobu, Tatsuya Kawaguchi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16238v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PureCLIP-Depth专注于计算机视觉中的单目深度估计任务，利用CLIP的嵌入空间进行映射学习。所有评分关键词均与大语言模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是视觉任务中的CLIP模型应用，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需提示和解码器的单目深度估计方法PureCLIP-Depth，直接在CLIP嵌入空间中学习从RGB到深度的映射，并在室内外数据集上取得了基于CLIP模型的最佳性能。

摘要翻译

我们提出PureCLIP-Depth，这是一种完全无需提示词、无需解码器的单目深度估计模型，其计算完全在对比语言-图像预训练嵌入空间内进行。与近期严重依赖几何特征的模型不同，我们探索了一种由概念信息驱动的新颖单目深度估计方法，直接在概念性CLIP空间中进行计算。我们方法的核心在于学习从RGB域到深度域的直接映射，且该映射严格限定在此嵌入空间内部。我们的方法在室内和室外数据集上均取得了基于CLIP嵌入模型的领先性能。本研究所用代码发布于：https://github.com/ryutaroLF/PureCLIP-Depth

摘要 (Abstract)

We propose PureCLIP-Depth, a completely prompt-free, decoder-free Monocular Depth Estimation (MDE) model that operates entirely within the Contrastive Language-Image Pre-training (CLIP) embedding space. Unlike recent models that rely heavily on geometric features, we explore a novel approach to MDE driven by conceptual information, performing computations directly within the conceptual CLIP space. The core of our method lies in learning a direct mapping from the RGB domain to the depth domain strictly inside this embedding space. Our approach achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets. The code used in this research is available at: https://github.com/ryutaroLF/PureCLIP-Depth

关键词: PureCLIP-Depth, Monocular Depth Estimation, CLIP embedding space, prompt-free, decoder-free, conceptual information, state-of-the-art, indoor and outdoor datasets

254. ❌ RASLF: Representation-Aware State Space Model for Light Field Super-Resolution

作者: Zeqiang Wei, Kai Jin, Kuan Song, Xiuzhuang Zhou, Wenlong Chen, Min Xu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16243v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《RASLF: Representation-Aware State Space Model for Light Field Super-Resolution》专注于计算机视觉中的光场超分辨率任务，提出了一种基于状态空间模型的深度学习框架。其核心贡献在于改进光场表示、几何对齐和特征聚合，以提高超分辨率性能。然而，所有评分关键词均与大语言模型（LLMs）、其训练/对齐技术（如RLHF、SFT）、推理优化（如量化、推测解码）、代理系统或特定科学AI应用（如生物信息学）直接相关。该论文未涉及任何语言模型、文本生成、对齐方法或代理技术，也未应用于生物/化学领域。因此，所有关键词的相关度均为0分，论文主题与评分关键词集完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为RASLF的表示感知状态空间模型，通过渐进几何细化、表示感知非对称扫描和双锚聚合模块，解决了光场超分辨率中多视图表示互补性不足导致的纹理丢失和几何错位问题，在多个公开基准上实现了最高的重建精度和计算效率。

摘要翻译

当前基于状态空间模型的光场超分辨率方法往往未能充分利用不同光场表征之间的互补性，导致精细纹理丢失与多视角间的几何错位。为解决这些问题，我们提出RASLF——一种表征感知的状态空间框架，该框架能显式建模跨多光场表征的结构相关性。具体而言，我们设计了渐进式几何优化模块，该模块利用全景极平面表征显式编码多视角视差差异，从而实现跨不同光场表征的融合。此外，我们引入表征感知非对称扫描机制，该机制根据不同表征空间的物理特性动态调整扫描路径，通过路径剪枝优化性能与效率的平衡。同时，双锚点聚合模块改进了层次化特征流，减少深层冗余特征并优先传递重要重建信息。在多个公开基准测试上的实验表明，RASLF在保持高计算效率的同时实现了最优的重建精度。

摘要 (Abstract)

Current SSM-based light field super-resolution (LFSR) methods often fail to fully leverage the complementarity among various LF representations, leading to the loss of fine textures and geometric misalignments across views. To address these issues, we propose RASLF, a representation-aware state-space framework that explicitly models structural correlations across multiple LF representations. Specifically, a Progressive Geometric Refinement (PGR) block is created that uses a panoramic epipolar representation to explicitly encode multi-view parallax differences, thereby enabling integration across different LF representations. Furthermore, we introduce a Representation Aware Asymmetric Scanning (RAAS) mechanism that dynamically adjusts scanning paths based on the physical properties of different representation spaces, optimizing the balance between performance and efficiency through path pruning. Additionally, a Dual-Anchor Aggregation (DAA) module improves hierarchical feature flow, reducing redundant deeplayer features and prioritizing important reconstruction information. Experiments on various public benchmarks show that RASLF achieves the highest reconstruction accuracy while remaining highly computationally efficient.

关键词: Light Field Super-Resolution, State Space Model, Representation-Aware, Progressive Geometric Refinement, Panoramic Epipolar Representation, Multi-view Parallax, Computational Efficiency, Reconstruction Accuracy

255. ❌ Exclusivity-Guided Mask Learning for Semi-Supervised Crowd Instance Segmentation and Counting

作者: Jiyang Huang, Hongru Cheng, Wei Lin, Jia Wan, Antoni B. Chan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16241v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的半监督人群实例分割与计数，提出EDP-SAM和XMask方法，使用高斯平滑和可微分中心采样等技术。所有评分关键词均涉及大模型、深度学习技术原理或AI科学应用，而本文研究的是传统计算机视觉任务，未涉及大模型、语言模型、MoE、缩放定律、训练技术、推理优化、智能体、量化压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等任何评分关键词相关的内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于排他性引导掩码学习的半监督人群实例分割与计数框架，通过EDP-SAM生成掩码监督和XMask强制空间分离，在多个数据集上实现了最先进的性能。

摘要翻译

半监督人群分析是一个重要的研究领域，因为未标注数据通常易于大量获取且成本低廉。然而，传统的基于点的标注方式限制了性能，因为个体区域本身具有模糊性，因此从稀疏标注中学习细粒度结构语义仍是一个未解决的挑战。本文首先提出了一种基于最近邻排除圆（Nearest Neighbor Exclusion Circle, NNEC）约束的排除约束双提示SAM模型（Exclusion-Constrained Dual-Prompt SAM, EDP-SAM），用于为现有数据集生成掩码监督。为了在密集场景中分割个体，我们进一步提出了排他性引导掩码学习（Exclusivity-Guided Mask Learning, XMask），该方法通过一种判别性掩码目标来强制实现空间分离。我们采用高斯平滑和可微分中心采样策略来提升特征连续性和训练稳定性。基于XMask，我们提出了一种半监督人群计数框架，该框架使用实例掩码先验作为伪标签，这些伪标签比传统的点标注包含更丰富的形状信息。在上海科技大学A数据集（ShanghaiTech A）、UCF-QNRF数据集和JHU++数据集上（使用5%、10%和40%的标注数据）的大量实验验证了我们的端到端模型在半监督分割和计数任务上达到了最先进的性能，有效地在一个统一框架内弥合了计数与实例分割之间的差距。

摘要 (Abstract)

Semi-supervised crowd analysis is a prominent area of research, as unlabeled data are typically abundant and inexpensive to obtain. However, traditional point-based annotations constrain performance because individual regions are inherently ambiguous, and consequently, learning fine-grained structural semantics from sparse anno tations remains an unresolved challenge. In this paper, we first propose an Exclusion-Constrained Dual-Prompt SAM (EDP-SAM), based on our Nearest Neighbor Exclusion Circle (NNEC) constraint, to generate mask supervision for current datasets. With the aim of segmenting individuals in dense scenes, we then propose Exclusivity-Guided Mask Learning (XMask), which enforces spatial separation through a discriminative mask objective. Gaussian smoothing and a differentiable center sampling strategy are utilized to improve feature continuity and training stability. Building on XMask, we present a semi-supervised crowd counting framework that uses instance mask priors as pseudo-labels, which contain richer shape information than traditional point cues. Extensive experiments on the ShanghaiTech A, UCF-QNRF, and JHU++ datasets (using 5%, 10%, and 40% labeled data) verify that our end-to-end model achieves state-of-the-art semi-supervised segmentation and counting performance, effectively bridging the gap between counting and instance segmentation within a unified framework.

关键词: semi-supervised crowd analysis, instance segmentation, crowd counting, mask learning, exclusivity-guided, EDP-SAM, XMask, pseudo-labels

256. ❌ Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors

作者: Ryosuke Hori, Jyun-Ting Song, Zhengyi Luo, Jinkun Cao, Soyong Shin, Hideo Saito, Kris Kitani 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16233v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究基于物理模拟的人体运动捕捉方法，使用IMU和足底压力传感器数据，通过数字孪生和物理仿真重建人体运动。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而本文专注于计算机视觉、运动捕捉和物理仿真领域，未涉及任何大模型技术、深度学习创新或AI在生物医药等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于物理模拟的人体运动捕捉方法GRIP，通过结合IMU和足底压力传感器数据，并使用数字孪生技术在物理仿真器中重建物理合理的人体运动，实验表明该方法在全局姿态准确性和物理一致性方面优于现有方法。

摘要翻译

我们提出地面反作用惯性姿态重建系统（Ground Reaction Inertial Poser，简称GRIP），该方法利用四个可穿戴设备重建物理合理的人体运动。与传统仅使用惯性测量单元（IMU）的方法不同，GRIP将IMU信号与足底压力数据相结合，以同时捕捉身体动力学特征和地面交互作用。此外，GRIP并非单纯依赖运动学估计，而是通过物理模拟器中的合成人形数字孪生体来重建真实且物理合理的运动。GRIP的核心包含两个模块：运动学网络（KinematicsNet）从传感器数据中估计身体姿态与速度，以及动力学网络（DynamicsNet）利用运动学网络预测结果与模拟人形状态之间的残差来控制模拟器中的人形运动。为实现鲁棒训练与公平评估，我们引入了大规模数据集——面向人体运动与交互的压力与惯性传感数据集（Pressure and Inertial Sensing for Human Motion and Interaction，简称PRISM），该数据集通过同步的IMU与鞋垫压力传感器采集了多样化的人体运动。实验结果表明，在所有评估数据集上，GRIP均优于现有的纯IMU方法及IMU-压力融合方法，实现了更高的全局姿态精度和更强的物理一致性。

摘要 (Abstract)

We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMU-only approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions. Furthermore, rather than relying solely on kinematic estimation, GRIP uses a digital twin of a person, in the form of a synthetic humanoid in a physics simulator, to reconstruct realistic and physically plausible motion. At its core, GRIP consists of two modules: KinematicsNet, which estimates body poses and velocities from sensor data, and DynamicsNet, which controls the humanoid in the simulator using the residual between the KinematicsNet prediction and the simulated humanoid state. To enable robust training and fair evaluation, we introduce a large-scale dataset, Pressure and Inertial Sensing for Human Motion and Interaction (PRISM), that captures diverse human motions with synchronized IMUs and insole pressure sensors. Experimental results show that GRIP outperforms existing IMU-only and IMU-pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency.

关键词: human motion capture, IMU sensors, insole pressure sensors, physics-based simulation, digital twin, humanoid control, sensor fusion, physical consistency

257. ❌ Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation

作者: Yiming Huang, Baixiang Huang, Beilei Cui, Chi Kit Ng, Long Bai, Hongliang Ren 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16211v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D计算机视觉领域，具体研究3D重建和生成技术，特别是基于3D高斯泼溅和扩散模型的几何感知生成方法。论文内容涉及3D表示、视图合成、深度估计等计算机视觉任务，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新或任何关键词中提到的具体技术（如MoE、RLHF、RAG、量化等）。论文虽然使用了扩散模型，但这是用于3D视觉任务的生成模型，而非与大语言模型相关的技术。所有关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本论文属于纯粹的3D计算机视觉研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Leveling3D的新方法，通过集成前馈式3D重建与几何一致的生成，解决了3D高斯泼溅在受限区域导致的新视图渲染伪影问题，实现了同时重建和生成，并在新视图合成和深度估计任务上取得了最先进的性能。

摘要翻译

前馈式三维重建技术已彻底改变了三维视觉领域，为下游任务（如基于三维高斯泼溅的新视角合成）提供了强大的基准。先前的研究尝试利用扩散模型修复存在瑕疵的渲染结果，但这些方法缺乏几何层面的考量，难以填补外推视角下的缺失区域。本研究提出Leveling3D——一种创新流程，它将前馈式三维重建与几何一致生成相结合，实现了整体同步的重建与生成。我们设计了一种几何感知的均衡适配器，该轻量级技术能够将扩散模型内部知识与前馈模型的几何先验对齐。该适配器可在三维表征欠约束区域所导致的外推新视角伪影区域进行生成。具体而言，为学习更具分布多样性的生成效果，我们引入了训练阶段的调色板过滤策略，以及测试时的掩码细化机制以避免修复区域边界混乱。更重要的是，Leveling3D所增强的外推新视角可作为前馈式三维高斯泼溅（3DGS）的输入，从而提升三维重建质量。我们在公开数据集上实现了新视角合成与深度估计等任务的先进性能。

摘要 (Abstract)

Feed-forward 3D reconstruction has revolutionized 3D vision, providing a powerful baseline for downstream tasks such as novel-view synthesis with 3D Gaussian Splatting. Previous works explore fixing the corrupted rendering results with a diffusion model. However, they lack geometric concern and fail at filling the missing area on the extrapolated view. In this work, we introduce Leveling3D, a novel pipeline that integrates feed-forward 3D reconstruction with geometrical-consistent generation to enable holistic simultaneous reconstruction and generation. We propose a geometry-aware leveling adapter, a lightweight technique that aligns internal knowledge in the diffusion model with the geometry prior from the feed-forward model. The leveling adapter enables generation on the artifact area of the extrapolated novel views caused by underconstrained regions of the 3D representation. Specifically, to learn a more diverse distributed generation, we introduce the palette filtering strategy for training, and a test-time masking refinement to prevent messy boundaries along the fixing regions. More importantly, the enhanced extrapolated novel views from Leveling3D could be used as the inputs for feed-forward 3DGS, leveling up the 3D reconstruction. We achieve SOTA performance on public datasets, including tasks such as novel-view synthesis and depth estimation.

关键词: 3D reconstruction, 3D Gaussian Splatting, novel-view synthesis, diffusion model, geometry-aware generation, feed-forward model, depth estimation, SOTA performance

258. ❌ Long-Horizon Traffic Forecasting via Incident-Aware Conformal Spatio-Temporal Transformers

作者: Mayur Patil, Qadeer Ahmed, Shawn Midlam-Mohler, Stephanie Marik, Allen Sheldon, Rajeev Chhajer, Nithin Santhanam 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16857v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于交通预测领域，使用Spatio-Temporal Transformer和自适应共形预测等技术解决多时间范围交通预测问题，并整合事故数据。所有评分关键词均涉及大模型、深度学习技术原理或特定AI应用领域（如生物信息学），而本论文研究的是交通工程领域的特定应用，未涉及任何大模型技术、深度学习创新原理或评分关键词中指定的科学领域应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合事故感知动态图构建和自适应共形预测的时空Transformer模型，用于多时间范围交通预测，实验表明该方法在长时预测精度和校准不确定性方面优于基线方法。

摘要翻译

可靠的多时段交通预测具有挑战性，因为路网状况具有随机性，事件干扰是间歇性的，且有效的空间依赖性随一天中的时段模式而变化。本研究基于俄亥俄州交通部（ODOT）的交通流量数据及相应的ODOT事故记录展开。本工作采用时空变换器（STT）模型与自适应共形预测（ACP）相结合的方法，以生成具有校准不确定性的多时段预测。我们提出了一种分段变异系数（CV）策略，该策略利用对数正态分布对逐小时的行程时间变异性进行建模，从而能够构建每小时的动态邻接矩阵。我们进一步利用源自ODOT事故数据集的事件相关严重性信号（包括事故清理时间、天气条件、超速违规、施工区域及道路功能等级）对边权重进行扰动，以捕捉局部干扰和高峰/非高峰时段的转换。这种动态图构建方法取代了固定CV假设，能更好地表征预测窗口内不断变化的交通状况。为进行验证，我们在SUMO仿真中通过俄亥俄州哥伦布市路网上的多时段循环运行生成扩展行程，并应用蒙特卡洛仿真来获取被测车辆（VUT）的行程时间分布。实验表明，与其他基线方法相比，该方法提高了长时域预测的准确性，并获得了校准良好的预测区间。

摘要 (Abstract)

Reliable multi-horizon traffic forecasting is challenging because network conditions are stochastic, incident disruptions are intermittent, and effective spatial dependencies vary across time-of-day patterns. This study is conducted on the Ohio Department of Transportation (ODOT) traffic count data and corresponding ODOT crash records. This work utilizes a Spatio-Temporal Transformer (STT) model with Adaptive Conformal Prediction (ACP) to produce multi-horizon forecasts with calibrated uncertainty. We propose a piecewise Coefficient of Variation (CV) strategy that models hour-to-hour traveltime variability using a log-normal distribution, enabling the construction of a per-hour dynamic adjacency matrix. We further perturb edge weights using incident-related severity signals derived from the ODOT crash dataset that comprises incident clearance time, weather conditions, speed violations, work zones, and roadway functional class, to capture localized disruptions and peak/off-peak transitions. This dynamic graph construction replaces a fixed-CV assumption and better represents changing traffic conditions within the forecast window. For validation, we generate extended trips via multi-hour loop runs on the Columbus, Ohio, network in SUMO simulations and apply a Monte Carlo simulation to obtain travel-time distributions for a Vehicle Under Test (VUT). Experiments demonstrate improved long-horizon accuracy and well-calibrated prediction intervals compared to other baseline methods.

关键词: traffic forecasting, spatio-temporal transformer, adaptive conformal prediction, incident-aware, dynamic adjacency matrix, Monte Carlo simulation, travel-time distribution, SUMO simulation

259. ❌ Dynamic Meta-Layer Aggregation for Byzantine-Robust Federated Learning

作者: Reek Das, Biplab Kanti Sen 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦学习（Federated Learning）中的拜占庭鲁棒性防御机制，提出了一种名为FedAOT的自适应聚合框架来对抗多标签翻转和无目标中毒攻击。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是分布式机器学习中的安全聚合问题，属于不同的技术领域（联邦学习安全），与所有关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习中拜占庭攻击（如多标签翻转和无目标中毒）导致模型性能下降的问题，提出了一种基于元学习的自适应聚合防御机制FedAOT，实验证明其能有效提升模型准确性和鲁棒性，同时保持计算效率。

摘要翻译

联邦学习（Federated Learning, FL）正日益广泛应用于医疗、金融和物联网等领域，它能够在保护用户隐私的同时实现协同模型训练。然而，联邦学习系统易受拜占庭敌手的攻击，这些敌手会注入恶意更新，从而严重损害全局模型的性能。现有的防御方法往往侧重于特定攻击类型，难以应对无目标攻击策略，例如多标签翻转或噪声与后门模式的组合攻击。为克服这些局限，我们提出FedAOT——一种新颖的防御机制，它采用受元学习启发的自适应聚合框架来抵御多标签翻转和无目标投毒攻击。FedAOT根据客户端更新的可靠性动态调整其权重，在不依赖预设阈值或严格攻击假设的前提下抑制对抗性影响。值得注意的是，FedAOT能够有效泛化至不同数据集和多种攻击类型，即使在先前未见的攻击场景下仍能保持鲁棒性能。实验结果表明，FedAOT在保持计算效率的同时，显著提升了模型准确性与抗干扰能力，为安全的联邦学习提供了一个可扩展且实用的解决方案。

摘要 (Abstract)

Federated Learning (FL) is increasingly applied in sectors like healthcare, finance, and IoT, enabling collaborative model training while safeguarding user privacy. However, FL systems are susceptible to Byzantine adversaries that inject malicious updates, which can severely compromise global model performance. Existing defenses tend to focus on specific attack types and fail against untargeted strategies, such as multi-label flipping or combinations of noise and backdoor patterns. To overcome these limitations, we propose FedAOT-a novel defense mechanism that counters multi-label flipping and untargeted poisoning attacks using a metalearning-inspired adaptive aggregation framework. FedAOT dynamically weights client updates based on their reliability, suppressing adversarial influence without relying on predefined thresholds or restrictive attack assumptions. Notably, FedAOT generalizes effectively across diverse datasets and a wide range of attack types, maintaining robust performance even in previously unseen scenarios. Experimental results demonstrate that FedAOT substantially improves model accuracy and resilience while maintaining computational efficiency, offering a scalable and practical solution for secure federated learning.

关键词: Federated Learning, Byzantine Robustness, Meta-Learning, Adaptive Aggregation, Poisoning Attacks, Multi-label Flipping, Model Resilience, Secure Machine Learning

260. ❌ Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

作者: Jello Zhou, Vudtiwat Ngampruetikorn, David J. Schwab 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16842v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究随机重置在强化学习中的应用，属于强化学习优化方法，与所有关键词（均涉及大模型、深度学习技术原理或特定AI应用领域）无直接关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了随机重置如何加速强化学习中的策略收敛，发现即使在搜索时间未减少的情况下，重置也能通过截断无信息轨迹来增强价值传播，从而加速学习。

摘要翻译

随机重置，即动态过程被间歇性地重置至固定参考状态，已成为优化首次通过特性的有力机制。现有理论主要处理静态、非学习过程。本文探讨随机重置如何与强化学习相互作用——在强化学习中，底层动态通过经验进行自适应。在表格网格环境中，我们发现即使重置不减少纯扩散智能体的搜索时间，它仍能加速策略收敛，这揭示了一种超越经典首次通过优化的新机制。在基于神经网络价值函数近似的连续控制任务中，我们证明当探索困难且奖励稀疏时，随机重置能改进深度强化学习性能。与时间折扣不同，重置在保留最优策略的同时，通过截断冗长且信息贫乏的轨迹来增强价值传播，从而加速收敛。我们的研究确立了随机重置作为一种简单、可调节的加速学习机制，将统计力学中的经典现象转化为强化学习的优化原理。

摘要 (Abstract)

Stochastic resetting, where a dynamical process is intermittently returned to a fixed reference state, has emerged as a powerful mechanism for optimizing first-passage properties. Existing theory largely treats static, non-learning processes. Here we ask how stochastic resetting interacts with reinforcement learning, where the underlying dynamics adapt through experience. In tabular grid environments, we find that resetting accelerates policy convergence even when it does not reduce the search time of a purely diffusive agent, indicating a novel mechanism beyond classical first-passage optimization. In a continuous control task with neural-network-based value approximation, we show that random resetting improves deep reinforcement learning when exploration is difficult and rewards are sparse. Unlike temporal discounting, resetting preserves the optimal policy while accelerating convergence by truncating long, uninformative trajectories to enhance value propagation. Our results establish stochastic resetting as a simple, tunable mechanism for accelerating learning, translating a canonical phenomenon of statistical mechanics into an optimization principle for reinforcement learning.

关键词: Stochastic Resetting, Reinforcement Learning, Policy Convergence, Deep Reinforcement Learning, Value Propagation, Optimization Principle, Statistical Mechanics

261. ❌ Conditional Distributional Treatment Effects: Doubly Robust Estimation and Testing

作者: Saksham Jain, Alex Luedtke 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是因果推断中的条件分布处理效应估计和检验方法，属于统计学和计量经济学领域。论文内容完全不涉及大模型、深度学习、AI技术或科学AI应用，所有关键词均与大模型技术原理、训练方法、推理优化、应用场景等无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于捕捉条件分布处理效应的新估计量，并开发了具有双重鲁棒性和局部渐近最优性的估计器，以及一个用于检验条件潜在结果分布全局同质性的新方法。

摘要翻译

除条件平均处理效应外，干预还可能以协变量依赖的方式影响整个结果分布，例如通过改变特定亚群的方差或尾部风险。我们提出了一种新的估计量来捕捉此类条件分布处理效应，并开发了一种在局部渐近意义上达到极小极大最优的双重稳健估计器。基于此，我们构建了一种针对条件潜在结果分布全局同质性的检验方法，该方法能够容纳超出最大均值差异（MMD）的分布差异，具有可证明的一类错误控制有效性，并对固定备择假设具有一致性——据我们所知，这是该领域首个具备此类理论保证的检验方法。此外，我们推导了两种自然差异度量（包括MMD）的精确闭式表达式，并为该检验提供了一种无需置换、计算高效的算法。

摘要 (Abstract)

Beyond conditional average treatment effects, treatments may impact the entire outcome distribution in covariate-dependent ways, for example, by altering the variance or tail risks for specific subpopulations. We propose a novel estimand to capture such conditional distributional treatment effects, and develop a doubly robust estimator that is minimax optimal in the local asymptotic sense. Using this, we develop a test for the global homogeneity of conditional potential outcome distributions that accommodates discrepancies beyond the maximum mean discrepancy (MMD), has provably valid type 1 error, and is consistent against fixed alternatives – the first test, to our knowledge, with such guarantees in this setting. Furthermore, we derive exact closed-form expressions for two natural discrepancies (including the MMD), and provide a computationally efficient, permutation-free algorithm for our test.

关键词: Conditional Distributional Treatment Effects, Doubly Robust Estimation, Minimax Optimal, Global Homogeneity Test, Maximum Mean Discrepancy, Potential Outcome Distributions, Covariate-dependent Treatment Effects

262. ❌ RaDAR: Relation-aware Diffusion-Asymmetric Graph Contrastive Learning for Recommendation

作者: Yixuan Huang, Jiawei Chen, Shengfan Zhang, Zongsheng Cao 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16800v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于推荐系统中的图神经网络和图对比学习技术，提出了一种结合扩散模型和关系感知的图对比学习框架。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用相关，而本文研究的是传统推荐系统算法，未涉及大模型、深度学习技术原理创新或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对推荐系统中图对比学习存在的随机边扰动破坏结构信号和数据稀疏性问题，提出了RaDAR框架，通过结合扩散模型和关系感知的图对比学习，在多个基准测试中显著提升了推荐性能。

摘要翻译

协同过滤（Collaborative Filtering, CF）推荐系统通过整合图神经网络（Graph Neural Networks, GNNs）和图对比学习（Graph Contrastive Learning, GCL）取得了显著进展。然而，（i）随机边扰动往往会扭曲关键的结构信号，并破坏增强视图间的语义一致性；（ii）数据稀疏性阻碍了协同信号的传播，限制了模型的泛化能力。

为应对这些挑战，我们提出了RaDAR（面向推荐系统的关系感知扩散-非对称图对比学习框架），这是一种新颖的框架，它结合了两种互补的视图生成机制：一个用于捕捉全局结构的图生成模型，以及一个用于细化噪声边的关系感知去噪模型。

RaDAR引入了三项关键创新：（1）采用全局负采样的非对称对比学习，在抑制噪声的同时保持语义对齐；（2）扩散引导的增强，通过渐进式噪声注入与去噪以提升鲁棒性；（3）关系感知的边细化，基于潜在节点语义动态调整边权重。

在三个公开基准数据集上的大量实验表明，RaDAR始终优于现有最先进方法，尤其在噪声和稀疏条件下表现更为突出。

摘要 (Abstract)

Collaborative filtering (CF) recommendation has been significantly advanced by integrating Graph Neural Networks (GNNs) and Graph Contrastive Learning (GCL). However, (i) random edge perturbations often distort critical structural signals and degrade semantic consistency across augmented views, and (ii) data sparsity hampers the propagation of collaborative signals, limiting generalization. To tackle these challenges, we propose RaDAR (Relation-aware Diffusion-Asymmetric Graph Contrastive Learning Framework for Recommendation Systems), a novel framework that combines two complementary view generation mechanisms: a graph generative model to capture global structure and a relation-aware denoising model to refine noisy edges. RaDAR introduces three key innovations: (1) asymmetric contrastive learning with global negative sampling to maintain semantic alignment while suppressing noise; (2) diffusion-guided augmentation, which employs progressive noise injection and denoising for enhanced robustness; and (3) relation-aware edge refinement, dynamically adjusting edge weights based on latent node semantics. Extensive experiments on three public benchmarks demonstrate that RaDAR consistently outperforms state-of-the-art methods, particularly under noisy and sparse conditions.

关键词: Recommendation Systems, Graph Neural Networks, Graph Contrastive Learning, Diffusion Models, Relation-aware, Edge Refinement, Collaborative Filtering, Data Sparsity

263. ❌ High-Dimensional Gaussian Mean Estimation under Realizable Contamination

作者: Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16798v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 这篇论文研究高维高斯均值估计在可实现污染模型下的计算复杂性，属于理论计算机科学和统计学习理论领域。论文内容完全不涉及大模型、深度学习、AI应用或任何评分关键词中的技术。所有关键词都专注于大模型技术、训练方法、推理优化、对齐技术、AI应用等，而本文是纯粹的统计理论分析，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了高维高斯分布在可实现污染模型下的均值估计问题，证明了在统计查询模型中存在信息-计算差距，表明算法要么需要远超信息理论上界所需的样本量，要么需要指数级运行时间。

摘要翻译

我们研究在一种称为可实现的$ε$污染模型的缺失数据机制下，$\mathbb{R}^d$中协方差为单位矩阵的高斯分布均值估计问题。在该模型中，对手可以选择一个介于0和$ε$之间的函数$r(x)$，且每个样本$x$以概率$r(x)$发生缺失。近期研究（Ma等人，2024）提出该模型作为完全随机缺失（Missing Completely At Random, MCAR）——即缺失性与数据独立——与非随机缺失（Missing Not At Random, MNAR）——即缺失性可能任意依赖于样本值并可能导致不可识别性问题——之间的中等强度设定。该工作建立了可实现污染模型中均值估计的信息理论上界与下界。他们所提出的估计器在计算时间上具有维度指数级复杂度，这为在高维空间中设计计算高效算法留下了可能性。在本工作中，我们在统计查询（Statistical Query, SQ）模型（并作为推论，在低阶多项式与PTF检验中）建立了一个信息-计算间隙，表明算法必须使用远多于信息理论所需样本量的数据，或承受指数级计算时间。我们通过一种样本-时间权衡几乎匹配我们下界的算法，补充了SQ下界结果。这些结果共同从性质上刻画了在$ε$可实现污染模型下高斯均值估计的复杂度。

摘要 (Abstract)

We study mean estimation for a Gaussian distribution with identity covariance in $\mathbb{R}^d$ under a missing data scheme termed realizable $ε$-contamination model. In this model an adversary can choose a function $r(x)$ between 0 and $ε$ and each sample $x$ goes missing with probability $r(x)$. Recent work Ma et al., 2024 proposed this model as an intermediate-strength setting between Missing Completely At Random (MCAR) – where missingness is independent of the data – and Missing Not At Random (MNAR) – where missingness may depend arbitrarily on the sample values and can lead to non-identifiability issues. That work established information-theoretic upper and lower bounds for mean estimation in the realizable contamination model. Their proposed estimators incur runtime exponential in the dimension, leaving open the possibility of computationally efficient algorithms in high dimensions. In this work, we establish an information-computation gap in the Statistical Query model (and, as a corollary, for Low-Degree Polynomials and PTF tests), showing that algorithms must either use substantially more samples than information-theoretically necessary or incur exponential runtime. We complement our SQ lower bound with an algorithm whose sample-time tradeoff nearly matches our lower bound. Together, these results qualitatively characterize the complexity of Gaussian mean estimation under $ε$-realizable contamination.

关键词: Gaussian mean estimation, realizable contamination, high-dimensional statistics, information-computation gap, Statistical Query model, missing data, computational complexity, sample-time tradeoff

264. ❌ Conservative Continuous-Time Treatment Optimization

作者: Nora Schneider, Georg Manten, Niki Kilbertus 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16789v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医疗治疗优化的连续时间随机控制框架，使用随机微分方程建模患者动态，并引入基于签名的MMD正则化器来限制外推。虽然属于AI在科学领域的应用（医疗优化），但论文内容完全不涉及大语言模型、深度学习技术原理或任何列出的具体大模型技术关键词。仅与"AI for Science"有微弱关联（医疗应用属于科学领域），但论文并未使用AI for Science的典型方法（如大模型、深度学习），而是使用传统的控制理论和统计方法。

!!! tip deepseek-chat TL;DR

该论文提出了一个保守的连续时间随机控制框架，通过基于签名的MMD正则化器来优化不规则采样的患者轨迹治疗计划，以减少模型误差和外推风险，并在基准数据集上展示了比非保守基线更好的鲁棒性和性能。

摘要翻译

我们提出了一种保守的连续时间随机控制框架，用于基于不规则采样的患者轨迹进行治疗方案优化。未知的患者动态被建模为一个受控随机微分方程，其中治疗作为连续时间控制变量。基于模型的朴素优化可能利用模型误差并提出超出支持集的控制方案，因此优化估计的动态模型未必能优化真实动态。为限制外推风险，我们在路径空间上增加了一个基于一致性特征签名的最大均值差异正则化项，该正则化器会惩罚那些导致轨迹分布偏离观测数据的治疗方案。由此构建的目标函数最小化了真实代价的一个可计算上界。在基准数据集上的实验表明，相较于非保守基线方法，本框架展现出更强的鲁棒性和更优的性能。

摘要 (Abstract)

We develop a conservative continuous-time stochastic control framework for treatment optimization from irregularly sampled patient trajectories. The unknown patient dynamics are modeled as a controlled stochastic differential equation with treatment as a continuous-time control. Naive model-based optimization can exploit model errors and propose out-of-support controls, so optimizing the estimated dynamics may not optimize the true dynamics. To limit extrapolation, we add a consistent signature-based MMD regularizer on path space that penalizes treatment plans whose induced trajectory distribution deviates from observed trajectories. The resulting objective minimizes a computable upper bound on the true cost. Experiments on benchmark datasets show improved robustness and performance compared to non-conservative baselines.

关键词: treatment optimization, continuous-time stochastic control, irregularly sampled trajectories, stochastic differential equation, MMD regularizer, conservative optimization, patient dynamics, trajectory distribution

265. ❌ pADAM: A Plug-and-Play All-in-One Diffusion Architecture for Multi-Physics Learning

作者: Amirhossein Mollaali, Bongseok Kim, Christian Moya, Guang Lin 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16757v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文pADAM专注于多物理场学习的扩散架构，属于AI for Science领域，但与所有其他关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词都特指大语言模型相关技术，而本文研究的是物理方程求解的生成模型，不涉及语言模型或相关技术。

!!! tip deepseek-chat TL;DR

论文提出了pADAM，一个统一的生成框架，用于学习跨异构偏微分方程族的共享概率先验，实现了在稀疏观测下的准确推理、不确定性量化和概率模型选择。

摘要翻译

在不同物理定律间实现泛化，始终是科学人工智能领域的核心挑战。现有深度学习求解器大多局限于单一方程场景，难以跨物理体系和推理任务迁移。本文提出pADAM——一种统一的生成式框架，通过学习异质偏微分方程族间的共享概率先验，构建系统状态与物理参数（如适用）的联合分布，从而在单一架构内支持前向预测与逆向推理，无需重新训练。在从标量扩散到非线性纳维-斯托克斯方程的系列基准测试中，pADAM即使在稀疏观测条件下仍能实现精确推理。结合保形预测方法，该框架还可提供具有覆盖保证的可靠不确定性量化。此外，pADAM仅需两个稀疏数据快照即可进行概率模型选择，通过其习得的生成式表征识别主导物理定律。这些成果彰显了生成式多物理场建模在实现统一且具备不确定性感知能力的科学推理方面的潜力。

摘要 (Abstract)

Generalizing across disparate physical laws remains a fundamental challenge for artificial intelligence in science. Existing deep-learning solvers are largely confined to single-equation settings, limiting transfer across physical regimes and inference tasks. Here we introduce pADAM, a unified generative framework that learns a shared probabilistic prior across heterogeneous partial differential equation families. Through a learned joint distribution of system states and, where applicable, physical parameters, pADAM supports forward prediction and inverse inference within a single architecture without retraining. Across benchmarks ranging from scalar diffusion to nonlinear Navier–Stokes equations, pADAM achieves accurate inference even under sparse observations. Combined with conformal prediction, it also provides reliable uncertainty quantification with coverage guarantees. In addition, pADAM performs probabilistic model selection from only two sparse snapshots, identifying governing laws through its learned generative representation. These results highlight the potential of generative multi-physics modeling for unified and uncertainty-aware scientific inference.

关键词: generative modeling, multi-physics learning, partial differential equations, probabilistic inference, uncertainty quantification, conformal prediction, Navier-Stokes equations, model selection

266. ❌ A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems

作者: Wei Min Loh, Sajib Kumer Sinha, Ankur Agarwal, Pascal Poupart 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16755v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是上下文多臂老虎机（contextual bandits）问题，提出了一种名为C3 Thompson采样的算法来处理具有密集臂特征、非线性奖励函数和非平稳相关性的复杂老虎机场景。论文内容完全聚焦于强化学习中的老虎机算法，没有涉及任何大语言模型、深度学习技术原理或AI在科学领域的应用。所有关键词都围绕大模型技术、训练方法、推理优化、对齐、应用等主题，与论文的强化学习算法研究完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为C3 Thompson采样的算法，用于解决具有密集臂特征、非线性奖励函数和非平稳相关性的复杂上下文老虎机问题，实验表明该算法在多个数据集上比其他算法降低了5.7%的平均累积遗憾，并在新闻推荐任务中实现了12.4%的点击率提升。

摘要翻译

上下文赌博机在许多实际问题中具有极高的应用价值。本文通过构建一个更贴近现实的复合问题推进研究：该问题融合了（1）具有稠密臂特征的上下文赌博机，（2）非线性奖励函数，以及（3）相关赌博机的泛化形式——其中奖励分布随时间变化但关联性保持稳定。这一建模框架可适用于更广泛的应用场景，例如推荐任务。为解决该问题，我们针对伯努利赌博机提出了条件耦合上下文C3汤普森采样算法。该方法将嵌入空间中改进的Nadaraya-Watson估计器与支持免重训练在线学习的汤普森采样相结合。实验结果表明，在四个OpenML表格数据集上，C3算法以平均累计遗憾降低5.7%的表现优于次优算法；在微软新闻数据集（MIND）上，相较于其他算法实现了12.4%的点击率提升。

摘要 (Abstract)

Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce conditionally coupled contextual C3 Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that C3 outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.

关键词: contextual bandits, Thompson sampling, non-stationary bandits, dense arm features, nonlinear reward functions, cumulative regret, recommendation systems, online learning

267. ❌ Data-driven forced response analysis with min-max representations of nonlinear restoring forces

作者: Akira Saito, Hiromu Fujita 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16746v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究机械系统的数据驱动非线性识别方法，使用人工神经网络框架中的激活函数（通过分段线性弹簧实现）来近似非线性恢复力。虽然涉及人工神经网络，但论文专注于传统机械工程领域的系统识别和建模，未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用。所有关键词均与大模型、深度学习技术或AI在科学领域的应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于人工神经网络激活函数的数据驱动方法，用于识别机械系统中的非线性恢复力，并通过分段线性弹簧成功建模了Duffing振荡器和悬臂板的磁力，验证了模型在强迫响应分析中的准确性。

摘要翻译

本文探讨了一种针对具有非线性恢复力（如多项式、分段线性及一般位移相关非线性）的机械系统的数据驱动非线性辨识新方法。该方法基于通用逼近定理构建，该定理指出非线性函数可通过人工神经网络框架中激活函数的线性组合来逼近。所提出的方法采用具有初始间隙的分段线性弹簧作为人工神经网络神经元的激活函数。通过构建具有初始间隙的分段线性弹簧库，并求解线性回归问题来确定各弹簧对非线性恢复力的贡献度。分段线性弹簧通过带有偏置的极小值与极大值函数组合实现。该方法被应用于具有立方刚度的杜芬振子及带有间隙的分段线性振子，并成功从其自由响应中辨识出非线性特性。所得模型进一步用于受迫响应分析，结果与原系统高度吻合。该方法还应用于受磁力恢复作用的悬臂板实验自由响应数据，成功获得了磁力的分段线性表示。研究同时表明，所建模型能够精确捕捉系统在谐波基础激励下的稳态响应。

摘要 (Abstract)

This paper discusses a novel data-driven nonlinearity identification method for mechanical systems with nonlinear restoring forces such as polynomial, piecewise-linear, and general displacement-dependent nonlinearities. The proposed method is built upon the universal approximation theorem that states that a nonlinear function can be approximated by a linear combination of activation functions in artificial neural network framework. The proposed approach utilizes piecewise linear springs with initial gaps to act as the activation functions of the neurons of artificial neural networks. A library of piecewise linear springs with initial gaps are constructed, and the contributions of the springs on the nonlinear restoring force are determined by solving the linear regression problems. The piecewise linear springs are realized by combinations of min and max functions with biases. The proposed method is applied to a Duffing oscillator with cubic stiffness, and a piecewise linear oscillator with a gap and their nonlinearities are successfully determined from their free responses. The obtained models are then used for conducting forced response analysis and the results match well with those of the original system. The method is then applied to experimentally-obtained free response data of a cantilevered plate that is subjected to magnetic restoring force, and successfully finds the piecewise linear representation of the magnetic force. It is also shown that the obtained model is capable of accurately capturing the steady-state response of the system subject to harmonic base excitation.

关键词: data-driven nonlinearity identification, nonlinear restoring forces, piecewise linear springs, artificial neural networks, activation functions, Duffing oscillator, forced response analysis, magnetic restoring force

268. ❌ Bayesian Inference of Psychometric Variables From Brain and Behavior in Implicit Association Tests

作者: Christian A. Kothe, Sean Mullen, Michael V. Bronstein, Grant Hanada, Marcelo Cicconet, Aaron N. McInnes, Tim Mullen, Marc Aafjes, Scott R. Sponheim, Alik S. Widge 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16741v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用稀疏分层贝叶斯模型从神经和行为数据中推断心理测量变量，属于AI在科学（特别是心理健康）领域的应用。论文未涉及任何大模型、深度学习技术原理或关键词列表中的其他技术主题。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其应用AI方法于心理健康研究（可视为生物信息学或科学AI的一个子领域），但并非核心内容，故给5分。其他关键词均完全无关，给0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种稀疏分层贝叶斯模型，利用多模态数据从隐式关联测试中推断心理健康相关的心理测量变量，在自杀倾向和精神病相关任务中达到了0.73-0.79的AUC，性能优于传统的D-score方法。

摘要翻译

目的。本研究建立了一种基于神经与行为数据推断心理健康相关心理测量变量的原理性方法，以内隐联想测验（Implicit Association Test, IAT）作为数据生成引擎，旨在克服当前金标准方法——仅依赖反应时的D分数法——预测性能有限（通常低于0.7 AUC）的问题。

方法。我们提出了一种稀疏分层贝叶斯模型，该模型利用多模态数据预测新参与者与精神疾病症状相关的体验。该模型是D分数的多变量泛化形式，具有可训练参数，并针对IAT研究中常见的小样本队列场景进行了参数效率优化。我们分析了两种IAT变体的数据：与自杀意念相关的E-IAT（样本量n=39）和与精神病性症状相关的PSY-IAT（n=34）。

主要结果。我们的方法克服了数据集中存在的高个体间变异性和低会话内效应量问题，在最佳模态配置下达到了0.73（E-IAT）和0.76（PSY-IAT）的AUC值，但校正后的95%置信区间较宽（±0.18），且经错误发现率（FDR）校正后结果仅达到边缘显著性（q=0.10）。将E-IAT数据限制于重度抑郁症（MDD）参与者时，AUC提升至0.79 [0.62, 0.97]（在q=0.05水平显著）。本方法的性能与各项任务中最优的参照方法（收缩线性判别分析LDA和EEGNet）相当，尽管参照方法针对任务进行了调整而本方法未作调整。在两个任务中，本方法的准确率均显著高于接近随机水平的D分数法（AUC 0.50-0.53），且其跨任务性能表现比任何单一参照方法都更为一致。

意义。我们的框架显示出提升基于IAT的关于受困感与精神病性体验评估的潜力，并可能扩展至其他心理健康状况的评估，但未来需要在更大规模的独立队列中进行进一步验证以确立其临床实用性。

摘要 (Abstract)

Objective. We establish a principled method for inferring mental health related psychometric variables from neural and behavioral data using the Implicit Association Test (IAT) as the data generation engine, aiming to overcome the limited predictive performance (typically under 0.7 AUC) of the gold-standard D-score method, which relies solely on reaction times. Approach. We propose a sparse hierarchical Bayesian model that leverages multi-modal data to predict experiences related to mental illness symptoms in new participants. The model is a multivariate generalization of the D-score with trainable parameters, engineered for parameter efficiency in the small-cohort regime typical of IAT studies. Data from two IAT variants were analyzed: a suicidality-related E-IAT ($n=39$) and a psychosis-related PSY-IAT ($n=34$). Main Results. Our approach overcomes a high inter-individual variability and low within-session effect size in the dataset, reaching AUCs of 0.73 (E-IAT) and 0.76 (PSY-IAT) in the best modality configurations, though corrected 95% confidence intervals are wide ($\pm 0.18$) and results are marginally significant after FDR correction ($q=0.10$). Restricting the E-IAT to MDD participants improves AUC to 0.79 $[0.62, 0.97]$ (significant at $q=0.05$). Performance is on par with the best reference methods (shrinkage LDA and EEGNet) for each task, even when the latter were adapted to the task, while the proposed method was not. Accuracy was substantially above near-chance D-scores (0.50-0.53 AUC) in both tasks, with more consistent cross-task performance than any single reference method. Significance. Our framework shows promise for enhancing IAT-based assessment of experiences related to entrapment and psychosis, and potentially other mental health conditions, though further validation on larger and independent cohorts will be needed to establish clinical utility.

关键词: Bayesian inference, psychometric variables, Implicit Association Test, mental health, sparse hierarchical model, multi-modal data, AUC, suicidality and psychosis

269. ❌ Understanding Quantization of Optimizer States in LLM Pre-training: Dynamics of State Staleness and Effectiveness of State Resets

作者: Kristi Topollai, Anna Choromanska 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16731v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM预训练中优化器状态的量化问题，与’Large Language Models’和’Pre-training’高度相关（10分），直接研究’Quantization’技术（10分）。论文通过分析量化导致的EMA状态停滞现象，提供了’Mechanistic Interpretability’层面的解释（5分）。其他关键词如MoE、SFT、RAG、推理加速等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了LLM预训练中量化优化器状态导致的EMA状态停滞问题，通过理论分析和实验验证，提出了基于状态重置的解决方案来恢复性能并减少内存使用。

摘要翻译

量化优化器状态正成为内存高效大规模预训练的重要技术手段，但其引发的优化器动态机制尚未被充分理解。本研究聚焦低精度指数移动平均（EMA）优化器状态，揭示了量化如何导致大量名义更新因舍入误差而回归至同一存储值，从而使状态实际陷入陈旧化，其适应速度的减缓程度远超名义衰减率所预示的范围。我们进而构建了一个简化的停滞预测模型，该模型可估算单步停滞概率，并刻画了初始化后停滞效应随时间累积的动态过程。这一视角从机制上解释了为何低精度环境下重置优化器状态具有积极作用：一旦量化EMA状态实质上陷入停滞，重置操作能暂时恢复其响应能力。基于此理论框架，我们推导出一种理论指导的简易方法，用于选择有效的重置周期。研究表明，在低精度场景中，关键问题不仅在于重置是否有效，更在于何时实施重置。通过受控模拟与大语言模型预训练实验验证，合适的重置策略能够有效弥补因低精度状态存储导致的性能损失，同时显著降低优化器状态的内存占用。

摘要 (Abstract)

Quantizing optimizer states is becoming an important ingredient of memory-efficient large-scale pre-training, but the resulting optimizer dynamics remain only partially understood. We study low-precision exponential moving average (EMA) optimizer states and show how quantization can cause many nominal updates to round back to the same stored value, making the state effectively stale and slowing adaptation beyond what the nominal decay would suggest. We then develop a simple predictive model of stalling that estimates one-step stalling probabilities and characterizes how stalling builds up over time after the initialization. This perspective provides a mechanistic explanation for why optimizer-state resets help in low precision: once a quantized EMA becomes effectively stale, resetting it can temporarily restore responsiveness. Motivated by this picture, we derive a simple theory-guided method for choosing useful reset periods, showing that in low precision the key question is not only whether resets help, but when they should be applied. Experiments in controlled simulations and LLM pre-training show that suitable reset schedules recover the performance lost to low-precision state storage while substantially reducing optimizer-state memory.

关键词: Quantization, Optimizer States, LLM Pre-training, EMA, State Staleness, State Resets, Memory Efficiency

270. ❌ GeMA: Learning Latent Manifold Frontiers for Benchmarking Complex Systems

作者: Jia Ming Li, Anupriya, Daniel J. Graham 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16729v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GeMA提出了一种基于变分自编码器（VAE）的几何流形分析方法，用于复杂系统（如铁路网络、可再生能源资产、国民经济）的基准测试。虽然论文涉及机器学习（VAE）和数据分析，但其核心是运筹学、经济学和系统效率评估领域的方法论创新，而非大模型或深度学习技术原理的创新。所有评分关键词均与大语言模型、深度学习技术、AI对齐、推理、代理、优化等具体技术相关，而本文未涉及这些内容。即使考虑"AI for Science"关键词，论文虽应用于科学领域（经济学、交通规划），但未使用AI进行科学发现或生物/化学信息学分析，因此相关性为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GeMA的几何流形分析方法，通过生产力流形变分自编码器学习潜在流形边界，以更灵活地评估复杂系统（如铁路网络和国民经济）的效率，解决了传统前沿方法在异质性、非凸性和规模偏差方面的局限性。

摘要翻译

对铁路网络、可再生能源发电资产及国民经济等复杂系统的性能进行基准测试，是交通规划、行业监管与宏观经济分析的核心工作。经典前沿方法，特别是数据包络分析（DEA）与随机前沿分析（SFA），通过在观测到的投入-产出空间中估计一个有效前沿，并将效率定义为与该前沿的距离来进行评估。然而，这些方法依赖于对生产集的严格假设，且仅能间接处理异质性与规模效应问题。

我们提出了几何流形分析（GeMA），这是一种通过生产力-流形变分自编码器（ProMan-VAE）实现的潜在流形前沿框架。GeMA并非在观测空间中设定一个前沿函数，而是将生产集表示为嵌入在联合投入-产出空间中的一个低维流形的边界。其采用分头编码器学习潜在变量，以捕捉技术结构与运营无效率。效率评估基于学习到的流形进行；内生的同侪群体作为潜在技术空间中的聚类自然涌现；商结构支持规模不变的基准测试；此外，通过解码器雅可比矩阵与利普希茨边界导出的局部认证半径，能够量化效率得分的几何稳健性。

我们在具有非凸前沿、异质技术与规模偏差的合成数据上，以及在四个真实案例研究中验证了GeMA的有效性：全球城市铁路系统（COMET）、英国铁路运营商（ORR）、国民经济体（佩恩世界表）以及一个高频风电场数据集。在这些领域中，当经典假设成立时，GeMA的表现与既有方法相当；而在存在显著异质性、非凸性或规模相关偏差的场景中，GeMA能提供更深入的洞察。

摘要 (Abstract)

Benchmarking the performance of complex systems such as rail networks, renewable generation assets and national economies is central to transport planning, regulation and macroeconomic analysis. Classical frontier methods, notably Data Envelopment Analysis (DEA) and Stochastic Frontier Analysis (SFA), estimate an efficient frontier in the observed input-output space and define efficiency as distance to this frontier, but rely on restrictive assumptions on the production set and only indirectly address heterogeneity and scale effects. We propose Geometric Manifold Analysis (GeMA), a latent manifold frontier framework implemented via a productivity-manifold variational autoencoder (ProMan-VAE). Instead of specifying a frontier function in the observed space, GeMA represents the production set as the boundary of a low-dimensional manifold embedded in the joint input-output space. A split-head encoder learns latent variables that capture technological structure and operational inefficiency. Efficiency is evaluated with respect to the learned manifold, endogenous peer groups arise as clusters in latent technology space, a quotient construction supports scale-invariant benchmarking, and a local certification radius, derived from the decoder Jacobian and a Lipschitz bound, quantifies the geometric robustness of efficiency scores. We validate GeMA on synthetic data with non-convex frontiers, heterogeneous technologies and scale bias, and on four real-world case studies: global urban rail systems (COMET), British rail operators (ORR), national economies (Penn World Table) and a high-frequency wind-farm dataset. Across these domains GeMA behaves comparably to established methods when classical assumptions hold, and provides additional insight in settings with pronounced heterogeneity, non-convexity or size-related bias.

关键词: Geometric Manifold Analysis, variational autoencoder, efficiency benchmarking, complex systems, latent manifold, non-convex frontiers, scale-invariant benchmarking, production set

271. ❌ The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models

作者: Robert Welch, Emir Konuk, Kevin Smith 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16728v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	7.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	6.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视觉语言模型（VLMs）中链式思维（CoT）推理对不确定性估计的影响，发现推理会降低不确定性估计质量并导致模型过度自信。论文与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分），因为这是研究的核心主题；与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’相关（8分），涉及深度推理过程；与’Hallucination Mitigation OR Factuality OR Truthfulness’相关（7分），涉及模型可靠性和事实性；与’Mechanistic Interpretability OR Explainable AI’相关（6分），涉及模型行为机制分析；与’Self-Correction OR Self-Improvement OR Self-Reflection’有一定关联（5分），涉及模型自我评估；与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为VLMs是大模型的一种。其他关键词与论文内容无关或未提及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在视觉语言模型中，链式思维推理虽然能提高任务准确性，但会降低不确定性估计质量并导致模型过度自信，主要原因是推理过程中的隐式答案条件化机制。

摘要翻译

视觉语言模型（VLMs）正日益被部署于高风险场景中，其中可靠的不确定性量化（UQ）与预测准确性同等重要。通过思维链（CoT）提示或经过推理训练的模型进行扩展推理，在现代VLM流程中已无处不在，但其对UQ可靠性的影响仍鲜为人知。我们发现，即使推理提高了任务准确性，它仍会持续降低大多数不确定性估计的质量。我们指出，隐含答案条件化是其主要机制：当推理轨迹在最终答案生成前收敛于某个结论时，词元概率越来越反映与模型自身推理轨迹的一致性，而非对正确性的不确定性。实际上，模型变得对其答案过度自信。相比之下，基于一致性的方法在推理条件下保持稳健且常有所改善，这使其成为具备推理能力的VLMs中进行不确定性估计的实用选择。

摘要 (Abstract)

Vision-language models (VLMs) are increasingly deployed in high-stakes settings where reliable uncertainty quantification (UQ) is as important as predictive accuracy. Extended reasoning via chain-of-thought (CoT) prompting or reasoning-trained models has become ubiquitous in modern VLM pipelines, yet its effect on UQ reliability remains poorly understood. We show that reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. We identify implicit answer conditioning as the primary mechanism: as reasoning traces converge on a conclusion before the final answer is generated, token probabilities increasingly reflect consistency with the model’s own reasoning trace rather than uncertainty about correctness. In effect, the model becomes overconfident in its answer. In contrast, agreement-based consistency remains robust and often improves under reasoning, making it a practical choice for uncertainty estimation in reasoning-enabled VLMs.

关键词: Vision-language models, Chain-of-thought, Uncertainty quantification, Overconfidence, Implicit answer conditioning, Reasoning reliability, Token probabilities, Consistency-based estimation

272. ❌ Novelty-Driven Target-Space Discovery in Automated Electron and Scanning Probe Microscopy

作者: Utkarsh Pratiush, Kamyar Barakati, Boris N. Slautin, Catherine C. Bodinger, Christopher D. Lowe, Brandi M. Cossairt, Sergei V. Kalinin 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16715v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于自动化显微镜中的目标空间发现，提出了一种基于深度核学习的BEACON框架，用于在实验中学习结构-性质关系并引导发现。论文的核心是深度学习在科学仪器（扫描透射电子显微镜）中的应用，属于AI for Science范畴，因此仅与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（评5分），因为它是AI在科学实验中的应用，但未明确涉及生物信息学或化学信息学。其他所有关键词均与大模型、训练方法、推理技术、代理系统等无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文解决了自动化显微镜中科学发现依赖于目标空间而非直接图像特征的挑战，提出了一个深度核学习框架BEACON，通过在线学习结构-性质关系来主动探索多样化的响应机制，并在扫描透射电子显微镜上实现了从离线验证到实时实验的部署。

摘要翻译

现代自动化显微技术面临一个根本性的发现挑战：在许多系统中，最重要的科学信息并不存在于直接可见的图像特征中，而是存在于顺序获取的光谱或功能响应的目标空间内，这使得开发能够主动搜寻新行为而非简单优化已知目标的策略变得至关重要。为此，我们开发了一种深度核学习BEACON框架，该框架通过在实验过程中学习结构-性质关系，并利用这一不断演化的模型来探索多样化的响应区域，从而明确设计用于引导目标空间中的科学发现。我们首先基于预先获取的真实数据集构建演示工作流程来验证该方法，这使得能够直接与经典采集策略进行基准比较，并允许我们以透明且可重复的方式定义一组监控函数，用于比较探索质量、目标空间覆盖度以及代理模型行为。这一基准测试框架为评估发现驱动算法提供了实用基础，而不仅仅是优化性能。随后，我们在扫描透射电子显微镜（STEM）上实现并部署了该工作流程，表明该方法能够从离线验证过渡到真实的实验实施。为支持更广泛的研究群体采用和扩展本方法，相关代码笔记本已公开，用户可借此复现工作流程、测试基准，并将该方法适配至各自的仪器和数据集。

摘要 (Abstract)

Modern automated microscopy faces a fundamental discovery challenge: in many systems, the most important scientific information does not reside in the immediately visible image features, but in the target space of sequentially acquired spectra or functional responses, making it essential to develop strategies that can actively search for new behaviors rather than simply optimize known objectives. Here, we developed a deep-kernel-learning BEACON framework that is explicitly designed to guide discovery in the target space by learning structure-property relationships during the experiment and using that evolving model to seek diverse response regimes. We first established the method through demonstration workflows built on pre-acquired ground-truth datasets, which enabled direct benchmarking against classical acquisition strategies and allowed us to define a set of monitoring functions for comparing exploration quality, target-space coverage, and surrogate-model behavior in a transparent and reproducible manner. This benchmarking framework provides a practical basis for evaluating discovery-driven algorithms, not just optimization performance. We then operationalized and deployed the workflow on STEM, showing that the approach can transition from offline validation to real experimental implementation. To support adoption and extension by the broader community, the associated notebooks are available, allowing users to reproduce the workflows, test the benchmarks, and adapt the method to their own instruments and datasets.

关键词: automated microscopy, target-space discovery, deep kernel learning, BEACON framework, structure-property relationships, scanning transmission electron microscopy, active exploration, experimental implementation

273. ❌ High-dimensional estimation with missing data: Statistical and computational limits

作者: Kabir Aladin Verchand, Ankit Pensia, Saminul Haque, Rohith Kuditipudi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16712v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高维缺失数据下的统计估计问题，属于传统统计机器学习领域，主要涉及高斯分布、均值估计、协方差估计和线性回归等经典统计问题，并探讨了统计计算间隙。所有评分关键词均与大模型、深度学习、AI应用或相关技术原理相关，而该论文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了高维缺失数据下的统计估计问题，发现在高斯数据中均值估计和协方差估计存在统计计算间隙，而线性回归则不存在这种间隙。

摘要翻译

我们研究在观测数据存在缺失情况下的计算高效总体参数估计问题。具体而言，我们考虑在可实现的缺失数据污染模型下进行估计，其中观测数据中比例为 $ε$ 的部分受到任意（且未知）的“非随机缺失”（Missing Not At Random, MNAR）机制影响。当真实数据服从高斯分布时，我们为若干问题提供了统计-计算间隙存在的证据。针对 $\ell_2$ 范数下的均值估计，我们证明：对于任意常数污染比例 $ε\in (0, 1)$，为获得至多 $ρ$ 的误差，至少需要（约）$n \gtrsim d e^{1/ρ^2}$ 个样本，且存在一种计算低效的算法能达到该误差界。另一方面，我们证明在某些常用算法族中，任何计算高效的方法都需要（约）$n \gtrsim d^{1/ρ^2}$ 的更大样本复杂度，并且存在一种基于平方和（sum-of-squares）的多项式时间算法（几乎）达到该下界。对于相对算子范数下的协方差估计，我们展示了类似结论成立。最后，我们转向存在缺失观测的线性回归问题，并证明此类间隙在该场景中并不持续存在。实际上，在此设定下，我们证明最小化一个简单的强凸经验风险即可在多项式时间内近乎达到信息论下界。

摘要 (Abstract)

We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $ε$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $ρ$, for any constant contamination $ε\in (0, 1)$, (roughly) $n \gtrsim d e^{1/ρ^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/ρ^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.

关键词: missing data, high-dimensional estimation, statistical-computational gaps, mean estimation, covariance estimation, linear regression, Gaussian data, sum-of-squares

274. ❌ Learning Lineage-guided Geodesics with Finsler Geometry

作者: Aaron Zweig, Mingxuan Zhang, David A. Knowles, Elham Azizi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16708v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究轨迹推断问题，提出了一种结合几何和分类的Finsler度量方法，用于处理具有离散先验知识（如发育生物学中的谱系树）的动态系统。论文内容主要涉及微分几何、机器学习在科学数据（特别是生物数据）中的应用，与大多数大模型/深度学习技术关键词（如LLM、MoE、RLHF、RAG等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文应用于发育生物学等科学领域，但并非核心创新点（创新主要在几何方法上），因此给予5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合几何和分类的Finsler度量方法，用于在具有离散先验知识（如谱系树）的动态系统中进行轨迹推断，从而在合成和真实数据上提高了插值任务的性能。

摘要翻译

轨迹推断研究如何对动态系统（如时间分辨的群体分布）的观测时间点之间的路径进行插值，其目标是推断未观测时间点的轨迹并更好地理解系统动态。先前的工作侧重于连续几何先验，利用数据依赖的空间特征来定义黎曼度量（Riemannian metric）。在许多应用中，存在关于允许转移的离散、定向先验知识（例如发育生物学中的谱系树）。我们引入了一种结合几何与分类的芬斯勒度量（Finsler metric），并将这两类先验整合到轨迹推断中，从而在合成和真实世界数据的插值任务上实现了性能提升。

摘要 (Abstract)

Trajectory inference investigates how to interpolate paths between observed timepoints of dynamical systems, such as temporally resolved population distributions, with the goal of inferring trajectories at unseen times and better understanding system dynamics. Previous work has focused on continuous geometric priors, utilizing data-dependent spatial features to define a Riemannian metric. In many applications, there exists discrete, directed prior knowledge over admissible transitions (e.g. lineage trees in developmental biology). We introduce a Finsler metric that combines geometry with classification and incorporate both types of priors in trajectory inference, yielding improved performance on interpolation tasks in synthetic and real-world data.

关键词: trajectory inference, Finsler geometry, lineage trees, developmental biology, geodesics, dynamic systems, interpolation, prior knowledge

275. ❌ Grid-World Representations in Transformers Reflect Predictive Geometry

作者: Sasha Brenner, Thomas R. Knösche, Nico Scherf 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16689v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Transformer在网格世界随机游走任务中如何形成内部世界表示，与’World Models AND General World Models’高度相关（10分），因为直接探讨了世界模型表示的形成机制；与’Mechanistic Interpretability OR Explainable AI’有一定关联（8分），因为分析了神经网络内部表示与数据几何结构的关系。其他关键词均未涉及，因为论文专注于基础Transformer架构在受控环境中的表示学习，而非具体的大模型技术、训练方法、应用领域或优化技术。

!!! tip deepseek-chat TL;DR

该论文研究了Transformer模型在预测二维网格上随机游走任务时，其内部表示如何与数据本身的几何结构对齐，从而形成类似世界模型的表示。

摘要翻译

下一词元预测模型常表现出对潜在世界及其规则的内在表征能力。这类模型的概率特性表明，世界的结构与概率分布的几何形态之间存在深刻关联。为更精确理解这一联系，我们采用一个最小随机过程作为受控实验环境：在二维晶格上进行受约束的随机游走，该游走必须在预定步数后抵达固定终点点。对此过程的最优预测完全取决于一个由游走者相对于目标的位置与剩余时间范围所决定的充分向量；换言之，概率分布由世界几何结构参数化。我们在从这些游走的精确分布中采样的前缀序列上训练仅含解码器的Transformer模型，并将其隐藏层激活值与解析推导的充分向量进行比较。在不同模型与网络层中，学习到的表征与真实预测向量高度吻合，且通常呈现低维特性。这为世界模型式表征可直接追溯至数据本身的预测几何结构提供了具体例证。尽管在简化玩具系统中得到验证，本分析表明：支撑最优预测的几何表征或可为研究神经网络如何内化语法及其他结构约束提供有益视角。

摘要 (Abstract)

Next-token predictors often appear to develop internal representations of the latent world and its rules. The probabilistic nature of these models suggests a deep connection between the structure of the world and the geometry of probability distributions. In order to understand this link more precisely, we use a minimal stochastic process as a controlled setting: constrained random walks on a two-dimensional lattice that must reach a fixed endpoint after a predetermined number of steps. Optimal prediction of this process solely depends on a sufficient vector determined by the walker’s position relative to the target and the remaining time horizon; in other words, the probability distributions are parametrized by the world’s geometry. We train decoder-only transformers on prefixes sampled from the exact distribution of these walks and compare their hidden activations to the analytically derived sufficient vectors. Across models and layers, the learned representations align strongly with the ground-truth predictive vectors and are often low-dimensional. This provides a concrete example in which world-model-like representations can be directly traced back to the predictive geometry of the data itself. Although demonstrated in a simplified toy system, the analysis suggests that geometric representations supporting optimal prediction may provide a useful lens for studying how neural networks internalize grammatical and other structural constraints.

关键词: Transformers, world models, internal representations, predictive geometry, random walks, grid-world, sufficient vectors, decoder-only transformers

276. ❌ Self-Aware Markov Models for Discrete Reasoning

作者: Gregor Kornhardt, Jannis Chemseddine, Christian Wald, Gabriele Steidl 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16661v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是离散推理任务的改进方法，核心是自感知马尔可夫模型，通过允许令牌重新掩码和自适应停止准则来纠正错误并适应问题难度。这与大模型/深度学习技术原理的直接关联较弱，但涉及推理（Chain of Thought, System 2 Thinking）和自我纠正（Self-Correction）概念，因此这些关键词得分为5-8分。其他关键词如LLMs、MoE、训练方法等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对标准掩码离散扩散模型在推理任务中无法纠正自身错误和固定计算步骤的问题，提出了一种基于学习马尔可夫转移核的自感知方法，允许令牌重新掩码和自适应停止，在Sudoku-Extreme和Countdown-4数据集上显著提升了性能。

摘要翻译

标准掩码离散扩散模型在推理任务中存在局限性，因其无法在掩码路径上修正自身错误。由于依赖固定次数的去噪步骤，这些模型无法根据问题的复杂性调整计算量。为解决这些局限，我们提出一种基于学习马尔可夫转移核的方法，该转移核在其自身输出上进行训练。这种设计允许对标记进行重新掩码，使模型能够修正先前的错误。此外，我们无需固定时间调度，而是采用训练得到的停止准则。这使得函数评估次数能够适应推理问题的难度。我们的改进方案增加了两个轻量级预测头，可实现现有预训练模型的复用与微调。在Sudoku-Extreme数据集上，我们以95%的有效率明显优于其他基于流的方法。对于Countdown-4问题，我们平均仅需10步即可正确解决近96%的问题，而许多问题仅需2步即可求解。

摘要 (Abstract)

Standard masked discrete diffusion models face limitations in reasoning tasks due to their inability to correct their own mistakes on the masking path. Since they rely on a fixed number of denoising steps, they are unable to adjust their computation to the complexity of a given problem. To address these limitations, we introduce a method based on learning a Markov transition kernel that is trained on its own outputs. This design enables tokens to be remasked, allowing the model to correct its previous mistakes. Furthermore, we do not need a fixed time schedule but use a trained stopping criterion. This allows for adaptation of the number of function evaluations to the difficulty of the reasoning problem. Our adaptation adds two lightweight prediction heads, enabling reuse and fine-tuning of existing pretrained models. On the Sudoku-Extreme dataset we clearly outperform other flow based methods with a validity of 95%. For the Countdown-4 we only need in average of 10 steps to solve almost 96% of them correctly, while many problems can be solved already in 2 steps.

关键词: discrete reasoning, masked diffusion models, Markov transition kernel, self-correction, adaptive stopping, Sudoku, Countdown, reasoning tasks

277. ❌ Simplex-to-Euclidean Bijection for Conjugate and Calibrated Multiclass Gaussian Process

作者: Bernardo Williams, Harsha Vardhan Tetali, Arto Klami, Marcelo Hartmann 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16621v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是多类高斯过程分类模型，利用Aitchison几何将概率单纯形映射到欧几里得空间，实现共轭推理和校准预测。论文内容完全专注于传统高斯过程方法、几何变换和概率校准，不涉及任何大语言模型、深度学习、AI for Science或其他指定关键词相关的技术。所有关键词均与论文主题无关，因此相关度评分均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用Aitchison几何将概率单纯形映射到欧几里得空间的多类高斯过程分类模型，实现了共轭推理和校准预测，并在合成和真实数据集上表现出竞争性能。

摘要翻译

我们提出了一种用于多类分类的共轭校准高斯过程模型，该方法通过利用概率单纯形的几何结构实现。基于Aitchison几何，我们将单纯形空间内的类别概率映射到无约束的欧几里得表示空间，从而将分类问题转化为一个潜在维度少于传统多类高斯过程分类器的回归问题。该模型在构建过程中无需依赖分布近似，即可实现共轭推断并获得可靠的预测概率。本方法与标准稀疏高斯过程回归技术兼容，能够在大规模数据集上实现可扩展的推断。实验结果表明，在合成数据集与真实数据集上，该方法均表现出良好的校准性能与有竞争力的分类效果。

摘要 (Abstract)

We propose a conjugate and calibrated Gaussian process (GP) model for multi-class classification by exploiting the geometry of the probability simplex. Our approach uses Aitchison geometry to map simplex-valued class probabilities to an unconstrained Euclidean representation, turning classification into a GP regression problem with fewer latent dimensions than standard multi-class GP classifiers. This yields conjugate inference and reliable predictive probabilities without relying on distributional approximations in the model construction. The method is compatible with standard sparse GP regression techniques, enabling scalable inference on larger datasets. Empirical results show well-calibrated and competitive performance across synthetic and real-world datasets.

关键词: Gaussian process, multi-class classification, Aitchison geometry, probability simplex, conjugate inference, calibrated predictions, sparse GP regression, Euclidean representation

278. ❌ Trajectory-Optimized Time Reparameterization for Learning-Compatible Reduced-Order Modeling of Stiff Dynamical Systems

作者: Joe Standridge, Daniel Livescu, Paul Cizmas 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16583v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于机器学习降阶模型（ML-ROMs）在刚性动力系统中的应用，提出了一种轨迹优化的时间重参数化方法（TOTR）来改善训练稳定性和预测精度。论文的核心是数值方法和微分方程求解，而非大语言模型或深度学习技术原理的创新。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"有一定关联（5分），因为论文属于AI在科学计算（具体为动力系统建模）中的应用，但并非生物信息学或化学信息学领域。其他关键词均与大语言模型、训练技术、推理优化、代理系统等无关，因此评分为0分。加权总分仅为5.0，远低于及格分26.6，表明论文与评审关注的大模型和深度学习技术主题高度不匹配。

!!! tip deepseek-chat TL;DR

该论文针对刚性动力系统中机器学习降阶模型训练不稳定的问题，提出了一种轨迹优化的时间重参数化方法，在三个刚性测试案例中实现了比基准算法低1-2个数量级的损失减少和更优的物理时间预测。

摘要翻译

刚性动力系统对机器学习降阶模型（ML-ROMs）提出了挑战，因为在刚性区域中显式时间积分会变得不稳定，而在学习循环中使用隐式积分则计算成本高昂，且通常会降低训练效率。时间重参数化（TR）提供了一种替代方案，它通过变换自变量，使快速物理时间瞬态过程在拉伸时间坐标上展开，从而能够在均匀采样网格上实现稳定的显式积分。尽管已有多种TR策略被提出，但它们对ML-ROMs可学习性的影响仍未得到充分理解。本研究探讨了时间重参数化作为神经ODE降阶建模的刚性缓解机制，并引入了一种轨迹优化时间重参数化（TOTR）方法。该方案将时间重参数化构建为弧长坐标中的优化问题，通过选择遍历速度剖面来惩罚拉伸时间中的加速度。通过针对训练动态的平滑性，该方法生成的重参数化轨迹比现有TR方法具有更好的条件性和更易于学习的特性。TOTR在三个刚性问题上进行了评估：参数化刚性线性系统、范德波尔振荡器和HIRES化学动力学模型。在所有案例中，在相同的训练方案下，所提出的方法相比其他TR方法能产生更平滑的重参数化和更优的物理时间预测。定量结果表明，与基准算法相比，损失降低了一到两个数量级。这些结果突出表明，ML-ROMs中有效的刚性缓解关键取决于时间映射本身的规律性和可学习性，而基于优化的TR为多尺度动力系统的显式降阶建模提供了一个稳健的框架。

摘要 (Abstract)

Stiff dynamical systems present a challenge for machine-learning reduced-order models (ML-ROMs), as explicit time integration becomes unstable in stiff regimes while implicit integration within learning loops is computationally expensive and often degrades training efficiency. Time reparameterization (TR) offers an alternative by transforming the independent variable so that rapid physical-time transients are spread over a stretched-time coordinate, enabling stable explicit integration on uniformly sampled grids. Although several TR strategies have been proposed, their effect on learnability in ML-ROMs remains incompletely understood. This work investigates time reparameterization as a stiffness-mitigation mechanism for neural ODE reduced-order modeling and introduces a trajectory-optimized TR (TOTR) formulation. The proposed approach casts time reparameterization as an optimization problem in arc-length coordinates, in which a traversal-speed profile is selected to penalize acceleration in stretched time. By targeting the smoothness of the training dynamics, this formulation produces reparameterized trajectories that are better conditioned and easier to learn than existing TR methods. TOTR is evaluated on three stiff problems: a parameterized stiff linear system, the van der Pol oscillator, and the HIRES chemical kinetics model. Across all cases, the proposed approach yields smoother reparameterizations and improved physical-time predictions under identical training regimens than other TR approaches. Quantitative results demonstrate loss reductions of one to two orders of magnitude compared to benchmark algorithms. These results highlight that effective stiffness mitigation in ML-ROMs depends critically on the regularity and learnability of the time map itself, and that optimization-based TR provides a robust framework for explicit reduced-order modeling of multiscale dynamical systems.

关键词: stiff dynamical systems, machine-learning reduced-order models, time reparameterization, neural ODE, trajectory-optimized TR, explicit integration, multiscale dynamical systems, training stability

279. ❌ SympFormer: Accelerated attention blocks via Inertial Dynamics on Density Manifolds

作者: Viktor Stein, Wuchen Li, Gabriele Steidl 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16535v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于Transformer注意力机制的数学理论改进，提出了一种基于惯性动力学和密度流形的新型加速注意力块架构。虽然论文涉及Transformer架构（这是大模型的基础组件），但所有关键词都明确针对大模型的应用、训练、对齐、推理优化、代理系统等具体技术方向，而本文纯粹是注意力机制的数学理论扩展，没有涉及任何关键词中提到的具体大模型技术、应用领域或工程实现。论文内容与所有关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于惯性Nesterov型动力学的新型加速注意力块架构SympFormer，通过引入令牌的速度变量和哈密顿动量机制，在理论上证明了比传统注意力块更快的收敛速度，同时保持了计算复杂度不变。

摘要翻译

Transformer在自然语言处理领域取得的实证成功很大程度上归功于自注意力机制。近期研究将注意力模块解释为相互作用的粒子系统，其平均场极限对应于在配备Wasserstein-$2$型度规的概率密度空间上，相互作用能量泛函的梯度流。我们通过引入源自密度空间上惯性Nesterov型动力学的加速注意力模块，拓展了这一视角。在所提出的架构中，标记（tokens）同时携带空间（特征）变量和速度变量。通过对加速密度动力学进行时间离散化与近似处理，我们得到了哈密顿动量注意力模块，这构成了所提出的加速注意力架构。特别地，对于线性自注意力，我们证明了注意力模块使用双线性核近似了势能的Stein变分梯度流。在此设定下，我们证明了椭圆等高概率分布能够被加速注意力模块保持。我们提出了可实现的基于粒子的算法，并证明所提出的加速注意力模块在保持预言机调用次数不变的同时，比经典注意力模块收敛得更快。

摘要 (Abstract)

Transformers owe much of their empirical success in natural language processing to the self-attention blocks. Recent perspectives interpret attention blocks as interacting particle systems, whose mean-field limits correspond to gradient flows of interaction energy functionals on probability density spaces equipped with Wasserstein-$2$-type metrics. We extend this viewpoint by introducing accelerated attention blocks derived from inertial Nesterov-type dynamics on density spaces. In our proposed architecture, tokens carry both spatial (feature) and velocity variables. The time discretization and the approximation of accelerated density dynamics yield Hamiltonian momentum attention blocks, which constitute the proposed accelerated attention architectures. In particular, for linear self-attention, we show that the attention blocks approximate a Stein variational gradient flow, using a bilinear kernel, of a potential energy. In this setting, we prove that elliptically contoured probability distributions are preserved by the accelerated attention blocks. We present implementable particle-based algorithms and demonstrate that the proposed accelerated attention blocks converge faster than the classical attention blocks while preserving the number of oracle calls.

关键词: SympFormer, accelerated attention blocks, inertial dynamics, density manifolds, Hamiltonian momentum attention, Stein variational gradient flow, convergence acceleration, particle-based algorithms

280. ❌ Deep Tabular Representation Corrector

作者: Hangting Ye, Peng Wang, Wei Fan, Xiaozhuang Song, He Zhao, Dandan Gun, Yi Chang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16569v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于表格数据的深度学习表示校正方法（TRC），旨在通过两个任务（表示重估计和空间映射）提升已训练模型的表示质量，而不修改原模型参数。其核心是通用的表格学习表示增强技术，与所有评分关键词（均围绕大模型技术原理、训练方法、推理优化、对齐、应用等）无直接关联。论文未涉及大模型、语言模型、MoE、缩放律、预训练/后训练、对齐技术、推理加速、幻觉缓解、可解释性、智能体、量化等主题，也未在生物信息学等科学AI领域应用大模型。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TRC的深度表格表示校正器，通过表示重估计和空间映射两个任务，在不修改原模型参数的情况下，有效提升了已训练深度表格学习模型的表示质量，并在多个基准测试中展现了优越性能。

摘要翻译

表格数据在医疗健康、工程、金融等诸多现实领域一直扮演着至关重要的角色。深度学习的近期成功催生了众多基于深度网络（如Transformer、ResNet）的表格学习方法。一般而言，现有的深度表格机器学习方法遵循两种范式，即“学习中”（in-learning）与“预学习”（pre-learning）。学习中方法需要从头训练网络或施加额外约束以规范表征，但这需同时处理多个任务并使学习更为困难；而预学习方法则设计若干预训练任务进行预训练，随后进行任务特定的微调，然而这需要大量额外的训练努力及先验知识。本文提出一种新颖的深度表格表征校正器（Tabular Representation Corrector, TRC），以模型无关的方式增强任何已训练深度表格模型的表征，且不改变其参数。具体而言，针对阻碍预测的表征偏移（representation shift）与表征冗余（representation redundancy），我们提出两项任务：（i）表格表征重估计（Tabular Representation Re-estimation），通过训练一个偏移估计器来计算表格表征的内在偏移并随后消除它，从而重估表征；（ii）表格空间映射（Tabular Space Mapping），通过一个坐标估计器将上述重估后的表征转换到一个轻量嵌入向量空间，同时保留关键预测信息以最小化冗余。这两项任务共同增强了深度表格模型的表征，且无需触及原始模型，因而具有高效性。最后，我们在多种表格基准数据集上，将TRC与前沿的深度表格机器学习模型结合进行了广泛实验，结果均显示出持续的优势。

摘要 (Abstract)

Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc. The recent success of deep learning has fostered many deep networks (e.g., Transformer, ResNet) based tabular learning methods. Generally, existing deep tabular machine learning methods are along with the two paradigms, i.e., in-learning and pre-learning. In-learning methods need to train networks from scratch or impose extra constraints to regulate the representations which nonetheless train multiple tasks simultaneously and make learning more difficult, while pre-learning methods design several pretext tasks for pre-training and then conduct task-specific fine-tuning, which however need much extra training effort with prior knowledge. In this paper, we introduce a novel deep Tabular Representation Corrector, TRC, to enhance any trained deep tabular model’s representations without altering its parameters in a model-agnostic manner. Specifically, targeting the representation shift and representation redundancy that hinder prediction, we propose two tasks, i.e., (i) Tabular Representation Re-estimation, that involves training a shift estimator to calculate the inherent shift of tabular representations to subsequently mitigate it, thereby re-estimating the representations and (ii) Tabular Space Mapping, that transforms the above re-estimated representations into a light-embedding vector space via a coordinate estimator while preserves crucial predictive information to minimize redundancy. The two tasks jointly enhance the representations of deep tabular models without touching on the original models thus enjoying high efficiency. Finally, we conduct extensive experiments on state-of-the-art deep tabular machine learning models coupled with TRC on various tabular benchmarks which have shown consistent superiority.

关键词: tabular data, deep learning, representation correction, model-agnostic, representation shift, representation redundancy, TRC, tabular benchmarks

281. ❌ Optimal uncertainty bounds for multivariate kernel regression under bounded noise: A Gaussian process-based dual function

作者: Amon Lahr, Anna Scampicchio, Johannes Köhler, Melanie N. Zeilinger 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16481v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于高斯过程的核回归不确定性边界问题，属于传统机器学习中的核方法和不确定性量化领域，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关。论文未涉及任何大模型架构、训练方法、推理优化、对齐技术、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于对偶函数的紧致、无分布的多输出核回归不确定性边界方法，解决了现有边界在噪声分布假设、保守性、可扩展性和下游任务集成方面的局限性。

摘要翻译

非保守的不确定性边界对于从含噪数据中对潜在函数进行可靠预测至关重要，因而也是实现基于学习的鲁棒控制的关键赋能因素。在该领域中，高斯过程回归等核方法凭借其固有的不确定性量化机制，已成为成熟技术。然而，现有边界要么对底层噪声分布提出强假设，要么过于保守，要么在多输出场景下扩展性不佳，要么难以整合至下游任务中。本文通过提出一种紧致的、无分布假设的多输出核估计边界，以应对这些局限性。该边界通过一种基于对偶的无约束化形式推导得出，其结构与经典高斯过程置信边界相同，因此可直接整合至下游优化流程中。我们证明了所提出的边界可推广至多种现有结果，并通过一个受四旋翼飞行器动力学学习启发的实例阐明了其应用。

摘要 (Abstract)

Non-conservative uncertainty bounds are essential for making reliable predictions about latent functions from noisy data–and thus, a key enabler for safe learning-based control. In this domain, kernel methods such as Gaussian process regression are established techniques, thanks to their inherent uncertainty quantification mechanism. Still, existing bounds either pose strong assumptions on the underlying noise distribution, are conservative, do not scale well in the multi-output case, or are difficult to integrate into downstream tasks. This paper addresses these limitations by presenting a tight, distribution-free bound for multi-output kernel-based estimates. It is obtained through an unconstrained, duality-based formulation, which shares the same structure of classic Gaussian process confidence bounds and can thus be straightforwardly integrated into downstream optimization pipelines. We show that the proposed bound generalizes many existing results and illustrate its application using an example inspired by quadrotor dynamics learning.

关键词: kernel regression, uncertainty bounds, Gaussian process, multi-output, distribution-free, duality-based formulation, safe learning-based control, quadrotor dynamics learning

282. ❌ Controlling Fish Schools via Reinforcement Learning of Virtual Fish Movement

作者: Yusuke Nishii, Hiroaki Kawashima 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16384v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究使用强化学习训练虚拟鱼来控制鱼群行为，属于AI在动物行为学/生物学领域的应用。所有关键词均与大模型/深度学习技术原理相关，而本文仅使用强化学习（非大模型技术），因此除’AI for Science’（权重1.0）因涉及科学应用得5分外，其余关键词均得0分。论文未涉及大模型、深度学习技术原理或相关子领域创新。

!!! tip deepseek-chat TL;DR

该研究提出了一种使用强化学习训练的虚拟鱼来引导和控制真实鱼群的方法，并通过仿真和真实实验验证了该方法的有效性。

摘要翻译

本研究探讨了一种利用强化学习训练的虚拟鱼来引导和控制鱼群的方法。为克服实体机器人代理存在的耐久性与运动限制等技术挑战，我们采用屏幕上显示的二维虚拟鱼作为引导媒介。针对真实鱼类缺乏详细行为模型的问题，我们采用了无模型强化学习方法。首先，仿真结果表明，即使在模拟真实鱼频繁忽略虚拟刺激的情况下，强化学习仍能获得有效的运动策略。其次，通过活鱼进行的实体实验证实，学习得到的策略能成功将鱼群引导至指定目标方向。统计分析显示，所提出的方法在性能上显著优于基线条件（包括无刺激状态和启发式的“停留边缘”策略）。本研究为如何通过人工代理利用强化学习影响动物集体行为提供了早期实证。

摘要 (Abstract)

This study investigates a method to guide and control fish schools using virtual fish trained with reinforcement learning. We utilize 2D virtual fish displayed on a screen to overcome technical challenges such as durability and movement constraints inherent in physical robotic agents. To address the lack of detailed behavioral models for real fish, we adopt a model-free reinforcement learning approach. First, simulation results show that reinforcement learning can acquire effective movement policies even when simulated real fish frequently ignore the virtual stimulus. Second, real-world experiments with live fish confirm that the learned policy successfully guides fish schools toward specified target directions. Statistical analysis reveals that the proposed method significantly outperforms baseline conditions, including the absence of stimulus and a heuristic “stay-at-edge” strategy. This study provides an early demonstration of how reinforcement learning can be used to influence collective animal behavior through artificial agents.

关键词: reinforcement learning, fish schools, virtual fish, collective animal behavior, model-free RL, behavioral control, artificial agents, real-world experiments

283. ❌ DISCOVER: A Solver for Distributional Counterfactual Explanations

作者: Yikai Gu, Lele Cao, Bo Zhao, Lei Lei, Lei You 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16436v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是分布反事实解释（DCE）的求解器DISCOVER，专注于可解释AI（XAI）领域，特别是针对表格数据和非可微模型的模型无关反事实解释方法。论文与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，仅与’Mechanistic Interpretability OR Explainable AI’有一定关联，因为反事实解释是可解释AI的一个子领域，但论文未涉及大模型或深度学习的可解释性，主要针对传统机器学习模型，因此给5分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为DISCOVER的模型无关求解器，用于解决分布反事实解释问题，通过稀疏提议-选择搜索范式替代梯度下降，在多个表格数据集上实现了输入和输出分布的强联合对齐。

摘要翻译

反事实解释通过识别能够导致不同预测结果的输入修改来解释模型决策。现有方法大多在实例层面进行操作。分布反事实解释通过优化一个最优传输目标来扩展这一设定，该目标平衡了与事实输入分布的接近度和与目标输出分布的对齐度，并通过机会约束边界提供统计认证。然而，DCE依赖于基于梯度的优化方法，而现实世界中的许多表格数据流程主要由不可微模型主导。我们提出了DISCOVER，一种用于分布反事实解释的模型无关求解器。DISCOVER保留了原始DCE的目标函数和认证机制，同时用稀疏的“提议-选择”搜索范式替代了梯度下降法。该方法利用传输目标的逐样本分解来计算每行影响分数，并强制执行前$k$干预预算，从而将编辑操作集中在最具影响力的样本上。为了在没有预测器梯度的情况下指导候选样本生成，DISCOVER引入了由输入侧传输几何驱动的OT引导锥形采样原语。在多个表格数据集上的实验表明，该方法能实现输入与输出分布的强联合对齐，将分布反事实推理扩展到现代黑盒学习流程中。代码仓库位于https://github.com/understanding-ml/DCE。

摘要 (Abstract)

Counterfactual explanations (CE) explain model decisions by identifying input modifications that lead to different predictions. Most existing methods operate at the instance level. Distributional Counterfactual Explanations (DCE) extend this setting by optimizing an optimal transport objective that balances proximity to a factual input distribution and alignment to a target output distribution, with statistical certification via chance constrained bounds. However, DCE relies on gradient based optimization, while many real-world tabular pipelines are dominated by non-differentiable models. We propose DISCOVER, a model-agnostic solver for distributional counterfactual explanations. DISCOVER preserves the original DCE objective and certification while replacing gradient descent with a sparse propose-and-select search paradigm. It exploits a sample-wise decomposition of the transport objective to compute per-row impact scores and enforce a top-$k$ intervention budget, focusing edits on the most influential samples. To guide candidate generation without predictor gradients, DISCOVER introduces an OT-guided cone sampling primitive driven by input-side transport geometry. Experiments on multiple tabular datasets demonstrate strong joint alignment of input and output distributions, extending distributional counterfactual reasoning to modern black box learning pipelines. A code repository is available at https://github.com/understanding-ml/DCE.

关键词: Counterfactual Explanations, Distributional Counterfactual Explanations, Model-agnostic, Optimal Transport, Tabular Data, Black Box Models, Sparse Search, Statistical Certification

284. ❌ Prior-Informed Neural Network Initialization: A Spectral Approach for Function Parameterizing Architectures

作者: David Orlando Salazar Torres, Diyar Altinses, Andreas Schwung 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16376v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究神经网络初始化方法，特别是针对函数参数化架构（如Bag-of-Functions框架）的谱方法初始化策略。论文核心关注传统神经网络初始化、数据驱动的先验信息嵌入、模型架构配置优化和计算效率提升。所有评分关键词均涉及大语言模型（LLM）及相关技术（如MoE、RLHF、RAG、量化等），而本论文完全不涉及大语言模型或深度学习在科学领域的应用，仅涉及通用神经网络初始化方法，与所有关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于数据谱结构和时间先验的神经网络初始化方法，通过FFT提取季节性先验指导网络深度和初始状态，使用残差回归参数化趋势分量，从而加速收敛、减少性能波动并提高计算效率，同时保持重建保真度。

摘要翻译

专为函数参数化设计的神经网络架构，例如函数袋（BoF）框架，弥合了深度学习表达能力与经典信号处理可解释性之间的鸿沟。然而，这些模型本质上对参数初始化敏感，因为传统的数据无关方案无法捕捉目标信号的结构特性，常常导致次优收敛。在本研究中，我们提出了一种先验信息驱动的设计策略，该策略利用数据固有的频谱和时间结构来指导网络初始化和架构配置。我们引入了一种基于原理的方法论，该方法使用快速傅里叶变换提取主导的季节性先验，以此指导模型深度和初始状态设定，并采用基于残差的回归方法来参数化趋势成分。至关重要的是，这种结构对齐使得编码器维度得以大幅缩减，同时不损害重建保真度。一项支持性的理论分析为有限样本情况下的趋势估计提供了指导。在合成和真实世界基准数据集上进行的大量实验表明，嵌入数据驱动的先验能显著加速收敛、减少多次实验间的性能波动并提升计算效率。总体而言，所提出的框架在不改变核心训练流程的前提下，实现了比标准初始化基线更优的性能，同时构建了更为紧凑且可解释的架构。

摘要 (Abstract)

Neural network architectures designed for function parameterization, such as the Bag-of-Functions (BoF) framework, bridge the gap between the expressivity of deep learning and the interpretability of classical signal processing. However, these models are inherently sensitive to parameter initialization, as traditional data-agnostic schemes fail to capture the structural properties of the target signals, often leading to suboptimal convergence. In this work, we propose a prior-informed design strategy that leverages the intrinsic spectral and temporal structure of the data to guide both network initialization and architectural configuration. A principled methodology is introduced that uses the Fast Fourier Transform to extract dominant seasonal priors, informing model depth and initial states, and a residual-based regression approach to parameterize trend components. Crucially, this structural alignment enables a substantial reduction in encoder dimensionality without compromising reconstruction fidelity. A supporting theoretical analysis provides guidance on trend estimation under finite-sample regimes. Extensive experiments on synthetic and real-world benchmarks demonstrate that embedding data-driven priors significantly accelerates convergence, reduces performance variability across trials, and improves computational efficiency. Overall, the proposed framework enables more compact and interpretable architectures while outperforming standard initialization baselines, without altering the core training procedure.

关键词: neural network initialization, spectral approach, function parameterization, prior-informed design, Fast Fourier Transform, encoder dimensionality reduction, convergence acceleration, computational efficiency

285. ❌ Encoding Predictability and Legibility for Style-Conditioned Diffusion Policy

作者: Adrien Jacquet Crétides, Mouad Abrini, Hamed Rahimi, Mohamed Chetouani 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16368v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究机器人轨迹生成中的效率与可读性平衡问题，使用扩散模型和条件预测器，仅与’Post-training OR Supervised Fine-tuning OR SFT’关键词有中等相关性（8分），因为论文提到’post-training pipeline’和’freezes the base policy’，涉及微调概念。其他关键词均与大模型、语言模型、科学AI应用等无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Style-Conditioned Diffusion Policy（SCDP），通过后训练管道和条件预测器，在机器人操作和导航任务中动态平衡轨迹的效率和可读性，在目标模糊时增强可读性，否则保持高效路径。

摘要翻译

在人与机器人协作中，如何在高效性与动作透明性之间取得平衡是一个核心挑战，因为高度表现力的运动往往会带来不必要的时间和能量消耗。在协作环境中，动作可读性（legibility）能让人类观察者更好地理解机器人的行为，从而提升安全性和信任度。然而，这些行为会导致轨迹表现欠佳且过于夸张，在机器人目标已十分明确的低模糊度场景中显得冗余。为解决这一权衡问题，我们提出了风格条件扩散策略（Style-Conditioned Diffusion Policy, SCDP），这是一个模块化框架，能够根据环境配置，将预训练扩散模型的轨迹生成约束至可读性或高效性方向。我们的方法采用一种训练后处理流程：冻结基础策略，同时训练一个轻量级场景编码器与条件预测器，以调节扩散过程。在推理阶段，模糊度检测模块会激活相应的条件设置——仅在目标模糊时优先采用表现力强的运动模式，其余情况则回归高效路径。我们在操作与导航任务上对SCDP进行了评估，结果表明：该方法在模糊场景中增强了动作可读性，同时在无需可读性时保持了最优效率，且无需重新训练基础策略。

摘要 (Abstract)

Striking a balance between efficiency and transparent motion is a core challenge in human-robot collaboration, as highly expressive movements often incur unnecessary time and energy costs. In collaborative environments, legibility allows a human observer a better understanding of the robot’s actions, increasing safety and trust. However, these behaviors result in sub-optimal and exaggerated trajectories that are redundant in low-ambiguity scenarios where the robot’s goal is already obvious. To address this trade-off, we propose Style-Conditioned Diffusion Policy (SCDP), a modular framework that constrains the trajectory generation of a pre-trained diffusion model toward either legibility or efficiency based on the environment’s configuration. Our method utilizes a post-training pipeline that freezes the base policy and trains a lightweight scene encoder and conditioning predictor to modulate the diffusion process. At inference time, an ambiguity detection module activates the appropriate conditioning, prioritizing expressive motion only for ambiguous goals and reverting to efficient paths otherwise. We evaluate SCDP on manipulation and navigation tasks, and results show that it enhances legibility in ambiguous settings while preserving optimal efficiency when legibility is unnecessary, all without retraining the base policy.

关键词: diffusion policy, trajectory generation, legibility, efficiency, post-training, human-robot collaboration, ambiguity detection, conditioning predictor

286. ❌ Decoding the Critique Mechanism in Large Reasoning Models

作者: Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16331v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型推理模型（LRMs）中的批判机制，属于大模型技术原理创新。核心相关关键词：1）‘Large Language Models’（论文研究LRMs，属于大模型范畴，10分）；2）‘Chain of Thought’（论文明确研究CoT推理中的错误传播和恢复，10分）；3）‘Self-Correction’（论文核心研究模型的自我纠正机制，10分）；4）‘System 2 Thinking’（论文研究深度推理中的错误检测和修正，与慢思考相关，8分）；5）‘Mechanistic Interpretability’（论文通过特征空间分析识别可解释的批判向量，属于机制可解释性研究，8分）；6）‘Hallucination Mitigation’（论文研究错误检测和纠正，间接有助于缓解幻觉问题，5分）。其他关键词与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型推理模型在链式思维推理过程中如何通过内部批判机制检测和纠正错误，并通过特征空间分析识别了可解释的批判向量，从而提升模型的错误检测能力。

摘要翻译

大型推理模型（LRMs）展现出回溯与自我验证机制，使其能够修正中间步骤并得出正确解，在复杂逻辑基准测试中表现出强劲性能。我们假设此类行为仅在模型具备足够强大的“批判”能力以检测自身错误时才有益。本研究通过在其中间推理步骤中植入算术错误，系统性地探究了当前LRMs如何从错误中恢复。值得注意的是，我们发现了一个奇特而重要的现象：尽管错误在思维链（CoT）中传播并导致错误的中间结论，模型仍能得出正确的最终答案。这种恢复意味着模型必须具备检测错误并触发自我修正的内部机制，我们将其称为隐藏批判能力。基于特征空间分析，我们识别出一个高度可解释的、表征该行为的批判向量。跨多个模型规模与架构的大量实验表明，使用该向量引导潜在表征能提升模型的错误检测能力，并在无需额外训练成本的情况下增强测试时缩放性能。我们的发现为理解LRMs的批判行为提供了宝贵见解，指出了控制与改进其自我验证机制的有前景的研究方向。代码发布于https://github.com/mail-research/lrm-critique-vectors。

摘要 (Abstract)

Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong “critique” ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating through the chain-of-thought (CoT), resulting in an incorrect intermediate conclusion, the model still reaches the correct final answer. This recovery implies that the model must possess an internal mechanism to detect errors and trigger self-correction, which we refer to as the hidden critique ability. Building on feature space analysis, we identify a highly interpretable critique vector representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model’s error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs’ critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at https://github.com/mail-research/lrm-critique-vectors.

关键词: Large Reasoning Models, critique mechanism, chain-of-thought, self-correction, error detection, feature space analysis, interpretable vector, test-time scaling

287. ❌ Laya: A LeJEPA Approach to EEG via Latent Prediction over Reconstruction

作者: Saarang Panchavati, Uddhav Panchavati, Corey Arnold, William Speier 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16281v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于脑电图（EEG）基础模型的自监督学习（SSL）方法创新，提出了一种基于LeJEPA的潜在预测架构（Laya），以替代传统的信号重建方法。论文的核心贡献在于SSL范式的改进，而非大语言模型（LLM）或深度学习通用技术。因此，绝大多数关键词（如LLM、MoE、对齐、推理、代理等）完全不相关，得0分。仅有两个关键词有部分关联：1. “Pre-training OR Continual Pre-training OR Domain Adaptation”：论文涉及基础模型的预训练（在大型无标签EEG数据上），但未明确讨论持续预训练或领域适应，因此给5分（中等关联）。2. “AI for Science OR Bioinformatics OR Cheminformatics”：论文属于AI在科学领域的应用（临床神经科学、脑机接口），与生物信息学相关，但非核心匹配，给8分（较强关联）。其他关键词如模型压缩、推理加速、可解释性等均未涉及。加权总分计算为(5.01.0) + (8.01.0) = 13.0，远低于动态及格分26.6，表明论文与评审关注的大模型/深度学习技术主题相关性较低。

!!! tip deepseek-chat TL;DR

该论文针对脑电图（EEG）基础模型中自监督学习依赖信号重建导致表示偏差的问题，提出了一种基于LeJEPA的潜在预测方法Laya，在多个EEG基准测试中相比基于重建的基线模型在线性探测下表现出更好的性能。

摘要翻译

脑电图（Electroencephalography, EEG）是研究脑功能的一种广泛应用工具，在临床神经科学、诊断及脑机接口（Brain-Computer Interfaces, BCIs）等领域均有重要应用。近期基于大规模无标注数据训练的EEG基础模型旨在学习可迁移的表征，但其有效性尚不明确；相较于小型任务专用模型，已报道的性能提升往往有限，且对下游适应与微调策略敏感，在线性探测评估下表现受限。我们假设其中一个影响因素在于当前方法主要依赖信号重构作为自监督学习（Self-Supervised Learning, SSL）目标，这会导致表征偏向高方差伪迹而非任务相关的神经结构。为克服这一局限，我们探索了一种基于联合嵌入预测架构（Joint Embedding Predictive Architectures, JEPA）的自监督学习范式，该架构通过预测潜在表征而非重构原始信号进行学习。尽管早期的JEPA类方法常依赖额外启发式策略以确保训练稳定性，但如LeJEPA等近期进展提供了更原则化且稳定的框架。我们提出了Laya——首个基于LeJEPA的EEG基础模型。在一系列EEG基准测试中，与基于重构的基线模型相比，Laya在线性探测评估下表现出更优的性能，这表明潜在预测目标为学习可迁移的高层次EEG表征提供了一条有前景的研究路径。

摘要 (Abstract)

Electroencephalography (EEG) is a widely used tool for studying brain function, with applications in clinical neuroscience, diagnosis, and brain-computer interfaces (BCIs). Recent EEG foundation models trained on large unlabeled corpora aim to learn transferable representations, but their effectiveness remains unclear; reported improvements over smaller task-specific models are often modest, sensitive to downstream adaptation and fine-tuning strategies, and limited under linear probing. We hypothesize that one contributing factor is the reliance on signal reconstruction as the primary self-supervised learning (SSL) objective, which biases representations toward high-variance artifacts rather than task-relevant neural structure. To address this limitation, we explore an SSL paradigm based on Joint Embedding Predictive Architectures (JEPA), which learn by predicting latent representations instead of reconstructing raw signals. While earlier JEPA-style methods often rely on additional heuristics to ensure training stability, recent advances such as LeJEPA provide a more principled and stable formulation. We introduce Laya, the first EEG foundation model based on LeJEPA. Across a range of EEG benchmarks, Laya demonstrates improved performance under linear probing compared to reconstruction-based baselines, suggesting that latent predictive objectives offer a promising direction for learning transferable, high-level EEG representations.

关键词: EEG foundation model, self-supervised learning, Joint Embedding Predictive Architectures, LeJEPA, latent prediction, signal reconstruction, transferable representations, linear probing

288. ❌ Physics-integrated neural differentiable modeling for immersed boundary systems

作者: Chenglin Li, Hang Xu, Jianting Chen, Yanfei Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16277v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是物理信息神经网络（PINN）在计算流体力学（CFD）中浸没边界系统中的应用，属于AI for Science（科学人工智能）领域。论文核心是开发一个物理集成的可微分框架，用于浸没边界流的长期预测，涉及神经网络PDE求解器、物理约束和计算加速。论文内容与绝大多数关键词（如LLM、MoE、RLHF、RAG、Agent等）完全无关，因为这些关键词特指大语言模型及其相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学计算（具体是流体力学）中的应用，但并非生物信息学或化学信息学，因此给予中等相关度5分。

!!! tip deepseek-chat TL;DR

该论文针对浸没边界流长期预测计算成本高、纯数据驱动模型外推稳定性差的问题，提出了一个物理集成的可微分神经网络框架，在保持高保真度和长期稳定性的同时，实现了比高分辨率求解器约200倍的推理加速。

摘要翻译

在长时间尺度上精确、高效且稳定地计算固体边界附近的复杂流体流动及其演化仍具挑战性。传统数值求解器需要精细网格与小时间步长以解析近壁面动力学，导致计算成本高昂；而纯数据驱动的代理模型则存在推演误差累积问题，且在外推条件下缺乏鲁棒性。为解决这些问题，本研究通过开发一种物理集成的可微分框架，对现有神经偏微分方程求解器进行扩展，以实现浸没边界流动的长时程预测。该框架的一个关键设计包含一项重要改进，即将物理原理结构化地集成到端到端可微分架构中，该架构包含基于偏微分方程的中间速度模块与多重直接力浸没边界模块，二者均遵循不可压缩流动计算的压力投影流程。计算代价高昂的压力投影步骤被替换为使用ConvResNet块实现的隐式学习校正以降低成本，同时引入子迭代策略将嵌入式物理模块的稳定性要求与代理模型的时间步长解耦，从而实现在大有效时间增量下进行稳定的粗网格自回归推演。该框架训练时仅需单步监督，避免了长时程反向传播，在单GPU上将训练时间缩短至一小时以内。在雷诺数Re=100下绕静止圆柱与旋转振荡圆柱流动的基准案例评估表明，所提模型在流场保真度与长时程稳定性方面均持续优于纯数据驱动、物理损失约束及粗网格数值基线方法，同时实现了约200倍于高分辨率求解器的推理加速。

摘要 (Abstract)

Accurately, efficiently, and stably computing complex fluid flows and their evolution near solid boundaries over long horizons remains challenging. Conventional numerical solvers require fine grids and small time steps to resolve near-wall dynamics, resulting in high computational costs, while purely data-driven surrogate models accumulate rollout errors and lack robustness under extrapolative conditions. To address these issues, this study extends existing neural PDE solvers by developing a physics-integrated differentiable framework for long-horizon prediction of immersed-boundary flows. A key design aspect of the framework includes an important improvement, namely the structural integration of physical principles into an end-to-end differentiable architecture incorporating a PDE-based intermediate velocity module and a multi-direct forcing immersed boundary module, both adhering to the pressure-projection procedure for incompressible flow computation. The computationally expensive pressure projection step is substituted with a learned implicit correction using ConvResNet blocks to reduce cost, and a sub-iteration strategy is introduced to separate the embedded physics module’s stability requirement from the surrogate model’s time step, enabling stable coarse-grid autoregressive rollouts with large effective time increments. The framework uses only single-step supervision for training, eliminating long-horizon backpropagation and reducing training time to under one hour on a single GPU. Evaluations on benchmark cases of flow past a stationary cylinder and a rotationally oscillating cylinder at Re=100 show the proposed model consistently outperforms purely data-driven, physics-loss-constrained, and coarse-grid numerical baselines in flow-field fidelity and long-horizon stability, while achieving an approximately 200-fold inference speedup over the high-resolution solver.

关键词: physics-integrated neural networks, differentiable modeling, immersed boundary method, computational fluid dynamics, neural PDE solver, long-horizon prediction, autoregressive rollout, inference acceleration

289. ❌ Neural Pushforward Samplers for the Fokker-Planck Equation on Embedded Riemannian Manifolds

作者: Andrew Qing He, Wei Cai 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16239v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是Fokker-Planck方程在黎曼流形上的数值求解方法，属于计算数学和科学计算领域。论文使用了神经网络方法（WANPF），但核心内容与深度学习技术原理创新或大模型应用无关。所有关键词（除最后一个外）都专门针对大语言模型（LLM）及其相关技术（如训练、推理、对齐、应用等），而本文完全不涉及语言模型或自然语言处理。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于科学计算中的AI应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文将弱对抗神经前推方法扩展到嵌入黎曼流形上的Fokker-Planck方程求解，通过流形收缩和全局测试函数实现了无网格训练，并在球面和环面上进行了数值验证。

摘要翻译

我们将弱对抗神经推进（WANPF）方法推广至定义在$R^n$中紧致光滑嵌入黎曼流形M上的福克-普朗克方程。核心发现在于：福克-普朗克方程的弱形式，结合通过切向投影$P(x)$与平均曲率向量$H(x)$表达的拉普拉斯-贝尔特拉米算子的环境空间表示，允许所有积分均利用定义在$R^n$上的全局测试函数，通过位于M上的样本期望进行计算。神经推进映射通过流形收缩被约束为始终将基分布的支撑集映射至M内，从而在构造上保证了概率守恒与流形隶属特性。我们选取对抗性环境平面波测试函数，并以闭合形式推导其拉普拉斯-贝尔特拉米算子，实现了无需自动微分与网格的训练。本文提出了稳态与时间依赖两种形式，推导了球面$S^{n-1}$与平坦环面$T^n$上拉普拉斯-贝尔特拉米算子的显式公式，并在$S^2$上的双稳态福克-普朗克方程中进行了数值验证。

摘要 (Abstract)

We extend the Weak Adversarial Neural Pushforward (WANPF) Method to the Fokker–Planck equation posed on a compact, smoothly embedded Riemannian manifold M in $R^n$. The key observation is that the weak formulation of the Fokker–Planck equation, together with the ambient-space representation of the Laplace–Beltrami operator via the tangential projection $P(x)$ and the mean-curvature vector $H(x)$, permits all integrals to be evaluated as expectations over samples lying on M, using test functions defined globally on $R^n$. A neural pushforward map is constrained to map the support of a base distribution into M at all times through a manifold retraction, so that probability conservation and manifold membership are enforced by construction. Adversarial ambient plane-wave test functions are chosen, and their Laplace–Beltrami operators are derived in closed form, enabling autodiff-free, mesh-free training. We present both a steady-state and a time-dependent formulation, derive explicit Laplace–Beltrami formulae for the sphere $S^{n-1}$ and the flat torus $T^n$, and demonstrate the method numerically on a double-well steady-state Fokker–Planck equation on $S^2$.

关键词: Fokker-Planck equation, Riemannian manifold, neural pushforward, weak adversarial method, Laplace-Beltrami operator, mesh-free training, double-well potential, sphere S^2

290. ❌ ReFORM: Review-aggregated Profile Generation via LLM with Multi-Factor Attention for Restaurant Recommendation

作者: Moonsoo Park, Seulbeen Je, Donghyeon Park 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16236v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确使用LLMs（大语言模型）进行推荐系统中的评论聚合和用户/物品画像生成，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术细节，如MoE、SLMs、Scaling Laws、训练方法（Pre-training、SFT、RLHF等）、推理优化（RAG、Context Window、KV Cache）、Agent系统、模型压缩或科学AI应用，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文提出ReFORM框架，利用大语言模型从评论中生成多因素用户和物品画像，并通过多因素注意力机制提升餐厅推荐的个性化效果，在实验中表现出优于现有方法的性能。

摘要翻译

在推荐系统中，大语言模型（LLMs）与图卷积网络相结合，通过生成描述性摘要来提升推荐鲁棒性，已获得广泛关注。然而，现有基于LLM的推荐研究主要依赖模型对商品标题的内部知识，而忽视了影响用户决策的多种因素的重要性。尽管反映每位用户多种决策因素的信息在评论中十分丰富，但少有研究主动利用这些洞见进行推荐。为应对这些局限，我们提出了ReFORM框架：一种基于大语言模型生成评论聚合画像的多因素注意力推荐框架。具体而言，我们首先利用LLM从评论中生成针对特定因素的用户画像与商品画像，以捕捉用户对商品的偏好以及用户对商品的评价。随后，我们提出一种多因素注意力机制，以突出每个用户决策过程中最具影响力的因素。本文在两个不同规模的餐厅数据集上进行实验，结果表明该框架具有鲁棒性，且性能优于当前最先进的基线方法。此外，深入分析验证了所提出模块的有效性，并揭示了个性化推荐的来源。我们的源代码与数据集可在https://github.com/m0onsoo/ReFORM获取。

摘要 (Abstract)

In recommender systems, large language models (LLMs) have gained popularity for generating descriptive summarization to improve recommendation robustness, along with Graph Convolution Networks. However, existing LLM-enhanced recommendation studies mainly rely on the internal knowledge of LLMs about item titles while neglecting the importance of various factors influencing users’ decisions. Although information reflecting various decision factors of each user is abundant in reviews, few studies have actively exploited such insights for recommendation. To address these limitations, we propose a ReFORM: Review-aggregated Profile Generation via LLM with Multi-FactOr Attentive RecoMmendation framework. Specifically, we first generate factor-specific user and item profiles from reviews using LLM to capture a user’s preference by items and an item’s evaluation by users. Then, we propose a Multi-Factor Attention to highlight the most influential factors in each user’s decision-making process. In this paper, we conduct experiments on two restaurant datasets of varying scales, demonstrating its robustness and superior performance over state-of-the-art baselines. Furthermore, in-depth analyses validate the effectiveness of the proposed modules and provide insights into the sources of personalization. Our source code and datasets are available at https://github.com/m0onsoo/ReFORM.

关键词: Recommender Systems, Large Language Models, Review Aggregation, Multi-Factor Attention, Personalization, Restaurant Recommendation, User Profile Generation, Item Profile Generation

291. ❌ Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

作者: Kaixuan Du, Meng Cao, Hang Zhang, Yukun Wang, Xiangzhou Huang, Ni Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16223v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为DCRL的无监督强化学习方法，专门针对大型语言模型（LLMs）在复杂推理任务中的性能提升。论文核心关注LLMs的推理能力（与Chain of Thought和System 2 Thinking高度相关），并提出了一个通过两阶段共识机制实现自我改进（Self-Correction/Self-Improvement）的方法。论文未涉及其他关键词如MoE、量化、RAG、对齐、工具使用、科学AI应用等具体技术或领域。

!!! tip deepseek-chat TL;DR

该论文针对无监督RLVR方法中LLMs容易陷入虚假多数答案的问题，提出了一种名为DCRL的两阶段共识自监督训练方法，该方法在多个基准测试中稳定提升了LLMs的推理性能。

摘要翻译

当前针对大语言模型（LLM）的无标签强化学习价值表征方法，例如TTRL与Self-reward，已证明能有效提升LLM在复杂推理任务上的表现。然而，这些方法严重依赖准确的伪标签估计，且易收敛于虚假但流行的答案，从而陷入主导模式并限制性能的进一步提升。基于此，我们提出双重共识强化学习，这是一种新颖的自监督训练方法，能够通过两阶段共识机制生成更可靠的学习信号。模型首先作为锚点，生成主导性响应；随后作为探索者，通过临时遗忘过程产生多样化的辅助信号。最终的训练目标源自这两组信号的调和平均数。值得注意的是，该过程完全无需外部模型或监督。在涵盖八个基准测试和多个领域的实验中，DCRL在多数投票基础上持续提升Pass@1指标，同时展现出更稳定的训练动态。这些结果表明，DCRL为无需标签的更强推理能力建立了一条可扩展的路径。

摘要 (Abstract)

Current label-free RLVR approaches for large language models (LLMs), such as TTRL and Self-reward, have demonstrated effectiveness in improving the performance of LLMs on complex reasoning tasks. However, these methods rely heavily on accurate pseudo-label estimation and converge on spurious yet popular answers, thereby trapping in a dominant mode and limiting further improvements. Building on this, we propose Dual Consensus Reinforcement Learning (DCRL), a novel self-supervised training method which is capable of generating more reliable learning signals through a two-stage consensus mechanism. The model initially acts as an anchor, producing dominant responses; then it serves as an explorer, generating diverse auxiliary signals via a temporary unlearning process. The final training target is derived from the harmonic mean of these two signal sets. Notably, the process operates entirely without external models or supervision. Across eight benchmarks and diverse domains, DCRL consistently improves Pass@1 over majority vote while yielding more stable training dynamics. These results demonstrate that DCRL establishes a scalable path toward stronger reasoning without labels.

关键词: Large Language Models, Unsupervised RLVR, Complex Reasoning, Self-supervised Training, Two-stage Consensus, Self-improvement, Reasoning Performance, Label-free Learning

292. ❌ Online Semi-infinite Linear Programming: Efficient Algorithms via Function Approximation

作者: Yiming Zong, Jiashuo Jiang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16200v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究在线半无限线性规划（OSILP）问题，提出了一种基于函数逼近的算法来解决具有大量或无限约束的动态资源分配问题。论文内容完全集中在运筹学、优化算法和在线学习领域，涉及线性规划、函数逼近、遗憾界分析等传统优化理论概念。所有评分关键词均与大模型、深度学习、AI技术原理或AI科学应用相关，而本论文完全不涉及这些领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对具有大量或无限约束的动态资源分配问题，提出了基于函数逼近的在线半无限线性规划算法，实现了与约束数量无关的遗憾界，并通过实验验证了其优于现有方法的性能。

摘要翻译

本文研究动态资源分配问题，其决策空间为有限维，但解必须满足通过流数据或预言机反馈揭示的大量甚至无限数量的约束。我们将此挑战建模为在线半无限线性规划（OSILP）问题，并开发了一种新颖的线性规划（LP）模型以近似求解。具体而言，我们采用函数近似方法将约束数量减少至常数 $q$，从而解决了传统在线LP算法的关键局限——其遗憾界通常依赖于约束数量，在此类场景中表现不佳。我们提出一种基于对偶的算法来求解新模型，该算法通过选择合适的势函数具有广泛适用性。我们在两种经典输入模型（随机输入与随机排列）下分析了该算法，分别建立了 $O(q\sqrt{T})$ 和 $O\left(\left(q+q\log{T}\right)\sqrt{T}\right)$ 的遗憾界。值得注意的是，这两个遗憾界均与约束数量无关，这证明了我们方法处理大量乃至无限约束的潜力。进一步，我们探索了改进 $O(q\sqrt{T})$ 遗憾界的可能性，提出了一种两阶段算法，在更严格的假设下实现了 $O(q\log{T} + q/ε)$ 的遗憾。我们还将算法推广至一般函数设定。一系列实验验证了在面对大量约束时，我们的算法优于现有方法。

摘要 (Abstract)

We consider the dynamic resource allocation problem where the decision space is finite-dimensional, yet the solution must satisfy a large or even infinite number of constraints revealed via streaming data or oracle feedback. We model this challenge as an Online Semi-infinite Linear Programming (OSILP) problem and develop a novel LP formulation to solve it approximately. Specifically, we employ function approximation to reduce the number of constraints to a constant $q$. This addresses a key limitation of traditional online LP algorithms, whose regret bounds typically depend on the number of constraints, leading to poor performance in this setting. We propose a dual-based algorithm to solve our new formulation, which offers broad applicability through the selection of appropriate potential functions. We analyze this algorithm under two classical input models-stochastic input and random permutation-establishing regret bounds of $O(q\sqrt{T})$ and $O\left(\left(q+q\log{T})\sqrt{T}\right)\right)$ respectively. Note that both regret bounds are independent of the number of constraints, which demonstrates the potential of our approach to handle a large or infinite number of constraints. Furthermore, we investigate the potential to improve upon the $O(q\sqrt{T})$ regret and propose a two-stage algorithm, achieving $O(q\log{T} + q/ε)$ regret under more stringent assumptions. We also extend our algorithms to the general function setting. A series of experiments validates that our algorithms outperform existing methods when confronted with a large number of constraints.

关键词: Online Semi-infinite Linear Programming, dynamic resource allocation, function approximation, regret bounds, constraint reduction, dual-based algorithm, stochastic input, random permutation

293. ❌ Sample-Efficient Adaptation of Drug-Response Models to Patient Tumors under Strong Biological Domain Shift

作者: Camille Jimenez Cortes, Philippe Lalanda, German Vega 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16185v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于生物医学AI应用领域，特别是药物反应预测的迁移学习框架。与关键词的相关性分析如下：1）高度相关（10分）：‘Pre-training OR Continual Pre-training OR Domain Adaptation’ - 论文核心是使用无标签数据进行预训练，然后进行领域适应；‘AI for Science OR Bioinformatics OR Cheminformatics’ - 论文直接应用于生物信息学和精准医疗。2）中等相关（5分）：‘Post-training OR Supervised Fine-tuning OR SFT’ - 涉及使用少量标签数据进行监督微调。3）无关（0分）：其他关键词主要涉及大模型技术原理（如LLM、MoE、推理优化、对齐等），论文未使用或提及这些技术，专注于传统的深度学习迁移学习框架。

!!! tip deepseek-chat TL;DR

该研究提出了一种分阶段迁移学习框架，通过无监督预学习细胞和药物表示，然后使用少量标记数据适应患者肿瘤，从而在强生物领域偏移下实现更高效的药物反应预测，减少了临床监督需求。

摘要翻译

在精准肿瘤学中，由于体外细胞系与患者肿瘤之间存在显著的生物学差异，根据临床前数据预测患者的药物反应仍是一项重大挑战。本研究并非旨在提高绝对的体外预测准确性，而是探讨在强烈的生物学领域偏移下，将表征学习与任务监督明确分离，是否能使药物反应模型以更高的样本效率适应患者数据。我们提出了一种分阶段的迁移学习框架：首先基于自编码器的表征学习方法，从大量未标记的药物基因组数据中独立学习细胞与药物的表征；随后将这些表征与细胞系数据中的药物反应标签进行对齐，并利用少量样本监督将其适应于患者肿瘤。通过涵盖域内、跨数据集及患者层面的系统性评估，我们发现当源域与目标域高度重叠时，无监督预训练带来的收益有限，但在适应标注数据极少的患者肿瘤时则表现出明显优势。具体而言，所提出的框架在少量样本的患者层面适应过程中实现了更快的性能提升，同时在标准细胞系基准测试中保持了与单阶段基线模型相当的准确性。总体而言，这些结果表明，从未标记的分子谱中学习结构化的可迁移表征，能够显著减少有效药物反应预测所需的临床监督数据量，为数据高效的临床前到临床转化提供了一条可行路径。

摘要 (Abstract)

Predicting drug response in patients from preclinical data remains a major challenge in precision oncology due to the substantial biological gap between in vitro cell lines and patient tumors. Rather than aiming to improve absolute in vitro prediction accuracy, this work examines whether explicitly separating representation learning from task supervision enables more sample-efficient adaptation of drug-response models to patient data under strong biological domain shift. We propose a staged transfer-learning framework in which cellular and drug representations are first learned independently from large collections of unlabeled pharmacogenomic data using autoencoder-based representation learning. These representations are then aligned with drug-response labels on cell-line data and subsequently adapted to patient tumors using few-shot supervision. Through a systematic evaluation spanning in-domain, cross-dataset, and patient-level settings, we show that unsupervised pretraining provides limited benefit when source and target domains overlap substantially, but yields clear gains when adapting to patient tumors with very limited labeled data. In particular, the proposed framework achieves faster performance improvements during few-shot patient-level adaptation while maintaining comparable accuracy to single-phase baselines on standard cell-line benchmarks. Overall, these results demonstrate that learning structured and transferable representations from unlabeled molecular profiles can substantially reduce the amount of clinical supervision required for effective drug-response prediction, offering a practical pathway toward data-efficient preclinical-to-clinical translation.

关键词: drug-response prediction, precision oncology, transfer learning, domain adaptation, unsupervised pretraining, few-shot learning, pharmacogenomics, patient tumors

294. ❌ Execution-Grounded Credit Assignment for GRPO in Code Generation

作者: Abhijit Kumar, Natalya Kumar, Shikhar Gupta 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16158v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究代码生成中的强化学习算法改进（GRPO），提出了一种基于执行轨迹的信用分配方法（EGCA）。论文与LLMs相关（8分），因为代码生成是大语言模型的重要应用领域，且论文在HumanEval和MBPP基准上测试，这些是评估代码生成LLMs的标准数据集。论文未涉及其他关键词的具体技术，如MoE、量化、RAG等，因此其他关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对GRPO强化学习在代码生成中信用分配粗糙的问题，提出了基于执行轨迹的信用分配方法EGCA，在HumanEval和MBPP基准上显著提高了代码生成成功率。

摘要翻译

无需批评者的可验证奖励强化学习（RLVR）通过优化单元测试通过率来改进代码生成，但GRPO风格的更新存在信用分配粗糙的问题：即使失败源于局部语义错误，单一的结果信号仍会均匀分散到长程序中。我们提出执行溯源的信用分配方法（EGCA），该方法利用执行轨迹对GRPO更新进行局部化定位。对于满足算法约束但未通过测试的程序，EGCA在相同插装环境下执行候选程序与规范参考解决方案（离线一次性构建；仅用于分析而非监督），识别最早的语义分歧点，并仅将优势值分配给对应的词元片段，同时屏蔽下游词元。EGCA是一种即插即用式改进，无需批评者网络、辅助损失函数或可学习的验证器，在HumanEval上达到82.1%的pass@1（较GRPO提升3.1%），在MBPP上达到68.9%（提升1.5%），仅产生18%的额外计算开销。

摘要 (Abstract)

Critic-free reinforcement learning with verifiable rewards (RLVR) improves code generation by optimizing unit-test pass rates, but GRPO-style updates suffer from coarse credit assignment: a single outcome signal is spread uniformly across long programs even when failure stems from a localized semantic error. We propose Execution-Grounded Credit Assignment (EGCA), which localizes GRPO updates using execution traces. For programs that satisfy algorithmic constraints but fail tests, EGCA executes the candidate and a canonical reference solution (curated once offline; used for analysis, not supervision) under identical instrumentation, identifies the earliest semantic divergence, and assigns advantage only to the corresponding token span while masking downstream tokens. EGCA is a drop-in modification requiring no critic, auxiliary loss, or learned verifier, yielding 82.1% pass@1 on HumanEval (+3.1 over GRPO) and 68.9% on MBPP (+1.5) with 18% wall-clock overhead.

关键词: code generation, reinforcement learning, credit assignment, GRPO, execution traces, HumanEval, MBPP, unit-test pass rates

295. ❌ The Finetuner’s Fallacy: When to Pretrain with Your Finetuning Data

作者: Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blakeney, Maximilian Böther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vineeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee, Katherine L. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason Telanoff, Jack Urbanek, Zhengping Wang, Josh Wills, Haoli Yin, Aditi Raghunathan, J. Zico Kolter, Bogdan Gaza, Ari Morcos, Matthew Leavitt, Pratyush Maini 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16177v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究专门化预训练（SPT）策略，将领域数据重复用于预训练阶段，以提升领域性能并保持通用能力。核心与预训练、微调、扩展定律和科学AI应用高度相关（如ChemPile、ProofPile），但未涉及MoE、SLMs、对齐、推理加速等其他技术。

!!! tip deepseek-chat TL;DR

论文研究了专门化预训练策略，通过在预训练阶段重复使用领域数据，相比标准预训练能提升领域性能、减少过拟合和遗忘，并在化学、数学等科学领域验证了其有效性。

摘要翻译

现实世界的模型部署要求其在数据通常稀缺的狭窄领域具备强大性能。通常，实践者会对模型进行微调以使其专业化，但这可能导致模型过拟合特定领域并遗忘通用知识。我们研究了一种简单策略——专业化预训练（Specialized Pretraining, SPT），即从预训练阶段开始，将通常留作微调使用的小规模领域数据集作为总训练词元的一部分进行重复训练。在三个专业领域（ChemPile、MusicPile 和 ProofPile）中，与标准预训练相比，SPT 在微调后提升了领域性能并保留了通用能力。在我们的实验中，SPT 将达成特定领域性能所需的预训练词元减少了高达 1.75 倍。当目标领域在预训练语料中代表性不足时，这些收益更为显著：在远离网络文本的领域上，一个 10 亿参数的 SPT 模型表现优于 30 亿参数的标准预训练模型。除这些实证收益外，我们推导了过拟合缩放定律，以指导实践者根据给定的预训练计算预算选择最优的领域数据重复策略。我们的观察揭示了“微调者的谬误”：尽管微调看似是领域适应最经济的途径，但在预训练阶段引入专业化领域数据能扩展其效用。SPT 通过减少重复曝光中的过拟合实现了更好的专业领域性能，并通过减少微调期间的遗忘获得了更好的通用领域性能，最终在推理阶段摊销后以更少的参数和更低的总计算量达成更强结果。为最大化领域数据的价值，应尽可能将其纳入训练早期阶段。

摘要 (Abstract)

Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compared to standard pretraining. In our experiments, SPT reduces the pretraining tokens needed to reach a given domain performance by up to 1.75x. These gains grow when the target domain is underrepresented in the pretraining corpus: on domains far from web text, a 1B SPT model outperforms a 3B standard pretrained model. Beyond these empirical gains, we derive overfitting scaling laws to guide practitioners in selecting the optimal domain-data repetition for a given pretraining compute budget. Our observations reveal the finetuner’s fallacy: while finetuning may appear to be the cheapest path to domain adaptation, introducing specialized domain data during pretraining stretches its utility. SPT yields better specialized domain performance (via reduced overfitting across repeated exposures) and better general domain performance (via reduced forgetting during finetuning), ultimately achieving stronger results with fewer parameters and less total compute when amortized over inference. To get the most out of domain data, incorporate it as early in training as possible.

关键词: specialized pretraining, domain adaptation, finetuning, overfitting scaling laws, ChemPile, ProofPile, pretraining compute budget, domain-data repetition

296. ❌ DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

作者: Long Li, Zhijian Zhou, Tianyi Wang, Weidi Xu, Zuming Huang, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16157v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究强化学习（RL）在增强大语言模型推理能力中的应用，提出了一种名为DyJR的经验回放方法来解决现有方法中的样本效率低和模式崩溃问题。论文与’Large Language Models’高度相关（8分），因为摘要明确提到’Reinforcement Learning (RL) enhances Large Language Model reasoning’，且实验在数学推理和Text-to-SQL基准上进行。与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为论文涉及推理任务（数学推理和Text-to-SQL），这些任务通常需要多步或深度推理，但论文未明确提及CoT或System 2具体技术。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关（0分），因为论文未涉及这些技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DyJR的动态Jensen-Shannon回放方法，通过使用动态参考分布和正则化框架，解决了强化学习增强大语言模型推理时样本效率低和模式崩溃的问题，在数学推理和Text-to-SQL基准上显著优于现有方法。

摘要翻译

尽管强化学习（RL）能够增强大语言模型的推理能力，但如GRPO这类同策略算法存在样本效率低下的问题，因为它们会丢弃过往的轨迹数据。现有的经验回放方法通过复用高准确度样本进行直接策略更新来解决此问题，但这通常会导致高昂的计算成本，并因过拟合而引发模式崩溃。我们认为，历史数据应优先用于维持多样性，而非单纯强化准确性。为此，我们提出了动态詹森-香农回放（DyJR），这是一个简单而有效的正则化框架，它利用近期轨迹构建的动态参考分布。DyJR引入了两项创新：（1）时间敏感动态缓冲区，采用先进先出（FIFO）和自适应大小调整机制，仅保留时间上邻近的样本，从而与模型演化保持同步；（2）詹森-香农散度（Jensen-Shannon Divergence）正则化，以分布约束替代直接的梯度更新，防止多样性崩溃。在数学推理和Text-to-SQL基准测试上的实验表明，DyJR显著优于GRPO以及RLEP、Ex-GRPO等基线方法，同时保持了与原始GRPO相当的训练效率。此外，从Rank-$k$词元概率演化的视角，我们证明了DyJR能够增强多样性并减轻对Rank-1词元的过度依赖，从而阐明了DyJR中特定子模块如何影响训练动态。

摘要 (Abstract)

While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct gradient updates with a distributional constraint to prevent diversity collapse. Experiments on mathematical reasoning and Text-to-SQL benchmarks demonstrate that DyJR significantly outperforms GRPO as well as baselines such as RLEP and Ex-GRPO, while maintaining training efficiency comparable to the original GRPO. Furthermore, from the perspective of Rank-$k$ token probability evolution, we show that DyJR enhances diversity and mitigates over-reliance on Rank-1 tokens, elucidating how specific sub-modules of DyJR influence the training dynamics.

关键词: Reinforcement Learning, Large Language Models, Experience Replay, Diversity Preservation, Jensen-Shannon Divergence, Mathematical Reasoning, Text-to-SQL, Mode Collapse

297. ❌ Deep Adaptive Model-Based Design of Experiments

作者: Arno Strouwen, Sebastian Micluţa-Câmpeanu 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16146v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于基于模型的实验设计（MBDOE）在非线性动力系统中的应用，提出了一种结合深度自适应设计（DAD）和可微分机理模型的方法，以解决传统自适应MBDOE计算成本高、无法实时应用的问题。论文的核心是深度学习在科学计算和工程优化中的应用，特别是针对生物反应器、药代动力学模型等具体科学问题。因此，仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为论文涉及生物信息学（如生物反应器）和科学AI应用，但并非其核心焦点。其他关键词均与论文内容无关，评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合深度自适应设计和可微分机理模型的方法，以解决传统基于模型的实验设计在非线性动力系统中计算成本高、无法实时应用的问题，并在多个复杂系统（如生物反应器、药代动力学模型）上验证了其有效性。

摘要翻译

基于模型的设计实验（MBDOE）对于非线性动态系统中的高效参数估计至关重要。然而，传统的自适应MBDOE需要在每个实验步骤之间进行昂贵的后验推断和设计优化，这阻碍了实时应用。为解决此问题，我们将深度自适应设计（Deep Adaptive Design, DAD）与可微分机理模型相结合：DAD通过离线训练的神经网络策略将序贯设计成本分摊，从而避免在线计算负担。针对已知控制方程但参数不确定的动态系统，我们扩展了序贯对比训练目标以处理冗余参数，并提出了一种基于Transformer的策略架构，该架构充分考虑了动态系统的时间结构。我们在四个复杂度递增的系统上验证了该方法：采用Monod动力学的补料分批生物反应器、具有不确定底物抑制的Haldane生物反应器、带有冗余清除参数的双房室药代动力学模型，以及一个用于实时部署的直流电机系统。

摘要 (Abstract)

Model-based design of experiments (MBDOE) is essential for efficient parameter estimation in nonlinear dynamical systems. However, conventional adaptive MBDOE requires costly posterior inference and design optimization between each experimental step, precluding real-time applications. We address this by combining Deep Adaptive Design (DAD), which amortizes sequential design into a neural network policy trained offline, with differentiable mechanistic models. For dynamical systems with known governing equations but uncertain parameters, we extend sequential contrastive training objectives to handle nuisance parameters and propose a transformer-based policy architecture that respects the temporal structure of dynamical systems. We demonstrate the approach on four systems of increasing complexity: a fed-batch bioreactor with Monod kinetics, a Haldane bioreactor with uncertain substrate inhibition, a two-compartment pharmacokinetic model with nuisance clearance parameters, and a DC motor for real-time deployment.

关键词: Model-based design of experiments, Deep Adaptive Design, differentiable mechanistic models, nonlinear dynamical systems, parameter estimation, sequential design, transformer-based policy, real-time applications

298. ❌ Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment

作者: Enguang Fan, Yifan Chen, Zihan Shan, Matthew Caesar, Jae Kim 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16141v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多智能体强化学习（MARL）在无人机协同部署中的应用，核心是通信感知的多智能体系统。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化等）完全无关。唯一相关的是"Multi-agent Systems OR Agent Coordination”，因为论文明确研究多智能体协调问题，使用集中训练分散执行（CTDE）框架，并涉及邻居间通信协调，因此给予10分（高度相关）。其他关键词均未涉及大模型、深度学习技术原理或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图的多智能体强化学习框架，用于解决部分可观测和间歇通信条件下的无人机协同部署问题，在合作中继任务中实现了高覆盖率并保持与优化基准的竞争力，且能泛化到未见过的团队规模。

摘要翻译

自主无人机集群正日益被用作可快速部署的空中中继与感知平台，但实际部署必须在部分可观测性与间歇性点对点通信链路条件下运行。本文提出一种基于图的多智能体强化学习框架，采用集中训练分散执行模式进行训练：集中式评价器与全局状态仅在训练阶段可用，而每架无人机在执行时使用本地观测信息与邻近邻居的消息来执行共享策略。我们的架构通过智能体-实体注意力模块对本地智能体状态与邻近实体进行编码，并在距离受限的通信图上通过邻居自注意力机制聚合无人机间的消息。我们主要在协作中继部署任务上进行评估，并在对抗性交战任务上进行次要评估。在中继部署任务中，所提方法在受限通信与部分观测条件下实现了高覆盖率，同时与基于混合整数线性规划优化的离线上界保持竞争力，并且无需微调即可泛化至未见过的团队规模。在对抗性场景中，同一框架无需架构修改即可迁移应用，并较非通信基线提高了胜率。

摘要 (Abstract)

Autonomous Unmanned Aerial Vehicle (UAV) swarms are increasingly used as rapidly deployable aerial relays and sensing platforms, yet practical deployments must operate under partial observability and intermittent peer-to-peer links. We present a graph-based multi-agent reinforcement learning framework trained under centralized training with decentralized execution (CTDE): a centralized critic and global state are available only during training, while each UAV executes a shared policy using local observations and messages from nearby neighbors. Our architecture encodes local agent state and nearby entities with an agent-entity attention module, and aggregates inter-UAV messages with neighbor self-attention over a distance-limited communication graph. We evaluate primarily on a cooperative relay deployment task (DroneConnect) and secondarily on an adversarial engagement task (DroneCombat). In DroneConnect, the proposed method achieves high coverage under restricted communication and partial observation (e.g. 74% coverage with M = 5 UAVs and N = 10 nodes) while remaining competitive with a mixed-integer linear programming (MILP) optimization-based offline upper bound, and it generalizes to unseen team sizes without fine-tuning. In the adversarial setting, the same framework transfers without architectural changes and improves win rate over non-communicating baselines.

关键词: Multi-agent Reinforcement Learning, Decentralized Cooperation, UAV Deployment, Communication-Aware, CTDE, Partial Observability, Graph-based Framework, Agent Coordination

299. ❌ Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

作者: Yuxuan Zhu, Daniel Kang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16140v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLVR（Reinforcement Learning with Verifiable Rewards）在大型语言模型中的应用，直接涉及’Large Language Models’和’RLHF’等关键词（10分）。研究重点为数据质量对RLVR性能的影响，与’Scaling Laws AND Data Quality’高度相关（10分）。论文在数学推理和Text2SQL任务中评估模型，涉及推理过程，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。研究噪声数据对模型事实性的影响，与’Hallucination Mitigation’部分相关（5分）。其他关键词如MoE、SLMs、PEFT等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究发现，在具有可验证奖励的强化学习（RLVR）中，噪声数据会显著损害大型语言模型的性能，现有RLVR算法无法补偿低质量数据的影响，高质量数据仍然至关重要。

摘要翻译

具有可验证奖励的强化学习（RLVR）推动了大型语言模型在各领域的能力进展。近期研究表明，改进的RLVR算法能使模型从错误标注中有效学习，达到与使用干净数据训练相当的性能。本研究发现，这些结论并不成立，因为其所声称的100%噪声训练数据实际上"混入"了干净数据。通过严格的重新验证流程修正数据集后，我们证明噪声对RLVR具有破坏性。现有RLVR算法的改进均未能缓解噪声影响，其性能与基础GRPO算法相当。此外，在数学推理基准测试中，使用完全错误标注训练的模型性能比使用干净数据训练的模型低8-10%。最后，我们在Text2SQL任务中验证了这些结论对现实噪声同样成立：使用真实世界人工标注错误进行训练会导致准确率比使用干净数据低5-12%。我们的结果表明，当前RLVR方法尚无法弥补数据质量缺陷，高质量数据仍然至关重要。

摘要 (Abstract)

Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy training data is “contaminated” with clean data. After rectifying the dataset with a rigorous re-verification pipeline, we demonstrate that noise is destructive to RLVR. We show that existing RLVR algorithm improvements fail to mitigate the impact of noise, achieving similar performance to that of the basic GRPO. Furthermore, we find that the model trained on truly incorrect annotations performs 8-10% worse than the model trained on clean data across mathematical reasoning benchmarks. Finally, we show that these findings hold for real-world noise in Text2SQL tasks, where training on real-world, human annotation errors cause 5-12% lower accuracy than clean data. Our results show that current RLVR methods cannot yet compensate for poor data quality. High-quality data remains essential.

关键词: Reinforcement Learning with Verifiable Rewards, RLVR, Large Language Models, Data Quality, Noisy Data, Mathematical Reasoning, Text2SQL, GRPO

300. ❌ Functorial Neural Architectures from Higher Inductive Types

作者: Karen Sargsyan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16123v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究神经网络架构的数学基础（范畴论、高阶归纳类型），关注组合泛化问题的理论分析和架构设计，与所有评分关键词（均聚焦于大模型技术、应用、优化等具体方向）无直接关联。论文未涉及大模型、深度学习应用、技术原理创新等关键词相关主题。

!!! tip deepseek-chat TL;DR

该论文从范畴论角度分析神经网络组合泛化失败的原因，提出基于高阶归纳类型的函数式神经架构设计方法，并通过实验证明函数式解码器在多个拓扑空间上显著优于非函数式方法。

摘要翻译

神经网络在组合泛化方面存在系统性缺陷——即无法针对已知部件的新组合生成正确输出。我们证明这种缺陷源于架构层面：组合泛化等价于解码器的函子性，该视角既能提供理论保证，也能推导不可能性结果。我们通过从目标空间的路径广群到参数映射范畴的幺半函子，将高阶归纳类型（Higher Inductive Type, HIT）规范编译为神经架构：路径构造子转化为生成器网络，复合操作转化为结构拼接，见证群关系的2-胞则转化为习得的自然变换。我们证明通过独立生成片段的结构拼接所组装的解码器是严格幺半函子（在构造上具备组合性），而softmax自注意力机制在任何非平凡组合任务中均不具备函子性。两项结果均在Cubical Agda中形式化验证。在三个空间上的实验验证了完整理论体系：在环面（$\mathbb{Z}^2$）上，函子化解码器性能超越非函子化版本2-2.7倍；在$S^1 \vee S^1$（$F_2$）空间上，A/B类型性能差距扩大至5.5-10倍；在克莱因瓶（$\mathbb{Z} \rtimes \mathbb{Z}$）上，习得的2-胞将作用于群关系的词汇错误率差距缩小了46%。

摘要 (Abstract)

Neural networks systematically fail at compositional generalization – producing correct outputs for novel combinations of known parts. We show that this failure is architectural: compositional generalization is equivalent to functoriality of the decoder, and this perspective yields both guarantees and impossibility results. We compile Higher Inductive Type (HIT) specifications into neural architectures via a monoidal functor from the path groupoid of a target space to a category of parametric maps: path constructors become generator networks, composition becomes structural concatenation, and 2-cells witnessing group relations become learned natural transformations. We prove that decoders assembled by structural concatenation of independently generated segments are strict monoidal functors (compositional by construction), while softmax self-attention is not functorial for any non-trivial compositional task. Both results are formalized in Cubical Agda. Experiments on three spaces validate the full hierarchy: on the torus ($\mathbb{Z}^2$), functorial decoders outperform non-functorial ones by 2-2.7x; on $S^1 \vee S^1$ ($F_2$), the type-A/B gap widens to 5.5-10x; on the Klein bottle ($\mathbb{Z} \rtimes \mathbb{Z}$), a learned 2-cell closes a 46% error gap on words exercising the group relation.

关键词: compositional generalization, functoriality, Higher Inductive Types, neural architectures, monoidal functor, path groupoid, structural concatenation, Cubical Agda

301. ❌ When Generative Augmentation Hurts: A Benchmark Study of GAN and Diffusion Models for Bias Correction in AI Classification Systems

作者: Shesh Narayan Gupta, Nik Bear Brown 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16134v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究生成模型（GAN和扩散模型）在AI分类系统中用于偏差校正的数据增强，仅与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为明确使用了LoRA微调Stable Diffusion。其他关键词均与大模型技术原理、科学应用等无关，评0分。

!!! tip deepseek-chat TL;DR

本研究通过基准测试发现，在低数据条件下，FastGAN数据增强不仅无效反而会加剧分类器偏差，而使用LoRA微调的Stable Diffusion能有效减少偏差并提升性能。

摘要翻译

生成模型被广泛用于弥补人工智能训练流程中的类别不平衡问题，然而其在低数据条件下的失效模式尚不明确。本文报告了一项对照基准测试，比较了三种应用于细粒度动物分类任务的增强策略：传统图像变换、FastGAN以及通过低秩自适应微调的Stable Diffusion 1.5。通过使用牛津-IIIT宠物数据集并人工设定八个代表性不足的品种，我们发现FastGAN增强不仅在极低训练集规模下表现不佳，反而会主动加剧分类器偏差，该效应在三个随机种子下均呈现统计显著性的大效应量（偏差差距增加：+20.7%，科恩d值 = +5.03，p = 0.013）。尽管种子数量较少，但此处效应量足够大，使我们对研究结论的方向具有信心。基于t分布随机邻域嵌入的特征嵌入分析显示，FastGAN为极端少数类品种生成的图像在真实图像分布之外形成了紧密的孤立聚类，这种模式与模态崩溃现象一致。采用低秩自适应的Stable Diffusion模型取得了最佳整体效果，获得了最高的宏观F1分数（0.9125 ± 0.0047），并将偏差差距相较于未增强基线降低了13.1%。数据表明存在一个介于每类20至50张训练图像之间的样本量边界，低于该边界时GAN增强在此场景中会产生负面影响，但需要跨更多领域进行后续研究以更精确确定该边界位置。所有实验均在配备6至8GB内存的消费级GPU上完成，无需云端计算支持。

摘要 (Abstract)

Generative models are widely used to compensate for class imbalance in AI training pipelines, yet their failure modes under low-data conditions are poorly understood. This paper reports a controlled benchmark comparing three augmentation strategies applied to a fine-grained animal classification task: traditional transforms, FastGAN, and Stable Diffusion 1.5 fine-tuned with Low-Rank Adaptation (LoRA). Using the Oxford-IIIT Pet Dataset with eight artificially underrepresented breeds, we find that FastGAN augmentation does not merely underperform at very low training set sizes but actively increases classifier bias, with a statistically significant large effect across three random seeds (bias gap increase: +20.7%, Cohen’s d = +5.03, p = 0.013). The effect size here is large enough to give confidence in the direction of the finding despite the small number of seeds. Feature embedding analysis using t-distributed Stochastic Neighbor Embedding reveals that FastGAN images for severe-minority breeds form tight isolated clusters outside the real image distribution, a pattern consistent with mode collapse. Stable Diffusion with Low-Rank Adaptation produced the best results overall, achieving the highest macro F1 (0.9125 plus or minus 0.0047) and a 13.1% reduction in the bias gap relative to the unaugmented baseline. The data suggest a sample-size boundary somewhere between 20 and 50 training images per class below which GAN augmentation becomes harmful in this setting, though further work across additional domains is needed to establish where that boundary sits more precisely. All experiments run on a consumer-grade GPU with 6 to 8 GB of memory, with no cloud compute required.

关键词: Generative Augmentation, Bias Correction, GAN, Diffusion Models, Low-Rank Adaptation, Class Imbalance, Mode Collapse, Fine-grained Classification

302. ❌ A Depth-Aware Comparative Study of Euclidean and Hyperbolic Graph Neural Networks on Bitcoin Transaction Systems

作者: Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16080v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是图神经网络（GNNs）在比特币交易网络中的应用，特别是比较欧几里得和双曲几何嵌入空间在节点分类任务中的表现。论文的核心技术是图神经网络，而非大语言模型（LLMs）或深度学习技术原理的创新。所有关键词（除了最后一个）都明确指向大语言模型、其训练方法、优化技术、推理机制、对齐、压缩、代理系统等特定领域，与论文的图神经网络研究完全无关。最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”得5分，因为论文将AI（图神经网络）应用于比特币交易系统分析，这属于AI在特定领域（金融/计算社会科学）的应用，与“AI for Science”有一定关联，但并非核心的生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文通过对比欧几里得和双曲几何图神经网络在大型比特币交易图上的节点分类性能，研究了嵌入几何和邻域深度对建模大规模交易网络的影响，发现学习率和曲率的联合选择对稳定高维双曲嵌入至关重要。

摘要翻译

比特币交易网络是大型社会技术系统，其活动通过多跳交互模式呈现。图神经网络已成为分析此类系统的广泛采用工具，支持实体检测和交易分类等任务。Elliptic等大规模数据集的涌现推动了此类系统的分析及欺诈检测等任务的发展。在这些场景中，每个节点可用的交易上下文信息量由邻域聚合与采样策略决定，然而这些感受野与嵌入几何结构之间的相互作用尚未得到充分关注。本研究在大型比特币交易图上对欧几里得空间与切空间双曲图神经网络进行了节点分类的受控比较。通过固定模型架构与维度并显式改变邻域范围，我们分析了两种嵌入空间的差异。我们进一步检验优化行为，发现学习率与曲率的联合选择对稳定高维双曲嵌入起着关键作用。总体而言，我们的研究结果为建模大规模交易网络时嵌入几何结构与邻域深度的作用提供了实践见解，为计算社会系统中双曲图神经网络的部署提供了参考依据。

摘要 (Abstract)

Bitcoin transaction networks are large scale socio- technical systems in which activities are represented through multi-hop interaction patterns. Graph Neural Networks(GNNs) have become a widely adopted tool for analyzing such systems, supporting tasks such as entity detection and transaction classification. Large-scale datasets like Elliptic have allowed for a rise in the analysis of these systems and in tasks such as fraud detection. In these settings, the amount of transactional context available to each node is determined by the neighborhood aggregation and sampling strategies, yet the interaction between these receptive fields and embedding geometry has received limited attention. In this work, we conduct a controlled comparison of Euclidean and tangent-space hyperbolic GNNs for node classification on a large Bitcoin transaction graph. By explicitly varying the neighborhood while keeping the model architecture and dimensionality fixed, we analyze the differences in two embedding spaces. We further examine optimization behavior and observe that joint selection of learning rate and curvature plays a critical role in stabilizing high-dimensional hyperbolic embeddings. Overall, our findings provide practical insights into the role of embedding geometry and neighborhood depth when modeling large-scale transaction networks, informing the deployment of hyperbolic GNNs for computational social systems.

关键词: Graph Neural Networks, Bitcoin transaction networks, Hyperbolic embeddings, Node classification, Euclidean geometry, Neighborhood depth, Embedding geometry, Computational social systems

303. ❌ MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models

作者: Chen-Hao Chao, Wei-Fang Sun, Junwei Qua, Chun-Yi Lee, Rahul G. Krishnan 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16077v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是Masked Diffusion Models (MDM)在语言建模中的应用，属于大模型技术范畴。与’Large Language Models’相关度较高（8分），因为论文比较了MDM与自回归模型在语言建模上的性能。与’Pre-training’相关（8分），因为论文涉及扩散模型的预训练过程。与’Scaling Laws’有一定关联（5分），因为论文进行了计算效率的缩放分析。其他关键词如MoE、SFT、RLHF、RAG等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

论文针对MDM-Prime框架在子分词器超参数选择和似然估计方面的局限性，提出了MDM-Prime-v2模型，通过二进制编码和索引混洗技术，在计算效率上比自回归模型高21.8倍，并在OpenWebText上取得了更低的困惑度。

摘要翻译

掩码扩散模型（Masked Diffusion Models, MDM）在使用部分掩码方案（Partial masking scheme, Prime）进行学习时展现出卓越的泛化能力。该方法将词元转换为子词元，并在子词元级别对扩散过程进行建模。我们发现了MDM-Prime框架的两个局限性。首先，我们缺乏指导子词元化器中词元粒度超参数选择的工具。其次，我们发现，当与常用的字节对编码（Byte-Pair-Encoding, BPE）分词器结合使用时，子词元化器的函数形式会显著降低似然估计的质量。为应对这些局限，我们研究了MDM-Prime中变分下界的紧致性，并开发了MDM-Prime-v2——一种融合了二进制编码（Binary Encoding）与索引混洗（Index Shuffling）的掩码扩散语言模型。我们的缩放分析表明，MDM-Prime-v2的计算效率比自回归模型（Autoregressive Models, ARM）高21.8倍。在计算最优的比较中，MDM-Prime-v2在OpenWebText数据集上达到了7.77的困惑度，优于ARM（12.99）、MDM（18.94）和MDM-Prime（13.41）。当模型规模扩展到11亿参数时，我们的模型在多种常识推理任务上进一步展现出卓越的零样本准确率。

摘要 (Abstract)

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. Second, we find that the function form of the subtokenizer significantly degrades likelihood estimation when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. To address these limitations, we study the tightness of the variational bound in MDM-Prime and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our scaling analysis reveals that MDM-Prime-v2 is 21.8$\times$ more compute-efficient than autoregressive models (ARM). In compute-optimal comparisons, MDM-Prime-v2 achieves 7.77 perplexity on OpenWebText, outperforming ARM (12.99), MDM (18.94), and MDM-Prime (13.41). When extending the model size to 1.1B parameters, our model further demonstrates superior zero-shot accuracy on various commonsense reasoning tasks.

关键词: Masked Diffusion Models, MDM-Prime-v2, Binary Encoding, Index Shuffling, compute-efficient scaling, language modeling, perplexity, zero-shot accuracy

304. ❌ Adaptive regularization parameter selection for high-dimensional inverse problems: A Bayesian approach with Tucker low-rank constraints

作者: Qing-Mei Yang, Da-Qing Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16066v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	5.0/10	0.0

评分理由: 论文专注于高维逆问题的变分贝叶斯方法，结合Tucker分解进行正则化参数选择。虽然属于科学计算领域，但内容与深度学习、大模型技术无关。所有关键词（除’AI for Science’外）均涉及大模型、深度学习、对齐、推理、压缩等具体技术，而本文研究的是传统的贝叶斯统计方法、张量分解和逆问题求解，未使用或涉及任何深度学习模型、训练技术或大模型相关概念。‘AI for Science’得5分是因为论文属于科学计算应用，但并非使用AI/深度学习解决科学问题，而是传统数值方法。

!!! tip deepseek-chat TL;DR

本文提出了一种结合Tucker分解的变分贝叶斯方法，用于高维逆问题的自适应正则化参数选择，在图像去模糊、热传导等任务中取得了优于传统方法的性能。

摘要翻译

本文提出了一种集成Tucker分解的新型变分贝叶斯方法，用于高效求解高维逆问题。该方法通过Tucker分解将变分推断从高维空间转换至低维核心张量空间，从而降低计算复杂度。其核心创新在于引入了逐模态精度参数，实现了对各向异性结构的自适应正则化。例如，在定向图像去模糊任务中，学习到的参数与物理各向异性保持一致，对关键方向（如行与列轴向）施加更强的正则化。该方法还能从数据中估计噪声水平，无需依赖噪声参数的先验知识（这与传统基准方法如偏差原理（DP）不同）。通过在二维去模糊、三维热传导和Fredholm积分方程上的实验评估，与L曲线准则、广义交叉验证（GCV）、无偏预测风险估计器（UPRE）及DP等方法相比，本方法在定量指标（峰值信噪比PSNR、结构相似性SSIM）和定性可视化结果（误差图、精度参数变化趋势）上均展现出持续改进。该方法可扩展至包含11万个变量的问题，在去模糊任务中性能优于现有方法0.73-2.09 dB，在三维热传导问题中提升达6.75 dB。局限性包括对Tucker分解中秩选择的敏感性以及缺乏理论分析。未来工作将探索自动秩选择机制与理论保证。本方法连接了贝叶斯理论与可扩展计算，为成像、遥感和科学计算中的大规模逆问题提供了实用解决方案。

摘要 (Abstract)

This paper introduces a novel variational Bayesian method that integrates Tucker decomposition for efficient high-dimensional inverse problem solving. The method reduces computational complexity by transforming variational inference from a high-dimensional space to a lower-dimensional core tensor space via Tucker decomposition. A key innovation is the introduction of per-mode precision parameters, enabling adaptive regularization for anisotropic structures. For instance, in directional image deblurring, learned parameters align with physical anisotropy, applying stronger regularization to critical directions (e.g., row vs. column axes). The method further estimates noise levels from data, eliminating reliance on prior knowledge of noise parameters (unlike conventional benchmarks such as the discrepancy principle (DP)). Experimental evaluations across 2D deblurring, 3D heat conduction, and Fredholm integral equations demonstrate consistent improvements in quantitative metrics (PSNR, SSIM) and qualitative visualizations (error maps, precision parameter trends) compared to L-curve criterion, generalized cross-validation (GCV), unbiased predictive risk estimator (UPRE), and DP. The approach scales to problems with 110,000 variables and outperforms existing methods by 0.73-2.09 dB in deblurring tasks and 6.75 dB in 3D heat conduction. Limitations include sensitivity to rank selection in Tucker decomposition and the need for theoretical analysis. Future work will explore automated rank selection and theoretical guarantees. This method bridges Bayesian theory and scalable computation, offering practical solutions for large-scale inverse problems in imaging, remote sensing, and scientific computing.

关键词: variational Bayesian method, Tucker decomposition, high-dimensional inverse problems, adaptive regularization, anisotropic structures, precision parameters, computational complexity reduction, Bayesian inference

305. ❌ Attribution Upsampling should Redistribute, Not Interpolate

作者: Vincenzo Buono, Peyman Sheikholharam Mashhadi, Mahmoud Rahat, Prayag Tiwari, Stefan Byttner 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16067v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于可解释AI（XAI）中的归因方法上采样问题，提出了一种新的语义感知上采样方法（USU），以解决传统插值方法在归因图上产生的伪影问题。论文的核心是计算机视觉和机器学习模型的可解释性技术，属于“Mechanistic Interpretability OR Explainable AI”范畴，因此该关键词得10分。论文未涉及大语言模型（LLM）、模型训练/微调技术、推理优化、智能体系统、模型压缩、科学AI应用等其他关键词，因此其余关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文解决了可解释AI中归因图上采样方法因使用传统图像插值技术而导致归因信号失真、产生虚假高重要性区域的问题，提出了一种基于质量再分配的通用语义感知上采样方法（USU），该方法能保持归因质量并产生语义一致的解释。

摘要翻译

可解释人工智能中的归因方法依赖于为自然图像而非显著性图设计的升采样技术。标准双线性和双三次插值会通过混叠、振铃和边界渗漏系统性地破坏归因信号，产生虚假的高重要性区域，从而歪曲模型推理逻辑。我们发现核心问题在于将归因升采样视为独立于模型推理的插值问题，而非一个质量再分配问题——其中模型衍生的语义边界必须主导重要性流动的方式。我们提出通用语义感知升采样（Universal Semantic-Aware Upsampling, USU），这是一种基于原理的方法，通过比率形式的质量再分配算子重构升采样过程，可证明地保持归因质量与相对重要性排序。通过将特征归因的公理化传统延伸至升采样领域，我们形式化了忠实升采样的四项需求，并证明插值方法在结构上违反其中三项。这三项需求迫使任何再分配算子必须采用比率形式；第四项需求则在该算子族中选定唯一势函数，从而产生USU。在具有已知归因先验的模型上进行的受控实验验证了USU的形式化保证；在ImageNet、CIFAR-10和CUB-200数据集上的评估证实了该方法能持续提升忠实度，并产生定性更优、语义连贯的解释。

摘要 (Abstract)

Attribution methods in explainable AI rely on upsampling techniques that were designed for natural images, not saliency maps. Standard bilinear and bicubic interpolation systematically corrupts attribution signals through aliasing, ringing, and boundary bleeding, producing spurious high-importance regions that misrepresent model reasoning. We identify that the core issue is treating attribution upsampling as an interpolation problem that operates in isolation from the model’s reasoning, rather than a mass redistribution problem where model-derived semantic boundaries must govern how importance flows. We present Universal Semantic-Aware Upsampling (USU), a principled method that reformulates upsampling through ratio-form mass redistribution operators, provably preserving attribution mass and relative importance ordering. Extending the axiomatic tradition of feature attribution to upsampling, we formalize four desiderata for faithful upsampling and prove that interpolation structurally violates three of them. These same three force any redistribution operator into a ratio form; the fourth selects the unique potential within this family, yielding USU. Controlled experiments on models with known attribution priors verify USU’s formal guarantees; evaluation across ImageNet, CIFAR-10, and CUB-200 confirms consistent faithfulness improvements and qualitatively superior, semantically coherent explanations.

关键词: Attribution Methods, Explainable AI, Upsampling, Semantic-Aware, Mass Redistribution, Interpretability, Saliency Maps, Faithful Explanations

306. ❌ Safe Distributionally Robust Feature Selection under Covariate Shift

作者: Hiroyuki Hanada, Satoshi Akahane, Noriaki Hashimoto, Shion Takeno, Ichiro Takeuchi 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16062v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是分布鲁棒特征选择（DRFS）问题，专注于稀疏传感应用和协变量偏移下的安全筛选方法。论文内容属于传统机器学习中的特征选择、鲁棒学习和稀疏建模领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型、深度学习、语言模型、对齐、推理、代理、压缩等技术或应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为safe-DRFS的新方法，用于解决协变量偏移下的分布鲁棒特征选择问题，确保在输入分布变化范围内识别出包含所有可能最优特征子集的集合，并提供了有限样本理论保证。

摘要翻译

在实际机器学习应用中，模型开发阶段与部署阶段所面临的环境往往存在差异，尤其当模型被众多用户应用于多样化场景时。学习在合理部署环境中均能保持可靠性能的模型，被称为分布鲁棒性学习。本研究聚焦于分布鲁棒性特征选择问题，特别关注由工业需求驱动的稀疏传感应用场景。在实际多传感器系统中，通常会在部署前基于大量可用传感器的性能评估来选定一个共享的传感器子集。在部署阶段，个体用户可能根据其特定环境对模型进行进一步调整或微调。当部署环境与开发阶段的预期环境存在差异时，这种策略可能导致系统缺乏实现最优性能所需的传感器。为解决这一问题，我们提出safe-DRFS方法，这是一种将传统稀疏建模环境中的安全筛选技术扩展至协变量偏移下分布鲁棒性设置的新颖方法。该方法能够识别出一个特征子集，该子集涵盖了在指定输入分布偏移范围内可能成为最优的所有子集，并提供了有限样本理论保证，确保不会错误地排除必要特征。

摘要 (Abstract)

In practical machine learning, the environments encountered during the model development and deployment phases often differ, especially when a model is used by many users in diverse settings. Learning models that maintain reliable performance across plausible deployment environments is known as distributionally robust (DR) learning. In this work, we study the problem of distributionally robust feature selection (DRFS), with a particular focus on sparse sensing applications motivated by industrial needs. In practical multi-sensor systems, a shared subset of sensors is typically selected prior to deployment based on performance evaluations using many available sensors. At deployment, individual users may further adapt or fine-tune models to their specific environments. When deployment environments differ from those anticipated during development, this strategy can result in systems lacking sensors required for optimal performance. To address this issue, we propose safe-DRFS, a novel approach that extends safe screening from conventional sparse modeling settings to a DR setting under covariate shift. Our method identifies a feature subset that encompasses all subsets that may become optimal across a specified range of input distribution shifts, with finite-sample theoretical guarantees of no false feature elimination.

关键词: distributionally robust learning, feature selection, covariate shift, sparse sensing, safe screening, finite-sample guarantees, multi-sensor systems

307. ❌ Shuffling the Stochastic Mirror Descent via Dual Lipschitz Continuity and Kernel Conditioning

作者: Junwen Qiu, Leilei Mei, Junyu Zhang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16042v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是优化算法理论，特别是随机镜像下降算法在相对光滑条件下的收敛性分析。论文内容完全聚焦于数学优化理论，涉及Lipschitz连续性、相对光滑性、Bregman散度、核函数条件等概念。所有评分关键词都涉及大模型、深度学习、AI应用或相关技术，而这篇论文是纯数学优化理论研究，与这些关键词没有任何直接关联。论文没有讨论任何大模型、深度学习、AI应用或相关技术原理。

!!! tip deepseek-chat TL;DR

该论文解决了在缺乏Lipschitz光滑性的相对光滑优化问题中，随机重排镜像下降算法的收敛性分析难题，通过引入对偶核条件(DKC)正则性条件，首次建立了该算法的复杂度界限和迭代收敛性。

摘要翻译

全局Lipschitz光滑性条件通过两个关键推论——下降引理与梯度Lipschitz连续性——构成了大多数收敛性与复杂度分析的基础。如何在缺乏Lipschitz光滑性的情况下研究优化算法的性能，仍是一个活跃的研究领域。Bauschke-Bolte-Teboulle（2017）与Lu-Freund-Nesterov（2018）提出的相对光滑性框架提供了扩展的下降引理，确保了基于Bregman散度的近端梯度法及其基础随机变体的收敛性。然而，许多广泛使用的技术（如动量机制、随机重排和方差缩减）额外需要梯度偏差的Lipschitz型界，这使得它们在相对光滑性框架下的分析仍是一个开放领域。为解决此问题，我们引入对偶核条件（dual kernel conditioning，DKC）正则性条件来调控核函数的局部相对曲率。结合相对光滑性，DKC为梯度提供了对偶Lipschitz连续性：尽管梯度映射在原始空间中不具备Lipschitz连续性，但在由镜像映射诱导的对偶空间中它保持了Lipschitz连续性。我们验证了DKC被广泛使用的核函数普遍满足，且在仿射复合与锥组合下封闭。借助这些新工具，我们首次建立了约束非凸相对光滑问题中随机重排镜像下降法的复杂度界及迭代收敛性。

摘要 (Abstract)

The global Lipschitz smoothness condition underlies most convergence and complexity analyses via two key consequences: the descent lemma and the gradient Lipschitz continuity. How to study the performance of optimization algorithms in the absence of Lipschitz smoothness remains an active area. The relative smoothness framework from Bauschke-Bolte-Teboulle (2017) and Lu-Freund-Nesterov (2018) provides an extended descent lemma, ensuring convergence of Bregman-based proximal gradient methods and their vanilla stochastic counterparts. However, many widely used techniques (e.g., momentum schemes, random reshuffling, and variance reduction) additionally require the Lipschitz-type bound for gradient deviations, leaving their analysis under relative smoothness an open area. To resolve this issue, we introduce the dual kernel conditioning (DKC) regularity condition to regulate the local relative curvature of the kernel functions. Combined with the relative smoothness, DKC provides a dual Lipschitz continuity for gradients: even though the gradient mapping is not Lipschitz in the primal space, it preserves Lipschitz continuity in the dual space induced by a mirror map. We verify that DKC is widely satisfied by popular kernels and is closed under affine composition and conic combination. With these novel tools, we establish the first complexity bounds as well as the iterate convergence of random reshuffling mirror descent for constrained nonconvex relative smooth problems.

关键词: Stochastic Mirror Descent, Random Reshuffling, Relative Smoothness, Dual Kernel Conditioning, Convergence Analysis, Nonconvex Optimization, Bregman Divergence, Complexity Bounds

308. ❌ Power Analysis for Prediction-Powered Inference

作者: Yiqun T. Chen, Moran Guo, Shengy Li 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究预测赋能推断（PPI）中的统计功效分析，关注如何利用AI/ML模型的预测来减少标注样本需求，属于AI在科学（特别是生物医学）领域的应用。因此，仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为论文在生物医学应用中进行了验证（如单细胞转录组学、临床血压测量、皮肤镜成像）。其他关键词均涉及大模型技术原理、训练方法、推理优化、代理系统等具体技术，论文未直接涉及这些内容，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了在预测赋能推断（PPI）中，给定高预测能力的AI/ML模型时，如何计算达到所需统计功效所需的标注样本量，并推导出封闭形式的功效公式，发现所需样本减少量与预测和真实值之间的R²大致成比例。

摘要翻译

现代研究越来越多地利用机器学习与人工智能（AI/ML）模型的预测结果，而近期工作如预测驱动推断（prediction-powered inference, PPI）已开发出有效的下游统计推断流程。然而，经典的功效与样本量公式并未充分考虑这些预测的影响。本文致力于解决一个简单而实际的问题：给定一个具有高预测能力的新AI/ML模型，需要多少标注样本才能达到期望的统计功效？我们通过刻画PPI估计量的渐近方差特性，并应用Wald检验反演来推导闭式功效公式，从而得到所需的标注样本量。我们的研究涵盖了广泛使用的场景，包括两样本比较与2×2表格中的风险度量。研究发现，一个实用的经验法则是：相较于传统设计，所需标注样本量的减少幅度大致与预测值和真实值之间的R2成正比。我们通过蒙特卡洛模拟验证了所推导的解析公式，并在三个当代生物医学应用场景中展示了该框架的实用性，涵盖单细胞转录组学、临床血压测量和皮肤镜影像分析。我们将相关软件以R包形式提供，并在线计算器发布于https://github.com/yiqunchen/pppower。

摘要 (Abstract)

Modern studies increasingly leverage outcomes predicted by machine learning and artificial intelligence (AI/ML) models, and recent work, such as prediction-powered inference (PPI), has developed valid downstream statistical inference procedures. However, classical power and sample size formulas do not readily account for these predictions. In this work, we tackle a simple yet practical question: given a new AI/ML model with high predictive power, how many labeled samples are needed to achieve a desired level of statistical power? We derive closed-form power formulas by characterizing the asymptotic variance of the PPI estimator and applying Wald test inversion to obtain the required labeled sample size. Our results cover widely used settings including two-sample comparisons and risk measures in 2x2 tables. We find that a useful rule of thumb is that the reduction in required labeled samples relative to classical designs scales roughly with the R2 between the predictions and the ground truth. Our analytical formulas are validated using Monte Carlo simulations, and we illustrate the framework in three contemporary biomedical applications spanning single-cell transcriptomics, clinical blood pressure measurement, and dermoscopy imaging. We provide our software as an R package and online calculators at https://github.com/yiqunchen/pppower.

关键词: prediction-powered inference, statistical power, sample size, machine learning, AI/ML models, biomedical applications, R², Wald test

309. ❌ Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

作者: Xiaozhou Ye, Feng Jiang, Zihan Wang, Xiulai Wang, Yutao Zhang, Kevin I-Kai Wang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16043v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于传感器活动识别，使用Transformer和强化学习，但未涉及大语言模型（LLMs）、MoE、SLMs、扩展定律、预训练/后训练、对齐、RLHF、PEFT、RAG、长上下文、推理加速、思维链、系统2思维、MCTS、自校正、智能体、工具使用、多智能体、量化、推测解码、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等大模型核心技术。唯一相关的是’AI for Science’，因为论文将AI应用于医疗健康监测（传感器活动识别），属于AI在科学领域的应用，但并非核心生物信息学或化学信息学，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对跨用户传感器活动识别中因生理差异和传感器放置导致的泛化问题，提出了一种基于无评论者强化学习的协作时序特征生成框架（CTFG），在DSADS和PAMAP2基准上实现了最先进的跨用户准确率（88.53%和75.22%）。

摘要翻译

基于可穿戴惯性传感器的人体活动识别是健康监测、运动分析和情境感知计算的基础，但其部署受到因生理特征差异、运动习惯不同和传感器佩戴位置多变所导致的跨用户变异性的阻碍。现有领域泛化方法要么忽视了传感器数据流中的时间依赖性，要么依赖于不切实际的目标域标注。我们提出一种新范式：将可泛化的特征提取建模为由强化学习驱动的协作式序列生成过程。我们的框架——CTFG（协作时序特征生成）——采用基于Transformer的自回归生成器，该生成器以前序上下文和编码后的传感器输入为条件，逐步构建特征令牌序列。生成器通过组相对策略优化进行优化，这是一种无需评论家的算法，它通过将每个生成的序列与从同一输入采样的替代序列组进行比较来评估其优劣，通过组内归一化而非学习价值估计来推导优势。此设计消除了基于评论家方法中固有的依赖分布的偏差，并提供自校准的优化信号，该信号在不同用户分布下保持稳定。一个包含类别区分度、跨用户不变性和时序保真度的三重目标奖励函数，共同塑造特征空间以分离活动类别、对齐用户分布并保留细粒度时序内容。在DSADS和PAMAP2基准测试上的评估表明，该方法实现了最先进的跨用户识别准确率（分别为88.53%和75.22%），显著降低了任务间训练方差，加速了收敛速度，并在不同动作空间维度下展现出鲁棒的泛化能力。

摘要 (Abstract)

Human Activity Recognition using wearable inertial sensors is foundational to healthcare monitoring, fitness analytics, and context-aware computing, yet its deployment is hindered by cross-user variability arising from heterogeneous physiological traits, motor habits, and sensor placements. Existing domain generalization approaches either neglect temporal dependencies in sensor streams or depend on impractical target-domain annotations. We propose a different paradigm: modeling generalizable feature extraction as a collaborative sequential generation process governed by reinforcement learning. Our framework, CTFG (Collaborative Temporal Feature Generation), employs a Transformer-based autoregressive generator that incrementally constructs feature token sequences, each conditioned on prior context and the encoded sensor input. The generator is optimized via Group-Relative Policy Optimization, a critic-free algorithm that evaluates each generated sequence against a cohort of alternatives sampled from the same input, deriving advantages through intra-group normalization rather than learned value estimation. This design eliminates the distribution-dependent bias inherent in critic-based methods and provides self-calibrating optimization signals that remain stable across heterogeneous user distributions. A tri-objective reward comprising class discrimination, cross-user invariance, and temporal fidelity jointly shapes the feature space to separate activities, align user distributions, and preserve fine-grained temporal content. Evaluations on the DSADS and PAMAP2 benchmarks demonstrate state-of-the-art cross-user accuracy (88.53% and 75.22%), substantial reduction in inter-task training variance, accelerated convergence, and robust generalization under varying action-space dimensionalities.

关键词: Human Activity Recognition, Wearable Inertial Sensors, Cross-user Generalization, Reinforcement Learning, Transformer, Temporal Feature Generation, Critic-free Optimization, Domain Invariance

310. ❌ W2T: LoRA Weights Already Know What They Can Do

作者: Xiaolong Han, Ferrante Neri, Zijian Jiang, Fang Wu, Yanfang Ye, Lu Yin, Zehong Wang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15990v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LoRA权重分析，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（15分），因为论文专门研究LoRA权重分析、规范化和嵌入方法。与’Large Language Models OR LLMs OR Foundation Models’相关（10分），因为LoRA主要用于大语言模型适配。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等与论文内容无关（0分），论文未涉及这些技术。

!!! tip deepseek-chat TL;DR

该论文提出W2T方法，通过QR分解和SVD将LoRA权重映射到规范形式，解决了权重分解的歧义问题，并利用Transformer生成权重空间嵌入，实现了从LoRA权重直接预测模型行为和性能，无需运行基础模型或访问训练数据。

摘要翻译

每个LoRA检查点都以低秩权重矩阵的形式紧凑存储任务特定的更新，为将大语言模型适配至新任务和领域提供了高效途径。理论上，这些权重已编码了适配器的功能及其性能表现。本文探讨是否能够直接从权重中读取这些信息，而无需运行基础模型或访问训练数据。一个关键障碍在于，单个LoRA更新存在无限多种分解方式。若不解决这种歧义性，基于分解因子训练的模型可能仅拟合特定分解形式，而非底层更新本身。为此，我们提出\methodfull方法，通过QR分解与奇异值分解将每个LoRA更新映射至可证明的规范形式，使得所有等价分解共享同一表示。随后将所得分量进行标记化处理，并由Transformer生成权重空间嵌入。在语言与视觉LoRA集合上的实验表明，一旦消除分解歧义，W2T在属性分类、性能预测和适配器检索任务中均取得显著效果，证明LoRA权重能够可靠指示模型行为。代码发布于https://github.com/xiaolonghan2000/Weight2Token。

摘要 (Abstract)

Each LoRA checkpoint compactly stores task-specific updates in low-rank weight matrices, offering an efficient way to adapt large language models to new tasks and domains. In principle, these weights already encode what the adapter does and how well it performs. In this paper, we ask whether this information can be read directly from the weights, without running the base model or accessing training data. A key obstacle is that a single LoRA update can be factorized in infinitely many ways. Without resolving this ambiguity, models trained on the factors may fit the particular factorization rather than the underlying update. To this end, we propose \methodfull, which maps each LoRA update to a provably canonical form via QR decomposition followed by SVD, so that all equivalent factorizations share the same representation. The resulting components are then tokenized and processed by a Transformer to produce a weight-space embedding. Across language and vision LoRA collections, W2T achieves strong results on attribute classification, performance prediction, and adapter retrieval, demonstrating that LoRA weights reliably indicate model behavior once factorization ambiguity is removed. Code is available at https://github.com/xiaolonghan2000/Weight2Token.

关键词: LoRA, weight analysis, canonical form, QR decomposition, SVD, weight-space embedding, adapter retrieval, performance prediction

311. ❌ Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

作者: Jaesung Bae, Xiuwen Zheng, Minje Kim, Chang D. Yoo, Mark Hasegawa-Johnson 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15988v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于利用深度学习进行构音障碍语音质量评估（DSQA），属于AI在生物医学/健康领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（8分）。方法涉及预训练（使用未标记数据和对比学习）和微调，与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（各5分）。论文未直接涉及大模型（LLMs）技术原理、推理、对齐、压缩、代理等具体创新，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种利用未标记构音障碍语音和大规模典型语音数据集的三阶段框架，通过伪标签生成、弱监督预训练和微调，显著提升了构音障碍语音质量评估的鲁棒性和性能。

摘要翻译

构音障碍语音质量评估（DSQA）对临床诊断和包容性语音技术至关重要。然而，主观评估成本高昂且难以规模化，而标注数据的稀缺限制了鲁棒的客观建模。为此，我们提出一个三阶段框架，利用未标注的构音障碍语音和大规模典型语音数据集来扩展训练。首先，教师模型为未标注样本生成伪标签；随后采用标签感知对比学习策略进行弱监督预训练，使模型接触多样化的说话者和声学条件。预训练模型随后针对下游DSQA任务进行微调。在涵盖多种病因和语言的五个未见数据集上的实验证明了我们方法的鲁棒性。我们基于Whisper的基线模型显著优于SpICE等当前最优（SOTA）DSQA预测器，完整框架在未见测试数据集上平均SRCC达到0.761。

摘要 (Abstract)

Dysarthric speech quality assessment (DSQA) is critical for clinical diagnostics and inclusive speech technologies. However, subjective evaluation is costly and difficult to scale, and the scarcity of labeled data limits robust objective modeling. To address this, we propose a three-stage framework that leverages unlabeled dysarthric speech and large-scale typical speech datasets to scale training. A teacher model first generates pseudo-labels for unlabeled samples, followed by weakly supervised pretraining using a label-aware contrastive learning strategy that exposes the model to diverse speakers and acoustic conditions. The pretrained model is then fine-tuned for the downstream DSQA task. Experiments on five unseen datasets spanning multiple etiologies and languages demonstrate the robustness of our approach. Our Whisper-based baseline significantly outperforms SOTA DSQA predictors such as SpICE, and the full framework achieves an average SRCC of 0.761 across unseen test datasets.

关键词: dysarthric speech, speech quality assessment, data augmentation, contrastive learning, pretraining, fine-tuning, robustness, Whisper

312. ❌ The Importance of Being Smoothly Calibrated

作者: Parikshit Gopalan, Konstantinos Stavropoulos, Kunal Talwar, Pranay Tankala 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.16015v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是预测模型的校准理论（smooth calibration），属于机器学习理论领域，与所有关键词（均涉及大模型、深度学习技术及应用）完全无关。论文未提及任何大模型、深度学习、AI应用或相关技术原理。

!!! tip deepseek-chat TL;DR

该论文研究了平滑校准作为鲁棒校准度量的理论性质，提出了新的全预测保证，并给出了平滑校准与完美校准分布之间地球移动距离的简洁刻画。

摘要翻译

近期研究凸显了平滑校准误差[Kakade and Foster, 2008]作为校准误差鲁棒度量方法的核心地位。本文对平滑校准的相关成果进行了概括、统一与拓展，既将其视为一种鲁棒的校准度量，也将其作为实现全预测（omniprediction）的关键步骤——全预测可使下游决策者在优化某种预测者未知的严格损失函数时，获得低遗憾的预测结果。

我们针对所有有界严格损失函数类，提出了平滑校准预测器的新全预测保证。通过对预测器添加噪声实现平滑化，并与空间中任意基准预测器的平滑版本进行竞争——即对基准预测器添加噪声后进行任意后处理。全预测误差受限于预测器的平滑校准误差及其与基准预测器之间的推土机距离（earth mover’s distance）。我们通过实例证明这种依赖关系通常无法进一步改进。该结论统一并拓展了先前基于平滑校准的全预测研究结果[Foster and Vohra, 1998; Hartline, Wu, and Yang, 2025]。

我们提出了一种基于推土机距离的平滑校准新特征刻画：即通过度量预测与标签的联合分布到最近完美校准分布的推土机距离来定义。这同时为[Blasiok, Gopalan, Hu, and Nakkiran, 2023]中提出的校准下距离关系提供了更简洁的证明。

基于此，我们证明了校准上距离的估计问题无法在样本复杂度独立于预测值支撑集大小的条件下达到二次因子精度。这与校准距离的估计问题形成鲜明对比：后者已被证明是信息论意义上不可实现的——任何有限样本量均无法保证估计精度[Blasiok, Gopalan, Hu, and Nakkiran, 2023]。

摘要 (Abstract)

Recent work has highlighted the centrality of smooth calibration [Kakade and Foster, 2008] as a robust measure of calibration error. We generalize, unify, and extend previous results on smooth calibration, both as a robust calibration measure, and as a step towards omniprediction, which enables predictions with low regret for downstream decision makers seeking to optimize some proper loss unknown to the predictor. We present a new omniprediction guarantee for smoothly calibrated predictors, for the class of all bounded proper losses. We smooth the predictor by adding some noise to it, and compete against smoothed versions of any benchmark predictor on the space, where we add some noise to the predictor and then post-process it arbitrarily. The omniprediction error is bounded by the smooth calibration error of the predictor and the earth mover’s distance from the benchmark. We exhibit instances showing that this dependence cannot, in general, be improved. We show how this unifies and extends prior results [Foster and Vohra, 1998; Hartline, Wu, and Yang, 2025] on omniprediction from smooth calibration. We present a crisp new characterization of smooth calibration in terms of the earth mover’s distance to the closest perfectly calibrated joint distribution of predictions and labels. This also yields a simpler proof of the relation to the lower distance to calibration from [Blasiok, Gopalan, Hu, and Nakkiran, 2023]. We use this to show that the upper distance to calibration cannot be estimated within a quadratic factor with sample complexity independent of the support size of the predictions. This is in contrast to the distance to calibration, where the corresponding problem was known to be information-theoretically impossible: no finite number of samples suffice [Blasiok, Gopalan, Hu, and Nakkiran, 2023].

关键词: smooth calibration, calibration error, omniprediction, proper loss, earth mover’s distance, robust calibration measure, prediction theory, statistical learning theory

313. ❌ Determinism in the Undetermined: Deterministic Output in Charge-Conserving Continuous-Time Neuromorphic Systems with Temporal Stochasticity

作者: Jing Yan, Kang You, Zhezhi He, Yaoyu Zhang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是连续时间神经形态系统中的确定性计算问题，专注于脉冲神经网络（SNNs）的电荷守恒框架和与量化人工神经网络的对应关系。所有评分关键词都涉及大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及语言模型、深度学习模型或相关技术，属于神经形态计算和硬件实现领域，与评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了异步神经形态系统中由于时间随机性导致的确定性计算难题，通过建立电荷守恒框架证明了脉冲神经网络的终端状态仅取决于总输入电荷，并建立了与量化人工神经网络的精确对应关系。

摘要翻译

在异步神经形态系统中实现确定性计算结果，由于连续时间硬件固有的时间随机性，仍然是一个根本性挑战。为解决此问题，我们开发了一个统一的连续时间脉冲神经网络框架，该框架将电荷守恒定律与最小神经元级约束相结合。这种整合确保了终端状态仅取决于总输入电荷，从而提供了一个对时间随机性保持不变的独特累积输出。我们证明，在无环网络中，这种映射严格不受脉冲时序影响，而循环连接则可能引入时间敏感性。此外，我们建立了这些电荷守恒脉冲神经网络与量化人工神经网络之间的精确表示对应关系，从而在静态深度学习与事件驱动动力学之间搭建了无需近似误差的桥梁。这些结果为设计连续时间神经形态系统奠定了严格的理论基础，使其能够在保持算法确定性的同时，利用异步处理的效率。

摘要 (Abstract)

Achieving deterministic computation results in asynchronous neuromorphic systems remains a fundamental challenge due to the inherent temporal stochasticity of continuous-time hardware. To address this, we develop a unified continuous-time framework for spiking neural networks (SNNs) that couples the Law of Charge Conservation with minimal neuron-level constraints. This integration ensures that the terminal state depends solely on the aggregate input charge, providing a unique cumulated output invariant to temporal stochasticity. We prove that this mapping is strictly invariant to spike timing in acyclic networks, whereas recurrent connectivity can introduce temporal sensitivity. Furthermore, we establish an exact representational correspondence between these charge-conserving SNNs and quantized artificial neural networks, bridging the gap between static deep learning and event-driven dynamics without approximation errors. These results establish a rigorous theoretical basis for designing continuous-time neuromorphic systems that harness the efficiency of asynchronous processing while maintaining algorithmic determinism.

关键词: neuromorphic systems, spiking neural networks, deterministic computation, charge conservation, temporal stochasticity, continuous-time framework, quantized artificial neural networks, event-driven dynamics

314. ❌ Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

作者: Egor Shulgin, Dimitri von Rütte, Tianyue H. Zhang, Niccolò Ajroldi, Bernhard Schölkopf, Antonio Orvieto 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15958v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是大规模训练中超参数缩放定律的理论推导，属于优化理论范畴，而非大模型技术原理或应用创新。论文虽然提到’large-scale training’，但这是指训练规模大，而非特指大语言模型。所有关键词都涉及大模型的具体技术、应用或特性，而本文专注于优化算法的理论分析，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文通过现代优化理论中的线性最小化预言机框架，推导出学习率、动量和批量大小随迭代次数或计算预算变化的幂律缩放定律，为大规模训练中的超参数转移提供了统一的理论解释。

摘要翻译

超参数迁移已成为现代大规模训练方案的重要组成部分。现有方法（如muP）主要关注模型尺寸间的迁移，而跨批次大小与训练时长的迁移通常依赖于基于时间尺度保持、二次代理模型和连续时间近似等洞见得出的经验缩放规则。本研究通过基于线性最小化预言机（LMO）方法的最新收敛界视角，系统探究现代一阶优化器的超参数缩放规律。该框架涵盖归一化随机梯度下降法（SGD）、符号随机梯度下降法（signSGD，可近似Adam优化器）以及Muon优化器。我们将近期文献中的收敛界作为代理目标，通过在不同调参区间内最小化这些边界，推导出学习率、动量与批次大小随迭代次数或计算令牌预算变化的闭式幂律调度方案。在固定模型尺寸的前提下，本分析以统一且原理化的视角复现了文献中的多数洞见与观测结果，并为未来研究指明了清晰方向。我们的研究结果特别关注动量与批次大小缩放之间的交互作用，表明多种缩放策略均可能实现最优性能。

摘要 (Abstract)

Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating bounds in recent literature as a proxy and minimizing them across different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as functions of the iteration or token budget. Our analysis, holding model size fixed, recovers most insights and observations from the literature under a unified and principled perspective, with clear directions open for future research. Our results draw particular attention to the interaction between momentum and batch-size scaling, suggesting that optimal performance may be achieved with several scaling strategies.

关键词: hyperparameter scaling laws, modern optimization theory, Linear Minimization Oracle, convergence bounds, learning rate, momentum, batch size, large-scale training

315. ❌ GASP: Guided Asymmetric Self-Play For Coding LLMs

作者: Swadesh Jana, Cansu Sancaktar, Tomáš Daniš, Georg Martius, Antonio Orvieto, Pavel Kolev 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15957v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究的是针对代码生成大语言模型（LLMs）的后训练（post-training）方法——引导式非对称自博弈（GASP）。该方法属于大模型技术原理的创新，直接涉及LLMs的后训练优化。因此，与’Large Language Models OR LLMs OR Foundation Models’和’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。论文未涉及其他关键词所描述的具体技术（如MoE、量化、RAG、对齐等）、应用领域（如科学AI）或特定能力（如长上下文、推理链），故其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GASP的引导式非对称自博弈方法，用于改进代码生成大语言模型的后训练过程，通过在自博弈中引入真实数据中的困难目标问题作为引导，构建渐进式课程，从而显著提升了模型在LiveCodeBench上的性能，并解决了基线模型无法处理的难题。

摘要翻译

非对称自我博弈已成为大语言模型后训练的一种新兴范式，其中教师模型持续生成处于学生模型学习能力边界的问题供其解答。尽管这类方法有望在不依赖人类数据的情况下实现开放式数据生成的自我迭代，但其存在一个主要缺陷：并非所有难以解决的问题都具有提升模型整体能力所需的价值或启发性。当前的非对称自我博弈方法缺乏目标导向，不具备实质的基准参照。本研究提出导向型非对称自我博弈，其通过真实数据中的目标标杆问题提供基准参照——这些被识别为对模型构成艰难探索挑战的问题。在自我博弈过程中，教师模型首先生成一个困难问题的简化变体，随后基于该简化问题生成一个更难变体，旨在通过训练逐步缩小与目标标杆之间的能力差距。该方法在LiveCodeBench上的pass@20指标较无导向的非对称自我博弈提升了2.5%，并且通过教师构建的渐进式课程，我们成功解决了所有基线模型均无法攻克的困难目标标杆问题。

摘要 (Abstract)

Asymmetric self-play has emerged as a promising paradigm for post-training large language models, where a teacher continually generates questions for a student to solve at the edge of the student’s learnability. Although these methods promise open-ended data generation bootstrapped from no human data, they suffer from one major problem: not all problems that are hard to solve are interesting or informative to improve the overall capabilities of the model. Current asymmetric self-play methods are goal-agnostic with no real grounding. We propose Guided Asymmetric Self-Play (GASP), where grounding is provided by real-data goalpost questions that are identified to pose a hard exploration challenge to the model. During self-play, the teacher first generates an easier variant of a hard question, and then a harder variant of that easier question, with the goal of gradually closing the gap to the goalpost throughout training. Doing so, we improve pass@20 on LiveCodeBench (LCB) by 2.5% over unguided asymmetric self-play, and through the curriculum constructed by the teacher, we manage to solve hard goalpost questions that remain out of reach for all baselines.

关键词: Large Language Models, Post-training, Asymmetric Self-Play, Code Generation, Curriculum Learning, LiveCodeBench, Guided Learning, Model Improvement

316. ❌ Knowledge Graph Extraction from Biomedical Literature for Alkaptonuria Rare Disease

作者: Giang Pham, Rebecca Finetti, Caterina Graziani, Bianca Roncaglia, Asma Bendjeddou, Linda Brodo, Sara Brunetti, Moreno Falaschi, Stefano Forti, Silvia Giulia Galfré, Paolo Milazzo, Corrado Priami, Annalisa Santucci, Ottavia Spiga, Alina Sîrbu 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15711v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于使用基于PubTator3的文本挖掘方法从生物医学文献中提取知识图谱，以研究罕见疾病Alkaptonuria。虽然属于AI在科学领域的应用，但具体方法为传统的文本挖掘和知识图谱构建，并未涉及大模型、深度学习技术原理或任何评分关键词中的具体技术（如LLM、MoE、RLHF等）。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物信息学应用，但未体现大模型或深度学习创新。其他关键词均与论文内容无关，因此除该关键词外均得0分。

!!! tip deepseek-chat TL;DR

该研究应用基于PubTator3的文本挖掘方法从生物医学文献中提取知识图谱，以分析罕见代谢疾病Alkaptonuria的系统性相互作用、共病和潜在治疗靶点。

摘要翻译

尿黑酸尿症（Alkaptonuria, AKU）是一种极罕见的常染色体隐性遗传代谢疾病，由HGD（尿黑酸1,2-双加氧酶）基因突变引起，导致尿黑酸（Homogentisic Acid, HGA）在体液和组织中病理性蓄积。这引发全身性临床表现，包括早发性脊柱关节病、肾结石与前列腺结石以及心血管并发症。由于该病极为罕见，无论是临床数据还是文献资料，相关数据均十分有限。知识图谱（Knowledge Graphs, KGs）有助于将关于该疾病的有限知识（基础机制、临床表现及现有疗法）与其他知识相连接；然而，在现有的生物医学知识图谱中，尿黑酸尿症往往代表性不足或完全缺失。本研究采用基于PubTator3的文本挖掘方法，以大规模提取生物医学关系。我们构建了两个不同规模的知识图谱，利用现有生化知识对其进行验证，并借此提取可能与尿黑酸尿症相关的基因、疾病及疗法。这一计算框架揭示了该疾病的系统性相互作用、其共病情况及潜在治疗靶点，证明了本方法在分析罕见代谢疾病方面的有效性。

摘要 (Abstract)

Alkaptonuria (AKU) is an ultra-rare autosomal recessive metabolic disorder caused by mutations in the HGD (Homogentisate 1,2-Dioxygenase) gene, leading to a pathological accumulation of homogentisic acid (HGA) in body fluids and tissues. This leads to systemic manifestations, including premature spondyloarthropathy, renal and prostatic stones, and cardiovascular complications. Being ultra-rare, the amount of data related to the disease is limited, both in terms of clinical data and literature. Knowledge graphs (KGs) can help connect the limited knowledge about the disease (basic mechanisms, manifestations and existing therapies) with other knowledge; however, AKU is frequently underrepresented or entirely absent in existing biomedical KGs. In this work, we apply a text-mining methodology based on PubTator3 for large-scale extraction of biomedical relations. We construct two KGs of different sizes, validate them using existing biochemical knowledge and use them to extract genes, diseases and therapies possibly related to AKU. This computational framework reveals the systemic interactions of the disease, its comorbidities, and potential therapeutic targets, demonstrating the efficacy of our approach in analyzing rare metabolic disorders.

关键词: Alkaptonuria, knowledge graph extraction, biomedical literature, text-mining, PubTator3, rare disease, metabolic disorder, therapeutic targets

317. ❌ A multiscale discrete-to-continuum framework for structured population models

作者: Eleonora Agostinelli, Keith L. Chambers, Helen M. Byrne, Mohit P. Dalwadi 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15217v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于数学建模和数值方法，研究如何将离散结构种群模型通过多尺度方法系统性地推导为连续体近似，并以脂质结构模型为例进行验证。论文内容属于应用数学和计算生物学领域，与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文涉及生物信息学/计算生物学中的数学模型应用，但并非使用AI或大模型方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种多尺度离散到连续体框架，用于系统性地推导结构化种群模型的连续近似，解决了传统上尺度方法在截断阶数、一致有效性和边界条件方面的模糊性问题，并以动脉粥样硬化早期脂质结构模型为例验证了其有效性。

摘要翻译

生物种群数学模型通常采用离散结构类别来捕捉个体间的性状差异（如年龄、尺寸、表型、细胞内状态）。将这些离散模型升尺度为连续描述可提升解析处理能力与数值解的可扩展性。然而，仅基于泰勒展开的传统升尺度方法可能在截断阶数、一致有效性及边界条件方面引入模糊性。为此，本文提出一种离散多尺度框架，以系统性地推导结构化种群模型的连续近似。通过将多尺度方法与匹配渐近展开法应用于离散系统，我们识别了结构空间中适合连续描述的区域，并推导出相应的偏微分方程。主导阶动力学在主体区域表现为非线性平流方程，而在前沿波与停滞点附近的内层薄区则呈现平流-扩散过程。针对本质上不适用连续描述的区域，我们进一步推导了离散边界层描述。最后，我们以早期动脉粥样硬化的简单脂质结构模型为例演示该方法，并验证离散与连续描述的一致性。所提出的多尺度框架可应用于其他具有离散结构的异质系统，从而获得具有渐近一致边界条件的合宜升尺度动力学。

摘要 (Abstract)

Mathematical models of biological populations commonly use discrete structure classes to capture trait variation among individuals (e.g. age, size, phenotype, intracellular state). Upscaling these discrete models into continuum descriptions can improve analytical tractability and scalability of numerical solutions. Common upscaling approaches based solely on Taylor expansions may, however, introduce ambiguities in truncation order, uniform validity and boundary conditions. To address this, here we introduce a discrete multiscale framework to systematically derive continuum approximations of structured population models. Using the method of multiple scales and matched asymptotic expansions applied to discrete systems, we identify regions of structure space for which a continuum representation is appropriate and derive the corresponding partial differential equations. The leading-order dynamics are given by a nonlinear advection equation in the bulk domain and advection-diffusion processes in small inner layers about the leading wavefronts and stagnation point. We further derive discrete boundary layer descriptions for regions where a continuum representation is fundamentally inappropriate. Finally, we demonstrate the method on a simple lipid-structured model for early atherosclerosis and verify consistency between the discrete and continuum descriptions. The multiscale framework we present can be applied to other heterogeneous systems with discrete structure in order to obtain appropriate upscaled dynamics with asymptotically consistent boundary conditions.

关键词: structured population models, discrete-to-continuum framework, multiscale analysis, asymptotic expansions, partial differential equations, advection-diffusion, lipid-structured model, atherosclerosis

318. ❌ Whole slide and microscopy image analysis with QuPath and OMERO

作者: Léo Leplat, Alan O’Callaghan, Peter Bankhead 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15702v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	5.0/10	0.0

评分理由: 该论文主要描述QuPath生物图像分析软件与OMERO服务器集成的技术扩展，属于生物信息学领域的软件工具开发。论文内容完全不涉及大语言模型、深度学习技术原理或任何评分关键词中的具体技术方法（如MoE、RLHF、RAG等）。唯一的相关性在于’AI for Science’关键词，因为生物图像分析属于科学领域的AI应用，但论文本身并未讨论AI模型或算法创新，而是软件集成和远程访问功能，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了QuPath生物图像分析软件无法高效访问远程存储图像的问题，通过开发与OMERO服务器集成的扩展，实现了对远程图像像素和元数据的访问，并提供了连接其他图像管理系统的开发模板。

摘要翻译

QuPath是一款用于生物图像分析的开源软件。作为一款灵活且易于安装的桌面应用程序，QuPath被全球各地的实验室用于可视化和分析大型复杂图像。然而，仅依赖存储在本地文件系统中的图像限制了QuPath在更大规模研究中的应用。本文介绍了一种新的扩展功能，使QuPath能够从OMERO服务器访问像素数据和元数据。这一增强使软件能够高效处理远程存储的图像，同时也为希望将QuPath与其他图像管理系统连接的开发者提供了模板。

摘要 (Abstract)

QuPath is open-source software for bioimage analysis. As a desktop application that is flexible and easy to install, QuPath is used by labs worldwide to visualise and analyse large and complex images. However, relying only on images stored only on a local file system limits QuPath’s use for larger studies. This paper describes a new extension that enables QuPath to access pixels and metadata from an OMERO server. This enhances the software by allowing it to work efficiently with images stored remotely, while also serving as a template for developers who want to connect QuPath to other image management systems.

关键词: QuPath, bioimage analysis, OMERO server, remote image access, image management systems, software extension, whole slide imaging, microscopy image analysis

319. ❌ Multi-GPU MBE(3)-OSV-MP2 for Performant Large-Scale ab initio Calculations

作者: Qiujiang Liang, Jun Yang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16575v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学中的量子化学方法（MP2理论）在GPU上的高性能计算实现，属于计算科学领域。论文内容与绝大多数关键词（涉及大模型、深度学习、训练方法、推理优化、AI代理等）完全无关。唯一可能相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于科学计算应用（计算化学），可视为广义的“AI for Science”范畴，但论文并未使用AI或机器学习方法，而是传统的量子化学数值计算，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多GPU的高性能实现，用于大规模第三阶多体展开轨道特定虚拟MP2（MBE(3)-OSV-MP2）能量计算，在多个测试案例中实现了显著的加速（最高40倍），为真实大分子的快速局部相关计算开辟了新途径。

摘要翻译

由于显著的算法复杂性，轨道不变局域关联方法在图形处理器（GPU）上的计算加速在很大程度上仍未得到探索。GPU实现的局域关联理论的运行效率可能受到以下方面的显著制约：轨道局域化过程的可并行化程度、局域波函数的迭代求解，以及CUDA内核对固有局域或稀疏操作的适配。基于二阶Møller-Plesset微扰（MP2）理论，我们提出了一种用于大规模三阶多体展开轨道特定虚轨道MP2（MBE(3)-OSV-MP2）能量计算的多GPU实现方案。相应地，我们的算法与实现在多个方面解决了GPU并行化能力，以实现局域MP2计算的峰值利用率和并行性，包括Jacobi-Pipek-Mezey局域化、随机化OSV生成、直接MP2积分再生，以及CUDA内核对局域操作的适配。基于GPU的MBE(3)-OSV-MP2能量计算实现了$O(N^{1.9})$的标度，并在分布于多个节点的24个GPU上达到84%的并行效率。本实现方案对于(H$2$O)${128}$/cc-pVDZ和(H$2$O)${190}$/cc-pVDZ体系，分别实现了相对于正则RI-MP2的40倍墙钟时间加速，以及相对于基于CPU的MBE(3)-OSV-MP2的10倍加速。对包含784个原子的胰岛素肽进行大规模计算，在8个NVIDIA A800 GPU上，使用cc-pVDZ基组（7571个基函数）在24分钟内获得了完整的MBE(3)-OSV-MP2能量，使用cc-pVTZ基组（17448个基函数）则在6.4小时内完成。我们的工作为在现实大分子体系上执行快速的基于GPU的局域关联计算开辟了新的可能性。

摘要 (Abstract)

The computational acceleration of orbital-invariant local correlation methods on graphics processing units (GPUs) has remained largely unexplored due to substantial algorithmic complexities. The runtime efficiency of GPU-implemented local correlation theories can be significantly constrained by the parallelizable degree of the orbital localization procedure, the iterative solution of the local wave function, and the adaptation of CUDA kernels to inherently local or sparse operations. Using the second-order Møller-Plesset perturbation (MP2) theory, we present a multi-GPU implementation for large-scale third-order many-body expansion orbital-specific virtual MP2 (MBE(3)-OSV-MP2) energy calculations. Accordingly, our algorithms and implementation address the GPU parallelization ability for peak utilization and parallelism of local MP2 computation in several aspects, including Jacobi-Pipek-Mezey localization, randomized OSV generation, direct MP2 integral regeneration, as well as CUDA kernel adaptation to local operations. The GPU-based MBE(3)-OSV-MP2 energy computation achieves $O(N^{1.9})$ scaling and 84% parallel efficiency up to 24 GPUs distributed on multiple nodes. The present implementation delivers 40-fold wall-time speedup of the canonical RI-MP2 and 10-fold speedup of the CPU-based MBE(3)-OSV-MP2 for (H$2$O)${128}$/cc-pVDZ and (H$2$O)${190}$/cc-pVDZ, respectively. A large scale computation of 784-atom insulin peptide yields the full MBE(3)-OSV-MP2 energies in 24 minutes with cc-pVDZ (7571 basis functions) and 6.4 hours with cc-pVTZ (17448 basis functions) on 8 NVIDIA A800 GPUs. Our work opens up new possibilities for performing fast GPU-based local correlation calculations on real-life macromolecules.

关键词: GPU acceleration, local correlation methods, MP2 theory, many-body expansion, orbital-specific virtual, large-scale ab initio calculations, computational chemistry, high-performance computing

320. ❌ Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database

作者: Madhulatha Mandarapu, Sandeep Kunkunuru 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15080v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	15.0/10	0.0

评分理由: 论文主要研究生物医学知识图谱的构建、联邦和LLM智能体访问，核心贡献包括：1）构建三个大规模开放生物医学知识图谱；2）实现跨知识图谱联邦查询；3）开发基于MCP服务器的LLM智能体访问机制。论文高度相关于’AI for Science OR Bioinformatics OR Cheminformatics’（15分），因为其专注于生物信息学领域的AI应用；相关于’LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分），因为论文开发了LLM智能体访问机制并进行了评估；相关于’Large Language Models OR LLMs OR Foundation Models’（8分）和’Tool Use OR Function Calling OR API Tool Use’（8分），因为论文使用LLM（GPT-4o）作为基准，并涉及MCP工具调用。其他关键词与论文内容无关（0分），因为论文不涉及大模型技术原理创新（如MoE、Scaling Laws、训练方法、推理优化、对齐、压缩等），也不涉及多智能体系统、思维链、世界模型等主题。

!!! tip deepseek-chat TL;DR

该论文解决了生物医学知识分散在孤立数据库中的问题，通过构建三个大规模开放知识图谱、实现跨图谱联邦查询以及开发LLM智能体访问机制，显著提升了生物医学问答的准确性（从0%提升至98%）。

摘要翻译

生物医学知识分散于各自独立的数据库中——Reactome存储通路数据，STRING存储蛋白质相互作用，ClinicalTrials.gov存储研究注册信息，DrugBank存储药物词汇，DGIdb存储药物-基因相互作用，SIDER存储副作用信息。我们提出了三个开源生物医学知识图谱——通路知识图谱（整合5个来源，含118,686个节点、834,785条边）、临床试验知识图谱（整合5个来源，含7,774,446个节点、26,973,997条边）以及药物相互作用知识图谱（整合3个来源，含32,726个节点、191,970条边）——它们构建于Samyama之上，这是一个用Rust编写的高性能图数据库。

我们的贡献体现在三个方面。首先，我们描述了一种可复现的ETL（提取、转换、加载）模式，用于从异构公共数据源构建大规模知识图谱，该模式具备跨源去重、批量加载（支持Python Cypher与Rust原生加载器）以及便携式快照导出功能。其次，我们展示了跨知识图谱的联邦查询能力：将全部三个快照加载至单一图租户中，即可实现跨数据集的基于属性的关联查询。第三，我们引入了基于模式驱动的MCP（模型上下文协议）服务器生成方案，以支持大语言模型智能体访问，并在新的BiomedQA基准测试（40个药理学问题）上进行了评估：领域特定的MCP工具达到了98%的准确率，而文本转Cypher方法的准确率为0%，独立GPT-4o模型的准确率为75%。

所有数据源均为开放许可。整合后的联邦图谱（含790万个节点、2800万条边）在商用云硬件上加载仅需约3分钟，跨知识图谱查询可在80毫秒至4秒内完成。

摘要 (Abstract)

Biomedical knowledge is fragmented across siloed databases – Reactome for pathways, STRING for protein interactions, ClinicalTrials.gov for study registries, DrugBank for drug vocabularies, DGIdb for drug-gene interactions, SIDER for side effects. We present three open-source biomedical knowledge graphs – Pathways KG (118,686 nodes, 834,785 edges from 5 sources), Clinical Trials KG (7,774,446 nodes, 26,973,997 edges from 5 sources), and Drug Interactions KG (32,726 nodes, 191,970 edges from 3 sources) – built on Samyama, a high-performance graph database written in Rust. Our contributions are threefold. First, we describe a reproducible ETL pattern for constructing large-scale KGs from heterogeneous public data sources, with cross-source deduplication, batch loading (Python Cypher and Rust native loaders), and portable snapshot export. Second, we demonstrate cross-KG federation: loading all three snapshots into a single graph tenant enables property-based joins across datasets. Third, we introduce schema-driven MCP server generation for LLM agent access, evaluated on a new BiomedQA benchmark (40 pharmacology questions): domain-specific MCP tools achieve 98% accuracy vs. 0% for text-to-Cypher and 75% for standalone GPT-4o. All data sources are open-license. The combined federated graph (7.9M nodes, 28M edges) loads in approximately 3 minutes on commodity cloud hardware, and cross-KG queries complete in 80ms-4s.

关键词: biomedical knowledge graphs, graph database, ETL pattern, cross-KG federation, LLM agents, MCP server, BiomedQA benchmark, pharmacology questions

321. ❌ Disentangling Single- and Biexciton Dynamics with Photoelectron-Detected Two-Dimensional Electronic Spectroscopy

作者: Luisa Brenneis, Matthias Hensen, Julian Lüttig, Tobias Brixner 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16484v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是二维电子光谱学中的光电子检测技术，属于物理化学和光谱学领域，专注于量子系统的非线性光学响应和激子动力学。论文内容完全不涉及大模型、深度学习、人工智能或任何计算机科学相关技术，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文通过数值模拟研究了光电子检测二维光谱学中的时间门控和动能过滤技术，以分离单激子和双激子动力学，并证明这些方法能够提取与相干检测光谱相同的信息，同时直接推断激子-激子湮灭动力学。

摘要翻译

动作探测二维光谱通过记录非相干探测的观测量（如荧光、光电子或光电流）来解析量子系统随时间演化的非线性光学响应，这些观测量反映了系统的激发态布居数。诸如激子-激子湮灭等过程会改变该布居数，从而掩盖能量转移等过程。与相干探测的二维光谱相比，这限制了从动作探测二维光谱中可获得的信息。本文研究了光电子探测二维光谱中的时间门控和动能滤波技术，以区分不同过程。我们建立了一种数值模拟方案，能够计算多种系统的光电子探测二维光谱，证明即使在存在湮灭过程的情况下，时间门控也能提取与相干探测二维光谱相同的信息。此外，我们可以直接推断湮灭动力学。动能滤波还能进一步分离特定的激发态动力学过程。我们的模拟表明，时间门控和动能滤波是光电子探测二维光谱中极具前景的扩展技术。

摘要 (Abstract)

Action-detected two-dimensional (2D) spectroscopy resolves the time-dependent nonlinear optical response of a quantum system by recording incoherently detected observables such as fluorescence, photoelectrons, or photocurrents which reflect the system’s excited-state population. Processes such as exciton-exciton annihilation alter this population and obscure, for instance, energy transfer processes. This limits the information available from action-detected 2D spectra compared to their coherently detected counterparts. Here we investigate time gating and kinetic-energy filtering in photoelectron-detected 2D spectroscopy to disentangle various processes. We implement a numerical simulation protocol that allows us to calculate photoelectron-detected 2D spectra for various systems, demonstrating that time gating can extract the same information as coherently detected 2D spectroscopy, even when annihilation is present. Furthermore, we can directly infer annihilation dynamics. Kinetic-energy filtering additionally enables the isolation of specific excited-state dynamics. Our simulations demonstrate that time gating and kinetic-energy filtering are promising extensions for photoelectron-detected 2D spectroscopy.

关键词: two-dimensional spectroscopy, photoelectron detection, exciton dynamics, time gating, kinetic-energy filtering, exciton-exciton annihilation, nonlinear optical response, numerical simulation

322. ❌ Free complement method with Gaussian expanded complements: hierarchical decontraction to mitigate the exponential wall before selection

作者: Cong Wang 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16262v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是量子化学中的自由补函数方法，具体涉及高斯展开补函数、波函数去收缩和变分参数优化，属于计算物理/化学领域。所有评分关键词均与大语言模型、深度学习、AI技术及其应用相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种在自由补函数方法中使用高斯展开补函数时，通过去收缩技术来推迟变分参数指数增长问题的方法。

摘要翻译

先前关于采用高斯展开互补函数的自由互补（FC）方法的研究（arXiv:2508.04635 [physics.chem-ph]）使用了斯莱特（Slater）初始波函数。当展开式中使用多于一个高斯函数时，在基于重叠矩阵的选择之前，这可能导致与高斯互补函数相关的变分系数随电子数呈指数级增长（在给定阶次下）。本研究通过利用$g$函数引入的不同指数进行解收缩（decontraction），以避免FC方法在低阶时出现这种情况。变分参数数量呈指数增长的现象被推迟到FC展开的更高阶次。

摘要 (Abstract)

The previous work (arXiv:2508.04635 [physics.chem-ph]) of the free complement (FC) method with Gaussian expanded complement functions adopts the Slater initial wavefunction. This may introduce an exponential complexity of the variational coefficients associated to the Gaussian complement functions with respect to the number of electrons at a given order before the overlap matrix based selection, for more than one Gaussian function used in the expansion. The present work uses decontractions via the distinct exponents introduced by the $g$ functions to avoid this scenario at low order of the FC method. The exponential number of the variational parameters is postponed to higher orders of the FC expansion.

关键词: free complement method, Gaussian expanded complements, hierarchical decontraction, exponential wall, variational parameters, Slater initial wavefunction, FC expansion, quantum chemistry

323. ❌ PFP/MM: A Hybrid Approach Combining a Universal Neural Network Potential with Classical Force Fields for Large-Scale Reactive Simulations

作者: Yu Miyazaki, Atsuhiro Tomita, Akihide Hayashi, So Takemoto, Mizuki Takemoto, Hodaka Mori 期刊/来源: arxiv 发布日期: 2026-03-17 arXiv链接: http://arxiv.org/abs/2603.16061v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于开发一种混合方法（PFP/MM），将通用机器学习原子间势（uMLIP）与经典分子力学（MM）相结合，用于大规模反应性分子模拟。研究属于计算化学和分子模拟领域，应用机器学习方法解决科学计算问题。所有关键词中，只有“AI for Science OR Bioinformatics OR Cheminformatics”高度相关（10分），因为论文明确属于AI在科学（具体是计算化学和生物分子模拟）中的应用。其他关键词均涉及大语言模型（LLM）相关技术、训练方法、推理优化、代理系统等，与论文的分子模拟和力场开发主题完全无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PFP/MM的混合方法，将通用机器学习原子间势与经典分子力学相结合，解决了大规模反应性分子模拟的计算效率问题，并在多个生物分子系统中验证了其准确性和实用性。

摘要翻译

通用机器学习原子间势能（uMLIPs）能够以接近密度泛函理论（DFT）的精度实现反应性分子模拟，但将其高效应用于大型、现实的凝聚相体系仍具有较高的计算需求。本文提出PFP/MM混合方法，将一种通用机器学习原子间势能——优选势能（PreFerred Potential, PFP）与分子力学（MM）相结合，从而能够进行大规模和长时间尺度的模拟，这对于纯uMLIP计算而言具有挑战性。以显式水环境中的丙氨酸二肽为例，我们实现了每日纳秒级的增强采样，并获得了与已知稳定区域一致的拉氏图（Ramachandran plot）。对于极性溶剂环境中的分子内亲核加成反应，我们重现了自由能剖面中预期的溶剂诱导稳定效应。我们进一步将该方法应用于细胞色素P450化合物I的羟基化反应，获得了与公认反应机制一致的自由能景观。这些结果表明，基于uMLIP的反应性模拟能够应用于大型、现实环境中多样化的凝聚相过程。

摘要 (Abstract)

Universal machine-learning interatomic potentials (uMLIPs) enable reactive molecular simulations with near-DFT accuracy, yet applying them efficiently to large, realistic condensed-phase systems remains computationally demanding. Here we present PFP/MM, a hybrid approach that combines a uMLIP, PreFerred Potential (PFP), with molecular mechanics (MM) to enable both large-scale and long-time simulations that are challenging for uMLIP-only calculations. Using an alanine dipeptide in explicit water, we achieve multi-ns/day enhanced sampling and obtain a Ramachandran plot consistent with established basins. For an intramolecular nucleophilic addition reaction in a polar solvent environment, we reproduce the expected solvent-induced stabilization in the free-energy profile. We further apply the approach to a cytochrome P450 Compound I hydroxylation reaction and obtain a free-energy landscape consistent with the accepted reaction mechanism. These results demonstrate that uMLIP-based reactive simulations can be applied to diverse condensed-phase processes in large, realistic environments.

关键词: machine-learning interatomic potentials, hybrid approach, reactive molecular simulations, molecular mechanics, free-energy landscape, condensed-phase systems, computational chemistry, biomolecular simulation

324. ❌ Velocity Gauge for Oscillator Strength in $Δ$SCF theory

作者: Yang Shen, Yichen Fan, Weitao Yang 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15879v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究ΔSCF理论中振子强度的速度规范计算，属于计算化学和量子化学领域，与深度学习、大模型技术完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，但论文未使用AI方法，而是基于理论物理和计算方法，因此仅给予5分（有一定关联，因为属于科学计算领域）。其他所有关键词均与大模型、深度学习技术相关，与本文内容无任何关联，全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了ΔSCF理论中计算振子强度时因基态和激发态Kohn-Sham波函数非正交性导致的物理问题，提出使用速度规范方法能够自然处理非正交性问题并提供与原点无关的预测，无需额外校正方案。

摘要翻译

Delta自洽场（$Δ$SCF）理论被广泛用于电子激发能计算。然而，计算相应的振子强度具有挑战性。相应的多电子波函数无法直接获取。$Δ$SCF中的基态和激发态波函数均通过虚构非相互作用体系的参考Kohn-Sham（KS）单行列式波函数描述。由于来自两次不同SCF计算的基态与激发态Kohn-Sham行列式之间的非正交性，导致了物理上不合理的、依赖于坐标原点的跃迁性质，例如跃迁偶极矩和长度规范的振子强度。在微扰中包含核贡献在理论上是严格的，但正如我们从理论和数值上所示，其有效性仅限于中性体系。尽管已有其他几种实用方法被提出以处理非正交性问题并获得了合理结果，但这些方法不可避免地改变了基态或激发态的行列式以及密度矩阵。在本工作中，我们探索了在$Δ$SCF理论中使用速度规范计算振子强度。我们证明，速度规范能够自然地处理$Δ$SCF KS波函数的非正交性，并提供与坐标原点无关的预测，而无需对KS波函数进行任何额外的校正方案。与通过对称正交化获得的长度规范结果相比，速度规范能提供可比的结果。此外，在速度规范跃迁偶极矩中采用自旋纯化的单重态激发能，显著提升了速度规范对共轭发色团的$Δ$SCF振子强度预测的整体性能。

摘要 (Abstract)

Delta self-consistent-field ($Δ$SCF) theory is widely used for electronic excitation energy calculations. However, calculating the corresponding oscillator strengths is challenging. The corresponding many-electron wavefunctions are not directly accessible. Both the ground-state and the excited-state wave functions from $Δ$SCF are described by reference Kohn-Sham (KS) single-determinant wavefunctions for the fictitious non-interacting systems. The non-orthogonality between the ground and excited Kohn-Sham determinants from two different SCF calculations leads to unphysically origin-dependent transition properties, such as transition dipole moment and length-gauge oscillator strength. Including nuclei contribution in the perturbation is theoretically rigorous, but its effectiveness is only limited to neutral systems, as we show theoretically and numerically. While several other practical approaches have been proposed to tackle the non-orthogonality problem and yield reasonable results, inevitably the determinant of the ground state or the excited state is changed, as well as the density matrix. In this work, we explore the use of the velocity gauge to compute oscillator strength within $Δ$SCF theory. We demonstrate that the velocity gauge is capable of naturally accounting for the non-orthogonality of $Δ$SCF KS wavefunctions and offering origin-independent predictions without any additional correction schemes to the KS wavefunctions. Compared to the length-gauge results obtained via symmetric orthogonalization, velocity gauge can offer comparable results. Furthermore, the adoption of spin-purified singlet excitation energy in the velocity-gauge transition dipole moment significantly enhances the overall performance of the velocity gauge for $Δ$SCF oscillator strength predictions on conjugated chromophores.

关键词: ΔSCF theory, oscillator strength, velocity gauge, Kohn-Sham wavefunctions, non-orthogonality, transition dipole moment, electronic excitation, conjugated chromophores

325. ❌ Transfer Learning Meets Embedded Correlated Wavefunction Theory for Chemically Accurate Molecular Simulations: Application to Calcium Carbonate Ion-Pairing

作者: Xuezhi Bian, Emily A. Carter 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算化学领域，提出了一种结合嵌入式相关波函数理论和迁移学习的分子模拟框架（ECW-TL），用于精确模拟水溶液中的离子配对过程。所有关键词均与大语言模型（LLM）或深度学习技术相关，而本文的核心是量子化学计算和机器学习势函数，未涉及LLM、深度学习架构、训练方法、推理优化、对齐技术、智能体系统等主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学（与化学信息学相关），但并非核心应用LLM或深度学习，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种嵌入式相关波函数迁移学习（ECW-TL）框架，用于实现化学精度的分子动力学模拟，并成功应用于水溶液中钙碳酸盐离子配对过程，揭示了精确电子交换和相关作用对离子对稳定性和结构的关键影响。

摘要翻译

在计算化学领域，实现分子模拟的化学精度仍是一个核心挑战。本文提出了一种嵌入式相关波函数迁移学习（ECW-TL）框架，用于精确模拟凝聚相中的分子动力学。该框架在嵌入式相关波函数（ECW）理论中纳入了高精度的电子交换与关联效应，同时保持了机器学习原子间势函数的训练与计算效率。我们以水溶液中Ca²⁺-CO₃²⁻离子配对过程为例验证该框架，此过程是海水中CO₂矿化的关键步骤。作为原理性验证，我们首先表明：通过嵌入式密度泛函理论（DFT）-SCAN数据对DFT-revPBE-D3(BJ)基线模型进行微调后，可在所有溶剂化状态下复现出误差小于1 kcal/mol的DFT-SCAN自由能面。将该框架进一步扩展至嵌入式二阶微扰理论（MP2）和局域化自然轨道耦合簇单双激发微扰理论（CCSD(T)）后，自由能剖面得到进一步优化，揭示了精确电子交换与关联作用在决定离子对稳定性和结构中的关键角色。因此，ECW-TL为将相关波函数（CW）精度迁移至复杂水相及界面化学过程的大规模模拟，提供了一条通用且数据高效的途径。

摘要 (Abstract)

Achieving chemical accuracy for molecular simulations remains a central challenge in computational chemistry. Here, we present an embedded correlated wavefunction transfer learning (ECW-TL) framework for accurately simulating molecular dynamics in the condensed phase. ECW-TL incorporates high-level electron exchange and correlation effects in ECW theory while preserving training and computational efficiency of machine learned interatomic potentials. We demonstrate the framework on Ca2+-CO32- ion pairing in aqueous solution, a key process underlying CO2 mineralization in seawater. As proof of principle, we first show that finetuning a DFT-revPBE-D3(BJ) baseline model with embedded-DFT-SCAN data reproduces the DFT-SCAN free-energy surface within 1 kcal/mol across all solvation states. Extending the framework to embedded MP2 and localized natural-orbital CCSD(T) further refines the free-energy profile, revealing the crucial role of exact electron exchange and correlation in determining ion-pair stability and structure. ECW-TL thus provides a general, data-efficient route for transferring CW accuracy to large-scale simulations of complex aqueous and interfacial chemical processes.

关键词: transfer learning, embedded correlated wavefunction, molecular simulations, chemical accuracy, ion pairing, aqueous solution, free-energy surface, CCSD(T)

326. ❌ On the performance of QTP functionals applied to second-order response properties II: Dynamic polarizability and long-range C$_6$ coefficients

作者: Rodrigo A. Mendes, Peter R. Franke, Ajith Perera, Rodney J. Bartlett 期刊/来源: arxiv 发布日期: 2026-03-16 arXiv链接: http://arxiv.org/abs/2603.15788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究量子化学中的交换相关泛函在二阶响应性质（动态极化率和C6色散系数）上的性能评估，属于计算化学领域。所有评分关键词均涉及大模型、深度学习及其相关技术（如训练方法、推理优化、应用等），而论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究评估了25种量子理论项目（QTP）交换相关泛函在预测动态极化率和C6色散系数等频率依赖性二阶响应性质上的性能，发现TPSS0和QTP01在动态极化率上表现最佳，而O3LYP在C6系数上总体表现最好。

摘要翻译

本研究是“论QTP泛函在二阶响应性质中的表现”系列的第二篇。在首篇论文（J. Chem. Phys. 162, 054105, 2025）中，我们展示了量子理论项目（Quantum Theory Project, QTP）泛函在预测静态微扰二阶性质（如静态极化率、核磁共振自旋-自旋耦合常数及核磁共振化学位移）方面的良好性能。本文则聚焦于频率依赖性质，即动态极化率与C$_6$色散系数。为求完备，本研究共考察了25种交换-关联（exchange-correlation, XC）泛函。动态极化率在五个不同微扰波长下进行评估：632.99纳米、594.10纳米、543.52纳米、514.50纳米及325.13纳米。该性质亦通过哈特里-福克（HF）和方程-运动耦合簇单双激发（EOM-CCSD）方法进行计算。总体而言，除最高频率外，EOM-CCSD结果与线性响应CC3所得结果高度接近。在科恩-沈（Kohn-Sham）计算中，TPSS0与QTP01在动态极化率方面展现出最佳综合表现。我们还评估了QTP泛函对CO分子极点结构的再现能力。针对C$_6$色散系数，计算采用卡西米尔-波尔德（Casimir-Polder）方程完成。O3LYP泛函取得了最佳整体表现，但排名前十一的泛函均显示出相近的精度。在QTP系列泛函中，QTP01与长程校正QTP（LC-QTP）为C$_6$系数提供了最优结果。

摘要 (Abstract)

This work is the second in the series “On the performance of QTP functionals applied to second-order response properties.” In the first paper (J. Chem. Phys. 162, 054105, 2025), we demonstrated the good performance of Quantum Theory Project functionals in predicting static perturbed second-order properties, such as static polarizabilities, nuclear magnetic resonance (NMR) spin-spin coupling constants, and NMR chemical shifts. In the present study, we focus on frequency-dependent properties, namely dynamic polarizabilities and C$_6$ dispersion coefficients. For completeness, a total of 25 exchange-correlation (XC) functionals were investigated. Dynamic polarizabilities were evaluated at five different perturbation wavelengths: 632.99 nm, 594.10 nm, 543.52 nm, 514.50 nm, and 325.13 nm. This property was also computed using HF and EOM-CCSD. In general, EOM-CCSD results are very close to those obtained with linear-response CC3, except at the highest frequency. Among Kohn-Sham calculations, TPSS0 and QTP01 showed the best overall performance for dynamic polarizabilities. We also assessed how well QTP functionals reproduce the pole structure of the CO molecule. For the C$_6$ dispersion coefficients, calculations were performed using the Casimir-Polder equation. The best overall performance was obtained with O3LYP; however, the first eleven ranked functionals show very similar accuracy. Within the QTP family, QTP01 and LC-QTP provide the best results for C$_6$ coefficients.

关键词: QTP functionals, second-order response properties, dynamic polarizability, C6 dispersion coefficients, exchange-correlation functionals, frequency-dependent properties, Casimir-Polder equation, quantum chemistry

327. ❌ Life cycle assessment for all organic chemicals

作者: Shaohan Chen, Tim Langhorst, Julian Nöhl, Christopher Oberschelp, Martin Pillich, Johannes Schilling, André Bardow 期刊/来源: arxiv 发布日期: 2026-03-15 arXiv链接: http://arxiv.org/abs/2603.15686v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究化学生命周期评估（LCA）的数据生成框架CRYSTAL，使用逆合成和机器学习方法创建化学品库存数据库。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文将机器学习应用于化学信息学领域，但并非核心创新点，因此给予5分。

!!! tip deepseek-chat TL;DR

该论文提出了CRYSTAL框架，通过逆合成和机器学习自动生成有机化学品的透明生命周期库存数据，创建了包含7万多种化学品的数据库，并识别了环境影响热点和关键枢纽化学品。

摘要翻译

化学品几乎渗透现代社会的每个角落，但其生产过程引发了严重的可持续性问题。实现可持续化学工业需要详尽的生命周期评估（LCA）；然而，由于现有生命周期清单（LCI）数据库仅覆盖了交易化学品中的极小部分，当前评估面临数据覆盖有限、部分不一致且不透明的困境，导致存在大量未知因素。本文提出面向生命周期透明评估的化学逆合成框架（Chemical RetrosYnthesiS for Transparent Assessment of Life-cycles，简称CRYSTAL）。该框架基于分子结构，通过逆合成与机器学习驱动的门到门清单，自动生成一致且透明的有机化学品LCI数据。利用CRYSTAL的预测能力，我们构建了一个涵盖超过70000种有机化学品的一致性数据库，包含逾110000个透明LCI数据集，量化了原料与能源需求，以及相关的辅助材料、生物圈流动和废弃物流。基于这一综合性数据库，我们识别出驱动有机化学品生产在多重环境类别中产生高影响的50个关键环境热点，以及对下游化学品生产最为关键的核心枢纽化学品。通过提供这一全面的数据基础，CRYSTAL框架为针对性工程改造和政策干预提供了系统性指导。其透明化、模块化的特性旨在将化学品LCA从依赖“未知的未知”转变为可协同完善的“已知的未知”图谱。

摘要 (Abstract)

Chemicals are embedded in nearly every aspect of modern society, yet their production poses substantial sustainability concerns. Achieving a sustainable chemical industry requires detailed Life Cycle Assessment (LCA); however, current assessments face many unknowns due to limited, partly inconsistent, and untransparent data coverage since existing Life Cycle Inventory (LCI) databases account for only a tiny fraction of traded chemicals. Here, we introduce the Chemical RetrosYnthesiS for Transparent Assessment of Life-cycles (CRYSTAL) framework, which automatically generates consistent and transparent LCI data for organic chemicals based on their molecular structure using retrosynthesis and machine-learned gate-to-gate inventories. Using the predictive power of CRYSTAL, we create a consistent database for more than 70000 organic chemicals, comprising over 110000 transparent LCI datasets that quantify both feedstock and energy demands, together with associated auxiliary materials, biosphere flows, and waste flows. From this comprehensive database, we identify 50 key environmental hotspots driving high impacts of organic chemical production across multiple environmental categories and pivotal hub chemicals that are most critical for downstream chemical production. In providing this comprehensive data foundation, the CRYSTAL framework offers systematic guidance for targeted engineering and policy interventions. Its transparent, modular nature is designed to shift chemical LCA from a reliance on “unknown unknowns” to a collaboratively improvable mapping of “known unknowns”.

关键词: Life Cycle Assessment, Organic Chemicals, Retrosynthesis, Machine Learning, Life Cycle Inventory, Environmental Impact, Chemical Database, Sustainability

Token 消耗统计

总计: 1,050,703 tokens（输入 711,326 / 输出 339,377）

模型	输入	输出	合计
deepseek-chat	586,339	325,488	911,827
glm-4.7	124,987	13,889	138,876

📊 ArXiv 研究报告 (2026-03-19)#

📌 配置信息#

关键词列表（共 27 个，总权重 27.0）#

评分设置#

📈 论文统计#

⭐ 及格论文详细分析#

1. Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence#

2. Exploring different approaches to customize language models for domain-specific text-to-code generat#

探索定制语言模型用于特定领域文本到代码生成的不同方法#

3. Parallel In-context Learning for Large Vision Language Models#

大型视觉语言模型的并行上下文学习#

4. Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimod#

循线索，构真相：开放词汇多模态情感识别中的混合证据演绎推理#

5. Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not#

6. Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective M#

7. SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs wi#

SIA：面向知识增强与安全电商搜索大模型工业部署的合成-注入-对齐框架#

8. Attention-guided Evidence Grounding for Spoken Question Answering#

9. InViC: Intent-aware Visual Cues for Medical Visual Question Answering#

10. MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale#

11. Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models#

12. Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models#

13. Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Prefer#

14. Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization#

频率至关重要：用于剪枝和量化的快速模型无关数据策展#

📋 所有论文列表#

1. ✅ Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence#

2. ✅ Exploring different approaches to customize language models for domain-specific text-to-code generation#

3. ✅ Parallel In-context Learning for Large Vision Language Models#

4. ✅ Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition#

5. ✅ Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits#

6. ✅ Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory#

7. ✅ SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment#

8. ✅ Attention-guided Evidence Grounding for Spoken Question Answering#

9. ✅ InViC: Intent-aware Visual Cues for Medical Visual Question Answering#

10. ✅ MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale#

11. ✅ Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models#

12. ✅ Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models#

13. ✅ Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences#

14. ✅ Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization#

15. ❌ Parametric Social Identity Injection and Diversification in Public Opinion Simulation#

16. ❌ CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation#

17. ❌ Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?#

18. ❌ From Natural Language to Executable Option Strategies via Large Language Models#

19. ❌ Trained Persistent Memory for Frozen Encoder–Decoder LLMs: Six Architectural Methods#

20. ❌ SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit#

21. ❌ Mixture of Style Experts for Diverse Image Stylization#

22. ❌ S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight#

23. ❌ GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators#

24. ❌ ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K#

25. ❌ MessyKitchens: Contact-rich object-level 3D scene reconstruction#

26. ❌ Demystifing Video Reasoning#

27. ❌ SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models#

28. ❌ SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation#

29. ❌ SOMA: Unifying Parametric Human Body Models#

30. ❌ Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks#

31. ❌ Internalizing Agency from Reflective Experience#

32. ❌ Learning to Present: Inverse Specification Rewards for Agentic Slide Generation#

33. ❌ Prompt Programming for Cultural Bias and Alignment of Large Language Models#

34. ❌ Real-Time Decoding of Movement Onset and Offset for Brain-Controlled Rehabilitation Exoskeleton#

35. ❌ Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost#

36. ❌ Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights#

37. ❌ ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation#

38. ❌ V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising#

39. ❌ DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping#

40. ❌ InCoder-32B: Code Foundation Model for Industrial Scenarios#

41. ❌ Anticipatory Planning for Multimodal AI Agents#

42. ❌ IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans#

43. ❌ TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities#

44. ❌ Finding Common Ground in a Sea of Alternatives#

45. ❌ Nonstandard Errors in AI Agents#

46. ❌ SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding#

47. ❌ MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning#

48. ❌ Retrieving Counterfactuals Improves Visual In-Context Learning#

49. ❌ Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure#

50. ❌ IQuest-Coder-V1 Technical Report#

51. ❌ Federated Learning with Multi-Partner OneFlorida+ Consortium Data for Predicting Major Postoperative Complications#

52. ❌ Cost Trade-offs in Matrix Inversion Updates for Streaming Outlier Detection#

53. ❌ When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making#

54. ❌ CritiSense: Critical Digital Literacy and Resilience Against Misinformation#

📊 ArXiv 研究报告 (2026-03-19)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

2. Exploring different approaches to customize language models for domain-specific text-to-code generat

探索定制语言模型用于特定领域文本到代码生成的不同方法

3. Parallel In-context Learning for Large Vision Language Models

大型视觉语言模型的并行上下文学习

4. Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimod

循线索，构真相：开放词汇多模态情感识别中的混合证据演绎推理

5. Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not

6. Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective M

7. SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs wi

SIA：面向知识增强与安全电商搜索大模型工业部署的合成-注入-对齐框架

8. Attention-guided Evidence Grounding for Spoken Question Answering

9. InViC: Intent-aware Visual Cues for Medical Visual Question Answering

10. MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

11. Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models

12. Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

13. Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Prefer

14. Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

频率至关重要：用于剪枝和量化的快速模型无关数据策展

📋 所有论文列表

1. ✅ Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

2. ✅ Exploring different approaches to customize language models for domain-specific text-to-code generation

3. ✅ Parallel In-context Learning for Large Vision Language Models

4. ✅ Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

5. ✅ Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits

6. ✅ Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

7. ✅ SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment

8. ✅ Attention-guided Evidence Grounding for Spoken Question Answering

9. ✅ InViC: Intent-aware Visual Cues for Medical Visual Question Answering

10. ✅ MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

11. ✅ Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models

12. ✅ Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

13. ✅ Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

14. ✅ Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

15. ❌ Parametric Social Identity Injection and Diversification in Public Opinion Simulation

16. ❌ CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation

17. ❌ Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?

18. ❌ From Natural Language to Executable Option Strategies via Large Language Models

19. ❌ Trained Persistent Memory for Frozen Encoder–Decoder LLMs: Six Architectural Methods

20. ❌ SOMP: Scalable Gradient Inversion for Large Language Models via Subspace-Guided Orthogonal Matching Pursuit

21. ❌ Mixture of Style Experts for Diverse Image Stylization

22. ❌ S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

23. ❌ GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators

24. ❌ ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K

25. ❌ MessyKitchens: Contact-rich object-level 3D scene reconstruction

26. ❌ Demystifing Video Reasoning

27. ❌ SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

28. ❌ SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

29. ❌ SOMA: Unifying Parametric Human Body Models

30. ❌ Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

31. ❌ Internalizing Agency from Reflective Experience

32. ❌ Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

33. ❌ Prompt Programming for Cultural Bias and Alignment of Large Language Models

34. ❌ Real-Time Decoding of Movement Onset and Offset for Brain-Controlled Rehabilitation Exoskeleton

35. ❌ Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost

36. ❌ Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights

37. ❌ ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation

38. ❌ V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising

39. ❌ DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping

40. ❌ InCoder-32B: Code Foundation Model for Industrial Scenarios

41. ❌ Anticipatory Planning for Multimodal AI Agents

42. ❌ IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

43. ❌ TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

44. ❌ Finding Common Ground in a Sea of Alternatives

45. ❌ Nonstandard Errors in AI Agents

46. ❌ SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

47. ❌ MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

48. ❌ Retrieving Counterfactuals Improves Visual In-Context Learning

49. ❌ Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

50. ❌ IQuest-Coder-V1 Technical Report

51. ❌ Federated Learning with Multi-Partner OneFlorida+ Consortium Data for Predicting Major Postoperative Complications

52. ❌ Cost Trade-offs in Matrix Inversion Updates for Streaming Outlier Detection

53. ❌ When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

54. ❌ CritiSense: Critical Digital Literacy and Resilience Against Misinformation