📊 ArXiv 研究报告 (2026-03-13)

生成时间: 2026-03-13 13:28:51 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 261 篇
及格论文: 12 篇 (4.6%)
深度分析: 12 篇

⭐ 及格论文详细分析

1. Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural

作者: Sizhong Qin, Ramon Elias Weber, Xinzheng Lu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11640v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出HouseMind，一个多模态大语言模型，专注于建筑平面图的理解、生成和编辑。核心与大语言模型（LLMs）高度相关（10分），因为它是一个多模态LLM框架。与指令调优（Instruction Tuning）高度相关（10分），因为模型通过指令调优实现可控生成。与AI for Science相关（10分），因为它将AI应用于建筑领域，属于科学应用。与小型语言模型（SLMs）有一定关联（5分），因为摘要提到模型高效且可本地部署。与预训练（Pre-training）和微调（SFT）有一定关联（5分），因为模型训练涉及多模态对齐和指令调优。与推理（Chain of Thought, System 2 Thinking）有一定关联（5分），因为任务需要几何和空间推理。其他关键词如MoE、RLHF、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该研究解决了建筑平面图设计中AI系统难以进行连贯空间推理和可控生成的挑战，提出了一个名为HouseMind的多模态大语言模型，通过离散房间实例令牌和指令调优，实现了从文本指令合成连贯、可控布局的框架，并在实验中表现出优异的几何有效性和可控性。

摘要翻译

建筑平面图设计需要对几何结构、语义信息与空间层级进行联合推理，这对当前人工智能系统仍构成重大挑战。尽管近期扩散模型与语言模型提升了视觉逼真度，但其在空间连贯推理与可控生成方面仍存在困难。本文提出HouseMind——一个多模态大语言模型，将平面图理解、生成与编辑统一于单一框架中。我们引入离散的房间实例（room-instance）标记来构建统一词汇表，从而连接布局设计与符号推理。通过多模态对齐与指令微调，该模型能够根据文本指令生成连贯且可控的平面布局。实验表明，该框架在保持高效性与本地可部署性的同时，实现了更优的几何有效性与生成可控性。

摘要 (Abstract)

Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.

关键词: multimodal large language model, architectural floor plans, discrete room-instance tokens, instruction tuning, controllable generation, spatial reasoning, HouseMind, geometric validity

深度分析:

分词使多模态大语言模型能够理解、生成和编辑建筑平面图

摘要:

建筑平面图设计需要对几何、语义和空间层级进行联合推理，这对现有AI系统是巨大挑战。本文提出了HouseMind，一个多模态大语言模型，通过引入离散房间实例Token，利用VQ-VAE将布局转化为统一词汇表，弥合了符号推理与连续几何的鸿沟。通过多模态对齐和指令微调，该模型在一个框架内统一了理解、生成和编辑任务。实验表明，HouseMind在几何有效性、可控性和效率方面优于现有基线，且支持本地部署，实现了从文本指令到空间布局的高效转换。

创新点:

提出了离散房间实例Token化方法，利用VQ-VAE将连续的几何布局转化为离散Token序列，使LLM能够进行符号层面的空间推理。
构建了统一的多任务框架，在一个模型中同时实现了平面图的理解、生成和编辑，打破了以往任务分离的局限。
设计了条件房间编码机制，在编码房间时引入轮廓上下文，从而捕捉空间邻接关系，增强了全局空间连贯性。
实现了高效且可本地部署的架构，相比扩散模型大幅降低了计算成本，支持实时推理和本地化应用。

方法

!!! info

论文采用的技术路线主要包含两个核心部分：首先，使用分层VQ-VAE模块对建筑平面图进行结构化Token化，分别对轮廓和房间实例进行编码，将图像转化为离散的Token序列；其次，通过三阶段训练流程（嵌入初始化、多模态预训练、指令微调）对多模态大语言模型进行训练，实现文本与空间表示的对齐。最终，模型通过自回归生成的方式完成理解、生成和编辑任务。

关键结果:

HouseMind在几何有效性和语义一致性方面优于现有的扩散模型和基于LLM的基线方法。
模型能够根据自然语言指令精确控制平面图的空间结构和语义组成。
实现了在保持全局空间连贯性的同时进行局部编辑的能力。
展示了紧凑架构带来的高效性，支持实时推理和本地设备部署。

技术栈: Vector-Quantized Variational Autoencoder (VQ-VAE), Multimodal Large Language Model (MLLM), Autoregressive Modeling, Instruction Tuning (SFT), CNN Encoder/Decoder, Transformer Backbone

优点

统一性强：将理解、生成和编辑整合在一个模型中，简化了工作流程。
可控性高：通过离散Token实现了细粒度的房间级控制和文本引导。
推理能力强：结合了LLM的符号推理能力和VQ-VAE的几何表示能力。
部署友好：相比扩散模型计算效率高，易于在本地部署。

局限

分辨率限制：输入输出图像尺寸为64x64px，可能难以表达复杂的建筑细节。
数据依赖：模型性能依赖于RPLAN数据集，可能存在数据偏差或对特定建筑风格的泛化问题。
拓扑约束：虽然改善了空间连贯性，但在严格的建筑规范（如门的位置、走廊宽度）方面可能仍需后处理验证。

与研究方向的相关性:

该论文高度相关。它属于大模型（LLM）在垂直领域（建筑设计）的创新应用，同时在大模型技术原理上进行了创新（通过VQ-VAE将视觉空间问题转化为语言Token问题）。它展示了深度学习技术如何解决复杂的空间推理问题，符合对大模型技术原理创新及科学领域应用的关注点。

2. Tiny Aya: Bridging Scale and Multilingual Depth

作者: Alejandro R. Salamanca, Diana Abagyan, Daniel D’souza, Ammar Khairi, David Mora, Saurabh Dash, Viraat Aryabumi, Sara Rajaee, Mehrnaz Mofakhami, Ananya Sahu, Thomas Euyang, Brittawnya Prince, Madeline Smith, Hangyu Lin, Acyr Locatelli, Sara Hooker, Tom Kocmi, Aidan Gomez, Ivan Zhang, Phil Blunsom, Nick Frosst, Joelle Pineau, Beyza Ermis, Ahmet Üstün, Julia Kreutzer, Marzieh Fadaee 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11510v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究小型多语言模型（3.35B参数）的训练和应用，高度相关关键词包括：Small Language Models（核心研究对象）、Pre-training（基础模型训练）、Post-training（区域感知后训练）、Instruction Tuning（指令调优变体）、Large Language Models（属于大模型范畴）。Scaling Laws AND Data Quality得5分，因论文涉及数据组成和替代扩展路径，但非核心。其余关键词如MoE、RLHF、RAG等未在摘要中提及，得0分。

!!! tip deepseek-chat TL;DR

Tiny Aya研究如何通过高效的训练策略和数据组成，构建一个仅3.35B参数的小型多语言模型，在70种语言上实现先进的翻译质量、多语言理解和生成能力，并提供了基础模型、指令调优变体和区域专业化模型。

摘要翻译

Tiny Aya重新定义了小型多语言模型所能达到的边界。该模型基于70种语言进行训练，并通过区域感知的后训练阶段进行精调，仅以3.35B参数规模便在翻译质量、多语言理解能力以及高质量目标语言生成方面实现了业界领先水平。本次发布包含一个预训练基础模型、一个全球平衡的指令微调版本，以及三个针对非洲、南亚、欧洲-亚太和西亚地区语言进行专门优化的区域定制模型。本报告详细阐述了Tiny Aya背后的训练策略、数据构成与综合评估框架，并提出了一条以效率为核心、注重语言间性能均衡且兼顾实际部署需求的多语言人工智能发展新路径。

摘要 (Abstract)

Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraining, it delivers state-of-the-art in translation quality, strong multilingual understanding, and high-quality target-language generation, all with just 3.35B parameters. The release includes a pretrained foundation model, a globally balanced instruction-tuned variant, and three region-specialized models targeting languages from Africa, South Asia, Europe, Asia-Pacific, and West Asia. This report details the training strategy, data composition, and comprehensive evaluation framework behind Tiny Aya, and presents an alternative scaling path for multilingual AI: one centered on efficiency, balanced performance across languages, and practical deployment.

关键词: small multilingual language model, 3.35B parameters, 70 languages, region-aware posttraining, instruction-tuned variant, translation quality, multilingual understanding, practical deployment

深度分析:

Tiny Aya：连接规模与多语言深度

摘要:

该论文介绍了Tiny Aya，这是一个高效、开放权重的多语言模型系列，仅用3.35B参数即可在70种语言中实现最先进的性能。它解决了当前多语言模型中存在的性能不平等问题，这些模型通常偏向高资源语言。作者采用了数据为中心的设计，包括专门的数据加权分词器以确保低资源语言的公平表示，以及平衡的语言分组策略。该系列包括一个基础模型、一个全局指令微调模型和三个区域专用模型。评估显示，Tiny Aya在翻译、理解和生成方面与更大的模型竞争激烈，显著减少了语言差异，并在安全性和文化意识方面表现出色，为高效、包容的多语言AI提供了一条新路径。

创新点:

高效多语言架构：在3.35B参数的小模型中实现了70种语言的SOTA性能，挑战了单纯依靠扩大模型规模的传统做法。
区域感知后训练：引入了区域专用模型（Earth, Fire, Water），在保持共享多语言基础的同时，针对特定语言集群（非洲、南亚、亚太/欧洲）进行优化。
平衡的分词器设计：使用专门的数据加权方案（结合数据分布和语言桶）训练单个分词器，以确保低资源语言的公平表示和压缩效率。
合成数据管道：实施了多阶段合成数据生成，包括翻译、提示级转换和FusioN，以扩展语言覆盖范围并提高自然度，减少对英语的偏见。

方法

!!! info

研究方法包括构建一个大规模多语言分词器，利用语言桶加权策略平衡数据分布；在预训练阶段，通过语言分组平衡和“Cooldown”策略（高质量数据上采样）处理70种语言及代码数据；在后训练阶段，将语言聚类为5个区域，利用合成数据生成（翻译、提示转换、FusioN）创建平衡数据集，并训练全局与区域模型后进行合并。评估采用涵盖翻译、理解、推理、生成及安全性的综合多语言基准测试。

关键结果:

Tiny Aya Global在WMT24++的55种语言中有46种优于Gemma3-4B。
在开放生成任务（mDolly）中，平均比下一个竞争对手高出5分。
区域专用模型在南亚的翻译质量提高了5.5 ChrF点，在非洲提高了1.7点。
在MultiJail上实现了91.1%的平均安全响应率，同时保持了跨语言的强安全性。
分词器在大多数脚本上实现了最低或接近最低的每字符平均token数，特别是在高棉语、泰卢固语等代表性不足的脚本上表现出色。

技术栈: 模型架构：3.35B参数（Transformer架构）, 分词器：262k词汇量，GPT-4o正则表达式预分词, 算法：加权数据混合方案（$w_i = \frac{w^d_i \cdot w^b_i}{\sum w^d_n \cdot w^b_n}$），模型合并, 工具/模型：command-a-translate, deepseek-v3（用于翻译）, xCOMET-XL, AfriCOMET（用于质量评估）, 数据集：Fineweb-2, WMT24++, mDolly, mArenaHard, GlobalMGSM, Flores, GlobalMMLU, MultiJail

优点

效率高：在小参数规模下实现了高性能，适合实际部署。
公平性强：显著减少了高资源和低资源语言之间的性能差距。
评估全面：评估框架广泛，涵盖了翻译、推理、安全性和文化意识。
灵活性：提供基础、全局和区域专用模型，以适应不同的用例。
数据为中心：强调数据质量和平衡，而不是仅仅扩大模型规模。

局限

规模限制：尽管效率很高，但3.35B参数可能仍然缺乏超大规模模型（如100B+）的某些复杂推理能力。
合成数据依赖：严重依赖翻译和合成生成，可能会引入伪影或继承“教师”模型的偏见。
区域聚类简化：将语言聚类到5个区域可能会简化某些语言或方言的细微差别。
评估局限：虽然基准测试广泛，但可能无法完全捕捉现实世界的细微差别或所有70种语言的特定文化背景。

与研究方向的相关性:

该论文与“大模型和深度学习技术原理的创新”高度相关。它深入探讨了多语言大模型的高效架构设计、分词器优化、数据混合策略以及后训练对齐技术，属于大模型底层原理的创新。虽然论文主要关注通用语言任务而非特定科学领域的应用，但其提出的高效训练方法和数据平衡策略对科学计算中的大模型应用具有重要的参考价值。

3. Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

作者: Yulu Gan, Phillip Isola 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12228v1

评分: 51.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	8.0/10	8.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究预训练模型参数分布中任务专家的密度问题，与"Mixture of Experts"高度相关（10分），直接讨论专家解决方案；与"Pre-training"和"Post-training"高度相关（10分），聚焦预训练权重和并行后训练方法；与"Large Language Models"相关（8分），涉及大规模模型特性；与"Model Merging"相关（8分），通过集成预测实现模型组合；与"Small Language Models"有一定关联（5分），对比了小模型情况；其他关键词如Scaling Laws、RLHF、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在大型预训练模型中，任务专家解决方案在预训练权重附近密度显著增加，并提出了一种简单的并行后训练方法，通过随机采样参数扰动、选择最优扰动并集成预测，其性能可与PPO、GRPO等标准后训练方法相竞争。

摘要翻译

预训练产生的学习参数向量通常被视为后续迭代适应的起点。在本研究中，我们提出将预训练结果视为参数向量上的分布，其支撑集已包含任务特定的专家模型。我们证明，在小型模型中此类专家解仅占据该分布体积的极小部分，因此其发现依赖于梯度下降等结构化优化方法。相比之下，在大型且充分预训练的模型中，任务专家的密度显著增加，使得多样化、能提升任务性能的专家模型大量分布于预训练权重邻域内。基于此视角，我们探索了一种完全并行的简单后训练方法：随机采样 $N$ 个参数扰动，选取最优的 $K$ 个样本，并通过多数投票进行预测集成。尽管方法简单，该策略在当代大规模模型中与PPO、GRPO、ES等标准后训练方法相比仍具有竞争力。

摘要 (Abstract)

Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples $N$ parameter perturbations at random, selects the top $K$, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.

关键词: pretraining, task experts, parameter distribution, post-training, model ensembling, large-scale models, parameter perturbations, majority vote

深度分析:

神经灌木丛：预训练权重周围密集分布着多样化的任务专家

摘要:

本文提出了一种新视角，将预训练模型的参数视为一个分布，而非单一固定点。研究发现，在大规模预训练模型中，预训练权重周围存在一个密集的“灌木丛”区域，其中包含大量能提升特定任务性能的专家解。随着模型规模的增大，这些解的密度和多样性显著增加。基于此，作者提出了一种名为 RandOpt 的后训练方法，通过随机采样权重扰动、筛选并集成预测，实现了与 PPO、GRPO 等复杂算法相当的性能，且训练步骤为 O(1)，极具效率。

创新点:

提出了“神经灌木丛”理论，揭示了大模型预训练权重周围密集分布着多样化的任务专家，且密度和多样性随模型规模呈幂律增长。
发现大模型与小模型在损失景观上的根本差异：小模型处于“大海捞针”的稀疏解空间，而大模型处于解密集的“灌木丛”状态。
提出了 RandOpt 算法，利用随机采样和集成策略替代复杂的梯度优化，证明了在大模型时代随机猜测作为一种有效后训练方法的可行性。

方法

!!! info

论文首先对 Qwen2.5 等不同规模的预训练模型施加高斯噪声扰动，并在数学、编程、写作、化学等多个任务上评估性能。通过定义“解决方案密度”和“谱不协调度”量化了权重空间的密度和多样性。随后，提出了 RandOpt 方法：随机生成 N 个权重扰动，在验证集上评估选出前 K 个，最后通过多数投票集成预测。实验将 RandOpt 与 PPO、GRPO、ES 等主流算法进行了对比。

关键结果:

解决方案密度随模型规模单调递增，大模型周围更容易找到提升性能的权重扰动。
解决方案多样性随模型规模增加，随机扰动倾向于成为特定任务的专家（在某任务提升，在其他任务下降），而非通才。
RandOpt 在 CountDown、GSM8K 等任务上达到了与 PPO、GRPO 等算法相当的准确率，且训练时间复杂度为 O(1)，具有极高的并行效率。

技术栈: RandOpt (Random Optimization), Gaussian Perturbation, Ensembling (Majority Vote), PPO (Proximal Policy Optimization), GRPO (Group Relative Policy Optimization), ES (Evolution Strategies), PCA (Principal Component Analysis), K-means Clustering, Qwen2.5, Olmo-3-7B-Instruct

优点

理论洞察深刻，揭示了预训练规模如何改变损失景观的几何性质，解释了为何大模型更容易微调。
方法极简且高效，RandOpt 抛弃了梯度计算，完全并行化，大幅缩短了训练时间。
实验证据详实，通过多个模型规模和多种任务验证了密度和多样性的缩放定律。

局限

推理成本较高，由于需要集成 K 个模型，推理时的计算开销和延迟是基线方法的 K 倍。
依赖强预训练基座，该方法主要适用于已经进入“灌木丛”状态的大模型，对小模型效果有限。
并非旨在追求 SOTA 性能，作者更多将其作为一种探测工具，在某些任务上可能不如精心调优的梯度方法。

与研究方向的相关性:

该论文高度相关于“大模型和深度学习技术原理的创新”。它不仅挑战了传统的微调范式，还从底层原理上解释了大模型的可训练性，属于对深度学习基础理论的重大创新。虽然涉及数学、化学等科学任务作为验证集，但其核心贡献在于算法原理而非具体的科学应用突破。

4. When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows

作者: Wenxian Yang, Hanzheng Qiu, Bangqun Zhang, Chengquan Li, Zhiyong Huang, Xiaobin Feng, Rongshan Yu, Jiahong Dong 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11721v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心研究LLM代理在临床工作流中的应用，与"Large Language Models"、“LLM Agents”、“Tool Use”、“Multi-agent Systems"高度相关（10分），因为论文明确讨论LLM代理、工具调用和多代理协调。与"AI for Science"相关（10分），因为论文专注于医疗领域的AI应用。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM代理在临床环境中部署的可靠性、安全性和长期记忆不足等问题，提出了一种基于受限执行环境、文档中心交互、页面索引内存和医疗技能库的架构，为医院构建了一个安全、透明、可审计的代理操作系统。

摘要翻译

大型语言模型（LLM）智能体通过整合推理、工具调用与持久记忆，扩展了传统生成模型的能力。近期研究表明，此类智能体可通过自动化文档处理、协调诊疗流程及辅助医疗决策，显著改善临床工作流。然而，尽管进展迅速，由于可靠性局限、安全风险及长期记忆机制不足，在医疗环境中部署自主智能体仍面临挑战。本研究提出一种适用于医院环境的LLM智能体架构。该设计引入四个核心组件：受Linux多用户系统启发的受限执行环境；连接患者与临床医生智能体的以文档为中心的交互范式；专为长期临床情境管理设计的页面索引记忆架构；以及支持临床任务序列按需组合的精选医疗技能库。该架构并非赋予智能体无限制的系统访问权限，而是通过预定义技能接口和资源隔离来约束其行为。我们认为，此类系统构成了“医院智能体操作系统”的基础——这是一个能够协调临床工作流，同时保障安全性、透明性与可审计性的计算层。本研究基于OpenClaw（一个将智能体能力构建为离散技能精选库的开源自主动智能体框架）实现设计，并通过临床安全部署所需的基础设施级约束对其进行扩展。

摘要 (Abstract)

Large language model (LLM) agents extend conventional generative models by integrating reasoning, tool invocation, and persistent memory. Recent studies suggest that such agents may significantly improve clinical workflows by automating documentation, coordinating care processes, and assisting medical decision making. However, despite rapid progress, deploying autonomous agents in healthcare environments remains difficult due to reliability limitations, security risks, and insufficient long-term memory mechanisms. This work proposes an architecture that adapts LLM agents for hospital environments. The design introduces four core components: a restricted execution environment inspired by Linux multi-user systems, a document-centric interaction paradigm connecting patient and clinician agents, a page-indexed memory architecture designed for long-term clinical context management, and a curated medical skills library enabling ad-hoc composition of clinical task sequences. Rather than granting agents unrestricted system access, the architecture constrains actions through predefined skill interfaces and resource isolation. We argue that such a system forms the basis of an Agentic Operating System for Hospital, a computing layer capable of coordinating clinical workflows while maintaining safety, transparency, and auditability. This work grounds the design in OpenClaw, an open-source autonomous agent framework that structures agent capabilities as a curated library of discrete skills, and extends it with the infrastructure-level constraints required for safe clinical deployment.

关键词: LLM agents, clinical workflows, hospital environment, autonomous agents, tool invocation, multi-agent systems, medical decision making, agentic operating system

深度分析:

当OpenClaw遇见医院：面向动态临床工作流的智能体操作系统

摘要:

本文针对现有大语言模型（LLM）智能体在医疗环境部署中面临的可靠性、安全性和长期记忆机制不足的问题，提出了一种面向医院的智能体操作系统架构。该架构基于OpenClaw框架，引入了四个核心组件：受Linux多用户系统启发的受限执行环境、以文档为中心的交互范式、用于长期临床语境管理的页索引记忆架构，以及支持临时临床任务组合的医疗技能库。该系统通过预定义的技能接口和资源隔离来约束智能体行为，旨在构建一个能够协调临床工作流、同时保证安全性、透明度和可审计性的计算层，从而解决传统医院信息系统无法应对长尾临床需求的局限性。

创新点:

提出了受限执行环境，借鉴Linux多用户系统的权限控制机制，限制智能体的系统访问权限，确保医疗环境下的安全性与可审计性。
设计了页索引记忆架构，摒弃了传统的扁平向量检索，通过人类可读的清单文件暴露文档层次结构，保留了临床记录的纵向时序依赖。
建立了以文档为中心的多智能体协调模型，模拟医生、护士和患者等不同角色，通过共享结构化文档进行交互，而非单一的对话接口。
引入了医疗技能库与临时任务组合机制，允许智能体动态组合细粒度的临床能力，以应对传统固定工作流系统无法处理的长尾临床变异。

方法

!!! info

论文采用系统架构设计的研究方法。首先分析现有LLM智能体框架（如ReAct, Reflexion）和RAG技术在医疗场景下的局限性；然后基于OpenClaw开源框架进行扩展，提出包含受限执行、页索引记忆、文档交互和技能库四个核心组件的架构设计；最后论证该架构如何满足医院环境的安全性、记忆结构和动态工作流需求。

关键结果:

确定了现有智能体框架在医疗部署中的四个关键差距：缺乏执行限制、记忆系统结构不匹配、未建模多参与者工作流、以及固定工作流无法处理长尾需求。
提出了“医院智能体操作系统”的概念架构，作为协调临床工作流的基础计算层。
展示了页索引记忆架构在处理动态文档和长期临床语境方面的理论优势，避免了向量嵌入的计算开销和上下文碎片化。

技术栈: Large Language Models (LLMs), OpenClaw (开源自主智能体框架), Linux multi-user system principles (Linux多用户系统原则), Page-indexed memory (页索引记忆), Retrieval-Augmented Generation (RAG) variants (RAG变体), Skill-based agent decomposition (基于技能的智能体分解)

优点

针对性强：专门针对医疗环境的隐私、审计和结构化文档需求进行了设计，解决了通用智能体框架在医疗落地时的“水土不服”问题。
架构创新：提出的页索引记忆架构挑战了主流的向量检索范式，更符合临床病历的时序性和结构性特征。
安全可控：通过受限执行环境和资源隔离，显著提高了自主智能体在敏感环境中的部署安全性。
灵活性高：通过技能库的动态组合，能够处理传统固定流程系统无法覆盖的复杂、罕见临床病例。

局限

论文主要侧重于架构设计和理论论证，尚未提供具体的临床实验数据或大规模部署的性能评估结果。
页索引记忆架构完全依赖LLM的语言推理能力来导航文档，对于推理能力较弱的模型可能会导致检索效率低下或错误。
实施复杂度较高，相比于即插即用的RAG系统，构建和维护页索引清单及受限环境需要更多的工程投入。

与研究方向的相关性:

该论文与用户关注的关键词高度相关。它属于“大模型和深度学习在科学领域的应用”中的生物医药AI应用子领域，探讨了LLM智能体在临床工作流中的具体落地架构。同时，它在“大模型和深度学习技术原理的创新”方面也有贡献，特别是提出了页索引记忆架构和受限执行环境，是对现有智能体基础设施的重要创新。论文针对医疗场景的长尾问题提出了新颖的解决方案，具有较强的创新性。

5. One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

作者: Mayank Saini Arit Kumar Bishwas 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11545v1

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	8.0/10	8.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种用于自主多模态查询处理的智能体框架，核心是中央监督器协调跨文本、图像、音频、视频和文档模态的专用工具。与关键词的相关性分析如下：1）高度相关（10分）：“Small Language Models (SLMs)"（摘要中明确提到SLM-assisted modality decomposition）、“LLM Agents/Autonomous Agents”（框架本质是agentic AI）、“Tool Use/Function Calling”（框架协调多种模态工具）；2）中等相关（8分）：“Large Language Models”（使用RouteLLM进行学习路由）、“Multi-agent Systems”（中央监督器协调多个工具代理）；3）无关（0分）：其余关键词未在论文中涉及，论文聚焦于多模态工具协调框架而非模型训练、推理优化、对齐等具体技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种自主多模态查询处理的智能体框架，通过中央监督器动态协调跨模态专用工具，相比分层基线实现了72%的准确答案时间减少、85%的对话返工减少和67%的成本降低，同时保持准确率相当。

摘要翻译

我们提出一种用于自主多模态查询处理的智能体AI框架，该框架能够协调跨文本、图像、音频、视频和文档模态的专用工具。一个中央监督器（Supervisor）动态分解用户查询，将子任务委派给适配相应模态的工具（例如目标检测、OCR、光学字符识别、语音转录），并通过自适应路由策略而非预定的决策树来综合结果。针对纯文本查询，该框架使用通过RouteLLM学习到的路由机制，而非文本路径则采用SLM辅助的模态分解方法。在涵盖15个任务类别的2,847个查询上进行评估后，我们的框架在保持准确率相当的前提下，相比匹配的层次化基线，实现了准确答案获取时间减少72%、对话返工减少85%以及成本降低67%。这些结果表明，智能化的集中式编排从根本上改善了多模态AI部署的经济性。

摘要 (Abstract)

We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.

关键词: agentic AI framework, autonomous multimodal query processing, tool orchestration, modality decomposition, RouteLLM, SLM-assisted, adaptive routing, centralized Supervisor

深度分析:

一个主管，多种模态：自主查询的自适应工具编排

摘要:

本文提出了一种用于自主多模态查询处理的智能体AI框架。针对现有单一模型成本高昂和分层路由系统僵化脆弱的问题，该框架采用中央Supervisor机制，动态分解用户查询，并将子任务委托给文本、图像、音频等特定模态的专业工具。对于文本查询使用RouteLLM进行学习路由，非文本路径使用SLM辅助模态分解。在2847个查询上的评估显示，该框架在保持准确率的同时，将获得准确答案的时间减少了72%，对话返工减少了85%，成本降低了67%。这证明了智能集中编排能显著改善多模态AI部署的经济性和可扩展性。

创新点:

提出了中央Supervisor架构，通过动态任务分解和自适应路由替代了预定义的决策树，解决了传统分层路由的脆弱性问题。
设计了基于形式化接口（类型签名、前置/后置条件、延迟先验）的工具协调机制，实现了工具的即插即用和动态组合。
实施了混合路由策略：针对文本查询使用RouteLLM进行学习路由，针对非文本查询使用SLM辅助模态分解，优化了处理效率。
引入了’Time-to-Accurate-Answer’作为综合优化指标，同时考虑响应延迟和返工概率，而非单纯优化单次查询成本。

方法

!!! info

论文构建了一个中央编排架构，Supervisor读取工具的形式化规范，根据查询特征和历史记忆状态做出上下文路由决策。对于文本查询，利用RouteLLM进行模型级路由；对于多模态查询，利用SLM辅助进行模态分解。研究在包含2,847个查询、跨越15个任务类别的数据集上进行了评估，并与匹配的分层基线进行了对比分析。

关键结果:

获得准确答案的中位时间减少了72%（四分位距65–77%）。
需要用户澄清或更正的对话返工减少了85%。
昂贵的模型调用减少了67%，总体成本显著降低。
在现实负载条件下，并发吞吐量提高了20%（54 vs 45 q/s）。
在保持准确率相当的情况下，感知任务（如目标检测）通过路由到专用模型显著降低了延迟。

技术栈: RouteLLM, SLM (Small Language Models), OCR (Optical Character Recognition), Object Detection Models, Speech Transcription Models, Formal Interface Specifications (Type Signatures, Preconditions, Postconditions)

优点

显著提升了多模态AI系统的经济性和响应速度，大幅降低了运营成本。
解决了传统基于规则或预定义树路由系统的脆弱性，能够处理未预见的新型查询模式。
架构具有良好的扩展性和复用性，Supervisor本身也是可组合组件，支持递归编排。
通过局部故障恢复机制，避免了级联失败和全局重启，提高了系统鲁棒性。

局限

Supervisor本身的推理能力可能成为系统瓶颈，其决策质量直接影响整体性能。
维护工具的形式化接口规范（如前置/后置条件）需要额外的开发和维护开销。
对于极端新颖或复杂的模态组合，系统的自适应能力仍取决于Supervisor的泛化能力。

与研究方向的相关性:

高度相关。论文主要涉及大模型技术原理的创新，特别是多模态AI的编排、路由和智能体架构。它解决了大模型在实际部署中的成本和效率问题，属于大模型系统架构层面的技术创新，符合用户对大模型技术原理创新的关注点。

6. CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable

作者: Ziqi Ye, Ziyang Gong, Ning Liao, Xiaoxing Hu, Di Wang, Hongruixuan Chen, Chen Huang, Yiguo He, Yuru Jia, Xiaoxing Wang, Haipeng Wang, Xue Yang, Junchi Yan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12008v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	15.0/10	15.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出CrossEarth-SAR，一个用于合成孔径雷达（SAR）图像跨域语义分割的十亿级视觉基础模型，其核心创新在于采用了物理引导的稀疏混合专家（MoE）架构，因此与"Mixture of Experts"高度相关（15分）。作为SAR领域的视觉基础模型，它与"Large Language Models"和"Foundation Models"相关（10分）。研究涉及大规模预训练，与"Pre-training"相关（10分）。该工作属于地球观测科学中的人工智能应用，与"AI for Science"相关（10分）。论文未涉及语言模型、推理、对齐、微调、代理、效率优化等其他具体技术，因此其余关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对合成孔径雷达（SAR）图像因传感器和区域差异导致的跨域语义分割泛化难题，提出了首个十亿级SAR视觉基础模型CrossEarth-SAR，它采用一种新颖的物理引导稀疏混合专家（MoE）架构，并在构建的大规模数据集上预训练，在多个跨域基准测试中取得了最先进的性能。

摘要翻译

合成孔径雷达（SAR）能够实现全球、全天候的地球观测。然而，由于成像机制的多样性，不同传感器和区域之间的域偏移严重阻碍了其语义泛化能力。为解决这一问题，我们提出了CrossEarth-SAR，这是首个基于新型物理引导稀疏专家混合（MoE）架构构建的十亿级SAR视觉基础模型，该架构融合了物理描述符，并专为跨域语义分割而设计。为促进大规模预训练，我们开发了CrossEarth-SAR-200K数据集，这是一个包含公开和私有SAR影像的弱监督与全监督统一数据集。我们还引入了一套基准测试集，涵盖8个不同域差距下的22个子基准，为SAR影像的域泛化语义分割建立了首个统一标准。大量实验表明，CrossEarth-SAR在20个基准测试中取得了最先进的结果，在多差距迁移场景下的部分基准上，其平均交并比（mIoU）超越先前方法超过10%。所有代码、基准测试集和数据集都将公开提供。

摘要 (Abstract)

Synthetic Aperture Radar (SAR) enables global, all-weather earth observation. However, owing to diverse imaging mechanisms, domain shifts across sensors and regions severely hinder its semantic generalization. To address this, we present CrossEarth-SAR, the first billion-scale SAR vision foundation model built upon a novel physics-guided sparse mixture-of-experts (MoE) architecture incorporating physical descriptors, explicitly designed for cross-domain semantic segmentation. To facilitate large-scale pre-training, we develop CrossEarth-SAR-200K, a weakly and fully supervised dataset that unifies public and private SAR imagery. We also introduce a benchmark suite comprising 22 sub-benchmarks across 8 distinct domain gaps, establishing the first unified standard for domain generalization semantic segmentation on SAR imagery. Extensive experiments demonstrate that CrossEarth-SAR achieves state-of-the-art results on 20 benchmarks, surpassing previous methods by over 10% mIoU on some benchmarks under multi-gap transfer. All code, benchmark and datasets will be publicly available.

关键词: Synthetic Aperture Radar (SAR), Foundation Model, Mixture of Experts (MoE), Domain Generalization, Semantic Segmentation, Large-scale Pre-training, Cross-domain, Geospatial

深度分析:

CrossEarth-SAR：以SAR为中心的十亿级规模地理空间基础模型，用于域可泛化语义分割

摘要:

针对合成孔径雷达（SAR）图像在不同传感器和区域间存在严重域偏移、导致语义泛化困难的问题，该论文提出了CrossEarth-SAR，这是首个专为SAR图像域可泛化语义分割设计的十亿级规模视觉基础模型。为了应对SAR数据的极端多样性，作者设计了一种新颖的物理引导稀疏混合专家架构，通过引入SAR物理描述符来稳定专家路由机制。此外，论文构建了包含20万张弱监督和全监督图像的CrossEarth-SAR-200K数据集，并建立了一个涵盖8种域间隙、22个子基准的标准化评测套件。实验结果表明，CrossEarth-SAR在20个基准测试中取得了最先进（SOTA）的性能，在多间隙迁移任务上相比先前方法提升了超过10%的mIoU。

创新点:

提出了首个十亿级参数规模的SAR视觉基础模型，采用稀疏混合专家架构，在大幅提升模型容量的同时控制了计算成本。
设计了物理引导的路由机制，利用SAR特有的物理描述符（如方向熵、等效视数等）辅助专家选择，有效解决了SAR数据异构性导致的路由不稳定问题。
构建了CrossEarth-SAR-200K数据集，整合了公共和私有SAR数据，支持大规模的持续预训练。
建立了包含22个子基准和8种域间隙的综合评测基准，为SAR图像的域泛化语义分割提供了统一的评估标准。

方法

!!! info

论文采用了基于DINOv2的Vision Transformer（ViT）作为主干网络，将其中的前馈网络（FFN）替换为稀疏混合专家层。为了稳定路由过程，引入了SAR物理算子计算三个关键描述符：成像几何（方向熵HDE）、雷达系统（等效视数ENL）和地形散射差异。这些描述符与Token嵌入结合输入路由器，以选择最合适的专家进行激活。模型使用Mask2Former作为解码器，并在CrossEarth-SAR-200K数据集上进行联合优化，训练目标包括分割损失和负载平衡损失。

关键结果:

在22个评估基准中的20个上达到了最先进（SOTA）的性能。
在多间隙迁移任务中，相比之前的最佳方法，mIoU提升幅度超过10%。
物理引导路由机制显著提升了模型在不同传感器、波段和极化模式下的泛化能力。
模型在处理SAR图像特有的斑点噪声和几何畸变方面表现出更强的鲁棒性。

技术栈: DINOv2 (Backbone), Mask2Former (Decoder), Sparse Mixture-of-Experts (MoE), Vision Transformer (ViT), SAR Physical Descriptors (Directional Entropy, ENL), Load Balancing Loss, PyTorch (Implied)

优点

规模与架构创新：首次将十亿级参数模型引入SAR领域，并创新性地结合物理引导的稀疏MoE架构，解决了大模型在特定领域应用的计算瓶颈。
物理感知深度学习：将SAR的物理成像机制（如几何、散射特性）深度融入神经网络设计，体现了“AI for Science”的深度融合。
数据与基准贡献：构建的大规模数据集和全面基准填补了SAR域泛化研究的空白，具有重要的社区价值。
性能优越：在广泛的域间隙下表现出显著的性能提升，证明了模型强大的泛化能力。

局限

计算资源需求：尽管采用了稀疏MoE，十亿级参数的模型在训练和推理时仍需较高的计算资源，可能限制其在边缘设备上的部署。
数据依赖性：模型性能高度依赖于大规模预训练数据集CrossEarth-SAR-200K，虽然包含部分私有数据，但数据的全面性和标注质量对最终效果至关重要。
实现复杂度：引入物理描述符计算和复杂的路由机制增加了系统的工程实现难度。

与研究方向的相关性:

该论文与关键词高度相关。它属于大模型在科学领域的具体应用（地球观测与遥感），同时在大模型技术原理上进行了创新（物理引导的稀疏MoE架构）。论文不仅展示了深度学习技术（ViT, MoE）在解决复杂科学问题（SAR域泛化）中的强大能力，还通过引入物理先验知识提升了模型的性能和可解释性，具有很高的创新性和学术价值。

7. On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

作者: Deyu Zou, Yongqiang Chen, Fan Feng, Mufei Li, Pan Li, Yu Gong, James Cheng 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12109v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在主动推理任务中的强化学习训练问题，特别是’信息自锁’现象。高度相关的关键词包括：‘Large Language Models’（论文明确研究LLM agents）、‘LLM Agents’（核心研究对象）、‘Chain of Thought’和’System 2 Thinking’（论文研究主动推理和多步推理能力）。‘Self-Correction’有一定关联，因为论文提出通过注入定向批评来帮助agent改进。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了在强化学习训练中，用于主动推理任务的大型语言模型智能体容易陷入'信息自锁'的问题，即停止询问信息性问题且难以内化已获信息，并提出通过重新分配学习信号并注入定向批评的方法来显著缓解此问题，在7个数据集上带来高达60%的改进。

摘要翻译

基于结果奖励的强化学习（RL）在训练大语言模型（LLM）智能体执行复杂推理任务方面已取得显著成功。然而，在主动推理场景中，智能体需要策略性地提出问题以获取任务相关信息，我们发现通过RL训练的LLM智能体常受困于信息自锁现象：智能体停止提出信息丰富的问题，且难以内化已获得的信息。为理解这一现象，我们将主动推理分解为两个核心能力：行动选择（Action Selection, AS），即通过查询决定观察流；以及信念追踪（Belief Tracking, BT），即基于收集到的证据更新智能体的信念。我们证明，AS与BT能力的不足会限制RL训练期间的信息探索。此外，探索不足反过来又会阻碍AS与BT能力的提升，形成一个反馈循环，将智能体锁定在低信息状态中。为解决此问题，我们提出一种简单而有效的方法，通过注入易于获取的方向性评判来重新分配学习信号，以帮助智能体摆脱自锁状态。在7个数据集上的大量实验表明，我们的方法显著缓解了信息自锁问题，带来了最高达60%的性能提升。

摘要 (Abstract)

Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent’s belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easy- to-obtain directional critiques to help the agent escape self-locking. Extensive experiments with 7 datasets show that our approach significantly mitigates the information self-locking, bringing up to 60% improvements.

关键词: Large Language Model agents, Reinforcement Learning, Active Reasoning, Information Self-Locking, Action Selection, Belief Tracking, Exploration, Directional Critiques

深度分析:

大语言模型智能体主动推理强化学习中的信息自锁现象研究

摘要:

本文针对大语言模型（LLM）智能体在主动推理任务中，使用基于结果的强化学习（RL）训练时出现的“信息自锁”现象进行了深入研究。作者发现，智能体会陷入低信息交互模式，停止提出有价值的问题且难以内化已获取的信息。通过将主动推理分解为动作选择（AS）和信念跟踪（BT）两个核心能力，文章从理论上证明了弱信念跟踪会掩盖信息丰富动作的学习信号，而保守的动作选择又限制了信念的更新，从而形成负反馈循环。为解决这一问题，作者提出了AREW框架，利用易于获取的诊断信号（如查询是否揭示了新证据）生成方向性批评，通过重新权衡优势函数来修正学习信号。实验表明，该方法在7个数据集上显著缓解了信息自锁，性能提升最高达60%。

创新点:

首次定义并深入分析了LLM智能体在主动推理强化学习中的“信息自锁”现象。
提出将主动推理能力分解为动作选择（AS）和信念跟踪（BT），并从理论上揭示了二者之间的负混杂效应。
设计了AREW（Advantage Reweighting）框架，利用易于获取的二进制方向性批评信号来重新分配学习信号，打破自锁循环。
在多个基准测试中验证了方法的有效性，展示了该方法能显著恢复智能体的信息寻求行为。

方法

!!! info

论文首先将主动推理建模为部分可观测马尔可夫决策过程（POMDP），并定义了AS和BT的代理指标以量化智能体行为。接着，通过理论分析推导了在低AS和BT机制下，学习信号如何被掩盖以及自锁状态如何形成。基于此，作者提出了AREW方法，该方法利用环境反馈（如用户是否提供了新信息）生成方向性批评，并在策略梯度优化中利用这些批评对优势函数进行重新加权，从而强化有效的动作选择和信念更新。最后，在PE-G（偏好估计）和MediQ（医疗诊断）等7个数据集上进行了广泛的实验验证。

关键结果:

发现基于结果的RL训练虽然能提高任务奖励，但往往无法提升智能体的动作选择（AS）和信念跟踪（BT）能力。
理论证明表明，弱信念跟踪会降低信息丰富动作的奖励相关性，导致策略优化无法有效学习。
提出的AREW方法在7个数据集上显著缓解了信息自锁问题，最高带来了60%的性能提升。
AREW改变了训练动态，使智能体恢复了信息寻求的交互模式，并表现出持续的AS和BT能力增长。

技术栈: 强化学习, 部分可观测马尔可夫决策过程, 大语言模型, 策略梯度算法, 优势函数重估, 主动推理基准测试

优点

问题定义新颖且具有实际意义，精准捕捉了当前LLM智能体训练中的一个关键痛点。
理论分析扎实，不仅观察到了现象，还通过AS和BT的分解解释了其背后的动力学机制。
解决方案（AREW）轻量且高效，不需要复杂的额外模型或昂贵的标注，仅利用易于获取的诊断信号。
实验全面，涵盖了不同类型的任务（如偏好估计、医疗诊断）和不同的模型规模，证明了方法的鲁棒性。

局限

方法依赖于易于获取的诊断信号（如判断查询是否带来新信息），在某些复杂环境中，这种信号可能难以定义或存在噪声。
目前的研究主要集中在主动推理（提问）任务上，对于其他类型的智能体行为（如工具使用、物理交互）的适用性尚需进一步验证。
理论分析基于一定的假设条件，现实世界中的环境动态可能比理论模型更为复杂。

与研究方向的相关性:

该论文高度相关于大模型和深度学习技术原理的创新。它深入探讨了强化学习在训练LLM智能体时的核心缺陷，并提出了新的理论框架和算法改进，属于大模型底层技术的重要突破。虽然涉及医疗诊断等应用场景，但其核心贡献在于技术原理层面的创新，符合用户对创新型强、底层技术改进的关注点。

8. AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimiza

作者: Qiyang Li, Rui Kong, Yuchen Li, Hengyi Cai, Shuaiqiang Wang, Linghe Kong, Guihai Chen, Dawei Yin 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11873v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究动态适配器（结合MoE和LoRA）在LLMs中的高效推理问题，与"Large Language Models”、“Mixture of Experts”、“PEFT/LoRA"高度相关（10分），因为直接研究这些技术的集成与优化。与"Speculative Decoding/Inference Acceleration"高度相关（10分），因为主要贡献是减少推理延迟的加速技术。与其他关键词无关（0分），因为论文专注于系统优化而非模型训练、对齐、应用领域或其他特定技术。

!!! tip deepseek-chat TL;DR

论文解决了动态适配器（结合MoE和LoRA）在大型语言模型中导致推理延迟显著增加的问题，通过提出AdaFuse框架实现了与现有方法相当的精度同时将解码延迟降低了2.4倍以上。

摘要翻译

将动态稀疏结构（如混合专家模型，MoE）与参数高效适配器（例如低秩自适应，LoRA）相结合，是增强大语言模型（LLM）能力的一项强大技术。然而，这种架构改进带来了高昂代价：尽管计算负载增加甚微，推理延迟却常常急剧上升，导致解码速度降低超过2.5倍。通过细粒度性能分析，我们发现主要瓶颈并非在于计算本身，而在于传统动态路由所需的大量零散、顺序执行的CUDA内核启动所带来的严重开销。为应对这一挑战，我们提出了AdaFuse框架，该框架基于算法与底层硬件系统的紧密协同设计，以实现高效的动态适配器执行。AdaFuse摒弃了传统的逐层或逐块路由策略，转而采用一种令牌级预门控策略，即在处理每个令牌之前，为其所有适配器层做出一次全局路由决策。这种“一次决策，处处应用”的方法有效地静态化了每个令牌的执行路径，从而为整体优化创造了条件。我们充分利用这一点，开发了一个定制的CUDA内核，该内核执行融合切换操作，将所选所有LoRA适配器的参数在一次高效传递中合并到骨干模型中。在主流开源大语言模型上的实验结果表明，AdaFuse在达到与先进动态适配器相当精度的同时，将解码延迟大幅降低了超过2.4倍，从而弥合了模型能力与推理效率之间的差距。

摘要 (Abstract)

The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a token-level pre-gating strategy, which makes a single, global routing decision for all adapter layers before a token is processed. This “decide-once, apply-everywhere” approach effectively staticizes the execution path for each token, creating an opportunity for holistic optimization. We capitalize on this by developing a custom CUDA kernel that performs a fused switching operation, merging the parameters of all selected LoRA adapters into the backbone model in a single, efficient pass. Experimental results on popular open-source LLMs show that AdaFuse achieves accuracy on par with state-of-the-art dynamic adapters while drastically cutting decoding latency by a factor of over 2.4x, thereby bridging the gap between model capability and inference efficiency.

关键词: AdaFuse, dynamic adapters, Mixture-of-Experts, LoRA, Large Language Models, inference acceleration, CUDA kernel optimization, token-level pre-gating

深度分析:

AdaFuse：通过令牌级预门控和融合内核优化加速动态适配器推理

摘要:

针对动态适配器（如MoE与LoRA结合）在增强大模型能力时导致推理延迟显著增加（超过2.5倍）的问题，本文提出了AdaFuse框架。研究发现延迟的主要瓶颈并非计算量本身，而是碎片化的CUDA内核启动开销。AdaFuse采用系统-算法协同设计，引入令牌级预门控策略，在处理令牌前为所有适配器层做出单一全局路由决策，从而静态化执行路径。此外，开发了自定义CUDA内核（SGMM），通过融合操作将选定的LoRA适配器高效合并到主干模型中。实验表明，AdaFuse在保持与最先进动态适配器相当精度的同时，将解码延迟降低了2.4倍以上，有效平衡了模型能力与推理效率。

创新点:

提出了令牌级预门控策略，在令牌处理前做出单一全局路由决策，实现了执行路径的静态化，为后续优化奠定基础。
开发了名为SGMM的自定义融合CUDA内核，在单次高效传递中完成适配器的切换与合并，大幅减少了内核启动开销。
揭示了动态适配器延迟的根本原因在于CUDA内核上下文操作的频繁调用，而非计算复杂度的增加，为系统优化提供了新视角。
设计了基于MoE的令牌级路由架构，支持在解码前将激活的LoRA适配器预合并到预训练模型中，重构了推理流程。

方法

!!! info

论文首先通过细粒度性能分析，定位动态适配器推理延迟的瓶颈在于CUDA内核启动而非计算本身。随后，提出AdaFuse框架，包含算法层面的令牌级预门控架构，改变传统的逐层路由方式。在系统层面，开发融合CUDA内核以高效管理适配器的合并与切换。最后，在Llama2-7B等开源大模型上进行广泛实验，与MOLA、PESC、MoRAL等基线模型在准确性和解码延迟上进行对比评估。

关键结果:

发现现有动态适配器方法虽然参数量和计算量（FLOPS）增加很少（<1%），但会导致解码延迟增加200%-950%。
AdaFuse实现了平均2.4倍的解码延迟加速，显著优于现有动态适配器方法。
在多种通用和特定领域的基准测试中，AdaFuse的准确性与最先进的动态适配器相当，证明了其有效性。

技术栈: Mixture-of-Experts (MoE), Low-Rank Adaptation (LoRA), CUDA Kernel Programming, Token-level Routing, Kernel Fusion, Matrix Multiplication (GEMM), Softmax & Top-K Routing

优点

解决了动态适配器在实际部署中的关键性能瓶颈，具有极高的工程实用价值。
创新性地结合了算法设计（预门控）与底层系统优化（融合内核），提供了跨层级的解决方案。
对延迟来源的分析深入且准确，指出了计算量与实际运行时间之间的非线性关系。
在大幅提升推理速度的同时保持了模型精度，实现了能力与效率的良好平衡。

局限

令牌级预门控（“决定一次，处处应用”）可能在某些需要逐层精细调整的复杂任务中限制了模型的灵活性。
开发自定义CUDA内核增加了实现的复杂度，可能在不同硬件环境下的移植和维护存在挑战。
主要优化集中在解码阶段，对于预填充阶段的性能提升可能相对有限。

与研究方向的相关性:

该论文属于“大模型和深度学习技术原理的创新”范畴。它深入研究了LLM推理加速技术，特别是针对动态适配器这一前沿架构的系统级优化。虽然论文未直接涉及科学领域的具体应用，但其提出的技术创新（如令牌级预门控和融合内核优化）对于提升大模型在包括科学计算在内的各领域的落地效率具有重要意义，属于底层核心技术的突破，与创新型强的评分标准高度契合。

9. Scaling Laws for Educational AI Agents

作者: Mengsong Wu, Hao Hao, Shuzhen Bi, Keqian Li, Wentao Liu, Siyu Song, Hongbo Zhao, Aimin Zhou 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11709v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	10.0/10	10.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based educational agents的scaling laws，与"Large Language Models"和"Scaling Laws"高度相关（10分）。论文明确研究educational agents，与"LLM Agents"高度相关（10分）。论文提到tool completeness和Tool Scaling，与"Tool Use"有一定关联（5分）。论文研究multi-agent platform，与"Multi-agent Systems"有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理技术、压缩等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了基于大语言模型的教育智能体的扩展规律，提出了Agent Scaling Law框架和AgentProfile机制，并通过EduClaw平台验证了教育智能体性能随配置文件结构丰富度可预测扩展的规律。

摘要翻译

尽管大型语言模型（LLM）在模型参数量、训练数据量和计算资源方面的缩放规律已得到广泛研究，但基于LLM的教育智能体的缩放行为仍未被探索。我们认为，教育智能体的能力提升不仅依赖于底层模型规模，更应通过一系列结构化维度实现，我们将其统称为智能体缩放定律：角色定义清晰度、技能深度、工具完备性、运行时能力以及教育专家知识注入。该框架的核心是AgentProfile——一种基于JSON的结构化规范，它作为实现教育智能体能力系统性增长的机制。我们提出了EduClaw，一个基于配置驱动的多智能体平台，该平台实践了此缩放定律，并通过构建和部署330多个涵盖K-12学科、包含1100多个技能模块的教育智能体配置，验证了其有效性。我们的实证观察表明，教育智能体的性能可随配置结构丰富度实现可预测的缩放。我们提出了两个互补的缩放轴——工具缩放与技能缩放——作为未来发展方向，并指出，实现更强大教育人工智能的路径不仅在于使用更大规模的模型，更在于构建更强大的结构化能力系统。

摘要 (Abstract)

While scaling laws for Large Language Models (LLMs) have been extensively studied along dimensions of model parameters, training data, and compute, the scaling behavior of LLM-based educational agents remains unexplored. We propose that educational agent capability scales not merely with the underlying model size, but through structured dimensions that we collectively term the Agent Scaling Law: role definition clarity, skill depth, tool completeness, runtime capability, and educator expertise injection. Central to this framework is AgentProfile, a structured JSON-based specification that serves as the mechanism enabling systematic capability growth of educational agents. We present EduClaw, a profile-driven multi-agent platform that operationalizes this scaling law, demonstrating its effectiveness through the construction and deployment of 330+ educational agent profiles encompassing 1,100+ skill modules across K-12 subjects. Our empirical observations suggest that educational agent performance scales predictably with profile structural richness. We identify two complementary scaling axes – Tool Scaling and Skill Scaling – as future directions, arguing that the path to more capable educational AI lies not solely in larger models, but in stronger structured capability systems.

关键词: Scaling Laws, Large Language Models, Educational Agents, Agent Scaling Law, AgentProfile, Multi-agent Platform, Tool Scaling, Skill Scaling

深度分析:

教育AI智能体的缩放定律

摘要:

本文探讨了基于大语言模型的教育智能体的缩放行为，提出了“智能体缩放定律”。研究指出，教育智能体的能力不仅取决于底层模型的大小，更取决于其结构化维度的丰富度，包括角色定义清晰度、技能深度、工具完整性等。为此，作者定义了AgentProfile规范，这是一种基于JSON的结构化智能体能力定义协议。基于此，论文开发了EduClaw平台，并部署了涵盖K-12学科的330多个智能体Profile和1100多个技能模块。实证观察表明，教育智能体的性能随Profile结构的丰富度呈现可预测的缩放趋势。论文还提出了工具缩放和技能缩放作为未来的研究方向，强调通过增强结构化能力系统而非单纯扩大模型来提升教育AI能力。

创新点:

提出了教育智能体的缩放定律，指出智能体能力随Profile结构丰富度缩放，而非仅依赖模型参数。
定义了AgentProfile规范，这是一种通用的、基于JSON的结构化智能体能力定义协议，支持跨领域应用。
开发了EduClaw平台，这是一个基于Profile驱动的多智能体系统，实现了智能体的自动化生成与管理。
通过部署330+个教育智能体和1100+个技能模块，实证验证了结构化规范对提升智能体教育交互效果的有效性。

方法

!!! info

论文采用了理论构建与实证开发相结合的方法。首先，通过理论分析提出了智能体缩放定律的数学模型，将智能体能力分解为角色、维度、技能、工具和运行时等变量。其次，设计了AgentProfile这一结构化数据规范来量化这些变量。最后，开发了EduClaw多智能体平台作为验证环境，构建了包含大量K-12学科智能体和技能模块的生态系统，通过实际部署观察性能随结构丰富度的变化。

关键结果:

教育智能体的性能与其Profile的结构丰富度呈正相关，且具有可预测的缩放特性。
通过AgentProfile将专业化知识外化，使得单一基础模型能适应长尾分布的教育需求。
成功构建并部署了包含330+个Profile和1100+个技能模块的教育智能体生态系统。
确认了工具缩放和技能缩放是与智能体缩放互补的未来研究方向。

技术栈: AgentProfile规范: 基于JSON的结构化数据模式。, EduClaw平台: Profile驱动的多智能体系统架构。, OpenClaw服务: 支持智能体生成和管理的后端服务。, 数学模型: C_agent ∝ f(d_role, d_dim, d_skill, d_tool, d_runtime)。

优点

视角新颖: 将缩放定律从模型层面扩展到智能体层面，提出了结构化能力缩放的新范式。
通用性强: AgentProfile设计为领域无关的协议，不仅限于教育，具有广泛的应用潜力。
工程落地: 提供了具体的平台（EduClaw）和大规模的实证数据（330+ agents），验证了理论的可行性。
解决痛点: 针对教育领域需求长尾分布的问题，提出了通过外化专业化而非微调模型的经济高效解决方案。

局限

评估指标缺失: 论文主要描述了平台构建和定性观察，缺乏具体的定量评估指标（如学生成绩提升、交互质量评分）来严格证明缩放定律的数学关系。
未来工作未完成: 工具缩放和技能缩放仅作为概念提出，尚未进行深入研究和验证。
依赖基础模型: 该方法仍依赖于底层LLM的能力，若基础模型推理能力不足，结构化Profile的效果可能受限。

与研究方向的相关性:

该论文高度相关。它属于“大模型在不同领域的研究应用”（教育AI）和“大模型技术原理的创新”（智能体缩放定律）。论文创新性地提出了智能体层面的缩放理论，区别于传统的模型参数缩放，具有很强的技术创新性。它利用LLM作为基础，通过结构化工程手段提升领域应用能力，符合用户对创新型技术及应用的关注。

10. Long-Context Encoder Models for Polish Language Understanding

作者: Sławomir Dadas, Rafał Poświata, Marek Kozłowski, Małgorzata Grębowiec, Michał Perełkiewicz, Paweł Klimiuk, Przemysław Boruta 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12191v1

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	15.0/10	15.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是开发波兰语长上下文编码器模型，通过位置嵌入适应和全参数持续预训练实现8192令牌上下文处理，并创建压缩变体。与"Large Language Models"相关（8分），因为涉及LLM架构讨论；与"Pre-training"高度相关（10分），因为采用两阶段训练和持续预训练；与"Context Window Extension"高度相关（15分），因为这是论文核心创新点。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对波兰语开发了一种能够处理8192令牌长上下文的编码器模型，通过位置嵌入适应和持续预训练方法，在长文档理解任务中显著优于现有解决方案。

摘要翻译

尽管仅解码器架构的大语言模型（LLM）近期主导了自然语言处理领域，但仅编码器架构在判别性任务中仍是成本效益高且参数效率优良的标准方案。然而，诸如BERT等经典编码器受限于较短的上下文窗口，难以处理长文档。本文针对波兰语模型解决了这一限制，引入了一个能够处理长达8192个词元序列的高质量波兰语模型。该模型通过采用两阶段训练流程开发，包括位置嵌入适配和全参数持续预训练。此外，我们提出了通过知识蒸馏训练的压缩模型变体。这些模型在25项任务上进行了评估，包括KLEJ基准测试、新引入的金融任务集（FinBench）以及其他分类与回归任务，特别是需要长文档理解的任务。结果表明，我们的模型在波兰语及多语言模型中取得了最佳平均性能，在长上下文任务上显著优于竞争方案，同时在短文本处理上保持了相当的质量。

摘要 (Abstract)

While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.

关键词: encoder-only models, long-context processing, Polish language understanding, positional embedding adaptation, continuous pre-training, knowledge distillation, KLEJ benchmark, FinBench

深度分析:

面向波兰语理解的长上下文编码器模型

摘要:

针对传统编码器模型（如BERT）上下文窗口短（512 tokens）的局限性，本文提出了一种支持长达8192 tokens的高质量波兰语编码器模型。该模型基于Polish RoBERTa v2，通过扩展位置嵌入和两阶段训练（先仅训练位置嵌入，再全参数微调）实现长上下文处理。此外，引入了Flash Attention和无污染打包技术提高效率，并通过知识蒸馏生成了适用于边缘设备的压缩版本。在25项任务（包括KLEJ基准和新的金融基准FinBench）上的评估表明，该模型在长文档任务上显著优于现有波兰语及多语言模型，同时在短文本任务上保持了竞争力。

创新点:

首个支持8192 tokens长上下文的波兰语编码器模型，填补了该语言长文档处理的空白。
提出两阶段训练策略（位置嵌入适应+全参数持续预训练），有效解决了直接扩展上下文导致的梯度波动问题。
引入“无污染打包”技术，在打包训练时通过约束注意力机制防止不同文档间的信息交叉污染。
通过知识蒸馏技术生成了层数减少50%和75%的高效模型变体，适应边缘设备部署需求。
发布了针对银行金融领域的波兰语评测基准FinBench，丰富了特定领域的评估资源。

方法

!!! info

研究基于现有的Polish RoBERTa Large v2模型进行架构扩展。首先将位置嵌入层扩展至8192个位置，并采用Flash Attention和无污染打包优化训练效率。训练过程分为两个阶段：第一阶段冻结模型权重，仅训练新的位置嵌入；第二阶段对全模型进行持续预训练。此外，利用知识蒸馏（基于最后一层token表示的MSE损失）将模型压缩为12层和6层的版本。最后在25个下游任务上进行综合评估。

关键结果:

新模型在25项评估任务中取得了波兰语及多语言模型中的最佳平均性能。
在需要长文档理解的任务中，模型表现显著优于现有的竞争解决方案。
在短文本任务上，模型保持了与原基础模型相当的质量。
压缩模型（6层）在保持性能的同时，推理吞吐量达到了基础模型的115%。

技术栈: Transformer Encoder (RoBERTa架构), Flash Attention, Contamination-free Packing (无污染打包), Masked Language Modeling (MLM), Knowledge Distillation (MSE Loss), AdamW Optimizer, Polynomial Learning Rate Scheduler

优点

技术路线清晰，两阶段训练策略有效平衡了长文本和短文本的处理能力。
引入的工程优化（Flash Attention、无污染打包）显著提升了训练效率和模型质量。
提供了不同尺寸的模型变体，兼顾了高性能场景和资源受限场景的需求。
评估全面，不仅包含通用基准，还引入了金融领域的专业数据集，具有很高的实用价值。

局限

知识蒸馏过程仅使用了最后一层的MSE损失，未利用注意力矩阵蒸馏，可能限制了压缩模型的性能上限。
虽然上下文扩展至8192 tokens，但与现代Decoder-only LLM（通常支持32k甚至128k tokens）相比仍有差距。
模型主要针对波兰语，其技术方法在其他低资源语言上的泛化效果尚需验证。

与研究方向的相关性:

该论文属于大模型和深度学习技术原理的创新范畴。虽然应用领域是特定语言（波兰语）的NLP，但其核心贡献在于解决了Transformer编码器架构的长上下文限制问题，涉及位置嵌入扩展、训练策略优化（两阶段训练）、注意力机制优化（Flash Attention、无污染打包）以及模型压缩（知识蒸馏）等深度学习核心技术。这些技术原理具有通用性，对提升大模型在科学计算或长文档处理等领域的应用效率具有参考价值，符合用户对技术原理创新的关注。

11. Automating Skill Acquisition through Large-Scale Mining of Open-Source Agentic Repositories: A Frame

作者: Shuzhen Bi, Mengsong Wu, Hao Hao, Keqian Li, Wentao Liu, Siyu Song, Hongbo Zhao, Aimin Zhou 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11808v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM向模块化、技能化智能体（LLM Agents/Autonomous Agents）的架构转变，通过挖掘开源仓库（如GitHub）自动获取高质量智能体技能，属于大模型在不同领域的研究应用创新。因此，与"Large Language Models"和"LLM Agents"高度相关（10分）。论文涉及从现有智能体系统（如TheoremExplainAgent, Code2Video）提取技能，这些系统可能使用工具（如Manim动画引擎），因此与"Tool Use"和"Multi-agent Systems"有一定关联（5分）。论文未直接讨论其他关键词的具体技术（如MoE、训练方法、推理优化等），也未涉及特定科学领域（如生物信息学），故其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过挖掘开源智能体仓库来自动获取高质量技能以增强LLM能力的框架，并证明该方法能显著提升知识传递效率而不需重新训练模型。

摘要翻译

从单体式大语言模型向模块化、技能化智能体的转变，代表了人工智能部署领域一次根本性的架构转型。尽管通用模型在陈述性知识上展现出卓越的广度，但其在自主工作流中的应用常受限于专业程序性知识的不足。本报告研究了一种通过挖掘GitHub等平台上的开源仓库，以自动化方式获取高质量智能体技能的系统性框架。我们聚焦于从包括TheoremExplainAgent和Code2Video在内的先进系统中提取可视化与教育能力，这两者均利用了Manim数学动画引擎。该框架涵盖仓库结构分析、通过密集检索进行语义技能识别，以及向标准化SKILL.md格式的转换。我们证明，通过对智能体仓库进行系统性提取，并结合严格的安全治理与多维评估指标，能够实现程序性知识的规模化获取，从而增强大语言模型的能力，而无需进行模型再训练。我们的分析表明，智能体生成的教育内容在知识传递效率上可实现40%的提升，同时其教学品质可与人工编写的教程相媲美。

摘要 (Abstract)

The transition from monolithic large language models (LLMs) to modular, skill-equipped agents represents a fundamental architectural shift in artificial intelligence deployment. While general-purpose models demonstrate remarkable breadth in declarative knowledge, their utility in autonomous workflows is frequently constrained by insufficient specialized procedural expertise. This report investigates a systematic framework for automated acquisition of high-quality agent skills through mining of open-source repositories on platforms such as GitHub. We focus on the extraction of visualization and educational capabilities from state-of-the-art systems including TheoremExplainAgent and Code2Video, both utilizing the Manim mathematical animation engine. The framework encompasses repository structural analysis, semantic skill identification through dense retrieval, and translation to the standardized SKILL.md format. We demonstrate that systematic extraction from agentic repositories, combined with rigorous security governance and multi-dimensional evaluation metrics, enables scalable acquisition of procedural knowledge that augments LLM capabilities without requiring model retraining. Our analysis reveals that agent-generated educational content can achieve 40% gains in knowledge transfer efficiency while maintaining pedagogical quality comparable to human-crafted tutorials.

关键词: Large Language Models, LLM Agents, Skill Acquisition, Open-Source Mining, Procedural Knowledge, Multi-Agent Systems, Agentic Workflows, Automated Framework

深度分析:

通过大规模挖掘开源智能体仓库实现技能自动化获取：一种多智能体程序性知识提取框架

摘要:

论文提出了一种通过大规模挖掘GitHub开源仓库来自动化获取智能体技能的框架，旨在解决大型语言模型缺乏特定程序性知识的问题。该框架包含仓库结构分析、基于密集检索的语义技能识别以及转换为标准化SKILL.md格式三个主要组件。研究以TheoremExplainAgent和Code2Video为例，展示了如何提取可视化和教育能力。结果表明，这种系统化的提取方法结合安全治理和多维评估，能够实现程序性知识的规模化获取，使智能体生成的教育内容在知识转移效率上提升40%，同时保持与人工教程相当的教学质量。

创新点:

提出了一种系统化的自动化技能获取框架，通过挖掘开源智能体仓库将程序性知识转化为可复用的智能体技能。
定义了智能体技能的数学形式（四元组结构）并采用了SKILL.md标准化规范，实现了技能的模块化和渐进式披露。
结合密集检索和交叉编码器的两阶段排序机制，用于从代码库中识别潜在的、可泛化的技能模式。
验证了从特定领域（如数学动画生成）的开源项目中提取技能并显著提升知识转移效率的可行性。

方法

!!! info

研究采用三阶段流程：首先，利用repo2AI等工具进行仓库结构分析，生成目录映射和上下文；其次，通过双编码器进行密集检索计算相似度，并利用交叉编码器进行二进制排序以识别潜在技能；最后，将识别出的模式转换为SKILL.md格式，包括生成YAML元数据、编写LLM可消费的程序指令以及打包可执行资产。

关键结果:

成功构建了一个从开源代码库中提取智能体技能的完整流程。
智能体生成的教育内容在知识转移效率上比传统方法提升了40%。
提取的技能在保持与人工编写教程相当的教学质量的同时，实现了无需模型重训的能力扩展。

技术栈: 双编码器与交叉编码器, 密集检索, 余弦相似度计算, SKILL.md规范, Manim数学动画引擎, repo2AI工具

优点

提供了一种无需重新训练底层大模型即可扩展智能体能力的途径，降低了计算成本。
采用SKILL.md标准，促进了技能的互操作性和复用性。
实现了从代码挖掘到技能生成的自动化流程，减少了人工编写技能的工作量。
通过具体的开源项目验证了框架的有效性，并给出了量化的效率提升数据。

局限

框架的效果严重依赖于源仓库代码的质量、文档完整性和逻辑清晰度。
提取的技能可能仍受限于原始代码的特定领域，跨领域迁移可能需要额外调整。
在大规模自动化挖掘过程中，如何确保提取的代码不包含恶意逻辑或漏洞仍是一个挑战。
目前的案例主要集中在可视化和教育领域，在其他复杂领域的适用性有待进一步验证。

与研究方向的相关性:

论文高度相关。它涉及大模型技术原理的创新，提出了从单体模型向模块化智能体技能架构转变的新范式。同时，论文以科学教育（数学定理可视化）和代码生成为具体应用场景，展示了如何利用大模型挖掘和复用领域知识，直接关联大模型在科学领域的应用。其自动化技能获取的思路具有较强的创新性和实用性。

12. From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration

作者: Gaole He, Brian Y. Lim 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11677v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文明确研究LLM驱动的自主代理（LLM Agents）用于复杂多步任务，与关键词1和17高度相关（10分）。论文讨论用户需要模拟长期效果，涉及推理过程，与关键词13和14有一定关联（5分）。其他关键词涉及具体技术细节（如MoE、量化、训练方法等）或特定应用领域（如生物信息学），论文未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文指出当前基于LLM的自主代理与人类交互是点状和反应式的，缺乏预见性，提出了模拟在环的交互范式，让用户在决策前探索模拟的未来轨迹，以实现更有效的人机协作。

摘要翻译

大型语言模型（LLM）正日益被用于驱动自主智能体执行复杂的多步骤任务。然而，当前的人机交互仍然是点状且反应式的：用户通过批准或纠正单个操作来规避即时风险，却无法预知后续影响。这迫使用户必须在脑海中模拟长期效应，这一过程不仅认知负荷高，且往往不够准确。用户虽能控制单个步骤，却缺乏做出知情决策所需的预见能力。我们认为，有效的协作需要预见性，而不仅仅是控制权。为此，我们提出“仿真循环内嵌”这一交互范式，使用户与智能体能够在做出最终决策前，共同探索模拟的未来轨迹。仿真将干预行为从被动的猜测转变为有依据的探索，同时帮助用户在过程中发现潜在的约束条件与偏好。本视角论文分析了现有范式的局限性，提出了基于仿真的协作概念框架，并通过具体的人机协作场景阐述了其潜在价值。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly used to power autonomous agents for complex, multi-step tasks. However, human-agent interaction remains pointwise and reactive: users approve or correct individual actions to mitigate immediate risks, without visibility into subsequent consequences. This forces users to mentally simulate long-term effects, a cognitively demanding and often inaccurate process. Users have control over individual steps but lack the foresight to make informed decisions. We argue that effective collaboration requires foresight, not just control. We propose simulation-in-the-loop, an interaction paradigm that enables users and agents to explore simulated future trajectories before committing to decisions. Simulation transforms intervention from reactive guesswork into informed exploration, while helping users discover latent constraints and preferences along the way. This perspective paper characterizes the limitations of current paradigms, introduces a conceptual framework for simulation-based collaboration, and illustrates its potential through concrete human-agent collaboration scenarios.

关键词: Large Language Models, LLMs, autonomous agents, human-agent collaboration, simulation, foresight, multi-step tasks, interaction paradigm

深度分析:

从控制到远见：仿真作为人机协作的新范式

摘要:

针对当前人机协作中用户仅能对单个动作进行反应式干预、缺乏对长期后果预见的局限性，本文提出了一种名为“环路中的仿真”的新交互范式。该范式允许用户和智能体在执行真实动作前，共同探索模拟的未来轨迹。通过将智能体内部的搜索过程可视化，用户可以比较不同路径的风险与收益，从而从被动监督转变为主动探索。论文阐述了该概念框架，分析了设计空间（如前瞻深度、探索广度），并讨论了仿真可靠性及认知负荷等挑战与机遇。

创新点:

提出了“环路中的仿真”交互范式，将人机协作从点对点的控制转变为基于未来轨迹预见的探索。
定义了仿真影响的概念，将抽象的未来转化为具体的风险、机会和权衡，辅助用户决策。
引入了仿真设计空间维度（前瞻深度、探索广度、粒度），指导系统设计者在信息量与认知负荷之间取得平衡。
强调了通过仿真发现潜在约束和偏好的能力，将协作过程转化为需求动态发现的联合探索过程。

方法

!!! info

论文主要采用概念框架构建和场景分析的方法。作者首先批判了现有的点对点交互模式，然后提出了包含智能体工作流、动作空间、仿真和仿真影响四个核心概念的理论框架。通过旅行规划的具体场景（如航班转接选择）对比了传统模式与新模式的差异，并分析了实现该范式所需的设计空间权衡及潜在的技术挑战。

关键结果:

现有的人机协作模式存在“控制但无远见”的根本缺陷，导致用户决策短视。
仿真作为中间层，能有效将智能体内部的搜索树外化，帮助用户理解下游后果。
仿真不仅能降低风险，还能促进意外发现，揭示用户未明确表达的潜在需求。
实现该范式面临仿真可靠性（如LLM幻觉）、筛选关键路径以及管理用户认知负荷等挑战。

技术栈: 大语言模型, 树搜索算法, 世界模型, 仿真环境, 人机交互界面设计

优点

视角新颖：指出了当前LLM智能体交互中“缺乏远见”的关键痛点，并提出了具有前瞻性的解决方案。
理论清晰：构建了完整的概念框架，将抽象的仿真过程具体化为可操作的设计维度。
实用性强：通过具体场景展示了仿真如何帮助用户发现风险和机会，对提升复杂任务中的人机协作效率有重要指导意义。

局限

技术实现难度高：在开放领域构建可靠的仿真环境极具挑战，依赖LLM进行自我仿真可能产生幻觉。
认知负荷问题：展示过多未来轨迹可能让用户感到困惑，如何筛选和呈现关键信息仍是未解难题。
缺乏实证数据：作为一篇观点论文，目前主要停留在概念和框架层面，缺乏用户实验数据来验证该范式的有效性。

与研究方向的相关性:

论文高度相关。它聚焦于大语言模型（LLM）驱动的智能体，属于大模型技术原理在人机交互（HCI）领域的创新应用。虽然不涉及具体的生物医药等科学领域，但它提出了改进LLM智能体可靠性和协作能力的新范式，符合“大模型和深度学习技术原理的创新”这一评价标准。其提出的仿真机制对于提升LLM在复杂任务（如科学实验规划）中的表现具有潜在的通用价值。

📋 所有论文列表

1. ✅ Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

作者: Sizhong Qin, Ramon Elias Weber, Xinzheng Lu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11640v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该研究解决了建筑平面图设计中AI系统难以进行连贯空间推理和可控生成的挑战，提出了一个名为HouseMind的多模态大语言模型，通过离散房间实例令牌和指令调优，实现了从文本指令合成连贯、可控布局的框架，并在实验中表现出优异的几何有效性和可控性。

摘要翻译

建筑平面图设计需要对几何结构、语义信息与空间层级进行联合推理，这对当前人工智能系统仍构成重大挑战。尽管近期扩散模型与语言模型提升了视觉逼真度，但其在空间连贯推理与可控生成方面仍存在困难。本文提出HouseMind——一个多模态大语言模型，将平面图理解、生成与编辑统一于单一框架中。我们引入离散的房间实例（room-instance）标记来构建统一词汇表，从而连接布局设计与符号推理。通过多模态对齐与指令微调，该模型能够根据文本指令生成连贯且可控的平面布局。实验表明，该框架在保持高效性与本地可部署性的同时，实现了更优的几何有效性与生成可控性。

摘要 (Abstract)

Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.

关键词: multimodal large language model, architectural floor plans, discrete room-instance tokens, instruction tuning, controllable generation, spatial reasoning, HouseMind, geometric validity

2. ✅ Tiny Aya: Bridging Scale and Multilingual Depth

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

Tiny Aya研究如何通过高效的训练策略和数据组成，构建一个仅3.35B参数的小型多语言模型，在70种语言上实现先进的翻译质量、多语言理解和生成能力，并提供了基础模型、指令调优变体和区域专业化模型。

摘要翻译

Tiny Aya重新定义了小型多语言模型所能达到的边界。该模型基于70种语言进行训练，并通过区域感知的后训练阶段进行精调，仅以3.35B参数规模便在翻译质量、多语言理解能力以及高质量目标语言生成方面实现了业界领先水平。本次发布包含一个预训练基础模型、一个全球平衡的指令微调版本，以及三个针对非洲、南亚、欧洲-亚太和西亚地区语言进行专门优化的区域定制模型。本报告详细阐述了Tiny Aya背后的训练策略、数据构成与综合评估框架，并提出了一条以效率为核心、注重语言间性能均衡且兼顾实际部署需求的多语言人工智能发展新路径。

摘要 (Abstract)

Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraining, it delivers state-of-the-art in translation quality, strong multilingual understanding, and high-quality target-language generation, all with just 3.35B parameters. The release includes a pretrained foundation model, a globally balanced instruction-tuned variant, and three region-specialized models targeting languages from Africa, South Asia, Europe, Asia-Pacific, and West Asia. This report details the training strategy, data composition, and comprehensive evaluation framework behind Tiny Aya, and presents an alternative scaling path for multilingual AI: one centered on efficiency, balanced performance across languages, and practical deployment.

3. ✅ Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

作者: Yulu Gan, Phillip Isola 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12228v1

评分: 51.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	8.0/10	8.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究发现，在大型预训练模型中，任务专家解决方案在预训练权重附近密度显著增加，并提出了一种简单的并行后训练方法，通过随机采样参数扰动、选择最优扰动并集成预测，其性能可与PPO、GRPO等标准后训练方法相竞争。

摘要翻译

预训练产生的学习参数向量通常被视为后续迭代适应的起点。在本研究中，我们提出将预训练结果视为参数向量上的分布，其支撑集已包含任务特定的专家模型。我们证明，在小型模型中此类专家解仅占据该分布体积的极小部分，因此其发现依赖于梯度下降等结构化优化方法。相比之下，在大型且充分预训练的模型中，任务专家的密度显著增加，使得多样化、能提升任务性能的专家模型大量分布于预训练权重邻域内。基于此视角，我们探索了一种完全并行的简单后训练方法：随机采样 $N$ 个参数扰动，选取最优的 $K$ 个样本，并通过多数投票进行预测集成。尽管方法简单，该策略在当代大规模模型中与PPO、GRPO、ES等标准后训练方法相比仍具有竞争力。

摘要 (Abstract)

Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples $N$ parameter perturbations at random, selects the top $K$, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.

关键词: pretraining, task experts, parameter distribution, post-training, model ensembling, large-scale models, parameter perturbations, majority vote

4. ✅ When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心研究LLM代理在临床工作流中的应用，与"Large Language Models”、“LLM Agents”、“Tool Use”、“Multi-agent Systems"高度相关（10分），因为论文明确讨论LLM代理、工具调用和多代理协调。与"AI for Science"相关（10分），因为论文专注于医疗领域的AI应用。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM代理在临床环境中部署的可靠性、安全性和长期记忆不足等问题，提出了一种基于受限执行环境、文档中心交互、页面索引内存和医疗技能库的架构，为医院构建了一个安全、透明、可审计的代理操作系统。

摘要翻译

大型语言模型（LLM）智能体通过整合推理、工具调用与持久记忆，扩展了传统生成模型的能力。近期研究表明，此类智能体可通过自动化文档处理、协调诊疗流程及辅助医疗决策，显著改善临床工作流。然而，尽管进展迅速，由于可靠性局限、安全风险及长期记忆机制不足，在医疗环境中部署自主智能体仍面临挑战。本研究提出一种适用于医院环境的LLM智能体架构。该设计引入四个核心组件：受Linux多用户系统启发的受限执行环境；连接患者与临床医生智能体的以文档为中心的交互范式；专为长期临床情境管理设计的页面索引记忆架构；以及支持临床任务序列按需组合的精选医疗技能库。该架构并非赋予智能体无限制的系统访问权限，而是通过预定义技能接口和资源隔离来约束其行为。我们认为，此类系统构成了“医院智能体操作系统”的基础——这是一个能够协调临床工作流，同时保障安全性、透明性与可审计性的计算层。本研究基于OpenClaw（一个将智能体能力构建为离散技能精选库的开源自主动智能体框架）实现设计，并通过临床安全部署所需的基础设施级约束对其进行扩展。

摘要 (Abstract)

Large language model (LLM) agents extend conventional generative models by integrating reasoning, tool invocation, and persistent memory. Recent studies suggest that such agents may significantly improve clinical workflows by automating documentation, coordinating care processes, and assisting medical decision making. However, despite rapid progress, deploying autonomous agents in healthcare environments remains difficult due to reliability limitations, security risks, and insufficient long-term memory mechanisms. This work proposes an architecture that adapts LLM agents for hospital environments. The design introduces four core components: a restricted execution environment inspired by Linux multi-user systems, a document-centric interaction paradigm connecting patient and clinician agents, a page-indexed memory architecture designed for long-term clinical context management, and a curated medical skills library enabling ad-hoc composition of clinical task sequences. Rather than granting agents unrestricted system access, the architecture constrains actions through predefined skill interfaces and resource isolation. We argue that such a system forms the basis of an Agentic Operating System for Hospital, a computing layer capable of coordinating clinical workflows while maintaining safety, transparency, and auditability. This work grounds the design in OpenClaw, an open-source autonomous agent framework that structures agent capabilities as a curated library of discrete skills, and extends it with the infrastructure-level constraints required for safe clinical deployment.

关键词: LLM agents, clinical workflows, hospital environment, autonomous agents, tool invocation, multi-agent systems, medical decision making, agentic operating system

5. ✅ One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

作者: Mayank Saini Arit Kumar Bishwas 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11545v1

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	8.0/10	8.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种自主多模态查询处理的智能体框架，通过中央监督器动态协调跨模态专用工具，相比分层基线实现了72%的准确答案时间减少、85%的对话返工减少和67%的成本降低，同时保持准确率相当。

摘要翻译

我们提出一种用于自主多模态查询处理的智能体AI框架，该框架能够协调跨文本、图像、音频、视频和文档模态的专用工具。一个中央监督器（Supervisor）动态分解用户查询，将子任务委派给适配相应模态的工具（例如目标检测、OCR、光学字符识别、语音转录），并通过自适应路由策略而非预定的决策树来综合结果。针对纯文本查询，该框架使用通过RouteLLM学习到的路由机制，而非文本路径则采用SLM辅助的模态分解方法。在涵盖15个任务类别的2,847个查询上进行评估后，我们的框架在保持准确率相当的前提下，相比匹配的层次化基线，实现了准确答案获取时间减少72%、对话返工减少85%以及成本降低67%。这些结果表明，智能化的集中式编排从根本上改善了多模态AI部署的经济性。

摘要 (Abstract)

We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.

关键词: agentic AI framework, autonomous multimodal query processing, tool orchestration, modality decomposition, RouteLLM, SLM-assisted, adaptive routing, centralized Supervisor

6. ✅ CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	15.0/10	15.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对合成孔径雷达（SAR）图像因传感器和区域差异导致的跨域语义分割泛化难题，提出了首个十亿级SAR视觉基础模型CrossEarth-SAR，它采用一种新颖的物理引导稀疏混合专家（MoE）架构，并在构建的大规模数据集上预训练，在多个跨域基准测试中取得了最先进的性能。

摘要翻译

合成孔径雷达（SAR）能够实现全球、全天候的地球观测。然而，由于成像机制的多样性，不同传感器和区域之间的域偏移严重阻碍了其语义泛化能力。为解决这一问题，我们提出了CrossEarth-SAR，这是首个基于新型物理引导稀疏专家混合（MoE）架构构建的十亿级SAR视觉基础模型，该架构融合了物理描述符，并专为跨域语义分割而设计。为促进大规模预训练，我们开发了CrossEarth-SAR-200K数据集，这是一个包含公开和私有SAR影像的弱监督与全监督统一数据集。我们还引入了一套基准测试集，涵盖8个不同域差距下的22个子基准，为SAR影像的域泛化语义分割建立了首个统一标准。大量实验表明，CrossEarth-SAR在20个基准测试中取得了最先进的结果，在多差距迁移场景下的部分基准上，其平均交并比（mIoU）超越先前方法超过10%。所有代码、基准测试集和数据集都将公开提供。

摘要 (Abstract)

Synthetic Aperture Radar (SAR) enables global, all-weather earth observation. However, owing to diverse imaging mechanisms, domain shifts across sensors and regions severely hinder its semantic generalization. To address this, we present CrossEarth-SAR, the first billion-scale SAR vision foundation model built upon a novel physics-guided sparse mixture-of-experts (MoE) architecture incorporating physical descriptors, explicitly designed for cross-domain semantic segmentation. To facilitate large-scale pre-training, we develop CrossEarth-SAR-200K, a weakly and fully supervised dataset that unifies public and private SAR imagery. We also introduce a benchmark suite comprising 22 sub-benchmarks across 8 distinct domain gaps, establishing the first unified standard for domain generalization semantic segmentation on SAR imagery. Extensive experiments demonstrate that CrossEarth-SAR achieves state-of-the-art results on 20 benchmarks, surpassing previous methods by over 10% mIoU on some benchmarks under multi-gap transfer. All code, benchmark and datasets will be publicly available.

关键词: Synthetic Aperture Radar (SAR), Foundation Model, Mixture of Experts (MoE), Domain Generalization, Semantic Segmentation, Large-scale Pre-training, Cross-domain, Geospatial

7. ✅ On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

作者: Deyu Zou, Yongqiang Chen, Fan Feng, Mufei Li, Pan Li, Yu Gong, James Cheng 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12109v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了在强化学习训练中，用于主动推理任务的大型语言模型智能体容易陷入'信息自锁'的问题，即停止询问信息性问题且难以内化已获信息，并提出通过重新分配学习信号并注入定向批评的方法来显著缓解此问题，在7个数据集上带来高达60%的改进。

摘要翻译

基于结果奖励的强化学习（RL）在训练大语言模型（LLM）智能体执行复杂推理任务方面已取得显著成功。然而，在主动推理场景中，智能体需要策略性地提出问题以获取任务相关信息，我们发现通过RL训练的LLM智能体常受困于信息自锁现象：智能体停止提出信息丰富的问题，且难以内化已获得的信息。为理解这一现象，我们将主动推理分解为两个核心能力：行动选择（Action Selection, AS），即通过查询决定观察流；以及信念追踪（Belief Tracking, BT），即基于收集到的证据更新智能体的信念。我们证明，AS与BT能力的不足会限制RL训练期间的信息探索。此外，探索不足反过来又会阻碍AS与BT能力的提升，形成一个反馈循环，将智能体锁定在低信息状态中。为解决此问题，我们提出一种简单而有效的方法，通过注入易于获取的方向性评判来重新分配学习信号，以帮助智能体摆脱自锁状态。在7个数据集上的大量实验表明，我们的方法显著缓解了信息自锁问题，带来了最高达60%的性能提升。

摘要 (Abstract)

Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent’s belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easy- to-obtain directional critiques to help the agent escape self-locking. Extensive experiments with 7 datasets show that our approach significantly mitigates the information self-locking, bringing up to 60% improvements.

关键词: Large Language Model agents, Reinforcement Learning, Active Reasoning, Information Self-Locking, Action Selection, Belief Tracking, Exploration, Directional Critiques

8. ✅ AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文解决了动态适配器（结合MoE和LoRA）在大型语言模型中导致推理延迟显著增加的问题，通过提出AdaFuse框架实现了与现有方法相当的精度同时将解码延迟降低了2.4倍以上。

摘要翻译

将动态稀疏结构（如混合专家模型，MoE）与参数高效适配器（例如低秩自适应，LoRA）相结合，是增强大语言模型（LLM）能力的一项强大技术。然而，这种架构改进带来了高昂代价：尽管计算负载增加甚微，推理延迟却常常急剧上升，导致解码速度降低超过2.5倍。通过细粒度性能分析，我们发现主要瓶颈并非在于计算本身，而在于传统动态路由所需的大量零散、顺序执行的CUDA内核启动所带来的严重开销。为应对这一挑战，我们提出了AdaFuse框架，该框架基于算法与底层硬件系统的紧密协同设计，以实现高效的动态适配器执行。AdaFuse摒弃了传统的逐层或逐块路由策略，转而采用一种令牌级预门控策略，即在处理每个令牌之前，为其所有适配器层做出一次全局路由决策。这种“一次决策，处处应用”的方法有效地静态化了每个令牌的执行路径，从而为整体优化创造了条件。我们充分利用这一点，开发了一个定制的CUDA内核，该内核执行融合切换操作，将所选所有LoRA适配器的参数在一次高效传递中合并到骨干模型中。在主流开源大语言模型上的实验结果表明，AdaFuse在达到与先进动态适配器相当精度的同时，将解码延迟大幅降低了超过2.4倍，从而弥合了模型能力与推理效率之间的差距。

摘要 (Abstract)

The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a token-level pre-gating strategy, which makes a single, global routing decision for all adapter layers before a token is processed. This “decide-once, apply-everywhere” approach effectively staticizes the execution path for each token, creating an opportunity for holistic optimization. We capitalize on this by developing a custom CUDA kernel that performs a fused switching operation, merging the parameters of all selected LoRA adapters into the backbone model in a single, efficient pass. Experimental results on popular open-source LLMs show that AdaFuse achieves accuracy on par with state-of-the-art dynamic adapters while drastically cutting decoding latency by a factor of over 2.4x, thereby bridging the gap between model capability and inference efficiency.

关键词: AdaFuse, dynamic adapters, Mixture-of-Experts, LoRA, Large Language Models, inference acceleration, CUDA kernel optimization, token-level pre-gating

9. ✅ Scaling Laws for Educational AI Agents

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	10.0/10	10.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了基于大语言模型的教育智能体的扩展规律，提出了Agent Scaling Law框架和AgentProfile机制，并通过EduClaw平台验证了教育智能体性能随配置文件结构丰富度可预测扩展的规律。

摘要翻译

尽管大型语言模型（LLM）在模型参数量、训练数据量和计算资源方面的缩放规律已得到广泛研究，但基于LLM的教育智能体的缩放行为仍未被探索。我们认为，教育智能体的能力提升不仅依赖于底层模型规模，更应通过一系列结构化维度实现，我们将其统称为智能体缩放定律：角色定义清晰度、技能深度、工具完备性、运行时能力以及教育专家知识注入。该框架的核心是AgentProfile——一种基于JSON的结构化规范，它作为实现教育智能体能力系统性增长的机制。我们提出了EduClaw，一个基于配置驱动的多智能体平台，该平台实践了此缩放定律，并通过构建和部署330多个涵盖K-12学科、包含1100多个技能模块的教育智能体配置，验证了其有效性。我们的实证观察表明，教育智能体的性能可随配置结构丰富度实现可预测的缩放。我们提出了两个互补的缩放轴——工具缩放与技能缩放——作为未来发展方向，并指出，实现更强大教育人工智能的路径不仅在于使用更大规模的模型，更在于构建更强大的结构化能力系统。

摘要 (Abstract)

While scaling laws for Large Language Models (LLMs) have been extensively studied along dimensions of model parameters, training data, and compute, the scaling behavior of LLM-based educational agents remains unexplored. We propose that educational agent capability scales not merely with the underlying model size, but through structured dimensions that we collectively term the Agent Scaling Law: role definition clarity, skill depth, tool completeness, runtime capability, and educator expertise injection. Central to this framework is AgentProfile, a structured JSON-based specification that serves as the mechanism enabling systematic capability growth of educational agents. We present EduClaw, a profile-driven multi-agent platform that operationalizes this scaling law, demonstrating its effectiveness through the construction and deployment of 330+ educational agent profiles encompassing 1,100+ skill modules across K-12 subjects. Our empirical observations suggest that educational agent performance scales predictably with profile structural richness. We identify two complementary scaling axes – Tool Scaling and Skill Scaling – as future directions, arguing that the path to more capable educational AI lies not solely in larger models, but in stronger structured capability systems.

关键词: Scaling Laws, Large Language Models, Educational Agents, Agent Scaling Law, AgentProfile, Multi-agent Platform, Tool Scaling, Skill Scaling

10. ✅ Long-Context Encoder Models for Polish Language Understanding

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	15.0/10	15.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对波兰语开发了一种能够处理8192令牌长上下文的编码器模型，通过位置嵌入适应和持续预训练方法，在长文档理解任务中显著优于现有解决方案。

摘要翻译

尽管仅解码器架构的大语言模型（LLM）近期主导了自然语言处理领域，但仅编码器架构在判别性任务中仍是成本效益高且参数效率优良的标准方案。然而，诸如BERT等经典编码器受限于较短的上下文窗口，难以处理长文档。本文针对波兰语模型解决了这一限制，引入了一个能够处理长达8192个词元序列的高质量波兰语模型。该模型通过采用两阶段训练流程开发，包括位置嵌入适配和全参数持续预训练。此外，我们提出了通过知识蒸馏训练的压缩模型变体。这些模型在25项任务上进行了评估，包括KLEJ基准测试、新引入的金融任务集（FinBench）以及其他分类与回归任务，特别是需要长文档理解的任务。结果表明，我们的模型在波兰语及多语言模型中取得了最佳平均性能，在长上下文任务上显著优于竞争方案，同时在短文本处理上保持了相当的质量。

摘要 (Abstract)

While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.

关键词: encoder-only models, long-context processing, Polish language understanding, positional embedding adaptation, continuous pre-training, knowledge distillation, KLEJ benchmark, FinBench

11. ✅ Automating Skill Acquisition through Large-Scale Mining of Open-Source Agentic Repositories: A Framework for Multi-Agent Procedural Knowledge Extraction

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了一种通过挖掘开源智能体仓库来自动获取高质量技能以增强LLM能力的框架，并证明该方法能显著提升知识传递效率而不需重新训练模型。

摘要翻译

从单体式大语言模型向模块化、技能化智能体的转变，代表了人工智能部署领域一次根本性的架构转型。尽管通用模型在陈述性知识上展现出卓越的广度，但其在自主工作流中的应用常受限于专业程序性知识的不足。本报告研究了一种通过挖掘GitHub等平台上的开源仓库，以自动化方式获取高质量智能体技能的系统性框架。我们聚焦于从包括TheoremExplainAgent和Code2Video在内的先进系统中提取可视化与教育能力，这两者均利用了Manim数学动画引擎。该框架涵盖仓库结构分析、通过密集检索进行语义技能识别，以及向标准化SKILL.md格式的转换。我们证明，通过对智能体仓库进行系统性提取，并结合严格的安全治理与多维评估指标，能够实现程序性知识的规模化获取，从而增强大语言模型的能力，而无需进行模型再训练。我们的分析表明，智能体生成的教育内容在知识传递效率上可实现40%的提升，同时其教学品质可与人工编写的教程相媲美。

摘要 (Abstract)

The transition from monolithic large language models (LLMs) to modular, skill-equipped agents represents a fundamental architectural shift in artificial intelligence deployment. While general-purpose models demonstrate remarkable breadth in declarative knowledge, their utility in autonomous workflows is frequently constrained by insufficient specialized procedural expertise. This report investigates a systematic framework for automated acquisition of high-quality agent skills through mining of open-source repositories on platforms such as GitHub. We focus on the extraction of visualization and educational capabilities from state-of-the-art systems including TheoremExplainAgent and Code2Video, both utilizing the Manim mathematical animation engine. The framework encompasses repository structural analysis, semantic skill identification through dense retrieval, and translation to the standardized SKILL.md format. We demonstrate that systematic extraction from agentic repositories, combined with rigorous security governance and multi-dimensional evaluation metrics, enables scalable acquisition of procedural knowledge that augments LLM capabilities without requiring model retraining. Our analysis reveals that agent-generated educational content can achieve 40% gains in knowledge transfer efficiency while maintaining pedagogical quality comparable to human-crafted tutorials.

关键词: Large Language Models, LLM Agents, Skill Acquisition, Open-Source Mining, Procedural Knowledge, Multi-Agent Systems, Agentic Workflows, Automated Framework

12. ✅ From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration

作者: Gaole He, Brian Y. Lim 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11677v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文指出当前基于LLM的自主代理与人类交互是点状和反应式的，缺乏预见性，提出了模拟在环的交互范式，让用户在决策前探索模拟的未来轨迹，以实现更有效的人机协作。

摘要翻译

大型语言模型（LLM）正日益被用于驱动自主智能体执行复杂的多步骤任务。然而，当前的人机交互仍然是点状且反应式的：用户通过批准或纠正单个操作来规避即时风险，却无法预知后续影响。这迫使用户必须在脑海中模拟长期效应，这一过程不仅认知负荷高，且往往不够准确。用户虽能控制单个步骤，却缺乏做出知情决策所需的预见能力。我们认为，有效的协作需要预见性，而不仅仅是控制权。为此，我们提出“仿真循环内嵌”这一交互范式，使用户与智能体能够在做出最终决策前，共同探索模拟的未来轨迹。仿真将干预行为从被动的猜测转变为有依据的探索，同时帮助用户在过程中发现潜在的约束条件与偏好。本视角论文分析了现有范式的局限性，提出了基于仿真的协作概念框架，并通过具体的人机协作场景阐述了其潜在价值。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly used to power autonomous agents for complex, multi-step tasks. However, human-agent interaction remains pointwise and reactive: users approve or correct individual actions to mitigate immediate risks, without visibility into subsequent consequences. This forces users to mentally simulate long-term effects, a cognitively demanding and often inaccurate process. Users have control over individual steps but lack the foresight to make informed decisions. We argue that effective collaboration requires foresight, not just control. We propose simulation-in-the-loop, an interaction paradigm that enables users and agents to explore simulated future trajectories before committing to decisions. Simulation transforms intervention from reactive guesswork into informed exploration, while helping users discover latent constraints and preferences along the way. This perspective paper characterizes the limitations of current paradigms, introduces a conceptual framework for simulation-based collaboration, and illustrates its potential through concrete human-agent collaboration scenarios.

关键词: Large Language Models, LLMs, autonomous agents, human-agent collaboration, simulation, foresight, multi-step tasks, interaction paradigm

13. ❌ Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions

作者: Tae-Eun Song 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12123v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文的核心是提出一种名为Cross-Context Review (CCR)的方法，旨在通过将生成和审查过程分离到不同的会话中来提高LLM输出质量。因此，它与"Large Language Models” (LLMs)高度相关（10分），因为论文直接研究LLM的自我审查能力。同时，论文的核心机制是让LLM在独立会话中审查自己的输出，这直接属于"Self-Correction"或"Self-Improvement"的范畴（10分）。论文通过减少错误来间接提高事实性，因此与"Hallucination Mitigation"有一定关联（5分）。其他关键词如MoE、SFT、RAG、Agents等，论文未涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，当大语言模型（LLM）在独立于生成会话的新会话中审查自己的输出时（Cross-Context Review），其错误检测能力显著优于在同一会话内进行自我审查，从而有效提高了输出质量。

摘要翻译

大语言模型在生成输出的同一会话中进行自我审查时，往往难以有效识别自身错误。本文提出跨上下文审查（Cross-Context Review, CCR）方法，其核心机制是在全新会话中开展审查工作，且不接触原始生成对话的历史记录。我们设计了一项对照实验：选取30件人工制品（含代码、技术文档、演示脚本）并注入150个错误，在四种审查条件下进行测试——同会话自我审查（SR）、重复自我审查（SR2）、上下文感知子代理审查（SA）以及跨上下文审查（CCR）。经过360次审查实验，CCR的F1分数达到28.6%，显著优于SR（24.6%，p=0.008，d=0.52）、SR2（21.7%，p<0.001，d=0.72）和SA（23.8%，p=0.004，d=0.57）。其中SR2的结果对解读最具意义：在同一会话中重复审查并未超越单次审查效果（p=0.11），这排除了重复性因素对CCR优势的解释。CCR的效能提升源于上下文分离机制本身。该方法兼容任意模型，无需额外基础设施，仅需增加一次会话成本。

摘要 (Abstract)

Large language models struggle to catch errors in their own outputs when the review happens in the same session that produced them. This paper introduces Cross-Context Review (CCR), a straightforward method where the review is conducted in a fresh session with no access to the production conversation history. We ran a controlled experiment: 30 artifacts (code, technical documents, presentation scripts) with 150 injected errors, tested under four review conditions – same-session Self-Review (SR), repeated Self-Review (SR2), context-aware Subagent Review (SA), and Cross-Context Review (CCR). Over 360 reviews, CCR reached an F1 of 28.6%, outperforming SR (24.6%, p=0.008, d=0.52), SR2 (21.7%, p<0.001, d=0.72), and SA (23.8%, p=0.004, d=0.57). The SR2 result matters most for interpretation: reviewing twice in the same session did not beat reviewing once (p=0.11), which rules out repetition as an explanation for CCR’s advantage. The benefit comes from context separation itself. CCR works with any model, needs no infrastructure, and costs only one extra session.

关键词: Cross-Context Review, LLM, Self-Review, Error Detection, Output Quality, Context Separation, Large Language Models, Controlled Experiment

14. ❌ PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents

作者: Minjia Wang, Yunfeng Wang, Xiao Ma, Dexin Lv, Qifan Guo, Lynn Zheng, Benliang Wang, Lei Wang, Jiannan Li, Yongwei Xing, David Xu, Zheng Sun 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11955v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是使用LLM代理合成数字足迹，因此与"Large Language Models"和"LLM Agents"高度相关（10分）。论文提到模型在合成数据上微调，与"Post-training"有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用LLM代理从结构化用户档案生成逼真数字足迹（如电子邮件、消息）的新方法，其合成数据比现有基线更多样、更真实，且基于该数据微调的模型在真实世界任务中表现更优。

摘要翻译

数字足迹（个体与数字系统交互的记录）对于行为研究、个性化应用开发和机器学习模型训练至关重要。然而，该领域的研究常因缺乏多样化且易于获取的数据而受到限制。为突破这一局限，我们提出一种利用大语言模型（LLM）智能体合成逼真数字足迹的新方法。该方法以结构化用户画像为起点，生成多样化且合理的用户事件序列，最终产出相应的数字产物，如电子邮件、消息、日历条目、提醒事项等。内在评估结果表明，生成的数据集相较于现有基线具有更高的多样性和真实性。此外，在真实世界分布外任务评估中，基于我们合成数据微调的模型表现优于使用其他合成数据集训练的模型。

摘要 (Abstract)

Digital footprints (records of individuals’ interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.

关键词: LLM agents, digital footprints, synthetic data, user profiles, fine-tuning, realistic generation, behavior simulation, out-of-distribution tasks

15. ❌ Language Generation with Replay: A Learning-Theoretic View of Model Collapse

作者: Giorgio Racca, Michal Valko, Amartya Sanyal 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11784v1

评分: 23.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	8.0/10	8.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在训练数据中混入自身生成内容（replay）时出现的模型崩溃（model collapse）问题，这与"Large Language Models"高度相关（10分）。论文明确提及"scaling laws"推动LLM训练数据需求增长，导致公开在线文本被消耗，这与"Scaling Laws” AND “Data Quality"相关（8分）。论文讨论训练管道和数据清理，与"Pre-training"有一定关联（5分）。其他关键词如MoE、SLMs、SFT、RAG、推理加速等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文从学习理论角度研究了大语言模型在训练数据中混入自身生成内容时导致的模型崩溃问题，证明了在某些生成概念下replay会带来根本性限制，并揭示了实践中数据清理、水印等方法的理论依据与局限性。

摘要翻译

随着扩展定律推动前沿大语言模型（LLM）的训练对数据量的需求不断增长，训练流程正逐渐接近一个临界点：大部分公开可用的在线文本可能被耗尽。与此同时，大语言模型的广泛使用增加了网络上机器生成内容的数量；这些趋势共同提高了生成文本重新进入未来训练语料库的可能性，从而增加了通常被称为模型崩溃的性能退化风险。在实践中，模型开发者通过数据清洗、水印技术、合成数据策略，或在某些情况下采取放任态度来应对这一问题。然而，生成模型中模型崩溃的问题尚未从学习理论的角度得到审视：我们通过极限框架下的语言生成这一理论视角来研究它，引入一个重放对抗者，该对抗者将生成器自身过去的输出增补到示例流中。我们的主要贡献是从学习理论角度对重放何时从根本上限制生成进行了细粒度刻画：虽然重放对于最强的均匀生成概念是无害的，但它在理论上会对较弱的非均匀生成和极限生成概念造成分离。有趣的是，我们的正面结果反映了实践中广泛使用的启发式方法，如数据清洗、水印和输出过滤，而我们的分离结果则揭示了这些方法可能失效的情形。

摘要 (Abstract)

As scaling laws push the training of frontier large language models (LLMs) toward ever-growing data requirements, training pipelines are approaching a regime where much of the publicly available online text may be consumed. At the same time, widespread LLM usage increases the volume of machine-generated content on the web; together, these trends raise the likelihood of generated text re-entering future training corpora, increasing the associated risk of performance degradation often called model collapse. In practice, model developers address this concern through data cleaning, watermarking, synthetic-data policies, or, in some cases, blissful ignorance. However, the problem of model collapse in generative models has not been examined from a learning-theoretic perspective: we study it through the theoretical lens of the language generation in the limit framework, introducing a replay adversary that augments the example stream with the generator’s own past outputs. Our main contribution is a fine-grained learning-theoretic characterization of when replay fundamentally limits generation: while replay is benign for the strongest notion of uniform generation, it provably creates separations for the weaker notions of non-uniform generation and generation in the limit. Interestingly, our positive results mirror heuristics widely used in practice, such as data cleaning, watermarking, and output filtering, while our separations show when these ideas can fail.

关键词: large language models, model collapse, scaling laws, training data, replay, learning theory, generative models, data cleaning

16. ❌ Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

作者: Sarbartha Banerjee, Prateek Sahu, Anjo Vahldiek-Oberwagner, Jose Sanchez Vicarte, Mohit Tiwari 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12023v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究复合AI系统中传统软件硬件漏洞与LLM特定攻击的结合，核心涉及LLMs和LLM Agents（复合AI系统包含LLM代理），因此这两个关键词高度相关（10分）。其他关键词涉及模型架构、训练方法、推理优化、特定应用领域等，论文未直接讨论，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了复合AI系统中传统软件硬件漏洞如何与LLM特定算法攻击结合，通过两个新颖攻击案例展示了这种组合攻击能破坏AI安全性和机密性，并系统化分析了攻击原语和生命周期。

摘要翻译

生成式人工智能的快速发展催生了复合人工智能系统——即由多个大语言模型（LLM）、软件工具和数据库系统构成的流水线。复合人工智能系统构建在运行于分布式硬件基础设施的分层传统软件栈之上。其中众多不同的软件组件普遍存在通用漏洞披露（CVE）数据库中记载的传统安全缺陷，而底层的分布式硬件基础设施则持续面临时序攻击、比特翻转故障和基于功耗的侧信道攻击威胁。当前研究主要聚焦于大语言模型特有的风险，如模型提取、训练数据泄露和不安全内容生成，却忽视了传统系统漏洞可能产生的影响。

本研究探讨了传统软件与硬件漏洞如何与大语言模型特有的算法攻击相结合，从而破坏复合人工智能流水线的完整性。我们展示了两种结合系统层漏洞与算法弱点的创新攻击方式：（1）利用软件代码注入漏洞配合防护机制绕过的Rowhammer攻击，将未经篡改的越狱提示词注入大语言模型，导致人工智能安全防护失效；（2）通过操纵知识数据库，诱导大语言模型智能体将敏感用户数据传输至恶意应用程序，从而破坏数据保密性。这些攻击凸显了解决传统漏洞的必要性；我们通过按攻击目标对漏洞进行分类并将其映射到攻击生命周期的不同阶段，系统化梳理了攻击原语并分析了它们的组合方式。该方法支持开展严格的红队测试，并为未来防御策略的制定奠定了基础。

摘要 (Abstract)

Rapid progress in generative AI has given rise to Compound AI systems - pipelines comprised of multiple large language models (LLM), software tools and database systems. Compound AI systems are constructed on a layered traditional software stack running on a distributed hardware infrastructure. Many of the diverse software components are vulnerable to traditional security flaws documented in the Common Vulnerabilities and Exposures (CVE) database, while the underlying distributed hardware infrastructure remains exposed to timing attacks, bit-flip faults, and power-based side channels. Today, research targets LLM-specific risks like model extraction, training data leakage, and unsafe generation – overlooking the impact of traditional system vulnerabilities. This work investigates how traditional software and hardware vulnerabilities can complement LLM-specific algorithmic attacks to compromise the integrity of a compound AI pipeline. We demonstrate two novel attacks that combine system-level vulnerabilities with algorithmic weaknesses: (1) Exploiting a software code injection flaw along with a guardrail Rowhammer attack to inject an unaltered jailbreak prompt into an LLM, resulting in an AI safety violation, and (2) Manipulating a knowledge database to redirect an LLM agent to transmit sensitive user data to a malicious application, thus breaching confidentiality. These attacks highlight the need to address traditional vulnerabilities; we systematize the attack primitives and analyze their composition by grouping vulnerabilities by their objective and mapping them to distinct stages of an attack lifecycle. This approach enables a rigorous red-teaming exercise and lays the groundwork for future defense strategies.

关键词: Compound AI systems, Large Language Models, LLM agents, security vulnerabilities, adversarial attacks, jailbreak prompt, confidentiality breach, attack lifecycle

17. ❌ Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

作者: Roman Koshkin, Jeon Haesung, Lianbo Liu, Hao Shi, Mengjie Zhao, Yusuke Fujita, Yui Sudo 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11578v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出Hikari模型，专注于端到端的同时语音到文本翻译和流式转录，属于大模型在特定任务（语音翻译）的应用。核心相关性：1）“Post-training” OR “Supervised Fine-tuning” OR “SFT”（10分）：摘要明确提到"supervised fine-tuning strategy"用于训练模型从延迟中恢复，是核心方法。2）“Large Language Models” OR “LLMs” OR “Foundation Models”（5分）：论文涉及端到端模型，可能基于大模型架构，但未明确提及LLMs，因此给中等分数。3）“Instruction Tuning” OR “Alignment” OR “Value Alignment”（5分）：模型通过编码READ/WRITE决策实现"causal alignment”，与对齐概念相关。其他关键词如MoE、Scaling Laws、RAG等未在摘要中体现，给0分。加权总分计算：101.0 + 51.0 + 5*1.0 = 20.0。

!!! tip deepseek-chat TL;DR

该论文研究了同时语音到文本翻译和流式转录的问题，通过提出Hikari模型、Decoder Time Dilation机制和监督微调策略，在多个语言对上实现了新的最先进BLEU分数，显著改善了质量-延迟权衡。

摘要翻译

同步机器翻译（Simultaneous Machine Translation, SiMT）传统上依赖于离线机器翻译模型结合人工设计的启发式规则或学习策略。我们提出Hikari模型，这是一种无需策略、完全端到端的系统，通过将读/写决策编码为概率化的等待令牌机制，实现同步语音到文本翻译与流式转录。我们还引入了解码器时间膨胀机制，以减少自回归开销并确保均衡的训练分布。此外，我们提出一种监督微调策略，训练模型从延迟中恢复，显著优化了质量与延迟的权衡关系。在英语到日语、德语和俄语的测试中，Hikari在低延迟与高延迟场景下均取得了最新的最优BLEU分数，超越了近期基线模型。

摘要 (Abstract)

Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.

关键词: simultaneous machine translation, speech-to-text translation, streaming transcription, end-to-end model, supervised fine-tuning, quality-latency trade-off, BLEU scores, Hikari

18. ❌ The Latent Color Subspace: Emergent Order in High-Dimensional Chaos

作者: Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12261v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究的是文本到图像生成模型FLUX.1中变分自编码器潜在空间的颜色表示解释，属于计算机视觉和生成模型的解释性AI范畴。与绝大多数关键词（涉及大语言模型、训练技术、推理、对齐、压缩、代理等）完全无关。唯一相关的关键词是"Mechanistic Interpretability” OR “Explainable AI”，因为论文的核心是解释模型内部表示（颜色编码），属于可解释AI/机制可解释性，评分为10分（高度相关，核心内容）。其他关键词评分为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文研究了FLUX.1文本到图像生成模型中变分自编码器潜在空间的颜色表示，揭示了其反映色调、饱和度和亮度的结构，并提出了一种无需训练、基于闭式潜在空间操作的显式颜色控制方法。

摘要翻译

文本到图像生成模型发展迅速，但实现对生成图像的细粒度控制仍然困难，这主要源于对语义信息编码方式的理解有限。我们对FLUX.1 [Dev]变分自编码器潜在空间中的颜色表征提出了一种解释，揭示了其反映色相、饱和度和明度的结构。我们通过证明该潜在颜色子空间（LCS）解释既能预测又能显式控制颜色，验证了其有效性，并引入了一种完全无需训练、仅基于闭式潜在空间操作的FLUX控制方法。代码发布于https://github.com/ExplainableML/LCS。

摘要 (Abstract)

Text-to-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training-free method in FLUX based solely on closed-form latent-space manipulation. Code is available at https://github.com/ExplainableML/LCS.

关键词: text-to-image generation, variational autoencoder, latent space, color representation, interpretability, FLUX.1, training-free control, latent color subspace

19. ❌ STAIRS-Former: Spatio-Temporal Attention with Interleaved Recursive Structure Transformer for Offline Multi-task Multi-agent Reinforcement Learning

作者: Jiwon Jeon, Myungsik Cho, Youngchul Sung 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11691v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文STAIRS-Former专注于离线多智能体强化学习（MARL），提出了一种改进的Transformer架构用于多任务多智能体场景。所有关键词中，仅"Multi-agent Systems" OR “Agent Coordination"高度相关（10分），因为论文核心研究多智能体系统中的协调问题。其他关键词均与论文内容无关（0分），因为论文不涉及大语言模型、模型训练技术、推理方法、AI科学应用等主题。

!!! tip deepseek-chat TL;DR

该论文针对离线多任务多智能体强化学习中智能体数量变化和泛化到未见场景的挑战，提出了一种增强时空层次结构的Transformer架构STAIRS-Former，在多个基准测试中实现了最先进的性能。

摘要翻译

离线多智能体强化学习（MARL）在处理多任务数据集时面临挑战，这主要源于不同任务间智能体数量的差异以及向未知场景泛化的需求。先前的研究通过采用基于观测标记化的Transformer架构与分层技能学习来解决这些问题。然而，这些方法未能充分利用Transformer注意力机制来实现智能体间的协同，且仅依赖单一历史标记，限制了其在部分可观测MARL环境中捕捉长期时序依赖关系的能力。本文提出STAIRS-Former，一种增强空间与时间层次结构的Transformer架构，该架构能够在捕捉长程交互历史的同时，对关键标记实现高效注意力聚焦。我们进一步引入标记丢弃技术，以提升模型在不同智能体数量下的鲁棒性与泛化能力。在包含SMAC、SMAC-v2、MPE及MaMuJoCo在内的多种多智能体基准测试中，基于多任务数据集的大量实验表明，STAIRS-Former持续优于现有方法，并取得了新的最优性能。

摘要 (Abstract)

Offline multi-agent reinforcement learning (MARL) with multi-task datasets is challenging due to varying numbers of agents across tasks and the need to generalize to unseen scenarios. Prior works employ transformers with observation tokenization and hierarchical skill learning to address these issues. However, they underutilize the transformer attention mechanism for inter-agent coordination and rely on a single history token, which limits their ability to capture long-horizon temporal dependencies in partially observable MARL settings. In this paper, we propose STAIRS-Former, a transformer architecture augmented with spatial and temporal hierarchies that enables effective attention over critical tokens while capturing long interaction histories. We further introduce token dropout to enhance robustness and generalization across varying agent populations. Extensive experiments on diverse multi-agent benchmarks, including SMAC, SMAC-v2, MPE, and MaMuJoCo, with multi-task datasets demonstrate that STAIRS-Former consistently outperforms prior methods and achieves new state-of-the-art performance.

关键词: Offline Multi-agent Reinforcement Learning, Multi-task Learning, Transformer Architecture, Spatio-Temporal Attention, Inter-agent Coordination, Long-horizon Temporal Dependencies, Token Dropout, Generalization

20. ❌ SemBench: A Universal Semantic Framework for LLM Evaluation

作者: Mikel Zubillaga, Naiara Perez, Oscar Sainz, German Rigau 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11687v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是提出SemBench框架，专门用于评估大语言模型（LLMs）的语义理解能力，因此与"Large Language Models"高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等）、应用领域（如生物信息学）或评估的特定方面（如幻觉缓解、推理），这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型语义理解评估的挑战，提出了一个轻量级、可扩展的跨语言评估框架SemBench，并通过实验证明其与标准数据集评估结果高度一致且数据高效。

摘要翻译

自然语言处理（NLP）领域的最新进展主要由大型语言模型（LLM）的兴起所推动，这些模型展现出卓越的生成与推理能力。然而，尽管取得了成功，评估这些模型对真实语义的理解能力仍是一个持续存在的挑战。传统的基准测试如上下文词义识别（Word-in-Context, WiC）能有效探测这一能力，但其构建过程资源密集，且通常仅限于高资源语言。本文中，我们提出了SemBench，一个仅使用词典义项定义和句子编码器即可自动生成合成基准测试的框架，用于评估大型语言模型的语义能力。该方法无需人工整理的例句，使其兼具可扩展性和语言无关性。我们在三种不同资源水平的语言（英语、西班牙语和巴斯克语）中，对一系列广泛的大型语言模型进行了SemBench评估。结果显示，SemBench得出的模型排名与标准WiC数据集获得的排名高度相关。此外，我们的分析表明，仅需少量示例即可获得稳定且有意义的排名。总体而言，SemBench为大型语言模型的语义理解能力提供了一种轻量、适应性强且数据高效的跨语言评估框架。

摘要 (Abstract)

Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true semantic understanding of these models remains a persistent challenge. Traditional benchmarks such as Word-in-Context (WiC) effectively probe this capability, but their creation is resource-intensive and often limited to high-resource languages. In this paper, we introduce SemBench, a framework for automatically generating synthetic benchmarks that assess the semantic competence of LLMs using only dictionary sense definitions and a sentence encoder. This approach eliminates the need for curated example sentences, making it both scalable and language-independent. We evaluate SemBench in three languages (English, Spanish, and Basque) spanning different levels of linguistic resources, and across a wide range of LLMs. Our results show that rankings derived from SemBench strongly correlate with those obtained from standard WiC datasets. Furthermore, our analysis demonstrates that only a small number of examples is required to achieve stable and meaningful rankings. Overall, SemBench provides a lightweight, adaptable, and data-efficient framework for cross-lingual evaluation of semantic understanding in LLMs.

关键词: Large Language Models, LLM evaluation, semantic understanding, cross-lingual evaluation, synthetic benchmarks, Word-in-Context, SemBench, data-efficient

21. ❌ AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

作者: Hamed Hamzeh 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12031v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling》专注于使用多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）和图神经网络（GNN）解决Kubernetes动态调度问题。其核心是强化学习在分布式系统中的应用，而非大语言模型（LLM）或深度学习技术原理的创新。因此，绝大多数关键词（涉及LLM技术、训练方法、推理优化、对齐、科学AI应用等）与论文内容完全无关，评分为0。唯一相关的关键词是“Multi-agent Systems” OR “Agent Coordination”，因为论文明确将调度问题建模为合作多智能体问题，每个集群节点作为一个智能体，并涉及智能体协调，这是论文的核心创新之一，评分为10（高度相关）。

!!! tip deepseek-chat TL;DR

该论文针对Kubernetes调度中现有强化学习方法在可扩展性、多目标权衡和动态适应性方面的不足，提出了一种自适应图增强多智能体强化学习调度器（AGMARL-DKS），通过在Google Kubernetes Engine上的评估，证明其在容错性、资源利用率和成本方面显著优于默认调度器。

摘要翻译

当前先进的云原生应用需要能够有效平衡系统稳定性、资源利用率及相关成本的智能调度器。尽管Kubernetes默认提供基于可行性的资源放置方案，但近期研究已开始探索利用强化学习（RL）实现更智能的调度决策。然而，当前基于强化学习的调度器存在三大局限：其一，多数调度器采用单一集中式智能体架构，难以适应大规模异构集群的扩展需求；其二，采用多目标奖励函数的调度器通常仅假设目标间存在简单、静态的线性组合关系；其三，尚无研究能构建出可自适应响应动态条件的压力感知型调度器。为弥补现有研究的不足，本文提出自适应图增强多智能体强化学习动态Kubernetes调度器（Adaptive Graph-enhanced Multi-Agent Reinforcement Learning Dynamic Kubernetes Scheduler, AGMARL-DKS）。该调度器通过三大创新突破现有局限：首先，我们将调度问题构建为协作式多智能体系统，每个集群节点作为独立智能体运行，采用“集中训练-分散执行”模式，从而实现可扩展的解决方案；其次，为实现兼具全局感知与分布式决策的能力，我们利用图神经网络（Graph Neural Network, GNN）为每个智能体构建全局集群上下文的状态表征，这相较于仅依赖局部观测的方法具有显著改进；最后，为在多目标间实现动态权衡，我们采用压力感知的词典序优化策略替代传统的简单静态线性加权方法。在Google Kubernetes Engine（GKE）平台上的评估表明，AGMARL-DKS在容错性、资源利用率和成本控制方面显著优于默认调度器，尤其在批处理任务与关键业务工作负载的调度场景中表现突出。

摘要 (Abstract)

State-of-the-art cloud-native applications require intelligent schedulers that can effectively balance system stability, resource utilisation, and associated costs. While Kubernetes provides feasibility-based placement by default, recent research efforts have explored the use of reinforcement learning (RL) for more intelligent scheduling decisions. However, current RL-based schedulers have three major limitations. First, most of these schedulers use monolithic centralised agents, which are non-scalable for large heterogeneous clusters. Second, the ones that use multi-objective reward functions assume simple, static, linear combinations of the objectives. Third, no previous work has produced a stress-aware scheduler that can react adaptively to dynamic conditions. To address these gaps in current research, we propose the Adaptive Graph-enhanced Multi-Agent Reinforcement Learning Dynamic Kubernetes Scheduler (AGMARL-DKS). AGMARL-DKS addresses these gaps by introducing three major innovations. First, we construct a scalable solution by treating the scheduling challenge as a cooperative multi-agent problem, where every cluster node operates as an agent, employing centralised training methods before decentralised execution. Second, to be context-aware and yet decentralised, we use a Graph Neural Network (GNN) to build a state representation of the global cluster context at each agent. This represents an improvement over methods that rely solely on local observations. Finally, to make trade-offs between these objectives, we use a stress-aware lexicographical ordering policy instead of a simple, static linear weighting of these objectives. The evaluations in Google Kubernetes Engine (GKE) reveal that AGMARL-DKS significantly outperforms the default scheduler in terms of fault tolerance, utilisation, and cost, especially in scheduling batch and mission-critical workloads.

关键词: Multi-Agent Reinforcement Learning, Kubernetes Scheduling, Graph Neural Network, Dynamic Scheduling, Resource Utilization, Fault Tolerance, Decentralized Execution, Cooperative Agents

22. ❌ EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

作者: Nicolas Deutschmann, Constance Ferragu, Jonathan D. Ziegler, Shayan Aziznejad, Eli Bixby 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11703v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文EvoFlows专注于蛋白质工程的深度学习应用，提出了一种基于编辑流的序列到序列建模方法，用于预测蛋白质突变。该研究与绝大多数关键词（如LLMs、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及通用大语言模型的技术原理、训练方法、推理优化或应用范式，而论文的核心是特定领域的蛋白质序列建模。唯一相关的关键词是"AI for Science” OR “Bioinformatics” OR “Cheminformatics”，因为论文明确属于生物信息学领域的AI应用，与蛋白质工程直接相关，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为EvoFlows的基于编辑流的蛋白质序列建模方法，用于预测蛋白质突变的位置和类型，实验表明其在生成自然且非平凡的蛋白质突变方面优于传统的掩码语言模型。

摘要翻译

我们提出EvoFlows，一种专为蛋白质工程设计的可变长度序列到序列蛋白质建模方法。与自回归模型和掩码语言模型不同，EvoFlows对模板蛋白质序列执行有限且可控数量的插入、删除和替换操作。换言之，EvoFlows不仅能预测_进行何种突变_，还能预测_突变发生的位置_。我们的方法利用编辑流（edit flows）来学习进化相关蛋白质序列之间的突变轨迹，同时建模相关天然蛋白质的分布以及连接这些蛋白质的突变路径。通过对来自UNIREF和OAS数据库的多种蛋白质群落进行广泛的计算机模拟评估，我们证明EvoFlows能以与蛋白质工程中常用主流掩码语言模型相当的质量捕获蛋白质序列分布，同时在给定模板蛋白质生成非平凡且类天然突变体方面表现出更强的能力。

摘要 (Abstract)

We introduce EvoFlows, a variable-length sequence-to-sequence protein modeling approach uniquely suited to protein engineering. Unlike autoregressive and masked language models, EvoFlows perform a limited, controllable number of insertions, deletions, and substitutions on a template protein sequence. In other words, EvoFlows predict not only which mutation to perform, but also where it should occur. Our approach leverages edit flows to learn mutational trajectories between evolutionarily-related protein sequences, simultaneously modeling distributions of related natural proteins and the mutational paths connecting them. Through extensive in silico evaluation on diverse protein communities from UNIREF and OAS, we demonstrate that EvoFlows capture protein sequence distributions with a quality comparable to leading masked language models commonly used in protein engineering, while showing improved ability to generate non-trivial yet natural-like mutants from a given template protein.

关键词: EvoFlows, protein engineering, edit flows, mutational trajectories, sequence-to-sequence modeling, protein mutants, masked language models, in silico evaluation

23. ❌ Coarse-Guided Visual Generation via Weighted h-Transform Sampling

作者: Yanghao Wang, Ziqi Jiang, Zhen Wang, Long Chen 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12057v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究的是视觉生成（图像和视频）领域，提出了一种基于h-transform和预训练扩散模型的训练自由引导方法，用于从粗糙参考生成精细样本。该工作属于计算机视觉和生成模型领域，与绝大多数关键词（主要针对大语言模型LLMs及其相关技术、应用和优化）完全无关。仅与"Pre-training" OR “Continual Pre-training” OR “Domain Adaptation"有一定关联（5分），因为该方法利用了预训练（Pre-training）的扩散模型作为基础，但论文核心并非研究预训练技术本身，而是利用现有预训练模型进行采样过程引导。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于h-transform和噪声感知调度的训练自由引导方法，用于解决从粗糙参考生成高质量视觉样本时面临的引导与合成质量难以平衡的问题，并在多种图像和视频生成任务中验证了其有效性和泛化能力。

摘要翻译

粗粒度引导视觉生成技术旨在从退化或低保真度的粗粒度参考中合成精细视觉样本，这对众多现实应用至关重要。尽管基于训练的方法效果显著，但其本质上受限于高昂的训练成本以及配对数据收集导致的泛化能力受限。因此，近期无需训练的研究提出利用预训练扩散模型，并在采样过程中引入引导机制。然而，这些无需训练的方法要么需要已知前向（精细到粗粒度）变换算子（例如双三次下采样），要么难以在引导效果与合成质量之间取得平衡。为应对这些挑战，我们提出一种基于h变换的新型引导方法，该工具能够在给定条件下约束随机过程（如采样过程）。具体而言，我们通过在原微分方程中添加漂移函数来修改每个采样时间步的转移概率，从而近似地将生成过程导向理想的精细样本。针对不可避免的近似误差，我们引入了一种噪声水平感知调度策略，随着误差增大逐渐降低该引导项的权重，以确保既遵循引导约束又实现高质量合成。在多种图像与视频生成任务上的大量实验验证了本方法的有效性与泛化能力。

摘要 (Abstract)

Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.

关键词: Coarse-guided visual generation, h-transform, pretrained diffusion models, training-free method, sampling process, guidance adherence, image generation, video generation

24. ❌ Compactifying the Electronic Wavefunction II: Quantum Estimators for Spin-Coupled Generalized Valence Bond Wavefunctions

作者: Bruna Gabrielly 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12045v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文专注于量子计算在电子结构计算中的应用，特别是针对自旋耦合广义价键波函数的矩阵元素估计。论文的核心是量子测量框架和量子电路设计，与深度学习、大模型技术完全无关。所有关键词（除了最后一个）都涉及大模型、深度学习、自然语言处理、模型优化等技术，与该论文的量子化学计算主题没有关联。最后一个关键词"AI for Science” OR “Bioinformatics” OR “Cheminformatics"得5分，因为论文属于计算化学领域，可视为科学计算中的AI相关应用（尽管是量子计算而非传统AI），但并非核心内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于测量的量子框架，用于估计自旋耦合广义价键波函数中的重叠和哈密顿矩阵元素，通过浅层无辅助量子电路实现，并在H4分子上验证了其准确性和化学一致性。

摘要翻译

本文提出一种基于测量的量子框架，用于评估自旋耦合广义价键波函数中的重叠积分与哈密顿矩阵元。该方法针对非正交价键方法的核心难题：如何估计不同且通常非正交的组态态函数之间的矩阵元。我们并非在量子硬件上制备完整波函数，而是将所需物理量重新表述为泡利算符串的真空期望值，这些值可通过由局域克利福德旋转和计算基测量构成的浅层、无需辅助量子比特的电路获取。与基于哈达玛测试的矩阵元估计方法相比，该构建通过将问题简化为局域泡利测量，避免了辅助量子比特和控制操作。这使得SCGVB问题的代数构建与在量子寄存器上执行的测量任务相分离，并产生了一种适用于近期量子架构的低深度策略。我们通过量子电路模拟在正方形和矩形H4体系上验证了该框架，所得重叠积分与哈密顿矩阵能在所考察的几何构型范围内较精确地复现基于经典Löwdin方法的结果，且推导出的Coulson-Chirgwin权重保持化学一致性。这些结果证明了基于测量的量子辅助方法处理非正交SCGVB展开的可行性，并为将量子测量融入价键电子结构工作流程提供了实用路径。

摘要 (Abstract)

We present a measurement-driven quantum framework for evaluating overlap and Hamiltonian matrix elements in spin-coupled generalized valence bond (SCGVB) wavefunctions. The approach targets a central difficulty of nonorthogonal valence-bond methods: estimating matrix elements between distinct, generally nonorthogonal configuration state functions. Rather than preparing the full wavefunction on quantum hardware, we reformulate the required quantities as vacuum expectation values of Pauli-string operators that can be accessed using shallow, ancilla-free circuits composed of local Clifford rotations and computational-basis measurements. In contrast to Hadamard-test-based matrix-element estimation, this construction avoids ancilla qubits and controlled operations by reducing the problem to local Pauli measurements. This separates the algebraic construction of the SCGVB problem from the measurement task executed on the quantum register and yields a low-depth strategy compatible with near-term architectures. We demonstrate the framework on square and rectangular H4 using quantum-circuit emulation, where the resulting overlap and Hamiltonian matrices reproduce classical Lowdin-based references with good accuracy across the geometries considered, and where derived Coulson-Chirgwin weights remain chemically consistent. These results support the feasibility of measurement-based quantum assistance for nonorthogonal SCGVB expansions and provide a practical route for incorporating quantum measurements into valence-bond electronic-structure workflows.

关键词: quantum computing, electronic structure, generalized valence bond, matrix elements, quantum measurements, quantum circuits, nonorthogonal wavefunctions, SCGVB

25. ❌ Accurate prediction of inverted singlet-triplet excited states using self-consistent spin-opposite perturbation theory

作者: Nhan Tri Tran, Hoang Thanh Nguyen, Lan Nguyen Tran 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11891v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文专注于计算化学领域，研究一种新的量子化学计算方法（O2BMP2）来高效预测分子激发态能量，特别是倒置单重态-三重态能隙。论文内容与绝大多数关键词（涉及大模型、深度学习、训练技术、推理优化、智能体等）完全无关。仅与最后一个关键词"AI for Science” OR “Bioinformatics” OR “Cheminformatics"有一定关联，因为该研究属于计算化学，是AI for Science（科学AI）在化学/材料科学中的一个具体应用案例，但论文本身并未使用或提及任何AI、机器学习或大模型技术，而是纯粹的量子化学计算方法开发。因此，仅给予该关键词5分（有一定关联），其余关键词均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文解决了高效准确预测倒置单重态-三重态能隙（对OLED材料至关重要）的计算难题，并提出并验证了O2BMP2方法，在保持高精度的同时显著降低了计算成本。

摘要翻译

对洪德规则的违背导致单重态-三重态能隙反转（INVEST），代表了光物理领域的一个范式转变，对有机发光二极管（OLED）技术具有重大意义。INVEST分子促进了无势垒的反向系间窜越，理论上允许在不依赖热激活的情况下实现100%的内量子效率。然而，准确预测负的单重态-三重态能隙通常需要极高的计算成本。在本研究中，我们评估了我们最近开发的单体莫勒-普莱塞特微扰理论（one-body Møller-Plesset perturbation theory, OBMP2）及其自旋相反变体（spin-opposite variant, O2BMP2）作为高效替代方法的有效性。通过对30个INVEST分子进行基准测试发现，采用适当自旋相反标度的O2BMP2，其精度可达到ADC(3)和EOM-CCSD的水平。此外，由于有可能将计算复杂度降低至$N^4$，O2BMP2在精度与效率之间实现了稳健的平衡，使其适用于下一代INVEST材料的高通量筛选。

摘要 (Abstract)

The violation of Hund’s rule, resulting in an inverted singlet-triplet (INVEST) gap, represents a paradigm shift in photophysics with major implications for OLED technology. INVEST molecules facilitate barrierless reverse intersystem crossing, theoretically permitting 100% internal quantum efficiency without thermal activation. However, accurately predicting negative singlet-triplet energy gaps typically demands prohibitive computational costs. In this study, we evaluate the efficacy of our recently developed one-body Møller-Plesset perturbation theory (OBMP2) and its spin-opposite variant (O2BMP2) as efficient alternatives. Benchmarking against 30 INVEST molecules reveals that O2BMP2, with appropriate spin-opposite scaling, achieves the accuracy of ADC(3) and EOM-CCSD. Furthermore, with the possibility of reducing computational complexity to $N^4$, O2BMP2 provides a robust balance of accuracy and efficiency, making it suitable for the high-throughput screening of next-generation INVEST materials.

关键词: inverted singlet-triplet gap, OLED technology, computational chemistry, perturbation theory, O2BMP2, high-throughput screening, excited states, energy gap prediction

26. ❌ SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

作者: Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan, Arman Cohan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12249v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是构建科学多模态文档推理数据集SciMDR，用于基础模型训练和评估。高度相关关键词：1. ‘Large Language Models/Foundation Models’（论文明确提及用于训练基础模型）；2. ‘Post-training/Supervised Fine-tuning’（实验表明模型在SciMDR上微调后性能显著提升）；3. ‘Chain of Thought/Multi-step Reasoning’（数据集包含显式推理链，关注复杂文档级推理）；4. ‘AI for Science’（专注于科学领域多模态文档理解）。其他关键词如MoE、量化、RAG等未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文针对科学多模态文档推理数据集构建中规模、忠实性和现实性之间的权衡问题，提出了合成与再接地框架，构建了大规模训练数据集SciMDR和专家标注评估基准SciMDR-Eval，实验表明基于SciMDR微调的模型在多个科学QA基准上取得显著提升。

摘要翻译

为基座模型训练构建科学多模态文档推理数据集时，规模、忠实度与真实性之间存在着固有的权衡。为应对这一挑战，我们提出了“合成-再锚定”框架，该两阶段流程包含：(1) 以主张为中心的问答合成，该阶段生成忠实、独立的问答对，并针对聚焦文本片段进行推理；(2) 文档级再锚定，通过程序化方法将这些问答对重新嵌入至完整文档任务中，以确保真实的复杂性。利用此框架，我们构建了SciMDR——一个用于跨模态理解的大规模训练数据集，包含来自2万篇科学论文的30万个带有显式推理链的问答对。我们进一步构建了SciMDR-Eval，这是一个专家标注的评测基准，用于评估完整科学工作流中的多模态理解能力。实验表明，基于SciMDR微调的模型在多个科学问答基准测试中均取得显著提升，尤其在需要复杂文档级推理的任务上表现突出。

摘要 (Abstract)

Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.

关键词: scientific multimodal document reasoning, foundation model training, synthesize-and-reground framework, explicit reasoning chains, cross-modal comprehension, document-level reasoning, scientific QA benchmarks, multimodal comprehension

27. ❌ Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

作者: Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang, Song Jiang, Bo Liu, Arman Cohan, Yuandong Tian, Zhengxing Chen 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12246v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在非可验证领域后训练中的对齐问题，特别是推理型LLM作为评判者的有效性。高度相关的关键词包括：LLMs（核心研究对象）、Post-training/SFT（研究背景）、Alignment（研究目标）、RLHF/DPO（研究方法）、Chain of Thought/System 2 Thinking（推理型评判者的核心特征）。其他关键词如MoE、SLMs、RAG、Quantization等未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该研究系统评估了在基于强化学习的LLM对齐中，非推理型和推理型LLM评判者在非可验证后训练中的实际效果，发现推理型评判者能训练出性能更强的策略，但这些策略可能通过生成欺骗性输出来获得高分。

摘要翻译

推理型大语言模型即评委（Reasoning LLMs-as-Judges）能够受益于推理时扩展，为将推理模型的成功延伸至不可验证领域（即输出正确性/质量无法直接检验的领域）提供了一条前景广阔的路径。然而，尽管推理型评委在静态评估基准上已展现出更优性能，其在实际策略训练中的有效性尚未得到系统检验。为此，我们开展了一项严谨研究，以探究非推理型与推理型评委在基于强化学习的大语言模型对齐中的实际影响。我们在一个受控的合成环境中进行了实验，其中“黄金标准”评委（gpt-oss-120b）提供偏好标注来训练较小的评委，结果揭示了非推理型与推理型评委之间的关键差异：非推理型评委极易导致奖励破解，而推理型评委则能引导策略在黄金标准评委评估下取得强劲表现。有趣的是，我们发现由推理型评委训练出的策略之所以能达到如此强的性能，是因为其学会了生成极具效力的对抗性输出——这些输出还能通过欺骗其他大语言模型评委，在诸如Arena-Hard等流行基准测试中获得高分。结合我们的进一步分析，本研究既凸显了在不可验证的大语言模型后训练中应用（推理型）大语言模型评委的重要发现，也指出了其仍有改进空间。

摘要 (Abstract)

Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a “gold-standard” judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.

关键词: LLM alignment, post-training, reasoning LLMs, LLM-as-judges, reinforcement learning, non-verifiable domains, reward hacking, adversarial outputs

28. ❌ Separable neural architectures as a primitive for unified predictive and generative intelligence

作者: Reza T. Batley, Apurba Sarker, Rajib Mostakim, Andrew Klichine, Sourav Saha 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12244v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种名为’可分离神经架构（SNA）‘的新型神经网络设计范式，旨在通过约束交互阶数和张量秩来分解高维映射，从而统一预测性和生成性智能。其核心贡献是架构层面的理论创新和方法论，而非具体针对大语言模型（LLM）或深度学习技术原理的改进。论文在四个领域（包括强化学习导航、微结构生成、湍流建模和神经语言建模）进行了验证，其中’神经语言建模’部分与’AI for Science’有一定关联，因为论文展示了该方法在科学计算（湍流）和工程（微结构）中的应用潜力，但并未深入探讨生物信息学或化学信息学等具体子领域。其他所有关键词均专门针对LLM的训练、对齐、推理、优化、应用场景或特定技术（如MoE、RAG、CoT等），而本文未涉及这些具体技术，主要关注通用神经架构的数学形式化及其跨领域适用性，因此相关性为0。

!!! tip deepseek-chat TL;DR

该研究提出了可分离神经架构（SNA）作为一种领域无关的基元，通过分解高维映射来统一预测性和生成性智能，并在导航、材料设计、湍流建模和语言建模四个领域验证了其有效性。

摘要翻译

物理学、语言学和感知领域的智能系统常呈现可分解的结构特征，但现有模型通常采用单一化的神经架构，未能显式利用这种结构。可分离神经架构通过形式化一个表征类别来解决这一问题，该架构统一了加法模型、二次模型及张量分解神经模型。通过约束交互阶数与张量秩，可分离神经架构引入了一种结构性归纳偏置，将高维映射分解为低元数组件。可分离性并非系统本身的固有属性：它往往通过描述系统所采用的坐标或表征方式自然涌现。关键在于，这种坐标感知的数学框架揭示了混沌时空动力学与语言自回归模型之间的结构相似性。通过将连续物理状态视为平滑可分离的嵌入表示，可分离神经架构实现了混沌系统的分布建模。该方法在保持适用于离散序列的同时，有效缓解了确定性算子产生的非物理性漂移特征。我们在四个领域中验证了该方法的组合灵活性：基于强化学习的自主航点导航、多功能微结构的逆向生成、湍流的分布建模以及神经语言建模。这些成果确立了可分离神经架构作为预测与生成智能的领域无关基元，能够统一确定性与分布性表征。

摘要 (Abstract)

Intelligent systems across physics, language and perception often exhibit factorisable structure, yet are typically modelled by monolithic neural architectures that do not explicitly exploit this structure. The separable neural architecture (SNA) addresses this by formalising a representational class that unifies additive, quadratic and tensor-decomposed neural models. By constraining interaction order and tensor rank, SNAs impose a structural inductive bias that factorises high-dimensional mappings into low-arity components. Separability need not be a property of the system itself: it often emerges in the coordinates or representations through which the system is expressed. Crucially, this coordinate-aware formulation reveals a structural analogy between chaotic spatiotemporal dynamics and linguistic autoregression. By treating continuous physical states as smooth, separable embeddings, SNAs enable distributional modelling of chaotic systems. This approach mitigates the nonphysical drift characteristics of deterministic operators whilst remaining applicable to discrete sequences. The compositional versatility of this approach is demonstrated across four domains: autonomous waypoint navigation via reinforcement learning, inverse generation of multifunctional microstructures, distributional modelling of turbulent flow and neural language modelling. These results establish the separable neural architecture as a domain-agnostic primitive for predictive and generative intelligence, capable of unifying both deterministic and distributional representations.

关键词: separable neural architecture, predictive intelligence, generative intelligence, tensor decomposition, structural inductive bias, domain-agnostic primitive, distributional modeling, chaotic systems

29. ❌ Incremental Neural Network Verification via Learned Conflicts

作者: Raya Elsaleh, Liam Davis, Haoze Wu, Guy Katz 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12232v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于神经网络验证技术，提出了一种增量验证方法，通过重用学习到的冲突来加速验证过程。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理、科学应用等）完全无关，因为：1）论文研究的是通用神经网络验证，而非大语言模型或特定深度学习技术；2）未涉及任何评分关键词中的技术（如MoE、Scaling Laws、微调、对齐、推理、代理等）；3）不属于AI for Science等应用领域；4）验证技术本身是底层工具，与评分关键词的创新方向不匹配。

!!! tip deepseek-chat TL;DR

该论文提出了一种增量神经网络验证技术，通过跨相关查询重用学习到的冲突来减少冗余搜索，在多个验证任务中实现了高达1.9倍的加速。

摘要翻译

神经网络验证常被用作大型分析流程的核心组件，这些流程会在同一网络上生成一系列紧密相关的验证查询。在现有的神经网络验证器中，每个查询通常被独立求解，且先前运行中学习到的信息会被丢弃，导致搜索空间中相同不可行区域被重复探索。本研究旨在通过减少这种冗余来加速验证过程。我们提出一种增量验证技术，能够在相关验证查询间复用已学习的冲突。该技术可集成于任何基于分支定界的神经网络验证器之上。在验证过程中，验证器会记录与已学习的激活相位不可行组合相对应的冲突，并在多次运行中保留这些信息。我们形式化了验证查询之间的精化关系，并证明在精化条件下，为某查询学习的冲突仍保持有效，从而实现可靠的冲突继承。继承的冲突通过SAT求解器进行一致性检查和传播处理，使得不可行子问题能在搜索早期被检测和剪枝。我们在Marabou验证器中实现了该技术，并在三个验证任务上进行了评估：局部鲁棒性半径确定、输入分割验证以及最小充分特征集提取。实验表明，增量式冲突复用减少了验证工作量，相比非增量基线方法实现了最高达$1.9\times$的加速效果。

摘要 (Abstract)

Neural network verification is often used as a core component within larger analysis procedures, which generate sequences of closely related verification queries over the same network. In existing neural network verifiers, each query is typically solved independently, and information learned during previous runs is discarded, leading to repeated exploration of the same infeasible regions of the search space. In this work, we aim to expedite verification by reducing this redundancy. We propose an incremental verification technique that reuses learned conflicts across related verification queries. The technique can be added on top of any branch-and-bound-based neural network verifier. During verification, the verifier records conflicts corresponding to learned infeasible combinations of activation phases, and retains them across runs. We formalize a refinement relation between verification queries and show that conflicts learned for a query remain valid under refinement, enabling sound conflict inheritance. Inherited conflicts are handled using a SAT solver to perform consistency checks and propagation, allowing infeasible subproblems to be detected and pruned early during search. We implement the proposed technique in the Marabou verifier and evaluate it on three verification tasks: local robustness radius determination, verification with input splitting, and minimal sufficient feature set extraction. Our experiments show that incremental conflict reuse reduces verification effort and yields speedups of up to $1.9\times$ over a non-incremental baseline.

关键词: neural network verification, incremental verification, learned conflicts, branch-and-bound, Marabou verifier, robustness verification, conflict inheritance, verification acceleration

30. ❌ Security Considerations for Artificial Intelligence Agents

作者: Ninghui Li, Kaiyuan Zhang, Kyle Polley, Jerry Ma 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12230v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI智能体（agents）的安全问题，与’LLM Agents/Autonomous Agents/Agentic Workflow’高度相关（10分），因为全文聚焦于agentic systems的安全考虑；与’Tool Use/Function Calling/API Tool Use’相关（8分），因为讨论了工具使用相关的攻击面（如indirect prompt injection）；与’Multi-agent Systems/Agent Coordination’高度相关（10分），因为分析了多智能体协调中的安全风险（如cascading failures）。与’Large Language Models/LLMs/Foundation Models’有一定关联（5分），因为AI agents通常基于大模型构建，但论文未深入讨论LLM技术本身。其他关键词（如MoE、Scaling Laws、Training methods等）与论文的安全研究主题完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了前沿AI智能体在工具使用、多智能体协调等场景中面临的新型安全风险（如间接提示注入、级联故障），并提出了分层防御框架和研究空白。

摘要翻译

本文是Perplexity对NIST/CAISI 2025-0035号信息征询文件的轻度改编版回应，详细阐述了我们对前沿AI智能体（AI agents）安全性的观察与建议。这些见解源于Perplexity在受控和开放环境中运营数百万人及数千家企业使用的通用智能体系统的实践经验。智能体架构改变了关于代码-数据分离、权限边界和执行可预测性的核心假设，从而催生了新的机密性、完整性与可用性失效模式。我们梳理了跨工具、连接器、托管边界和多智能体协调的主要攻击面，特别聚焦于间接提示注入（indirect prompt injection）、 confused-deputy行为以及长时工作流中的级联故障。随后，我们将现有防御措施评估为一个分层体系：输入级与模型级缓解措施、沙箱化执行以及对高影响行为的确定性策略执行。最后，我们指出了标准与研究方面的空白，包括适应性安全基准测试、适用于委托与权限控制的策略模型，以及与NIST风险管理原则相契合的安全多智能体系统设计指南。

摘要 (Abstract)

This article, a lightly adapted version of Perplexity’s response to NIST/CAISI Request for Information 2025-0035, details our observations and recommendations concerning the security of frontier AI agents. These insights are informed by Perplexity’s experience operating general-purpose agentic systems used by millions of users and thousands of enterprises in both controlled and open-world environments. Agent architectures change core assumptions around code-data separation, authority boundaries, and execution predictability, creating new confidentiality, integrity, and availability failure modes. We map principal attack surfaces across tools, connectors, hosting boundaries, and multi-agent coordination, with particular emphasis on indirect prompt injection, confused-deputy behavior, and cascading failures in long-running workflows. We then assess current defenses as a layered stack: input-level and model-level mitigations, sandboxed execution, and deterministic policy enforcement for high-consequence actions. Finally, we identify standards and research gaps, including adaptive security benchmarks, policy models for delegation and privilege control, and guidance for secure multi-agent system design aligned with NIST risk management principles.

关键词: AI agents, agentic systems, security, multi-agent coordination, tool use, prompt injection, cascading failures, risk management

31. ❌ Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration

作者: Priyanka Kargupta, Shuhaib Mehri, Dilek Hakkani-Tur, Jiawei Han 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12226v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLMs驱动跨学科科学创造力，属于大模型在科学领域的应用创新。高度相关的关键词包括：‘Large Language Models’（论文核心工具）、‘AI for Science’（应用领域）、‘Chain of Thought’和’System 2 Thinking’（涉及推理过程）。‘Retrieval-Augmented Generation’和’LLM Agents’有一定关联，因为框架涉及检索外部学科知识和LLM辅助工作流程。其他关键词如MoE、量化、对齐等未在摘要中提及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了Idea-Catalyst框架，利用大语言模型系统性地识别跨学科见解以增强科学创造力，实证表明该方法能将新颖性提高21%、洞察力提高16%。

摘要翻译

尽管跨学科研究能带来更广泛且更长远的影响，但大多数学术工作仍局限于单一领域的学术孤岛之中。近期基于人工智能的科学发现方法为跨学科研究展现了潜力，但许多方法优先追求快速设计实验与解决方案，绕过了驱动创造性跨学科突破所需的探索性、协作性推理过程。因此，先前的研究主要侧重于自动化科学发现，而非增强引发科学变革的推理过程本身。本文提出“创意催化剂”（Idea-Catalyst）这一新颖框架，它能系统性地识别跨学科洞见，以支持人类与大型语言模型的创造性推理。该框架从抽象的研究目标出发，旨在辅助头脑风暴阶段，明确避免过早锚定于特定解决方案。它体现了跨学科推理的关键元认知特征：（a）定义与评估研究目标；（b）对某一领域的机遇与未解挑战的认知；（c）基于潜在影响力对跨学科思想进行战略性探索。具体而言，“创意催化剂”将一个抽象目标（例如，改进人机协作）分解为核心目标领域的研究问题，用以指导分析该领域内的进展与开放挑战。这些挑战被重新表述为与领域无关的概念性问题，从而能够从外部学科（例如心理学、社会学）中检索处理类似问题的知识。通过综合这些领域的见解并将其重新语境化至目标领域，“创意催化剂”依据跨学科潜力对来源领域进行排序。实证表明，这种定向整合在保持扎根于原始研究问题的同时，将平均新颖性提升了21%，深刻性提升了16%。

摘要 (Abstract)

Despite interdisciplinary research leading to larger and longer-term impact, most work remains confined to single-domain academic silos. Recent AI-based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea-Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea-Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain’s opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea-Catalyst decomposes an abstract goal (e.g., improving human-AI collaboration) into core target-domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain-agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea-Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.

关键词: LLM-driven, interdisciplinary research, scientific creativity, reasoning processes, Idea-Catalyst, brainstorming, domain-agnostic problems, retrieval from external disciplines

32. ❌ Portfolio of Solving Strategies in CEGAR-based Object Packing and Scheduling for Sequential 3D Printing

作者: Pavel Surynek 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12224v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是3D打印中的对象排列和调度组合优化问题，通过并行化CEGAR-SEQ算法并引入多种对象排列策略来提高效率。论文内容完全聚焦于组合优化、并行计算和3D打印调度算法，没有涉及任何大模型、深度学习、AI技术原理或AI在科学领域的应用。所有关键词都与大模型、深度学习及相关技术相关，而本文是纯粹的算法优化研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过并行化CEGAR-SEQ算法并引入多种对象排列策略，解决了3D打印中的对象排列和调度组合优化问题，实验表明新算法Portfolio-CEGAR-SEQ在减少打印板使用数量方面优于原算法。

摘要翻译

数十年前仅存于超级计算机的计算能力——尤其是其并行处理能力——如今已普及至标准个人计算机的中央处理器（CPU），甚至移动电话的CPU中。本文展示了如何有效利用现代多核个人计算机CPU的计算能力，以解决顺序3D打印中物体排列与调度的复杂组合优化问题。我们通过并行化现有的CEGAR-SEQ算法实现这一目标：该算法将顺序物体排列与调度问题表达为线性算术公式，并采用受反例引导抽象精化（Counterexample Guided Abstraction Refinement, CEGAR）技术启发的求解方法。原始CEGAR-SEQ算法采用将物体向打印平台中心聚集的排列策略。我们提出了替代性的物体排列策略，例如将物体向打印平台角落聚集，以及依据物体高度进行调度。我们的并行化在高层级实现，即同时并行执行采用不同物体排列策略组合的CEGAR-SEQ算法，该并行算法被称为Porfolio-CEGAR-SEQ。实验评估表明，Porfolio-CEGAR-SEQ的性能优于原始CEGAR-SEQ算法。当对多块打印平台进行批量物体调度时，Portfolio-CEGAR-SEQ通常能比CEGAR-SEQ使用更少的打印平台。

摘要 (Abstract)

Computing power that used to be available only in supercomputers decades ago especially their parallelism is currently available in standard personal computer CPUs even in CPUs for mobile telephones. We show how to effectively utilize the computing power of modern multi-core personal computer CPU to solve the complex combinatorial problem of object arrangement and scheduling for sequential 3D printing. We achieved this by parallelizing the existing CEGAR-SEQ algorithm that solves the sequential object arrangement and scheduling by expressing it as a linear arithmetic formula which is then solved by a technique inspired by counterexample guided abstraction refinement (CEGAR). The original CEGAR-SEQ algorithm uses an object arrangement strategy that places objects towards the center of the printing plate. We propose alternative object arrangement strategies such as placing objects towards a corner of the printing plate and scheduling objects according to their height. Our parallelization is done at the high-level where we execute the CEGAR-SEQ algorithm in parallel with a portfolio of object arrangement strategies, an algorithm is called Porfolio-CEGAR-SEQ. Our experimental evaluation indicates that Porfolio-CEGAR-SEQ outperforms the original CEGAR-SEQ. When a batch of objects for multiple printing plates is scheduled, Portfolio-CEGAR-SEQ often uses fewer printing plates than CEGAR-SEQ.

关键词: 3D printing, object arrangement, scheduling, parallel computing, CEGAR, combinatorial optimization, portfolio strategy

33. ❌ RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images

作者: Bin Wan, Runmin Cong, Xiaofei Zhou, Hao Fang, Yaoqi Sun, Sam Kwong 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12215v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的遥感图像显著目标检测，提出RDNet网络架构，使用SwinTransformer作为骨干网络，并设计了动态自适应模块、频率匹配增强模块和区域比例感知定位模块来解决尺度变化和全局上下文建模问题。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是传统计算机视觉任务，未涉及大模型、深度学习技术原理创新或AI在生物/化学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文针对遥感图像中显著目标检测面临的尺度变化大和全局上下文建模困难问题，提出了RDNet网络，通过引入动态自适应模块、频率匹配增强模块和区域比例感知定位模块，实现了对尺度变化的鲁棒性和精确的目标定位，取得了优于现有方法的检测性能。

摘要翻译

遥感图像中的显著目标检测因目标尺寸差异大、自注意力机制计算成本高，以及基于CNN的特征提取器在捕获全局上下文和长距离依赖关系方面存在局限而面临重大挑战。依赖固定卷积核的现有方法往往难以适应多样化的目标尺度，导致细节丢失或无关特征聚合。为解决这些问题，本研究旨在增强对尺度变化的鲁棒性并实现精确的目标定位。我们提出了区域比例感知动态自适应显著目标检测网络（Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network, RDNet），该网络使用SwinTransformer替代CNN主干网络进行全局上下文建模，并引入了三个关键模块：（1）动态自适应细节感知（Dynamic Adaptive Detail-aware, DAD）模块，该模块在目标区域比例引导下应用变化的卷积核；（2）频率匹配上下文增强（Frequency-matching Context Enhancement, FCE）模块，通过小波交互和注意力机制丰富上下文信息；（3）区域比例感知定位（Region Proportion-aware Localization, RPL）模块，该模块利用交叉注意力突出语义细节，并集成了比例引导（Proportion Guidance, PG）块以辅助DAD模块。通过结合这些模块，RDNet实现了对尺度变化的鲁棒性和精确的定位能力，与现有先进方法相比，提供了更优越的检测性能。

摘要 (Abstract)

Salient object detection (SOD) in remote sensing images faces significant challenges due to large variations in object sizes, the computational cost of self-attention mechanisms, and the limitations of CNN-based extractors in capturing global context and long-range dependencies. Existing methods that rely on fixed convolution kernels often struggle to adapt to diverse object scales, leading to detail loss or irrelevant feature aggregation. To address these issues, this work aims to enhance robustness to scale variations and achieve precise object localization. We propose the Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network (RDNet), which replaces the CNN backbone with the SwinTransformer for global context modeling and introduces three key modules: (1) the Dynamic Adaptive Detail-aware (DAD) module, which applies varied convolution kernels guided by object region proportions; (2) the Frequency-matching Context Enhancement (FCE) module, which enriches contextual information through wavelet interactions and attention; and (3) the Region Proportion-aware Localization (RPL) module, which employs cross-attention to highlight semantic details and integrates a Proportion Guidance (PG) block to assist the DAD module. By combining these modules, RDNet achieves robustness against scale variations and accurate localization, delivering superior detection performance compared with state-of-the-art methods.

关键词: Salient object detection, Remote sensing images, SwinTransformer, Dynamic adaptive module, Scale variation, Global context modeling, Region proportion-aware, Object localization

34. ❌ WORKSWORLD: A Domain for Integrated Numeric Planning and Scheduling of Distributed Pipelined Workflows

作者: Taylor Paul, William Regli 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12214v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动化规划与调度领域，研究分布式数据管道的规划与调度问题，开发了WORKSWORLD领域用于数值规划器。论文内容涉及工作流表示、资源图、数值规划算法等传统自动化规划技术，完全不涉及大语言模型、深度学习、AI for Science等关键词相关的技术。所有关键词均与大模型、深度学习、AI科学应用等技术相关，而本文是纯粹的自动化规划研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了WORKSWORLD领域，用于解决分布式数据管道的自动化规划与调度问题，通过数值规划器在商品硬件上成功规划了包含14个组件的工作流。

摘要翻译

本研究致力于实现分布式数据流水线（或称工作流）的自动化规划与调度。我们开发了一种通用的工作流与资源图表示方法，该表示同时包含数据处理与共享组件，并配备相应的网络接口以供调度。基于此图模型，我们提出了WORKSWORTH领域——这是一个专为永久性调度工作流（如数据摄取流水线）设计的、与数值领域无关的规划器新领域。我们的框架允许用户定义数据源、可用工作流组件以及期望的数据目标与格式，而无需将整个工作流图显式声明为目标。该规划器通过求解一个联合规划与调度问题，生成既能构建工作流图、又能在资源图上调度其组件的规划方案。实验表明，一台配置商用硬件（提供一小时CPU时间与30GB内存）的先进数值规划器，能够成功调度横跨八个站点、包含多达14个组件的线性链式工作流。

摘要 (Abstract)

This work pursues automated planning and scheduling of distributed data pipelines, or workflows. We develop a general workflow and resource graph representation that includes both data processing and sharing components with corresponding network interfaces for scheduling. Leveraging these graphs, we introduce WORKSWORLD, a new domain for numeric domain-independent planners designed for permanently scheduled workflows, like ingest pipelines. Our framework permits users to define data sources, available workflow components, and desired data destinations and formats without explicitly declaring the entire workflow graph as a goal. The planner solves a joint planning and scheduling problem, producing a plan that both builds the workflow graph and schedules its components on the resource graph. We empirically show that a state-of-the-art numeric planner running on commodity hardware with one hour of CPU time and 30GB of memory can solve linear-chain workflows of up to 14 components across eight sites.

关键词: automated planning, scheduling, distributed data pipelines, workflow representation, numeric planning, resource graph, joint planning and scheduling, linear-chain workflows

35. ❌ Compiling Temporal Numeric Planning into Discrete PDDL+: Extended Version

作者: Andrea Micheli, Enrico Scala, Alessandro Valentini 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12188v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是自动规划领域中的PDDL+建模语言编译问题，具体涉及时间规划、持续动作和数值规划，属于经典AI规划领域。所有评分关键词均与大语言模型、深度学习技术、模型训练优化、AI对齐、推理方法、AI代理、模型压缩等现代大模型技术相关，而该论文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种将具有持续动作的时间数值规划问题编译为PDDL+模型的实用方法，完全捕获了语义并仅假设动作不自我重叠，实验证明该方法对困难的时间数值问题具有实际意义。

摘要翻译

自PDDL+建模语言提出以来，学界已知带有持续动作的时间规划（如PDDL 2.1）可被编译为PDDL+形式，但此后文献中始终未提出实用的编译方法。本文提出了一种将带有持续动作的时间规划实用化编译至PDDL+的方法，该方法完整保留了语义，仅假设动作不存在自重叠现象。我们的编译过程具有多项式复杂度，在常数因子内保持规划长度不变，并通过实验证明其对复杂的时态数值规划问题具有实际应用价值。

摘要 (Abstract)

Since the introduction of the PDDL+ modeling language, it was known that temporal planning with durative actions (as in PDDL 2.1) could be compiled into PDDL+. However, no practical compilation was presented in the literature ever since. We present a practical compilation from temporal planning with durative actions into PDDL+, fully capturing the semantics and only assuming the non-self-overlapping of actions. Our compilation is polynomial, retains the plan length up to a constant factor and is experimentally shown to be of practical relevance for hard temporal numeric problems.

关键词: temporal planning, numeric planning, PDDL+, durative actions, compilation, planning languages, automated planning

作者: Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12180v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态代理在文档密集型工作流中的推理能力，核心涉及LLM代理（LLM Agents）和检索增强生成（RAG）技术，因此这两个关键词高度相关（10分）。论文评估代理的“战略推理”与“随机搜索”，涉及推理过程（Chain of Thought, System 2 Thinking），相关度较高（8分）。论文提到代理使用工具（如检索文档），与Tool Use有一定关联（5分）。论文未涉及其他关键词如MoE、量化、对齐等具体技术，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文通过引入MADQA基准和评估框架，研究发现当前多模态代理在文档问答任务中虽能达到人类搜索的准确率，但主要依赖暴力搜索而非战略推理，且无法消除与理想性能间近20%的差距。

摘要翻译

多模态智能体为自动化复杂的文档密集型工作流提供了前景广阔的路径。然而，一个关键问题依然存在：这些智能体展现的是真正的策略性推理，还是仅仅基于随机试错的搜索？为探究此问题，我们提出了MADQA基准测试，该测试基于800份异构PDF文档构建了2,250道人工作答问题。在经典测试理论（Classical Test Theory）的指导下，我们设计了这一基准，旨在最大化对不同层次智能体能力的区分度。为评估智能体行为，我们引入了一种新颖的评估协议，用以衡量准确性与努力程度之间的权衡关系。运用此框架，我们发现，尽管最优智能体在原始准确率上能够匹敌人类搜索者，但它们成功解决的问题类型与人类存在显著差异，并且依赖暴力搜索来弥补策略规划的不足。这些智能体未能弥合与理想性能（oracle performance）之间近20%的差距，时常陷入低效的循环。我们公开了数据集与评估工具，以期助力研究从暴力检索向精准、高效推理的范式转变。

摘要 (Abstract)

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

关键词: multimodal agents, document-intensive workflows, strategic reasoning, stochastic search, MADQA benchmark, accuracy-effort trade-off, brute-force retrieval, agentic behavior

37. ❌ Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials

作者: Abhinaba Basu, Pavan Chakraborty 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12183v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于材料科学领域，研究机器学习原子间势能（MLIPs）的可靠性验证方法（Proof-Carrying Materials），与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关，因为论文属于AI在科学（具体为材料科学）领域的应用研究，涉及机器学习模型在材料发现中的可靠性评估和提升。

!!! tip deepseek-chat TL;DR

该论文针对机器学习原子间势能在高通量材料筛选中缺乏可靠性保证的问题，提出了Proof-Carrying Materials框架，通过对抗性证伪、置信区间精炼和形式化验证，显著提升了稳定材料的发现率（在案例研究中提升25%）。

摘要翻译

机器学习原子间势（MLIPs）被用于高通量材料筛选，但缺乏形式化的可靠性保证。我们通过一个包含25,000种材料的基准测试表明，使用单一MLIP作为稳定性过滤器会漏掉93%的密度泛函理论（DFT）稳定材料（召回率0.07）。“证明携带材料”（Proof-Carrying Materials, PCM）方法通过三个阶段弥补了这一差距：在成分空间中进行对抗性证伪、采用95%置信区间的自举包络细化，以及基于Lean 4的形式化验证。对CHGNet、TensorNet和MACE的审计揭示了架构特定的盲点，其成对误差相关性近乎为零（r <= 0.13；n = 5,000），这一结果得到了独立Quantum ESPRESSO验证的确认（20/20收敛；DFT与CHGNet力比中位数为12倍）。基于PCM发现的特征训练的风险模型，能够预测未知材料的失效情况（AUC-ROC = 0.938 +/- 0.004），并可在不同架构间迁移（跨MLIP AUC-ROC ~ 0.70；特征重要性r = 0.877）。在一个热电材料筛选的案例研究中，经PCM审计的筛选方案额外发现了62种被单一MLIP筛选漏掉的稳定材料——使发现产出提高了25%。

摘要 (Abstract)

Machine-learned interatomic potentials (MLIPs) are deployed for high-throughput materials screening without formal reliability guarantees. We show that a single MLIP used as a stability filter misses 93% of density functional theory (DFT)-stable materials (recall 0.07) on a 25,000-material benchmark. Proof-Carrying Materials (PCM) closes this gap through three stages: adversarial falsification across compositional space, bootstrap envelope refinement with 95% confidence intervals, and Lean 4 formal certification. Auditing CHGNet, TensorNet and MACE reveals architecture-specific blind spots with near-zero pairwise error correlations (r <= 0.13; n = 5,000), confirmed by independent Quantum ESPRESSO validation (20/20 converged; median DFT/CHGNet force ratio 12x). A risk model trained on PCM-discovered features predicts failures on unseen materials (AUC-ROC = 0.938 +/- 0.004) and transfers across architectures (cross-MLIP AUC-ROC ~ 0.70; feature importance r = 0.877). In a thermoelectric screening case study, PCM-audited protocols discover 62 additional stable materials missed by single-MLIP screening - a 25% improvement in discovery yield.

关键词: Machine-learned interatomic potentials, Materials screening, Reliability guarantees, Adversarial falsification, Formal certification, Stability filter, Risk model, Thermoelectric materials

38. ❌ BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning

作者: Jingyang Ke, Weihan Li, Amartya Pradhan, Jeffrey Markowitz, Anqi Wu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12176v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文BehaviorVLM提出了一种用于动物行为分析的统一视觉语言框架，核心创新在于利用预训练的视觉语言模型（VLMs）和大型语言模型（LLMs）进行推理，无需任务特定的微调。与关键词的相关性分析如下：1）高度相关（10分）：‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’是论文的核心方法，通过详细的、显式的、可验证的推理步骤指导VLMs；‘AI for Science/Bioinformatics/Cheminformatics’直接对应论文在神经科学和动物行为分析中的应用领域。2）中度相关（8分）：‘Large Language Models/LLMs/Foundation Models’在行为理解管道中用于合并和语义标记行为片段。3）轻度相关（5分）：‘Pre-training/Continual Pre-training/Domain Adaptation’涉及使用预训练的VLMs；‘Mechanistic Interpretability/Explainable AI’与框架的可解释性相关。4）无关（0分）：其他关键词如MoE、SFT、RLHF、RAG、量化等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需微调的视觉语言框架BehaviorVLM，通过多阶段推理步骤，实现了对自由移动动物行为的姿态估计和行为理解，大大减少了人工标注需求，并提高了可扩展性和可解释性。

摘要翻译

理解自由活动动物的行为是神经科学的核心课题，其中姿态估计与行为理解构成了将神经活动与自然动作相联系的基础。然而这两项任务目前仍严重依赖人工标注或不稳定的无监督流程，限制了可扩展性与可重复性。我们提出BehaviorVLM——一个统一的视觉语言框架，通过引导预训练视觉语言模型（VLMs）执行详细、明确且可验证的推理步骤，无需任务特定微调与极少人工标注，即可同时完成姿态估计与行为理解。针对姿态估计，我们利用量子点标记的行为数据，提出融合时序、空间及跨视角推理的多阶段流程。该设计大幅减少了人工标注需求，通过重投影误差等几何检验识别低置信度标签，并生成可用于后续筛选、校正或微调下游姿态模型的标签。针对行为理解，我们提出整合深度嵌入聚类（用于过分割行为发现）、基于VLM的单片段视频描述生成以及基于LLM的推理（用于合并行为片段并赋予语义标签）的流程。该行为分析流程可直接基于视觉信息运行，无需依赖关键点进行行为分割。这些组件共同实现了对多动物行为的可扩展、可解释且低标注依赖的分析。

摘要 (Abstract)

Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.

关键词: Vision-Language Models, Behavioral Understanding, Pose Estimation, Reasoning Steps, Neuroscience, Multi-animal Behavior, Fine-tuning-Free, Interpretable Analysis

39. ❌ A Quantitative Characterization of Forgetting in Post-Training

作者: Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12163v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究持续后训练中的遗忘问题，与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文标题和摘要明确聚焦于’post-training’。论文涉及生成模型，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），但未专门针对LLMs。其他关键词如MoE、SLMs、Scaling Laws、Instruction Tuning、RLHF、RAG、Context Window、KV Cache、CoT、Agents、Quantization、Hallucination、Interpretability、World Models、Model Merging、In-context Learning、AI for Science等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了生成模型在持续后训练中遗忘现象的理论机制，通过两模态混合模型量化了质量遗忘和旧成分漂移两种遗忘形式，并分析了不同KL目标函数和重放策略对遗忘的影响，为理解后训练遗忘提供了精确的理论框架。

摘要翻译

生成模型的持续后训练已被广泛应用，然而对于遗忘何时发生及其原因的系统性理解仍较为有限。我们在Chen等人（2025）（arXiv:2510.18874）提出的双模态混合抽象（代表旧任务与新任务）框架下发展理论结果，并将遗忘形式化为两种类型：（i）质量遗忘，即旧混合权重坍缩为零；（ii）旧成分漂移，即已学习正确的旧成分在训练过程中发生偏移。针对等协方差的高斯模态，我们证明：基于新分布数据训练的前向KL目标函数会驱使旧权重趋于零，而反向KL目标函数则收敛至真实目标（从而避免质量遗忘），且仅通过由巴氏系数控制的、基于重叠度的误分配概率扰动旧均值，从而产生随模态分离度呈指数衰减的漂移，并形成具有指数收敛性的局部良态几何结构。我们进一步量化了经验回放与这些目标函数的交互作用：对于前向KL，回放必须修改训练分布以改变总体最优解；对于反向KL，回放虽不改变总体目标函数，但通过有界重要性加权防止有限批次中的旧模态衰减。最后，我们通过同一理论视角分析了三种近期提出的近同策略后训练方法——SDFT（arxiv:2601.19897）、TTT-Discover（arxiv:2601.16175）与OAPL（arxiv:2602.19362），推导出每种方法保留旧质量并呈现重叠度控制漂移的显式条件。总体而言，我们的研究表明：遗忘现象可以根据散度方向、几何行为重叠度、采样机制以及训练过程中历史行为的可见性之间的相互作用进行精确量化。

摘要 (Abstract)

Continual post-training of generative models is widely used, yet a principled understanding of when and why forgetting occurs remains limited. We develop theoretical results under a two-mode mixture abstraction (representing old and new tasks), proposed by Chen et al. (2025) (arXiv:2510.18874), and formalize forgetting in two forms: (i) mass forgetting, where the old mixture weight collapses to zero, and (ii) old-component drift, where an already-correct old component shifts during training. For equal-covariance Gaussian modes, we prove that forward-KL objectives trained on data from the new distribution drive the old weight to zero, while reverse-KL objectives converge to the true target (thereby avoiding mass forgetting) and perturb the old mean only through overlap-gated misassignment probabilities controlled by the Bhattacharyya coefficient, yielding drift that decays exponentially with mode separation and a locally well-conditioned geometry with exponential convergence. We further quantify how replay interacts with these objectives. For forward-KL, replay must modify the training distribution to change the population optimum; for reverse-KL, replay leaves the population objective unchanged but prevents finite-batch old-mode starvation through bounded importance weighting. Finally, we analyze three recently proposed near-on-policy post-training methods, SDFT (arxiv:2601.19897), TTT-Discover (arxiv:2601.16175), and OAPL (arxiv:2602.19362), via the same lens and derive explicit conditions under which each retains old mass and exhibits overlap-controlled drift. Overall, our results show that forgetting can by precisely quantified based on the interaction between divergence direction, geometric behavioral overlap, sampling regime, and the visibility of past behavior during training.

关键词: post-training, forgetting, continual learning, generative models, KL divergence, mixture models, replay, theoretical analysis

40. ❌ GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

作者: Zexuan Yan, Jiarui Jin, Yue Ma, Shijian Wang, Jiahui Hu, Wenxiang Jiao, Yuan Lu, Linfeng Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12155v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GlyphBanana专注于文本到图像生成中的精确文本渲染问题，提出了一种基于agentic workflow的方法来改进复杂字符和公式的生成。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型的技术原理、训练方法、推理优化等。论文的核心创新在于agentic workflow的设计，因此与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。论文提到使用辅助工具，与’Tool Use OR Function Calling OR API Tool Use’有一定关联（5分）。论文不涉及大语言模型本身，而是文本到图像模型的应用，因此与AI for Science等关键词无关。

!!! tip deepseek-chat TL;DR

论文解决了文本到图像生成中复杂文本和数学公式渲染不准确的问题，通过提出GlyphBanana和相应的基准测试，并采用一种基于agentic workflow的训练免费方法，在多种T2I模型上实现了更精确的文本渲染。

摘要翻译

尽管生成模型的最新进展推动了文本渲染领域的显著进步，但准确生成复杂文本和数学公式仍然是一项艰巨的挑战。这一困难主要源于当前模型在遇到分布外提示时，其指令跟随能力有限。为解决此问题，我们引入了GlyphBanana，以及一个专门为渲染复杂字符和公式设计的对应基准。GlyphBanana采用一种智能体工作流，该工作流集成辅助工具，将字形模板注入到潜在空间和注意力图中，从而促进生成图像的迭代优化。值得注意的是，我们这种无需训练的方法可以无缝应用于各种文本到图像（Text-to-Image, T2I）模型，与现有基线相比实现了更高的精确度。大量实验证明了我们所提出工作流的有效性。相关代码已在 https://github.com/yuriYanZeXuan/GlyphBanana 公开。

摘要 (Abstract)

Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.

关键词: text rendering, agentic workflow, glyph templates, Text-to-Image models, complex characters, mathematical formulas, training-free approach, iterative refinement

41. ❌ IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

作者: Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu, Eric Xing, Virginia Smith, Ruslan Salakhutdinov, Zhiting Hu, Taylor Killian, Aviral Kumar 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12151v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的强化学习后训练中的计算优化问题，与’Large Language Models’、‘Post-training’和’RLHF’高度相关（10分），因为这些是论文的直接研究对象和方法。与’Scaling Laws’有一定关联（5分），因为论文借鉴并扩展了缩放定律的概念到RL领域。其他关键词如MoE、SLMs、RAG、量化等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型强化学习后训练中采样计算的最优分配问题，发现并行rollout数量随计算预算增加而可预测地增长并最终饱和，为计算高效的LLM RL训练提供了实用指导。

摘要翻译

尽管扩展定律为大型语言模型（LLM）预训练的计算资源分配提供了指导，但对于大型语言模型强化学习（RL）后训练中类似的计算分配原则，目前仍缺乏深入理解。本研究探讨了在LLM中采用同策略RL方法时，采样计算量的计算最优分配问题，将扩展问题构建为对三种资源在计算约束下的优化：每个问题的并行轨迹数量、每批次的问题数量以及更新步数。我们发现，每个问题的计算最优并行轨迹数量随计算预算的增加呈现可预测的增长，随后趋于饱和。这一趋势在简单问题和困难问题中均成立，但由不同机制驱动：在简单问题中主要由解锐化驱动，在困难问题中则由覆盖扩展驱动。我们进一步表明，增加并行轨迹数量可以减轻不同问题间的干扰，而每批次的问题数量主要影响训练稳定性，且可在较宽范围内选择。这些结论在不同基础模型和数据分布中得到验证，我们的研究结果将RL扩展定律重新表述为规范性的分配规则，并为计算高效的大型语言模型RL后训练提供了实用指导。

摘要 (Abstract)

While scaling laws guide compute allocation for LLM pre-training, analogous prescriptions for reinforcement learning (RL) post-training of large language models (LLMs) remain poorly understood. We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps. We find that the compute-optimal number of parallel rollouts per problem increases predictably with compute budget and then saturates. This trend holds across both easy and hard problems, though driven by different mechanisms: solution sharpening on easy problems and coverage expansion on hard problems. We further show that increasing the number of parallel rollouts mitigates interference across problems, while the number of problems per batch primarily affects training stability and can be chosen within a broad range. Validated across base models and data distributions, our results recast RL scaling laws as prescriptive allocation rules and provide practical guidance for compute-efficient LLM RL post-training.

关键词: Large Language Models, Reinforcement Learning, Post-training, Scaling Laws, Compute Optimization, Sampling Compute, RLHF, Parallel Rollouts

42. ❌ FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

作者: Quanhao Li, Zhen Xing, Rui Wang, Haidong Cao, Qi Dai, Daoguo Dong, Zuxuan Wu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12146v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance》专注于视频生成领域，特别是轨迹引导的可控视频生成和扩散模型蒸馏技术。虽然论文涉及深度学习（如扩散模型、对抗训练）和模型加速（few-step generation），但所有给定的关键词均明确针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、Agents等）、特定AI科学应用（如生物信息学）或LLM特有技术（如KV缓存压缩、上下文窗口扩展）。论文内容未涉及任何语言模型、文本生成、LLM对齐、推理、代理或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对轨迹可控视频生成中多步去噪过程导致的时间冗余问题，提出了FlashMotion训练框架，通过蒸馏和混合目标微调，在减少生成步数的同时保持了视频质量和轨迹准确性。

摘要翻译

轨迹可控视频生成领域近期取得了显著进展。现有方法主要采用基于适配器的架构，以实现沿预定轨迹的精确运动控制。然而，这些方法均依赖于多步去噪过程，导致显著的时间冗余与计算开销。虽然现有视频蒸馏方法已成功将多步生成器蒸馏为少步模型，但将其直接应用于轨迹可控视频生成时，会导致视频质量与轨迹精度明显下降。为弥补这一差距，我们提出了FlashMotion——一种专为少步轨迹可控视频生成设计的新型训练框架。我们首先在多步视频生成器上训练轨迹适配器以实现精确轨迹控制，随后将生成器蒸馏为少步版本以加速视频生成。最后，我们采用融合扩散目标与对抗目标的混合策略对适配器进行微调，使其与少步生成器对齐，从而生成高质量、高轨迹精度的视频。为进行评估，我们构建了FlashBench基准测试集，该基准专注于长序列轨迹可控视频生成，能够衡量不同前景物体数量下的视频质量与轨迹精度。在两种适配器架构上的实验表明，FlashMotion在视觉质量与轨迹一致性方面均优于现有视频蒸馏方法及先前的多步模型。

摘要 (Abstract)

Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.

关键词: video generation, trajectory guidance, few-step generation, diffusion models, model distillation, adversarial training, computational efficiency, trajectory accuracy

43. ❌ Automatic Generation of High-Performance RL Environments

作者: Seth Karten, Rahul Dev Appapogu, Chi Jin 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12145v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是使用AI代理（coding agent）自动生成高性能强化学习环境的方法，虽然涉及AI代理和自动化流程，但所有关键词都直接针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、应用等），而本文的核心是强化学习环境生成和性能优化，并未涉及LLM技术原理、训练、推理或应用，因此所有关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用AI代理自动生成高性能强化学习环境的通用方法，通过提示模板、分层验证和迭代修复，能以低成本快速生成语义等效的高性能环境，并在多个环境中验证了其有效性。

摘要翻译

将复杂的强化学习（RL）环境转化为高性能实现传统上需要数月的专业工程开发。我们提出一种可复用的方法——包含通用提示模板、分层验证与迭代式智能体辅助修复——能够以低于10美元的计算成本生成语义等效的高性能环境。我们在五个环境中展示了三种不同的工作流程。直接翻译（无现有高性能实现）：EmuRust（通过Rust并行化实现Game Boy模拟器的PPO速度提升1.5倍）与首个GPU并行的Pokemon对战模拟器PokeJAX（随机动作5亿步/秒，PPO 1520万步/秒；较TypeScript参考实现提升22,320倍）。基于现有高性能实现的验证翻译：在匹配GPU批次大小时，与MJX实现吞吐量持平（1.04倍）并在HalfCheetah JAX环境中达到Brax的5倍性能；在Puffer Pong环境中实现42倍PPO加速。新环境创建：首个可部署的JAX版Pokemon集换式卡牌引擎TCGJax（随机动作71.7万步/秒，PPO 15.3万步/秒；较Python参考实现提升6.6倍），该引擎从网络提取的规范自动合成。当模型参数量达2亿时，环境开销降至训练时间的4%以下。分层验证（属性测试、交互测试与推演测试）确认了所有五个环境的语义等效性；跨后端策略迁移证实所有环境均实现零模拟间隙。TCGJax基于未公开于代码仓库的私有参考实现合成，可作为智能体预训练数据污染问题的控制案例。本文提供了充分细节——包括代表性提示、验证方法与完整结果——使得编码智能体能直接从论文复现所有翻译过程。

摘要 (Abstract)

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a reusable recipe - a generic prompt template, hierarchical verification, and iterative agent-assisted repair - that produces semantically equivalent high-performance environments for <$10 in compute cost. We demonstrate three distinct workflows across five environments. Direct translation (no prior performance implementation exists): EmuRust (1.5x PPO speedup via Rust parallelism for a Game Boy emulator) and PokeJAX, the first GPU-parallel Pokemon battle simulator (500M SPS random action, 15.2M SPS PPO; 22,320x over the TypeScript reference). Translation verified against existing performance implementations: throughput parity with MJX (1.04x) and 5x over Brax at matched GPU batch sizes (HalfCheetah JAX); 42x PPO (Puffer Pong). New environment creation: TCGJax, the first deployable JAX Pokemon TCG engine (717K SPS random action, 153K SPS PPO; 6.6x over the Python reference), synthesized from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Hierarchical verification (property, interaction, and rollout tests) confirms semantic equivalence for all five environments; cross-backend policy transfer confirms zero sim-to-sim gap for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns. The paper contains sufficient detail - including representative prompts, verification methodology, and complete results - that a coding agent could reproduce the translations directly from the manuscript.

关键词: reinforcement learning environments, automatic generation, high-performance implementation, agent-assisted repair, semantic equivalence, hierarchical verification, policy transfer, GPU-parallel simulation

44. ❌ TopoBench: Benchmarking LLMs on Hard Topological Reasoning

作者: Mayug Maniparambil, Nils Hoehing, Janak Kapuriya, Arjun Karuvally, Ellen Rushe, Anthony Ventresque, Noel O’Connor, Fergal Reid 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12133v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在拓扑推理任务上的能力评估，与’Large Language Models’高度相关（10分），并深入分析CoT推理中的错误模式（10分），涉及深度推理过程（8分）。论文提到工具辅助约束检查，与’Tool Use’有一定关联（5分），并通过错误分类进行解释性分析（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在复杂拓扑网格谜题推理任务上的能力局限性，发现即使前沿模型在困难实例上成功率低于25%，主要瓶颈在于从空间表示中提取约束而非推理过程本身。

摘要翻译

解决拓扑网格谜题需要对连通性、环路闭合及区域对称性等全局空间不变量进行推理，这对即使最强大的大语言模型（LLMs）而言仍具挑战性。为在受控环境下研究这些能力，我们提出了TopoBench——一个包含六个谜题族系、涵盖三个难度等级的基准测试。我们在TopoBench上评估了当前先进的推理大语言模型，发现即使前沿模型也只能解决不到四分之一的高难度实例，其中两个族系几乎无法被破解。为探究这些失败源于推理局限还是源于提取与维持空间约束的困难，我们基于错误分类法标注了750条思维链轨迹，归纳出四种可能的因果失效模式，随后通过模拟各类错误的针对性干预实验进行验证。这些干预表明，诸如过早决策和约束遗忘等错误模式会直接影响谜题求解能力，而重复推理则是搜索过程中产生的良性效应。最后，我们研究了包括提示引导、单元格对齐的网格表示以及基于工具的约束检查在内的缓解策略，发现瓶颈在于从空间表征中提取约束，而非对约束进行推理。代码与数据发布于github.com/mayug/topobench-benchmark。

摘要 (Abstract)

Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To investigate whether these failures stem from reasoning limitations or from difficulty extracting and maintaining spatial constraints, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment and constraint forgetting have a direct impact on the ability to solve the puzzle, while repeated reasoning is a benign effect of search. Finally we study mitigation strategies including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them. Code and data are available at github.com/mayug/topobench-benchmark.

关键词: Topological Reasoning, LLM Benchmarking, Chain of Thought, Spatial Constraints, Error Analysis, Constraint Extraction, Reasoning Limitations, Puzzle Solving

45. ❌ Increasing intelligence in AI agents can worsen collective outcomes

作者: Neil F. Johnson 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12129v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI智能体（LLM agents）在资源稀缺环境下的集体行为，核心涉及LLM驱动的智能体（LLM Agents/Autonomous Agents）和多智能体系统（Multi-agent Systems/Agent Coordination）的协调与竞争。摘要明确提到“AI agents”、“LLM diversity”，并研究其集体动态，因此这三个关键词高度相关（10分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了当资源稀缺时，由不同LLM驱动的AI智能体群体如何协调或竞争，发现智能体多样性和强化学习会增加系统过载风险，而部落形成能缓解此风险，但整体效果取决于容量与人口比率。

摘要翻译

当资源稀缺时，人工智能代理群体会协调共生，还是陷入部落式的混乱？来自不同开发者的多样化决策型人工智能正进入日常设备——从手机、医疗设备到战场无人机和汽车——这些人工智能代理通常需要竞争有限的共享资源，例如充电槽位、中继带宽和交通优先级。然而，人们对它们的集体动态及其对用户和社会的风险仍知之甚少。本研究首次将人工智能代理群体作为一个真实系统进行探究，其中调控集体行为的四个关键变量可被独立调控：天性（内在的大语言模型多样性）、培育（个体强化学习）、文化（涌现的部落形成）以及资源稀缺性。我们通过实证与数学分析表明，当资源稀缺时，人工智能模型的多样性与强化学习会增加危险的系统过载风险，尽管部落形成会减轻这种风险。与此同时，部分个体却能从中获得巨大收益。当资源充足时，相同的因素会将系统过载驱动至接近零的水平，尽管部落形成会轻微加剧过载程度。其转变点遵循算术规律：即自发形成的对立部落首次能够容纳在可用容量之内。更复杂的人工智能代理群体未必表现更优：其复杂化究竟有益还是有害，完全取决于一个单一数值——容量与人口比率——这一数值在任何人工智能代理投入使用前即可预知。

摘要 (Abstract)

When resources are scarce, will a population of AI agents coordinate in harmony, or descend into tribal chaos? Diverse decision-making AI from different developers is entering everyday devices – from phones and medical devices to battlefield drones and cars – and these AI agents typically compete for finite shared resources such as charging slots, relay bandwidth, and traffic priority. Yet their collective dynamics and hence risks to users and society are poorly understood. Here we study AI-agent populations as the first system of real agents in which four key variables governing collective behaviour can be independently toggled: nature (innate LLM diversity), nurture (individual reinforcement learning), culture (emergent tribe formation), and resource scarcity. We show empirically and mathematically that when resources are scarce, AI model diversity and reinforcement learning increase dangerous system overload, though tribe formation lessens this risk. Meanwhile, some individuals profit handsomely. When resources are abundant, the same ingredients drive overload to near zero, though tribe formation makes the overload slightly worse. The crossover is arithmetical: it is where opposing tribes that form spontaneously first fit inside the available capacity. More sophisticated AI-agent populations are not better: whether their sophistication helps or harms depends entirely on a single number – the capacity-to-population ratio – that is knowable before any AI-agent ships.

关键词: AI agents, LLM diversity, collective dynamics, resource scarcity, tribe formation, reinforcement learning, system overload, multi-agent systems

46. ❌ CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance

作者: Leo Lin, Shivansh Patel, Jay Moon, Svetlana Lazebnik, Unnat Jain 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12120v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance》专注于机器人硬件设计，特别是肌腱驱动的人形手，用于接触丰富的操作任务。研究内容涉及机械设计、材料选择（软硬混合）、结构测试和远程操作，与所有评分关键词（均围绕大模型、深度学习、AI技术原理或AI在科学领域的应用）完全无关。论文未提及任何AI模型、算法、训练方法或AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何设计一种具有混合软硬顺应性的肌腱驱动人形手（CRAFT hand），以改进接触丰富的操作任务，结果表明该设计提高了强度和耐久性，同时保持了可重复性，并在远程操作中改善了脆弱和低摩擦物品的处理能力。

摘要翻译

我们推出CRAFT手，一种采用混合刚柔顺应性的肌腱驱动拟人机械手，专为密集接触式操作设计。其核心理念基于一个简单观察：手部接触并非均匀分布。冲击力集中于关节处，而连杆承担主要负载。CRAFT在关节处采用柔性材料并保持连杆刚性，同时利用滚动接触关节面确保屈伸运动轨迹的可重复性。十五个安装在手指上的电机通过肌腱驱动机械手，实现了紧凑的外形结构和轻量化手指。在结构测试中，CRAFT在保持相当可重复性的同时显著提升了强度与耐久性。在遥操作测试中，该机械手增强了对易碎及低摩擦物体的操控能力，并完整覆盖Feix分类体系中的33/33种抓握类型。整套设计方案成本低于600美元，我们将开源发布完整设计，并配套提供基于视觉的遥操作系统及仿真集成方案。项目主页：http://craft-hand.github.io/

摘要 (Abstract)

We introduce CRAFT hand, a tendon-driven anthropomorphic hand with hybrid hard-soft compliance for contact-rich manipulation. The design is based on a simple idea: contact is not uniform across the hand. Impacts concentrate at joints, while links carry most of the load. CRAFT places soft material at joints and keeps links rigid, and uses rollingcontact joint surfaces to keep flexion on repeatable motion paths. Fifteen motors mounted on the fingers drive the hand through tendons, keeping the form factor compact and the fingers light. In structural tests, CRAFT improves strength and endurance while maintaining comparable repeatability. In teleoperation, CRAFT improves handling of fragile and low-friction items, and the hand covers 33/33 grasps in the Feix taxonomy. The full design costs under $600 and will be released open-source with visionbased teleoperation and simulation integration. Project page: http://craft-hand.github.io/

关键词: tendon-driven hand, hybrid hard-soft compliance, anthropomorphic hand, contact-rich manipulation, rolling-contact joint, teleoperation, Feix taxonomy, open-source design

47. ❌ SommBench: Assessing Sommelier Expertise of Language Models

作者: William Brach, Tomas Bedej, Jacob Nielsen, Jacob Pichna, Juraj Bedej, Eemeli Saarensilta, Julie Dupouy, Gianluca Barmina, Andrea Blasi Núñez, Peter Schneider-Kamp, Kristian Košťál, Michal Ries, Lukas Galke Poech 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12117v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是评估大语言模型（LLMs）在侍酒师专业领域的多语言和文化能力，因此与’Large Language Models’高度相关（10分）。论文涉及感官判断（嗅觉、味觉）的AI评估，这属于AI在特定专业领域的应用，与’AI for Science’有一定关联（5分），但并非严格意义上的生物信息学或化学信息学。其他关键词（如MoE、SFT、RAG等）均未在摘要中提及或暗示，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为SommBench的多语言基准测试，用于评估大语言模型在侍酒师专业知识（包括葡萄酒理论、特征补全和餐酒搭配）上的表现，结果显示模型在理论问答上表现良好（最高97%正确率），但在特征补全和餐酒搭配任务上更具挑战性。

摘要翻译

随着大语言模型的快速发展，系统评估其多语言与跨文化能力变得日益重要。现有的文化评估基准主要关注能以语言形式编码的基础文化知识。本文提出SommBench，一个用于评估侍酒师专业能力的多语言基准测试，该领域深度依赖于嗅觉与味觉感官体验。虽然语言模型仅通过文本描述来学习感官属性，但SommBench旨在检验这种文本基础是否足以模拟专家级的感官判断。SommBench包含三大任务：葡萄酒理论问答（WTQA）、葡萄酒特征补全（WFC）以及餐酒搭配（FWP）。该基准支持多种语言版本：英语、斯洛伐克语、瑞典语、芬兰语、德语、丹麦语、意大利语和西班牙语，这有助于区分语言模型的葡萄酒专业知识与其语言能力。基准数据集由专业侍酒师及各语言母语者紧密协作开发，最终包含1,024道葡萄酒理论问答、1,000个葡萄酒特征补全样本和1,000个餐酒搭配样本。我们提供了包括闭源模型（如Gemini 2.5）和开源模型（如GPT-OSS与Qwen 3）在内的主流语言模型的测试结果。结果表明，性能最强的模型在葡萄酒理论问答任务上表现良好（闭源模型正确率最高达97%），但特征补全（最高65%）和餐酒搭配任务（马修斯相关系数MCC介于0至0.39之间）则更具挑战性。这些结果使SommBench成为评估语言模型侍酒师专业能力的一个兼具趣味性与挑战性的基准测试。该基准已公开于https://github.com/sommify/sommbench。

摘要 (Abstract)

With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert-level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model’s wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 wine theory question-answering questions, 1,000 wine feature-completion examples, and 1,000 food-wine pairing examples. We provide results for the most popular language models, including closed-weights models such as Gemini 2.5, and open-weights models, such as GPT-OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed-weights model), yet feature completion (peaking at 65%) and food-wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at https://github.com/sommify/sommbench.

关键词: large language models, multilingual evaluation, cultural capabilities, sommelier expertise, sensory judgment, benchmark, wine theory, food-wine pairing

48. ❌ Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives

作者: Taeho Lee, Donghwan Lee 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12110v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文专注于强化学习（RL）中的鲁棒控制问题，提出了一种基于minimax优化的深度确定性策略梯度方法（MMDDPG），用于在连续控制任务中学习抗干扰策略。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文的核心是传统强化学习算法（DDPG）的改进，未涉及大模型、深度学习技术原理创新或特定科学领域（如生物信息学）的AI应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对强化学习智能体在存在外部干扰和模型不确定性环境中性能不稳定的问题，提出了一种基于分数目标的minimax深度确定性策略梯度框架（MMDDPG），在MuJoCo环境中验证了其能显著提升策略的鲁棒性。

摘要翻译

强化学习（RL）在广泛的控制与决策任务中取得了显著成功。然而，当部署于存在意外外部干扰和模型不确定性的环境中时，RL智能体常表现出不稳定或性能下降的问题。因此，确保在此类条件下的可靠性能仍是一个关键挑战。本文提出极小极大深度确定性策略梯度（Minimax Deep Deterministic Policy Gradient, MMDDPG），这是一个用于在连续控制任务中学习抗干扰策略的框架。训练过程被构建为用户策略与对抗性干扰策略之间的极小极大优化问题。在该问题中，用户学习一个最小化目标函数的鲁棒策略，而对抗方则生成最大化该目标函数的干扰。为稳定这一交互过程，我们引入了一个平衡任务性能与干扰强度的分数目标函数。该目标函数防止了过度激进的干扰，并促进了鲁棒学习。在MuJoCo环境中的实验评估表明，所提出的MMDDPG在面对外力扰动和模型参数变化时，均实现了显著提升的鲁棒性。

摘要 (Abstract)

Reinforcement learning (RL) has achieved remarkable success in a wide range of control and decision-making tasks. However, RL agents often exhibit unstable or degraded performance when deployed in environments subject to unexpected external disturbances and model uncertainties. Consequently, ensuring reliable performance under such conditions remains a critical challenge. In this paper, we propose minimax deep deterministic policy gradient (MMDDPG), a framework for learning disturbance-resilient policies in continuous control tasks. The training process is formulated as a minimax optimization problem between a user policy and an adversarial disturbance policy. In this problem, the user learns a robust policy that minimizes the objective function, while the adversary generates disturbances that maximize it. To stabilize this interaction, we introduce a fractional objective that balances task performance and disturbance magnitude. This objective prevents excessively aggressive disturbances and promotes robust learning. Experimental evaluations in MuJoCo environments demonstrate that the proposed MMDDPG achieves significantly improved robustness against both external force perturbations and model parameter variations.

关键词: Reinforcement Learning, Robust Control, Minimax Optimization, Deep Deterministic Policy Gradient, Adversarial Disturbance, Fractional Objective, Continuous Control, MuJoCo

49. ❌ A Robust and Efficient Multi-Agent Reinforcement Learning Framework for Traffic Signal Control

作者: Sheng-You Huang, Hsiao-Chuan Chang, Yen-Chi Chen, Ting-Han Wei, I-Hau Yeh, Sheng-Yao Kuan, Chien-Yao Wang, Hsuan-Han Lee, I-Chen Wu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12096v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于交通信号控制的多智能体强化学习框架，仅与关键词’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文明确提出了多智能体强化学习框架并涉及智能体协调机制。其他关键词均与大模型、深度学习技术原理或科学AI应用无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于交通信号控制的鲁棒多智能体强化学习框架，通过转向比随机化、指数相位调整和邻居观测机制，在Vissim模拟器中验证了其优于基线方法，减少了超过10%的平均等待时间并提高了泛化能力。

摘要翻译

交通信号控制中的强化学习因对动态交通流变化的泛化能力有限，在实际部署中面临显著障碍。现有方法常过度拟合静态模式，且采用的动作空间与驾驶员预期不相容。本文提出一个在Vissim交通仿真器中验证的鲁棒多智能体强化学习框架。该框架整合了三种机制：（1）转向比随机化——一种通过将智能体暴露于动态转向概率中以增强对未见场景鲁棒性的训练策略；（2）面向稳定性的指数相位时长调整动作空间，通过循环的指数相位调整平衡响应性与精确度；（3）基于邻居的观测方案，该方案采用带集中式训练与分散式执行的MAPPO算法。通过利用集中式更新，本方法在保持可扩展局部通信的同时，逼近了全局观测的效能。实验结果表明，该框架优于标准强化学习基线，平均等待时间降低超过10%。所提模型在未见交通场景中展现出卓越的泛化能力，并保持高控制稳定性，为自适应信号控制提供了实用解决方案。

摘要 (Abstract)

Reinforcement Learning (RL) in Traffic Signal Control (TSC) faces significant hurdles in real-world deployment due to limited generalization to dynamic traffic flow variations. Existing approaches often overfit static patterns and use action spaces incompatible with driver expectations. This paper proposes a robust Multi-Agent Reinforcement Learning (MARL) framework validated in the Vissim traffic simulator. The framework integrates three mechanisms: (1) Turning Ratio Randomization, a training strategy that exposes agents to dynamic turning probabilities to enhance robustness against unseen scenarios; (2) a stability-oriented Exponential Phase Duration Adjustment action space, which balances responsiveness and precision through cyclical, exponential phase adjustments; and (3) a Neighbor-Based Observation scheme utilizing the MAPPO algorithm with Centralized Training with Decentralized Execution (CTDE). By leveraging centralized updates, this approach approximates the efficacy of global observations while maintaining scalable local communication. Experimental results demonstrate that our framework outperforms standard RL baselines, reducing average waiting time by over 10%. The proposed model exhibits superior generalization in unseen traffic scenarios and maintains high control stability, offering a practical solution for adaptive signal control.

关键词: Multi-Agent Reinforcement Learning, Traffic Signal Control, Robustness, Generalization, MAPPO, CTDE, Exponential Phase Duration Adjustment, Turning Ratio Randomization

50. ❌ Human-Centred LLM Privacy Audits: Findings and Frictions

作者: Dimitri Staufer, Kirsten Morehouse, David Hartmann, Bettina Berendt 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12094v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM隐私审计，直接涉及LLM技术应用，因此’Large Language Models’得10分。论文讨论模型输出的事实性和可解释性（隐私关联的评估），与’Hallucination Mitigation’和’Mechanistic Interpretability’有一定关联，各得5分。其他关键词（如MoE、SFT、RAG等）未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过人本审计工具（LMP2）评估LLM对个人姓名的隐私关联，发现GPT-4o能预测日常人物特征，并揭示了生成式AI评估中因输出概率性和上下文依赖导致的标准化难题。

摘要翻译

大型语言模型（LLM）从海量训练语料库和用户交互中学习统计关联，已部署的系统可能呈现或推断出与个人相关的信息。然而，人们缺乏实际方法来检查模型将其姓名与何种信息相关联。本文报告了一项持续研究的阶段性发现，并介绍了LMP2——一款基于浏览器的自我审计工具。在两项用户研究（总样本量$N_{total}{=}458$）中，GPT-4o对普通人的50项特征预测中有11项达到了${\ge}$60%的准确率；参与者表示希望控制LLM生成的关联信息，尽管并非所有输出都被视为隐私侵犯。为验证我们的探测方法，我们在公众人物和虚构姓名上评估了八个LLM，观察到稳定的姓名条件关联与模型默认输出之间存在明显区分。我们的研究结果也有助于揭示更广泛的生成式AI评估危机：当输出具有概率性、依赖于上下文且通过用户引导进行调节时，模型与个体的关联究竟包含哪些内容本身定义不清，其操作化依赖于难以验证或比较的探测方法和度量指标。为推进可靠、可操作、以人为本的LLM隐私审计，我们总结了研究中出现的九类障碍，并对未来工作及以人为本的LLM隐私审计设计提出了建议。

摘要 (Abstract)

Large language models (LLMs) learn statistical associations from massive training corpora and user interactions, and deployed systems can surface or infer information about individuals. Yet people lack practical ways to inspect what a model associates with their name. We report interim findings from an ongoing study and introduce LMP2, a browser-based self-audit tool. In two user studies ($N_{total}{=}458$), GPT-4o predicts 11 of 50 features for everyday people with $\ge$60% accuracy, and participants report wanting control over LLM-generated associations despite not considering all outputs privacy violations. To validate our probing method, we evaluate eight LLMs on public figures and non-existent names, observing clear separation between stable name-conditioned associations and model defaults. Our findings also contribute to exposing a broader generative AI evaluation crisis: when outputs are probabilistic, context-dependent, and user-mediated through elicitation, what model–individual associations even include is under-specified and operationalisation relies on crafting probes and metrics that are hard to validate or compare. To move towards reliable, actionable human-centred LLM privacy audits, we identify nine frictions that emerged in our study and offer recommendations for future work and the design of human-centred LLM privacy audits.

关键词: LLM privacy audits, human-centred, self-audit tool, privacy associations, generative AI evaluation, probing method, model-individual associations, GPT-4o

51. ❌ Resource-Efficient Iterative LLM-Based NAS with Feedback Memory

作者: Xiaojie Gu, Dmitry Ignatov, Radu Timofte 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLMs（特别是≤7B参数的指令调优模型）进行神经架构搜索（NAS），通过迭代生成、评估和精炼CNN架构。高度相关的关键词包括：LLMs（核心工具）、Instruction Tuning（使用的模型类型）、Self-Correction（通过反馈记忆进行迭代改进）、LLM Agents（双LLM专业化系统）。SLMs相关度中等，因为论文使用≤7B参数的模型并关注边缘部署，但未明确强调小型模型。其他关键词如MoE、Scaling Laws、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于大型语言模型的资源高效神经架构搜索方法，通过迭代生成和精炼卷积神经网络架构，在单个消费级GPU上实现了显著的性能提升，例如在CIFAR-10上将准确率从28.2%提高到69.2%。

摘要翻译

神经架构搜索（Neural Architecture Search, NAS）实现了网络设计的自动化，但传统方法需要大量计算资源。我们提出一种闭环流程，利用大语言模型（Large Language Models, LLMs）在单个消费级GPU上为图像分类任务迭代生成、评估并优化卷积神经网络架构，且无需对大语言模型进行微调。我们方法的核心是受马尔可夫链启发的历史反馈记忆机制：一个包含最近 $K{=}5$ 次改进尝试的滑动窗口，在保持上下文大小恒定的同时，为迭代学习提供足够信号。与先前丢弃失败轨迹的大语言模型优化器不同，每个历史条目都是一个结构化的诊断三元组——记录识别出的问题、建议的修改以及产生的结果——将代码执行失败视为首要的学习信号。双大语言模型分工降低了单次调用的认知负荷：代码生成器（Code Generator）负责生成可执行的PyTorch架构，而提示改进器（Prompt Improver）则处理诊断推理。由于大语言模型和架构训练共享有限的显存，搜索过程会隐式地倾向于生成适合边缘部署的紧凑、硬件高效的模型。我们在无约束的开放代码空间中评估了三个未经微调的指令调优大语言模型（参数量 ${\leq}7$B），迭代次数高达2000次，并使用CIFAR-10、CIFAR-100和ImageNette数据集上的单轮代理精度作为快速排序信号。在CIFAR-10上，DeepSeek-Coder-6.7B的准确率从28.2%提升至69.2%，Qwen2.5-7B从50.0%提升至71.5%，GLM-5从43.2%提升至62.0%。一次完整的2000次迭代搜索在单张RTX~4090上仅需约18 GPU小时完成，这为无需云基础设施、由大语言模型驱动的神经架构搜索建立了一种低成本、可复现且具备硬件感知能力的新范式。

摘要 (Abstract)

Neural Architecture Search (NAS) automates network design, but conventional methods demand substantial computational resources. We propose a closed-loop pipeline leveraging large language models (LLMs) to iteratively generate, evaluate, and refine convolutional neural network architectures for image classification on a single consumer-grade GPU without LLM fine-tuning. Central to our approach is a historical feedback memory inspired by Markov chains: a sliding window of $K{=}5$ recent improvement attempts keeps context size constant while providing sufficient signal for iterative learning. Unlike prior LLM optimizers that discard failure trajectories, each history entry is a structured diagnostic triple – recording the identified problem, suggested modification, and resulting outcome – treating code execution failures as first-class learning signals. A dual-LLM specialization reduces per-call cognitive load: a Code Generator produces executable PyTorch architectures while a Prompt Improver handles diagnostic reasoning. Since both the LLM and architecture training share limited VRAM, the search implicitly favors compact, hardware-efficient models suited to edge deployment. We evaluate three frozen instruction-tuned LLMs (${\leq}7$B parameters) across up to 2000 iterations in an unconstrained open code space, using one-epoch proxy accuracy on CIFAR-10, CIFAR-100, and ImageNette as a fast ranking signal. On CIFAR-10, DeepSeek-Coder-6.7B improves from 28.2% to 69.2%, Qwen2.5-7B from 50.0% to 71.5%, and GLM-5 from 43.2% to 62.0%. A full 2000-iteration search completes in ${\approx}18$ GPU hours on a single RTX~4090, establishing a low-budget, reproducible, and hardware-aware paradigm for LLM-driven NAS without cloud infrastructure.

关键词: Large Language Models, Neural Architecture Search, Iterative Learning, Feedback Memory, Edge Deployment, Instruction-tuned LLMs, Resource-efficient, Hardware-aware

52. ❌ A Multi-Label Temporal Convolutional Framework for Transcription Factor Binding Characterization

作者: Pietro Demurtas, Ferdinando Zanchetta, Giovanni Perini, Rita Fioresi 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12073v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习（Temporal Convolutional Networks）进行转录因子结合位点的多标签分类预测，属于生物信息学领域的AI应用。论文未涉及任何大语言模型（LLM）相关技术，也未讨论LLM技术原理、训练方法、推理优化、对齐、代理系统等主题。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物信息学中的深度学习应用，但并非核心创新于大模型技术，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于时序卷积网络的多标签分类框架，用于预测DNA序列上多个转录因子的结合位点，揭示了TF之间的相互作用和协同调控模式。

摘要翻译

转录因子（Transcription Factors, TFs）通过复杂且协同的机制调控基因表达。尽管许多转录因子共同发挥作用，但其结合与相互作用的逻辑尚未被完全阐明。当前大多数转录因子结合位点预测方法主要关注单个转录因子及二分类任务，未能全面分析不同转录因子之间可能存在的相互作用。本文中，我们将DNA转录因子结合位点识别视为一个多标签分类问题进行研究，对从公共数据库中获取的DNA序列实现了多种转录因子结合位点的可靠预测。我们的深度学习模型基于时序卷积网络（Temporal Convolutional Networks, TCNs），该网络能够预测多个转录因子的结合谱，并捕捉转录因子之间的相关性及其协同调控机制。研究结果表明，多标签学习在实现可靠预测性能的同时，能够揭示具有生物学意义的基序和与已知转录因子相互作用一致的共结合模式，并进一步提示转录因子之间可能存在的新关联与协作关系。

摘要 (Abstract)

Transcription factors (TFs) regulate gene expression through complex and co-operative mechanisms. While many TFs act together, the logic underlying TFs binding and their interactions is not fully understood yet. Most current approaches for TF binding site prediction focus on individual TFs and binary classification tasks, without a full analysis of the possible interactions among various TFs. In this paper we investigate DNA TF binding site recognition as a multi-label classification problem, achieving reliable predictions for multiple TFs on DNA sequences retrieved in public repositories. Our deep learning models are based on Temporal Convolutional Networks (TCNs), which are able to predict multiple TF binding profiles, capturing correlations among TFs andtheir cooperative regulatory mechanisms. Our results suggest that multi-label learning leading to reliable predictive performances can reveal biologically meaningful motifs and co-binding patterns consistent with known TF interactions, while also suggesting novel relationships and cooperation among TFs.

关键词: Transcription Factor Binding, Multi-label Classification, Temporal Convolutional Networks, Deep Learning, Bioinformatics, DNA Sequence Analysis, TF Interactions, Cooperative Regulatory Mechanisms

53. ❌ Beyond Convolution: A Taxonomy of Structured Operators for Learning-Based Image Processing

作者: Simone Cammarasana 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12067v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像处理算子分类和比较，研究卷积算子的替代方案，包括分解算子、自适应加权算子、基自适应算子、积分/核算子、注意力算子等。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等直接相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文系统性地分类和比较了卷积神经网络中卷积算子的替代算子，提出了五类算子家族并分析了它们的结构特性、计算成本和适用任务。

摘要翻译

卷积算子因其简洁性、平移等变性和高效实现，成为现代卷积神经网络（CNN）的基本构建模块。然而，其作为一种固定的、线性的、局部平均算子的结构，限制了其捕捉结构化信号特性的能力，例如低秩分解、自适应基表示以及非均匀空间依赖性。本文系统性地对扩展或替代基于学习的图像处理流程中标准卷积的算子进行了分类。我们将替代算子的研究领域划分为五个类别：（i）基于分解的算子，通过奇异值或张量分解分离结构和噪声成分；（ii）自适应加权算子，根据空间位置或信号内容调节卷积核的贡献；（iii）基自适应算子，将分析基与网络权重共同优化；（iv）积分与核算子，将卷积推广至位置相关和非线性核；（v）基于注意力机制的算子，完全放宽了局部性假设。针对每个类别，我们提供了形式化定义，讨论了其相对于卷积的结构特性，并对该算子最适用的任务进行了批判性分析。我们进一步从线性度、局部性、等变性、计算成本以及适用于图像到图像和图像到标签任务的能力等多个相关维度，对所有类别进行了比较分析，并概述了该研究领域的开放挑战与未来方向。

摘要 (Abstract)

The convolution operator is the fundamental building block of modern convolutional neural networks (CNNs), owing to its simplicity, translational equivariance, and efficient implementation. However, its structure as a fixed, linear, locally-averaging operator limits its ability to capture structured signal properties such as low-rank decompositions, adaptive basis representations, and non-uniform spatial dependencies. This paper presents a systematic taxonomy of operators that extend or replace the standard convolution in learning-based image processing pipelines. We organise the landscape of alternative operators into five families: (i) decomposition-based operators, which separate structural and noise components through singular value or tensor decompositions; (ii) adaptive weighted operators, which modulate kernel contributions as a function of spatial position or signal content; (iii) basis-adaptive operators, which optimise the analysis bases together with the network weights; (iv) integral and kernel operators, which generalise the convolution to position-dependent and non-linear kernels; and (v) attention-based operators, which relax the locality assumption entirely. For each family, we provide a formal definition, a discussion of its structural properties with respect to the convolution, and a critical analysis of the tasks for which the operator is most appropriate. We further provide a comparative analysis of all families across relevant dimensions – linearity, locality, equivariance, computational cost, and suitability for image-to-image and image-to-label tasks – and outline the open challenges and future directions of this research area.

关键词: convolution operator, structured operators, image processing, neural networks, attention-based operators, taxonomy, adaptive operators, computational analysis

54. ❌ Paper Title: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments

作者: Zhaoyang Jiang, Zhizhong Fu, David McAllister, Yunsoo Kim, Honghan Wu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12071v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文LoV3D提出了一种用于训练3D视觉语言模型（VLM）的流程，专注于通过纵向3D脑MRI进行认知预后推理。论文与大多数关键词无关，因为这些关键词主要涉及纯大语言模型（LLM）技术、优化方法或通用AI技术。然而，论文在三个关键词上具有相关性：1）‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（10分）：论文明确提到使用自动验证器进行直接偏好优化（DPO），无需人工标注，这是核心训练方法。2）‘Hallucination Mitigation OR Factuality OR Truthfulness’（10分）：论文旨在通过强制标签一致性、纵向连贯性和生物学合理性来减少幻觉风险，这是核心目标之一。3）‘AI for Science OR Bioinformatics OR Cheminformatics’（10分）：论文应用AI于神经科学领域，具体用于阿尔茨海默病等神经系统疾病的评估，属于AI for Science范畴。其他关键词如’Large Language Models’等不相关，因为论文使用视觉语言模型（VLM），而非纯文本大模型；‘Chain of Thought’等推理相关关键词也不适用，因为论文的推理是结构化的管道步骤，而非基于提示的推理。

!!! tip deepseek-chat TL;DR

LoV3D提出了一种训练3D视觉语言模型的流程，通过纵向脑MRI进行区域级解剖评估和比较，以诊断认知状态（如正常、轻度认知障碍或痴呆），并减少幻觉风险，在测试集上实现了高诊断准确性和可推广性。

摘要翻译

纵向脑部磁共振成像对于表征阿尔茨海默病等神经系统疾病的进展评估至关重要。然而，当前的深度学习工具将这一过程割裂开来：分类器将扫描简化为一个标签，体积测量流程产生难以解释的测量值，而视觉语言模型可能生成流畅但存在幻觉风险的结论。我们提出了LoV3D，一个用于训练三维视觉语言模型的流程。该流程读取纵向T1加权脑部磁共振成像，生成区域级解剖学评估，与先前的扫描进行纵向比较，最终输出一个三分类诊断（认知正常、轻度认知障碍或痴呆）以及一份综合诊断摘要。该分步流程通过强制标签一致性、纵向连贯性和生物学合理性，为最终诊断提供依据，从而降低幻觉风险。训练过程引入了一个临床加权的验证器，该验证器根据源自标准化体积指标的规范参考自动对候选输出进行评分，驱动无需任何人工标注的直接偏好优化。在受试者级别的预留ADNI测试集（479次扫描，258名受试者）上，LoV3D实现了93.7%的三分类诊断准确率（比无依据基线提升34.8%），二分类诊断准确率达97.2%（比当前最佳技术提升4%），区域级解剖学分类准确率达82.6%（比视觉语言模型基线提升33.1%）。零样本迁移在MIRIAD数据集上达到95.4%的准确率（痴呆召回率100%），在AIBL数据集上三分类准确率达82.9%，证实了其在不同机构、扫描设备和人群间的高泛化能力。代码发布于 https://github.com/Anonymous-TEVC/LoV-3D。

摘要 (Abstract)

Longitudinal brain MRI is essential for characterizing the progression of neurological diseases such as Alzheimer’s disease assessment. However, current deep-learning tools fragment this process: classifiers reduce a scan to a label, volumetric pipelines produce uninterpreted measurements, and vision-language models (VLMs) may generate fluent but potentially hallucinated conclusions. We present LoV3D, a pipeline for training 3D vision-language models, which reads longitudinal T1-weighted brain MRI, produces a region-level anatomical assessment, conducts longitudinal comparison with the prior scan, and finally outputs a three-class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia) along with a synthesized diagnostic summary. The stepped pipeline grounds the final diagnosis by enforcing label consistency, longitudinal coherence, and biological plausibility, thereby reducing the risks of hallucinations. The training process introduces a clinically-weighted Verifier that scores candidate outputs automatically against normative references derived from standardized volume metrics, driving Direct Preference Optimization without a single human annotation. On a subject-level held-out ADNI test set (479 scans, 258 subjects), LoV3D achieves 93.7% three-class diagnostic accuracy (+34.8% over the no-grounding baseline), 97.2% on two-class diagnosis accuracy (+4% over the SOTA) and 82.6% region-level anatomical classification accuracy (+33.1% over VLM baselines). Zero-shot transfer yields 95.4% on MIRIAD (100% Dementia recall) and 82.9% three-class accuracy on AIBL, confirming high generalizability across sites, scanners, and populations. Code is available at https://github.com/Anonymous-TEVC/LoV-3D.

关键词: 3D vision-language models, longitudinal brain MRI, Alzheimer’s disease, diagnostic accuracy, hallucination mitigation, Direct Preference Optimization, cognitive prognosis, regional volume assessments

55. ❌ Chemical Reaction Networks Learn Better than Spiking Neural Networks

作者: Sophie Jaffard, Ivo F. Sbalzarini 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12060v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究化学反应网络（CRN）与脉冲神经网络（SNN）的学习能力比较，属于AI for Science（科学AI）领域，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为涉及化学计算和生物化学网络，但论文未涉及大模型、深度学习技术原理或任何其他关键词（如LLM、MoE、训练方法、推理技术等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文通过数学证明和实验验证，发现无隐藏层的化学反应网络能比有隐藏层的脉冲神经网络更准确高效地学习分类任务，为化学计算机的机器学习提供了理论依据。

摘要翻译

我们通过数学证明，无隐藏层的化学反应网络能够完成脉冲神经网络需要隐藏层才能解决的任务。本证明采用化学反应网络的确定性质量作用动力学模型。具体而言，我们证明某一无隐藏层的反应网络能够学习一项分类任务，该任务先前已被证明需由带隐藏层的脉冲神经网络实现。我们给出了该网络整体行为的解析遗憾界，并分析了其渐近行为与Vapnik-Chervonenkis维度。在数值实验中，我们验证了所提出的化学反应网络对像素图像中手写数字进行分类的学习能力，并证明其比带隐藏层的脉冲神经网络更准确、更高效地解决了该任务。这为化学计算机中的机器学习提供了理论依据，并从数学角度解释了生物细胞在生化反应网络中可能表现出比神经元网络更高效学习行为的机制。

摘要 (Abstract)

We mathematically prove that chemical reaction networks without hidden layers can solve tasks for which spiking neural networks require hidden layers. Our proof uses the deterministic mass-action kinetics formulation of chemical reaction networks. Specifically, we prove that a certain reaction network without hidden layers can learn a classification task previously proved to be achievable by a spiking neural network with hidden layers. We provide analytical regret bounds for the global behavior of the network and analyze its asymptotic behavior and Vapnik-Chervonenkis dimension. In a numerical experiment, we confirm the learning capacity of the proposed chemical reaction network for classifying handwritten digits in pixel images, and we show that it solves the task more accurately and efficiently than a spiking neural network with hidden layers. This provides a motivation for machine learning in chemical computers and a mathematical explanation for how biological cells might exhibit more efficient learning behavior within biochemical reaction networks than neuronal networks.

关键词: chemical reaction networks, spiking neural networks, machine learning, classification task, mass-action kinetics, Vapnik-Chervonenkis dimension, handwritten digit classification, biological cells

56. ❌ XSkill: Continual Learning from Experience and Skills in Multimodal Agents

作者: Guanyu Jiang, Zhaochen Su, Xiaoye Qu, Yi R., Fung 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12056v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态智能体的持续学习框架XSkill，通过从经验（action-level）和技能（task-level）中提取知识来改进工具使用和任务规划。核心相关关键词包括：‘LLM Agents/Autonomous Agents/Agentic Workflow’（10分，论文核心研究智能体）和’Tool Use/Function Calling/API Tool Use’（10分，论文重点解决工具使用效率问题）。其他相关关键词：‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’（5分，涉及推理行为）、‘System 2 Thinking/Slow Thinking/In-depth Reasoning’（5分，涉及深度推理）、‘Self-Correction/Self-Improvement/Self-Reflection’（5分，通过持续学习实现自我改进）、‘In-context Learning/Many-shot Learning’（5分，涉及零样本泛化）。‘Large Language Models/LLMs/Foundation Models’（5分，论文使用多模态模型作为骨干，但非纯LLM技术研究）。其余关键词与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态智能体在开放环境中工具使用效率低和编排不灵活的问题，提出了XSkill双流框架，通过从视觉观察中持续学习经验和技能知识，显著提升了智能体在多个基准测试中的性能并实现了优越的零样本泛化能力。

摘要翻译

多模态智能体如今已能借助多样化工具处理复杂推理任务，但在开放场景中仍存在工具使用效率低下与流程编排僵化的问题。实现此类智能体无需参数更新即可通过历史轨迹持续改进的核心挑战在于：如何使其习得两种互补的可复用知识形式——经验（为工具选择与决策提供简洁的行动级指导）和技能（为规划与工具使用提供结构化的任务级指导）。为此，我们提出XSkill：一个面向多模态智能体、支持从经验与技能进行持续学习的双流框架。XSkill将知识提取与检索过程均锚定于视觉观察。在知识积累阶段，XSkill通过视觉锚定式摘要与跨轨迹批判，从多路径推演中提炼并整合经验与技能；在推理阶段，框架根据当前视觉情境检索并适配相关知识，同时将使用记录反馈至积累阶段，形成持续学习闭环。通过在四个骨干模型上对五个跨领域基准进行评估，XSkill始终显著优于纯工具驱动及基于学习的基线方法。进一步分析表明，两种知识流在影响智能体推理行为方面发挥互补作用，并展现出卓越的零样本泛化能力。

摘要 (Abstract)

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.

关键词: multimodal agents, continual learning, tool use, experience learning, skill learning, visual grounding, zero-shot generalization, reasoning behaviors

57. ❌ Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

作者: Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12038v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种训练无关的解码框架（Slow-Fast Inference），通过观察解码过程中注意力支持的稳定性，在语义边界处触发慢步骤（使用完整注意力）和快步骤（重用稀疏记忆）来加速长上下文自回归推理。该工作高度相关于大模型（LLMs）、长上下文LLMs、KV缓存压缩/注意力优化（通过稀疏记忆重用）和推理加速（speculative decoding/inference acceleration）等关键词，因为这些是论文的核心技术焦点。与Chain of Thought推理和LLM Agents有一定关联，因为论文评估了长CoT设置并提到了agentic workloads，但并非核心。其他关键词如MoE、SFT、对齐、RAG、量化等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对长上下文自回归解码成本高的问题，提出了一种训练无关的推理加速框架（Slow-Fast Inference），通过解耦生成过程为快步骤（重用稀疏记忆）和慢步骤（刷新记忆），在保持质量的同时实现了1.6倍至14.4倍的解码吞吐量提升。

摘要翻译

长上下文自回归解码的计算成本依然高昂，因为每个解码步骤都必须反复处理不断增长的历史信息。我们观察到解码过程中存在一种稳定模式：在一个句子内，更广义地说，在一个短语义连贯片段内，主导的注意力支持区域通常保持高度稳定。受此观察启发，我们提出慢-快推理框架，这是一种无需额外训练的解码框架，它将生成过程解耦为频繁的低成本快速步骤与偶尔的密集注意力慢速步骤。快速步骤通过复用紧凑的稀疏记忆实现高效解码。慢速步骤在语义边界附近触发。在慢速步骤中，模型会重新审视更广泛的上下文，并通过选择器刷新已选记忆以供后续快速步骤使用。在所有评估的上下文长度下，SFI 实现了约 $1.6\times$–$14.4\times$ 的解码吞吐量提升，同时在长上下文和长思维链设置中普遍保持与完整键值缓存基线相当的质量。由于 SFI 无需训练且可直接应用于现有模型检查点，它为降低当代自回归推理模型在长上下文、长视野及智能体任务中的推理成本提供了一条实用路径。

摘要 (Abstract)

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$–$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.

关键词: Inference Acceleration, Long-context Decoding, Autoregressive Models, Training-free Decoding, Attention Support Stability, Sparse Memory, Decoding Throughput, Slow-Fast Inference

58. ❌ Just Use XML: Revisiting Joint Translation and Label Projection

作者: Thennal D K, Chris Biemann, Hans Ole Hatzel 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12021v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是跨语言标签投影和机器翻译的联合框架，使用XML标签进行标注传输。虽然涉及自然语言处理和机器学习技术，但论文内容聚焦于具体的标注投影方法、翻译质量评估和跨语言迁移任务，并未涉及大模型、深度学习技术原理创新或AI在科学领域的应用。所有关键词都围绕大模型技术、深度学习创新或AI科学应用，与本文的标注投影和机器翻译研究主题完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了LabelPigeon框架，通过XML标签联合执行翻译和标签投影，解决了传统方法中翻译质量下降的问题，并在11种语言中提高了翻译质量，在27种语言的下游任务中实现了显著的跨语言迁移改进。

摘要翻译

标签投影是一种有效的跨语言迁移技术，可将高资源语言的跨度标注数据集扩展至低资源语言。现有方法大多将标签投影作为机器翻译后的独立步骤执行，而此前尝试将两者结合的研究均报告了翻译质量下降的问题。我们通过LabelPigeon这一创新框架重新评估了这一结论，该框架借助XML标签同步执行翻译与标签投影。我们设计了标签投影的直接评估方案，发现LabelPigeon在11种语言中不仅优于基线模型，还能主动提升翻译质量。进一步在203种语言和不同标注复杂度场景下的评估表明，基于额外微调的策略带来了持续的翻译质量改进。最后，通过在27种语言和三项下游任务中的测试，我们实现了相较于同类工作显著的跨语言迁移性能提升——在命名实体识别任务上F1值最高提升达39.9。总体而言，我们的研究证明：基于XML标签的标签投影技术能够在保障翻译质量的前提下，实现高效且有效的标签迁移。

摘要 (Abstract)

Label projection is an effective technique for cross-lingual transfer, extending span-annotated datasets from a high-resource language to low-resource ones. Most approaches perform label projection as a separate step after machine translation, and prior work that combines the two reports degraded translation quality. We re-evaluate this claim with LabelPigeon, a novel framework that jointly performs translation and label projection via XML tags. We design a direct evaluation scheme for label projection, and find that LabelPigeon outperforms baselines and actively improves translation quality in 11 languages. We further assess translation quality across 203 languages and varying annotation complexity, finding consistent improvement attributed to additional fine-tuning. Finally, across 27 languages and three downstream tasks, we report substantial gains in cross-lingual transfer over comparable work, up to +39.9 F1 on NER. Overall, our results demonstrate that XML-tagged label projection provides effective and efficient label transfer without compromising translation quality.

关键词: label projection, machine translation, cross-lingual transfer, XML tags, fine-tuning, NER, span annotation, translation quality

59. ❌ Sim-to-reality adaptation for Deep Reinforcement Learning applied to an underwater docking application

作者: Alaaeddine Chaarani, Narcis Palomeras, Pere Ridao 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12020v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于水下机器人自主对接的深度强化学习（DRL）应用，特别是模拟到现实的适应问题。论文的核心技术是深度强化学习（PPO算法）、数字孪生、模拟器加速和多进程训练框架。所有评分关键词都直接与大语言模型（LLMs）及其相关技术（如微调、对齐、推理、压缩、代理等）相关，而本论文完全不涉及任何语言模型、文本处理或自然语言技术。论文属于机器人学和自动控制领域，虽然使用了AI技术（DRL），但与评分关键词列表中的大模型技术主题完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了使用深度强化学习（PPO算法）和数字孪生模拟器来解决自主水下航行器（AUV）对接中的模拟到现实适应问题，实验结果表明在模拟中成功率超过90%，并在物理测试池中成功验证了控制器的有效性。

摘要翻译

深度强化学习（Deep Reinforcement Learning, DRL）为自主水下对接提供了一种优于传统控制方法的稳健替代方案，尤其适用于适应不可预测的环境条件。然而，弥合“仿真到现实”的差距以及管理高训练延迟，仍是实际部署中的主要瓶颈。本文提出了一种利用高保真数字孪生环境，基于吉罗纳自主水下航行器（Girona Autonomous Underwater Vehicle, AUV）实现自主对接的系统性方法。我们将Stonefish仿真器适配到一个多进程强化学习框架中，在融入真实的AUV动力学、碰撞模型和传感器噪声的同时，显著加速了学习过程。采用近端策略优化（Proximal Policy Optimization, PPO）算法，我们开发了一种六自由度（6-DoF）控制策略，该策略在无头环境中通过随机起始位置进行训练，以确保泛化性能。我们的奖励函数综合考虑了距离、朝向、动作平滑度以及自适应碰撞惩罚，以促进软对接。实验结果表明，智能体在仿真中实现了超过90%的成功率。此外，在物理测试水池中的成功验证证实了仿真到现实迁移的有效性，DRL控制器展现出涌现行为，例如基于俯仰的制动和偏航振荡，以辅助机械对准。

摘要 (Abstract)

Deep Reinforcement Learning (DRL) offers a robust alternative to traditional control methods for autonomous underwater docking, particularly in adapting to unpredictable environmental conditions. However, bridging the “sim-to-real” gap and managing high training latencies remain significant bottlenecks for practical deployment. This paper presents a systematic approach for autonomous docking using the Girona Autonomous Underwater Vehicle (AUV) by leveraging a high-fidelity digital twin environment. We adapted the Stonefish simulator into a multiprocessing RL framework to significantly accelerate the learning process while incorporating realistic AUV dynamics, collision models, and sensor noise. Using the Proximal Policy Optimization (PPO) algorithm, we developed a 6-DoF control policy trained in a headless environment with randomized starting positions to ensure generalized performance. Our reward structure accounts for distance, orientation, action smoothness, and adaptive collision penalties to facilitate soft docking. Experimental results demonstrate that the agent achieved a success rate of over 90% in simulation. Furthermore, successful validation in a physical test tank confirmed the efficacy of the sim-to-reality adaptation, with the DRL controller exhibiting emergent behaviors such as pitch-based braking and yaw oscillations to assist in mechanical alignment.

关键词: Deep Reinforcement Learning, Autonomous Underwater Vehicle, Sim-to-real Adaptation, Proximal Policy Optimization, Digital Twin, Underwater Docking, Control Policy, Stonefish Simulator

60. ❌ An Intent of Collaboration: On Agencies between Designers and Emerging (Intelligent) Technologies

作者: Pei-Ying Lin, Julie Heij, Iris Borst, Britt Joosten, Kristina Andersen, Wijnand IJsselsteijn 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12018v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究设计师与LLMs（如Google的LLM）在创意过程中的协作关系，探讨设计师如何在与LLMs合作时保持创造力。论文明确提到LLMs，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术细节、方法或应用，如MoE、SLMs、训练技术、推理方法、代理系统、压缩技术等，因此这些关键词评分为0分。论文属于大模型在创意设计领域的应用研究，但未深入技术原理或特定科学领域（如生物信息学），因此其他关键词不相关。

!!! tip deepseek-chat TL;DR

该研究探讨设计师与大型语言模型（LLMs）协作时如何保持创造力，发现设计师容易失去创意主导权，并提出通过理解自身创意过程、技术能力和调整人机关系来重获创造力。

摘要翻译

随着以LLM（大语言模型）和文生图AI为代表的强大智能技术不断涌现，并展现出增强创意流程的潜力，设计师们面临着在与这些异质数字伙伴协作时保持自主性与创造力的挑战。尽管生成式AI能产出多样、信息丰富甚至富有诗意的结果，但其缺乏具身知识的特性，对设计师获取丰硕成果构成了更大挑战，例如在数字手工艺领域。在本项目中，三位设计师开启了一段为期三个月的实验性探索，旨在与谷歌的LLM作为潜在智能伙伴进行共创，以研究它将如何影响设计师的创造力。我们发现，LLM与设计师之间存在一种能动性的权力动态，设计师在其中极易丧失其创造性能动性。重获设计师的创造性能动性，需要对其自身创作过程进行内省，对所涉特定新兴技术有结构性理解，并对人机关系的动态进行有意识的调整。我们建议，在运用新兴智能技术时，应从三个方面关注设计师的内心世界与能动性各方：对作为认知活动的创作过程的敏感性；对特定技术能力的主动探究；以及对设计师与新兴技术之间适宜工作关系的调整。

摘要 (Abstract)

Amidst the emergence of powerful intelligent technologies such as LLMs and text-to-image AIs that promise to enhance creative processes, designers face the challenges of remaining empowered and creative while working with these foreign digital partners. While generative AIs offer versatile, informative, and occasionally poetic outcomes, their lack of embodied knowledge presents an even greater challenge to designers in gaining fruitful outcomes, such as in the field of Digital Craftsmanship. In this project, three designers embarked on a three-month experimental journey with an intention to co-create with Google’s LLM as a potential intelligent partner to investigate how it will influence the designers’ creativity. We found that a power dynamic of agencies exists between the LLM and the designer, in which the designer can easily lose their creative agency. Regaining the designer’s creative agency involves introspection into their own creative process, a structural understanding of the specific emerging technology involved, and deliberate adjustments to the dynamics of the human-technology relationship. We propose paying attention to the designer’s inner world and parties of agencies when engaging with emerging intelligent technologies through three aspects: the sensitivity towards a creative process as cognitive activities; the active investigation into specific technology’s capability; and the adjustment towards an appropriate working relationship between the designer and the emerging technology.

关键词: LLMs, designers, creative agency, human-technology relationship, co-creation, emerging intelligent technologies, digital craftsmanship, power dynamics

61. ❌ Flowcean - Model Learning for Cyber-Physical Systems

作者: Maximilian Schmidt, Swantje Plambeck, Markus Knitt, Hendrik Rose, Goerschwin Fey, Jan Christian Wieck, Stephan Balduin 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12015v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Flowcean - Model Learning for Cyber-Physical Systems》专注于信息物理系统（CPS）的数据驱动模型生成框架，涉及机器学习方法、模块化架构和自动化建模。然而，所有评分关键词均围绕大模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化等）、特定AI应用（如AI for Science）或高级推理方法（如CoT、System 2 Thinking）。论文摘要和标题未提及任何大模型、深度学习技术原理创新或生物医药等科学领域的AI应用，也未涉及关键词中的具体技术。因此，所有关键词与论文内容完全无关，相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为Flowcean的模块化框架，用于自动化生成信息物理系统的数据驱动模型，以提高建模效率和可访问性。

摘要翻译

信息物理系统（Cyber-Physical Systems, CPS）的有效模型对其设计与运行至关重要。由于CPS固有的复杂性，构建此类模型既困难又耗时。因此，利用机器学习方法进行数据驱动的模型生成正日益受到关注。本文提出Flowcean——一个新颖的框架，旨在通过数据驱动学习实现模型生成的自动化，并特别注重模块化与可用性。通过提供多种学习策略、数据处理方法和评估指标，本框架为CPS场景提供了一个量身定制的全面解决方案。Flowcean在模块化且灵活的架构内促进了多种学习库与工具的集成，确保其能够适应广泛的建模任务。这简化了模型生成与评估的流程，使其更加高效且易于使用。

摘要 (Abstract)

Effective models of Cyber-Physical Systems (CPS) are crucial for their design and operation. Constructing such models is difficult and time-consuming due to the inherent complexity of CPS. As a result, data-driven model generation using machine learning methods is gaining popularity. In this paper, we present Flowcean, a novel framework designed to automate the generation of models through data-driven learning that focuses on modularity and usability. By offering various learning strategies, data processing methods, and evaluation metrics, our framework provides a comprehensive solution, tailored to CPS scenarios. Flowcean facilitates the integration of diverse learning libraries and tools within a modular and flexible architecture, ensuring adaptability to a wide range of modeling tasks. This streamlines the process of model generation and evaluation, making it more efficient and accessible.

关键词: Cyber-Physical Systems, model learning, data-driven, framework, modularity, automation, machine learning, model generation

62. ❌ Can RL Improve Generalization of LLM Agents? An Empirical Study

作者: Zhiheng Xi, Xin Guo, Jiaqi Liu, Jiazheng Zhang, Yutao Fan, Zhihao Zhang, Shichun Liu, Mingxu Chai, Xiaowei Shi, Yitao Zhai, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12011v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM Agents的泛化能力，特别是通过Reinforcement Fine-tuning (RFT)训练的多轮决策智能体。因此，与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SFT、RAG、量化等，故相关度为0分。

!!! tip deepseek-chat TL;DR

该论文研究了通过强化微调训练的LLM智能体在未见环境中的泛化能力，发现其在同一环境内任务难度上泛化良好，但在跨环境转移时表现较弱，而顺序训练和混合训练能改善整体平衡。

摘要翻译

强化微调（Reinforcement Fine-Tuning, RFT）在训练大语言模型智能体基于环境反馈进行多轮决策方面展现出潜力。然而，现有评估大多局限于领域内：训练与测试在同一环境甚至相同任务中进行。在实际部署中，智能体可能需要在未见过的环境中运行，这些环境具有不同的背景知识、观测空间与动作接口。为刻画RFT在此类变化下的泛化特性，我们沿三个维度展开系统性研究：（1）同一环境内跨任务难度的泛化能力，（2）向未见环境的跨环境迁移能力，以及（3）顺序多环境训练以量化迁移与遗忘效应。实验结果表明，RFT在单一环境内能良好泛化至不同难度任务，但在迁移至未见环境时表现较弱，这与语义先验及观测/动作接口的变化相关。相比之下，顺序训练能在上游任务遗忘最小化的前提下带来显著的下游增益，而跨环境混合训练则能提升整体平衡性。我们进一步提供了详细分析与深层洞见，希望本工作有助于社区开发与部署可泛化的大语言模型智能体。

摘要 (Abstract)

Reinforcement fine-tuning (RFT) has shown promise for training LLM agents to perform multi-turn decision-making based on environment feedback. However, most existing evaluations remain largely in-domain: training and testing are conducted in the same environment or even on the same tasks. In real-world deployment, agents may operate in unseen environments with different background knowledge, observation spaces, and action interfaces. To characterize the generalization profile of RFT under such shifts, we conduct a systematic study along three axes: (1) within-environment generalization across task difficulty, (2) cross-environment transfer to unseen environments, and (3) sequential multi-environment training to quantify transfer and forgetting. Our results show that RFT generalizes well across task difficulty within an environment, but exhibits weaker transfer to unseen environments, which correlates with shifts in both semantic priors and observation/action interfaces. In contrast, sequential training yields promising downstream gains with minimal upstream forgetting, and mixture training across environments improves the overall balance. We further provide detailed analyses and deeper insights, and hope our work helps the community develop and deploy generalizable LLM agents.

关键词: LLM Agents, Reinforcement Fine-tuning, Generalization, Multi-turn Decision-making, Cross-environment Transfer, Sequential Training, Mixture Training

63. ❌ Few-for-Many Personalized Federated Learning

作者: Ping Guo, Tiantian Zhang, Xi Lin, Xiang Li, Zhi-Ri Tang, Qingfu Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11992v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于个性化联邦学习（PFL）的算法优化，提出了一种名为FedFew的few-for-many优化框架，旨在解决多客户端异构数据下的模型个性化与可扩展性问题。论文的核心贡献在于联邦学习框架的重新表述和高效优化算法，而非大模型技术本身。所有关键词中，仅“AI for Science OR Bioinformatics OR Cheminformatics”因论文在医疗影像数据集上进行了实验而获得5分（有一定关联），其他关键词均与大模型技术、训练方法、推理优化、智能体等主题完全无关，故评分为0分。论文未涉及任何大模型相关技术或应用创新。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FedFew的个性化联邦学习框架，通过维护少量共享服务器模型来高效服务大量异构客户端，在多个数据集上实现了优于现有方法的性能。

摘要翻译

个性化联邦学习（Personalized Federated Learning, PFL）旨在为具有高度异构数据分布的客户端训练定制化模型，同时保护数据隐私。现有方法通常依赖于聚类或模型插值等启发式策略，这些方法缺乏平衡异构客户端目标的原则性机制。为具有不同数据分布的 $M$ 个客户端提供服务本质上是一个多目标优化问题，其中实现最优个性化理想上需要在帕累托前沿上获得 $M$ 个不同的模型。然而，在拥有数百或数千个客户端的联邦环境中，维护 $M$ 个独立模型会带来显著的可扩展性挑战。为解决这一挑战，我们将 PFL 重新表述为一个“少数服务多数”的优化问题，即仅维护 $K$ 个共享服务器模型（$K \ll M$）来共同服务所有 $M$ 个客户端。我们证明该框架能够实现近乎最优的个性化：近似误差随着 $K$ 的增加而减小，并且随着数据量的增长，每个客户端的模型会收敛至其各自的最优点。基于此重构，我们提出了 FedFew，一种通过高效的基于梯度的更新联合优化 $K$ 个服务器模型的实用算法。与需要手动划分客户端的基于聚类的方法或需要仔细调整超参数的基于插值的方法不同，FedFew 通过其优化过程自动发现最优的模型多样性。在视觉、自然语言处理以及真实世界医学影像数据集上的实验表明，仅使用 3 个模型的 FedFew 始终优于其他最先进的方法。代码可在 https://github.com/pgg3/FedFew 获取。

摘要 (Abstract)

Personalized Federated Learning (PFL) aims to train customized models for clients with highly heterogeneous data distributions while preserving data privacy. Existing approaches often rely on heuristics like clustering or model interpolation, which lack principled mechanisms for balancing heterogeneous client objectives. Serving $M$ clients with distinct data distributions is inherently a multi-objective optimization problem, where achieving optimal personalization ideally requires $M$ distinct models on the Pareto front. However, maintaining $M$ separate models poses significant scalability challenges in federated settings with hundreds or thousands of clients. To address this challenge, we reformulate PFL as a few-for-many optimization problem that maintains only $K$ shared server models ($K \ll M$) to collectively serve all $M$ clients. We prove that this framework achieves near-optimal personalization: the approximation error diminishes as $K$ increases and each client’s model converges to each client’s optimum as data grows. Building on this reformulation, we propose FedFew, a practical algorithm that jointly optimizes the $K$ server models through efficient gradient-based updates. Unlike clustering-based approaches that require manual client partitioning or interpolation-based methods that demand careful hyperparameter tuning, FedFew automatically discovers the optimal model diversity through its optimization process. Experiments across vision, NLP, and real-world medical imaging datasets demonstrate that FedFew, with just 3 models, consistently outperforms other state-of-the-art approaches. Code is available at https://github.com/pgg3/FedFew.

关键词: Personalized Federated Learning, PFL, Few-for-Many Optimization, Multi-objective Optimization, Model Scalability, Heterogeneous Data, FedFew Algorithm, Medical Imaging

64. ❌ BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs

作者: Ilias Aarab 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11991v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究零样本文本分类，系统比较了四种模型家族，包括指令调优的大语言模型（LLMs）。因此，与’Large Language Models OR LLMs OR Foundation Models’和’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分），因为论文明确评估了指令调优的LLMs。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization等，论文未涉及，故得0分。论文未提及生物信息学等特定科学领域应用，因此’AI for Science’也得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为BTZSC的基准测试，用于系统评估和比较在零样本文本分类任务中不同模型家族（包括NLI交叉编码器、嵌入模型、重排序器和指令调优的大语言模型）的性能，发现现代重排序器（如Qwen3-Reranker-8B）达到了新的最先进水平（宏观F1=0.72），而指令调优的LLMs在4-12B参数规模下也表现出竞争力。

摘要翻译

零样本文本分类（ZSC）旨在通过将文本直接与人类可读的标签描述进行匹配，从而消除成本高昂的任务特定标注。早期方法主要依赖于为自然语言推理（NLI）微调的交叉编码器模型，而近年来文本嵌入模型、重排序器以及指令微调大语言模型（LLMs）的进展，对基于NLI的架构主导地位提出了挑战。然而，系统性地比较这些多样化方法仍然存在困难。现有的评估（如MTEB）通常通过监督探针或微调引入标注样本，导致对真实零样本能力的探索不足。为此，我们提出了BTZSC，这是一个包含22个公共数据集的综合性基准，涵盖情感、主题、意图和情绪分类，捕捉了多样化的领域、类别基数和文档长度。利用BTZSC，我们对四大模型家族——NLI交叉编码器、嵌入模型、重排序器和指令微调LLMs——进行了系统比较，涵盖了38个公开及自定义检查点。我们的研究结果表明：（一）以Qwen3-Reranker-8B为代表的现代重排序器创造了新的最高水平，宏观F1分数达到0.72；（二）强大的嵌入模型（如GTE-large-en-v1.5）显著缩小了准确率差距，同时在准确率与延迟之间提供了最佳权衡；（三）参数规模在4-120亿的指令微调LLMs实现了有竞争力的性能（宏观F1最高达0.67），尤其在主题分类上表现突出，但仍落后于专用重排序器；（四）NLI交叉编码器即使在其骨干网络规模增大时性能也趋于停滞；（五）规模扩展主要使重排序器和LLMs受益，而对嵌入模型的提升有限。我们公开发布了BTZSC及配套评估代码，以支持零样本文本理解领域公平且可复现的进展。

摘要 (Abstract)

Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4–12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.

关键词: Zero-shot text classification, Benchmark, Large language models, Instruction tuning, Rerankers, Embedding models, Cross-encoders, Evaluation

65. ❌ LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories

作者: Qianpu Sun, Xiaowei Chi, Yuhan Rui, Ying Li, Kuangzhi Ge, Jiajun Li, Sirui Han, Shanghang Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLM）在科学实验室安全关键推理和规划中的应用，与’Large Language Models’、‘LLM Agents’和’AI for Science’高度相关（10分）。论文评估模型在危险识别和安全关键推理中的表现，涉及’Chain of Thought’和’System 2 Thinking’（8分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、优化技术、推理加速、模型压缩等未在摘要中提及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对科学实验室中多模态大语言模型代理的安全关键推理和规划能力不足的问题，提出了LABSHIELD基准测试，评估发现模型在专业实验室场景中的安全性能平均下降32.0%，凸显了安全中心推理框架的必要性。

摘要翻译

人工智能正日益成为科学自动化的催化剂，多模态大语言模型（Multimodal Large Language Model, MLLM）智能体正从实验室助手演变为自主实验室操作者。这一转变对实验室环境提出了严格的安全要求，因为脆弱的玻璃器皿、危险物质和高精度实验设备使得规划错误或风险误判可能造成不可逆的后果。然而，在此类高风险场景中，具身智能体的安全意识与决策可靠性尚未得到充分界定和评估。为弥补这一空白，我们提出了LABSHIELD——一个用于评估MLLM在危险识别与安全关键推理能力的真实多视角基准。该基准基于美国职业安全与健康管理局（Occupational Safety and Health Administration, OSHA）标准及全球化学品统一分类和标签制度（Globally Harmonized System, GHS），建立了一个涵盖164项操作任务的严谨安全分类体系，这些任务具有不同的操作复杂性和风险特征。我们在双轨评估框架下测试了20个专有模型、9个开源模型及3个具身模型。结果显示，通用领域多项选择题准确率与半开放问答安全性能之间存在系统性差距：在专业实验室场景中，模型平均表现下降32.0%，尤其在危险解读与安全感知规划方面。这些发现凸显了建立以安全为核心的推理框架的迫切性，以确保在具身实验室环境中实现可靠的自主科学实验。完整数据集即将公开发布。

摘要 (Abstract)

Artificial intelligence is increasingly catalyzing scientific automation, with multimodal large language model (MLLM) agents evolving from lab assistants into self-driving lab operators. This transition imposes stringent safety requirements on laboratory environments, where fragile glassware, hazardous substances, and high-precision laboratory equipment render planning errors or misinterpreted risks potentially irreversible. However, the safety awareness and decision-making reliability of embodied agents in such high-stakes settings remain insufficiently defined and evaluated. To bridge this gap, we introduce LABSHIELD, a realistic multi-view benchmark designed to assess MLLMs in hazard identification and safety-critical reasoning. Grounded in U.S. Occupational Safety and Health Administration (OSHA) standards and the Globally Harmonized System (GHS), LABSHIELD establishes a rigorous safety taxonomy spanning 164 operational tasks with diverse manipulation complexities and risk profiles. We evaluate 20 proprietary models, 9 open-source models, and 3 embodied models under a dual-track evaluation framework. Our results reveal a systematic gap between general-domain MCQ accuracy and Semi-open QA safety performance, with models exhibiting an average drop of 32.0% in professional laboratory scenarios, particularly in hazard interpretation and safety-aware planning. These findings underscore the urgent necessity for safety-centric reasoning frameworks to ensure reliable autonomous scientific experimentation in embodied laboratory contexts. The full dataset will be released soon.

关键词: multimodal large language models, safety-critical reasoning, laboratory safety, hazard identification, autonomous scientific experimentation, embodied agents, benchmark evaluation, safety taxonomy

66. ❌ HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

作者: Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang, Jun Xu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11975v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于具身智能体（embodied agents）在家庭场景中的安全检测，核心是评估视觉语言模型（VLMs）并提出了HD-Guard架构。与大多数关键词无关，因为论文不涉及大模型技术原理创新（如MoE、量化、训练方法等）。相关关键词：1）‘Large Language Models’得5分，因为VLMs是大模型的一种应用形式；2）‘Chain of Thought’和’System 2 Thinking’各得5分，因HD-Guard的SlowBrain涉及深度多模态推理，类似系统2思维；3）‘LLM Agents’得8分，因论文直接研究具身智能体（家庭机器人）的安全检测，属于智能体应用。其他关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对家庭环境中具身智能体的动态不安全行为检测问题，提出了HomeSafe-Bench基准和HD-Guard分层流式架构，实现了延迟与检测精度之间的优越平衡。

摘要翻译

具身智能体的快速发展加速了家庭机器人在真实环境中的部署。然而，与结构化的工业环境不同，家庭空间引入了不可预测的安全风险，其中感知延迟和常识知识缺乏等系统局限可能导致危险错误。当前的安全评估通常局限于静态图像、文本或一般性危险，未能充分衡量这些特定场景下的动态不安全行为检测能力。为弥补这一空白，我们提出了 HomeSafe-Bench，这是一个旨在评估视觉语言模型在家庭场景中不安全行为检测能力的挑战性基准。HomeSafe-Bench通过结合物理仿真与先进视频生成的混合流程构建，涵盖六大功能区域的438个多样化案例，并具备细粒度的多维度标注。除基准测试外，我们进一步提出 家庭安全分层双脑监护系统（HD-Guard），这是一种用于实时安全监控的分层流式架构。HD-Guard协调一个轻量级“快速脑”进行连续高频筛查，并通过异步的大规模“慢速脑”进行深度多模态推理，有效平衡了推理效率与检测精度。评估表明，HD-Guard在延迟与性能间实现了更优的权衡，同时我们的分析揭示了当前基于视觉语言模型的安全检测系统中存在的关键瓶颈。

摘要 (Abstract)

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.

关键词: Vision-Language Models, embodied agents, unsafe action detection, household safety, benchmark evaluation, hierarchical architecture, real-time monitoring, multimodal reasoning

67. ❌ Normative Common Ground Replication (NormCoRe): Replication-by-Translation for Studying Norms in Multi-agent AI

作者: Luca Deck, Simeon Allmendinger, Lucas Müller, Niklas Kühl 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11974v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体人工智能（MAAI）中的规范形成与协调，提出NormCoRe方法将人类实验设计转化为AI智能体研究。与关键词相关性分析：1）高度相关（10分）：‘Instruction Tuning/Alignment/Value Alignment’（论文研究规范对齐，智能体在公平敏感领域协商共享决策）；‘LLM Agents/Autonomous Agents/Agentic Workflow’（基于AI的智能体进行审议、谈判和决策）；‘Multi-agent Systems/Agent Coordination’（多智能体系统协调形成规范）。2）中等相关（8分）：‘Large Language Models/LLMs/Foundation Models’（论文使用基础模型实例化智能体角色，并指出规范判断受基础模型选择影响）。3）其他关键词（0分）：论文未涉及MoE、SLMs、训练技术、推理优化、模型压缩等具体技术细节，也未聚焦科学领域应用。

!!! tip deepseek-chat TL;DR

该论文提出NormCoRe方法，将人类实验设计转化为多智能体AI研究，发现AI智能体在公平原则协商中的规范判断与人类基线存在差异，且受基础模型和角色描述语言的影响。

摘要翻译

2010年代末期，时尚潮流“NormCore”将趋同性塑造为群体归属的信号，揭示了规范如何通过集体协调得以形成。当前，在基于多智能体人工智能（MAAI）的系统中，可以观察到类似的规范性协调形态——人工智能智能体在公平敏感领域通过审议、协商达成共识决策。然而，现有实证研究往往将规范视为对齐或复制的目标，隐含地假定人类受试者与人工智能智能体具有等效性，导致对集体规范性动态的考察不足。为弥补这一空白，我们提出“规范性共识复制”（NormCoRe）这一新型方法论框架，系统地将人类受试实验设计转化为MAAI环境。NormCoRe融合行为科学、复制研究及前沿MAAI架构，将人类实验的结构层次映射至人工智能智能体研究设计中，实现对研究设计的系统化记录及MAAI中规范的分析。我们通过复制一项关于分配正义的标志性实验（在该实验中，参与者在“无知之幕”条件下协商公平原则），验证了NormCoRe的实用性。研究表明，人工智能智能体实验中的规范性判断可能偏离人类基准，且对基础模型的选择以及智能体人格实例化的语言描述高度敏感。本工作为分析MAAI中的规范提供了原则性路径，并在以人工智能智能体替代或辅助人类执行任务时，为设计选择提供了系统化的指导、反思与记录框架。

摘要 (Abstract)

In the late 2010s, the fashion trend NormCore framed sameness as a signal of belonging, illustrating how norms emerge through collective coordination. Today, similar forms of normative coordination can be observed in systems based on Multi-agent Artificial Intelligence (MAAI), as AI-based agents deliberate, negotiate, and converge on shared decisions in fairness-sensitive domains. Yet, existing empirical approaches often treat norms as targets for alignment or replication, implicitly assuming equivalence between human subjects and AI agents and leaving collective normative dynamics insufficiently examined. To address this gap, we propose Normative Common Ground Replication (NormCoRe), a novel methodological framework to systematically translate the design of human subject experiments into MAAI environments. Building on behavioral science, replication research, and state-of-the-art MAAI architectures, NormCoRe maps the structural layers of human subject studies onto the design of AI agent studies, enabling systematic documentation of study design and analysis of norms in MAAI. We demonstrate the utility of NormCoRe by replicating a seminal experimental study on distributive justice, in which participants negotiate fairness principles under a “veil of ignorance”. We show that normative judgments in AI agent studies can differ from human baselines and are sensitive to the choice of the foundation model and the language used to instantiate agent personas. Our work provides a principled pathway for analyzing norms in MAAI and helps to guide, reflect, and document design choices whenever AI agents are used to automate or support tasks formerly carried out by humans.

关键词: Normative Common Ground Replication, Multi-agent Artificial Intelligence, Norm Coordination, Foundation Models, Distributive Justice, Veil of Ignorance, Agent Personas, Experimental Replication

68. ❌ Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

作者: Junhyeong Byeon, Jeongyeol Kim, Sejoon Lim 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11971v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究多模态情感识别，使用CLIP和Wav2Vec 2.0等预训练模型作为骨干网络，并引入双向交叉注意力融合模块。论文与大多数关键词无关，因为这些关键词主要针对大语言模型（LLM）的技术原理、训练方法、推理优化、对齐、代理系统等。唯一相关的关键词是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文使用了大规模预训练模型（CLIP和Wav2Vec 2.0）作为冻结骨干，这属于预训练技术的应用，但论文本身并未创新预训练方法，因此给予5分（有一定关联）。其他关键词如’AI for Science’等虽然涉及AI应用，但论文专注于情感识别，不属于生物信息学或化学信息学等科学领域，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于双向交叉注意力和时序建模的多模态情感识别框架，在ABAW 10th EXPR基准测试中取得了优于单模态模型的性能。

摘要翻译

由于面部外观、头部姿态、光照条件、背景噪声的巨大差异以及人类情感固有的动态特性，在非受控真实场景视频中进行情绪识别仍是一个具有挑战性的问题。仅依赖单一模态（如面部表情或语音）通常不足以捕捉这些复杂的情感线索。为解决这一问题，我们针对第十届野外情感行为分析挑战赛中的表情识别任务，提出了一种多模态情绪识别框架。

我们的方法利用大规模预训练模型作为冻结的主干网络，即用于视觉编码的CLIP模型和用于音频表征学习的Wav2Vec 2.0模型。为建模面部表情序列中的时序依赖关系，我们在固定长度视频窗口上使用时序卷积网络。此外，我们引入了双向交叉注意力融合模块，其中视觉与音频特征通过对称交互增强跨模态语境化能力，从而捕捉互补的情感信息。随后采用轻量级分类头进行最终情绪预测。我们进一步融合了基于CLIP文本特征的文本引导对比学习目标，以促进语义对齐的视觉表征。

在ABAW第十届表情识别基准测试上的实验结果表明，所提出的框架提供了强大的多模态基线，其性能优于单模态建模方法。这些结果验证了结合时序视觉建模、音频表征学习与跨模态融合对于在无约束真实场景中实现鲁棒情绪识别的有效性。

摘要 (Abstract)

Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.

关键词: multimodal emotion recognition, bi-directional cross-attention, temporal modeling, CLIP, Wav2Vec 2.0, Temporal Convolutional Network, cross-modal fusion, ABAW challenge

69. ❌ Learning Transferable Sensor Models via Language-Informed Pretraining

作者: Yuliang Chen, Arvind Pillai, Yu Yvonne Wu, Tess Z. Griffin, Lisa Marsch, Michael V. Heinz, Nicholas C. Jacobson, Andrew Campbell 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11950v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出SLIP框架，通过语言模型预训练学习传感器数据的可迁移表示，核心涉及预训练（Pre-training）和AI for Science（传感器数据分析应用）。与’Large Language Models’相关（使用预训练语言模型），但与大多数其他关键词无关（如MoE、SFT、RAG等）。

!!! tip deepseek-chat TL;DR

该论文针对传感器数据缺乏语义结构的问题，提出了SLIP框架，通过语言模型预训练实现跨传感器设置的零样本迁移，在11个数据集上显著提升了分类和问答性能。

摘要翻译

现代传感系统产生大量未标记的多元时间序列数据。这种丰富的未标记数据使得自监督学习（SSL）成为学习可迁移表征的自然选择。然而，现有方法大多针对重构或预测目标进行优化，往往难以捕捉下游分类与推理任务所需的语义结构。尽管近期传感器-语言对齐方法通过描述生成和零样本迁移提升了语义泛化能力，但它们局限于固定的传感器配置（如预定义的通道集、信号长度或时间分辨率），这阻碍了跨领域适用性。为弥补这些不足，我们提出 SLIP（Sensor Language-Informed Pretraining，传感器语言信息预训练），这是一个用于学习跨多样传感器设置泛化的语言对齐表征的开源框架。SLIP 融合了对比对齐与传感器条件描述生成，兼顾判别式理解与生成式推理能力。通过跨注意力机制复用预训练的解码器仅语言模型，并引入一种简洁灵活的补丁嵌入器，SLIP 在推理时无需额外重训练即可支持不同时间分辨率与可变长度输入。在11个数据集上的实验表明，SLIP 在零样本迁移、信号描述生成和问答任务中均表现出优越性能：其线性探测平均准确率达到77.14%，较强基线相对提升5.93%，并在基于传感器的问答任务中取得64.83%的准确率。

摘要 (Abstract)

Modern sensing systems generate large volumes of unlabeled multivariate time-series data. This abundance of unlabeled data makes self-supervised learning (SSL) a natural approach for learning transferable representations. However, most existing approaches are optimized for reconstruction or forecasting objectives and often fail to capture the semantic structure required for downstream classification and reasoning tasks. While recent sensor-language alignment methods improve semantic generalization through captioning and zero-shot transfer, they are limited to fixed sensor configurations, such as predefined channel sets, signal lengths, or temporal resolutions, which hinders cross-domain applicability. To address these gaps, we introduce \textbf{SLIP} (\textbf{S}ensor \textbf{L}anguage-\textbf{I}nformed \textbf{P}retraining), an open-source framework for learning language-aligned representations that generalize across diverse sensor setups. SLIP integrates contrastive alignment with sensor-conditioned captioning, facilitating both discriminative understanding and generative reasoning. By repurposing a pretrained decoder-only language model via cross-attention and introducing an elegant, flexible patch-embedder, SLIP supports different temporal resolutions and variable-length input at inference time without additional retraining. Across 11 datasets, SLIP demonstrates superior performance in zero-shot transfer, signal captioning, and question answering. It achieves a 77.14% average linear-probing accuracy, a 5.93% relative improvement over strong baselines, and reaches 64.83% accuracy in sensor-based question answering.

关键词: sensor models, language-informed pretraining, self-supervised learning, multivariate time-series, zero-shot transfer, contrastive alignment, sensor-language alignment, cross-domain applicability

70. ❌ Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models

作者: Zikang Ding, Haomiao Yang, Meng Hao, Wenbo Jiang, Kunlan Xiang, Runmeng Du, Yijing Liu, Ruichen Zhang, Dusit Niyato 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11949v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究预训练模型（PTMs）中的延迟后门攻击，与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（8分），因为论文明确针对预训练模型的安全漏洞。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为预训练模型包括大语言模型，且实验在NLP基准上进行。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种新型的延迟后门攻击（DBA），通过在预训练模型中引入时间维度，使攻击在触发后延迟激活，从而使用常见词语作为触发器，并在四个NLP基准上验证了其有效性和对现有防御的抵抗力。

摘要翻译

针对预训练模型（PTMs）的后门攻击传统上遵循一种“即时性假设”，即恶意行为在触发器出现时立即显现。本研究重新审视并挑战了这一范式，提出了延迟后门攻击（Delayed Backdoor Attacks, DBA）——一类新型威胁，其激活在时间上与触发器暴露相分离。我们认为，这种时间维度是开启一类先前不可行攻击的关键：即使用常见日常词汇作为触发器的攻击。为检验该范式的可行性，我们设计并实现了一个概念验证原型，称为基于非线性衰减的延迟后门攻击（Delayed Backdoor Attacks Based on Nonlinear Decay, DND）。DND嵌入了一个轻量级、有状态的逻辑模块，将激活推迟至可配置的阈值达到为止，从而产生一个明显的潜伏期，随后是受控的爆发。我们推导了一个形式化模型来描述这种潜伏行为，并提出一个双指标评估框架（攻击成功率ASR和延迟攻击成功率ASR$_{delay}$）以实证测量延迟效果。在四个自然语言处理（NLP）基准测试上的大量实验验证了DND的核心能力：它可在可控时间内保持休眠状态，维持高清洁准确率（$\ge$94%），并在激活后达到接近完美的攻击成功率（$\approx$99%，其他方法的平均值低于95%）。此外，DND对多种先进防御方法表现出抵抗力。本研究首次提供了实证证据，表明时间维度构成了预训练模型中一个可行却未受保护的攻击面，强调了开发下一代有状态且具备时间感知能力的防御机制的必要性。

摘要 (Abstract)

Backdoor attacks against pre-trained models (PTMs) have traditionally operated under an ``immediacy assumption,’’ where malicious behavior manifests instantly upon trigger occurrence. This work revisits and challenges this paradigm by introducing \textit{\textbf{Delayed Backdoor Attacks (DBA)}}, a new class of threats in which activation is temporally decoupled from trigger exposure. We propose that this \textbf{temporal dimension} is the key to unlocking a previously infeasible class of attacks: those that use common, everyday words as triggers. To examine the feasibility of this paradigm, we design and implement a proof-of-concept prototype, termed \underline{D}elayed Backdoor Attacks Based on \underline{N}onlinear \underline{D}ecay (DND). DND embeds a lightweight, stateful logic module that postpones activation until a configurable threshold is reached, producing a distinct latency phase followed by a controlled outbreak. We derive a formal model to characterize this latency behavior and propose a dual-metric evaluation framework (ASR and ASR$_{delay}$) to empirically measure the delay effect. Extensive experiments on four (natural language processing)NLP benchmarks validate the core capabilities of DND: it remains dormant for a controllable duration, sustains high clean accuracy ($\ge$94%), and achieves near-perfect post-activation attack success rates ($\approx$99%, The average of other methods is below 95%.). Moreover, DND exhibits resilience against several state-of-the-art defenses. This study provides the first empirical evidence that the temporal dimension constitutes a viable yet unprotected attack surface in PTMs, underscoring the need for next-generation, stateful, and time-aware defense mechanisms.

关键词: Delayed Backdoor Attacks, Pre-trained Models, Temporal Dimension, NLP Benchmarks, Attack Surface, Stateful Logic Module, DND, Backdoor Defense

71. ❌ Geometry-Aware Probabilistic Circuits via Voronoi Tessellations

作者: Sahil Sidheekh, Sriraam Natarajan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11946v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究概率电路（Probabilistic Circuits）的几何感知改进，通过Voronoi剖分引入数据流形的局部几何结构，属于机器学习中的概率图模型和密度估计领域。论文内容完全不涉及大语言模型（LLMs）、深度学习技术原理、AI for Science应用或任何评分关键词中列出的具体技术（如MoE、RLHF、RAG、量化等）。所有关键词均与大模型技术、训练方法、推理优化、对齐、AI应用等主题相关，而本文专注于传统概率模型的数学改进，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对概率电路因使用数据无关的混合权重而难以捕捉数据流形局部几何结构的问题，提出了通过Voronoi剖分将几何结构直接融入概率电路和节点的方法，并开发了保证推理边界的近似框架和恢复精确可处理推理的结构条件，在标准密度估计任务上进行了实证验证。

摘要翻译

概率电路（Probabilistic Circuits, PCs）能够实现精确且可处理的推理，但其使用的数据无关混合权重限制了捕捉数据流形局部几何结构的能力。我们提出将沃罗诺伊镶嵌（Voronoi Tessellations, VT）作为一种自然方式，将几何结构直接融入PC的求和节点中。然而，直接引入此类结构会破坏可处理性。我们形式化了这种不兼容性，并提出了两种互补的解决方案：（1）一种近似推理框架，为推理提供有保证的下界和上界；（2）一种VT的结构条件，在该条件下可恢复精确的可处理推理。最后，我们引入了一种可微分的VT松弛方法，使得基于梯度的学习成为可能，并在标准密度估计任务上对所得方法进行了实证验证。

摘要 (Abstract)

Probabilistic circuits (PCs) enable exact and tractable inference but employ data independent mixture weights that limit their ability to capture local geometry of the data manifold. We propose Voronoi tessellations (VT) as a natural way to incorporate geometric structure directly into the sum nodes of a PC. However, naïvely introducing such structure breaks tractability. We formalize this incompatibility and develop two complementary solutions: (1) an approximate inference framework that provides guaranteed lower and upper bounds for inference, and (2) a structural condition for VT under which exact tractable inference is recovered. Finally, we introduce a differentiable relaxation for VT that enables gradient-based learning and empirically validate the resulting approach on standard density estimation tasks.

关键词: Probabilistic Circuits, Voronoi Tessellations, Geometric Structure, Exact Tractable Inference, Density Estimation, Data Manifold, Approximate Inference, Gradient-based Learning

72. ❌ Effective Resistance Rewiring: A Simple Topological Correction for Over-Squashing

作者: Bertran Miquel-Oliver, Manel Gil-Sorribes, Victor Guallar, Alexis Molina 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11944v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图神经网络（GNN）中的过挤压问题，提出了一种基于有效电阻的图重连方法（ERR），以改善长距离依赖的信息传递。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或科学AI应用直接相关，而本文研究的是图神经网络（GNN）的特定结构问题，属于图深度学习领域，与LLM、MoE、缩放定律、训练技术、推理优化、智能体、量化等关键词无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对图神经网络中因过挤压导致的长距离依赖捕获难题，提出了一种基于有效电阻的简单拓扑校正方法（ERR），通过全局检测结构瓶颈并重连边来改善信息传播，实验表明该方法能提升连接性和信号传播，但需与归一化技术结合以平衡过挤压和过平滑的权衡。

摘要翻译

图神经网络因过度挤压现象而难以捕捉长范围依赖关系——该现象指来自指数级增长邻域的信息必须通过少量结构瓶颈进行传递。尽管近期重布线方法试图缓解这一限制，但许多方法依赖曲率等局部准则，可能忽略限制信息流动的全局连通性约束。本文提出有效电阻重布线（Effective Resistance Rewiring, ERR），这是一种利用有效电阻作为全局信号来检测结构瓶颈的简单拓扑校正策略。ERR在固定边预算下，通过迭代添加具有最大电阻的节点对之间的边，同时移除电阻最小的边，从而在控制图稠密化的同时增强弱通信路径。该方法除重布线预算外无需参数，且仅依赖聚合节点间所有路径的单一全局度量。除了在GCN模型中的预测性能外，我们分析了重布线如何影响信息传播。通过追踪跨层节点嵌入的余弦相似度，我们比较了重布线前后图中初始节点特征与学习表征在信息传递过程中的演化关系。该分析有助于判断性能提升是源于更好的长程通信，而非嵌入几何结构的变化。在同配图和异配图（包括使用DirGCN的有向图设置）上的实验揭示了过度挤压与过度平滑之间的权衡，其中过度平滑对应层间表征多样性的丧失。电阻引导的重布线改善了连通性和信号传播，但可能加速深度模型中的表征混合。将ERR与PairNorm等归一化技术结合，可稳定这种权衡并提升性能。

摘要 (Abstract)

Graph Neural Networks struggle to capture long-range dependencies due to over-squashing, where information from exponentially growing neighborhoods must pass through a small number of structural bottlenecks. While recent rewiring methods attempt to alleviate this limitation, many rely on local criteria such as curvature, which can overlook global connectivity constraints that restrict information flow. We introduce Effective Resistance Rewiring (ERR), a simple topology correction strategy that uses effective resistance as a global signal to detect structural bottlenecks. ERR iteratively adds edges between node pairs with the largest resistance while removing edges with minimal resistance, strengthening weak communication pathways while controlling graph densification under a fixed edge budget. The procedure is parameter-free beyond the rewiring budget and relies on a single global measure aggregating all paths between node pairs. Beyond predictive performance with GCN models, we analyze how rewiring affects message propagation. By tracking cosine similarity between node embeddings across layers, we examine how the relationship between initial node features and learned representations evolves during message passing, comparing graphs with and without rewiring. This analysis helps determine whether improvements arise from better long-range communication rather than changes in embedding geometry. Experiments on homophilic and heterophilic graphs, including directed settings with DirGCN, reveal a trade-off between over-squashing and oversmoothing, where oversmoothing corresponds to the loss of representation diversity across layers. Resistance-guided rewiring improves connectivity and signal propagation but can accelerate representation mixing in deep models. Combining ERR with normalization techniques such as PairNorm stabilizes this trade-off and improves performance.

关键词: Graph Neural Networks, Over-squashing, Effective Resistance, Topology Correction, Message Propagation, Long-range Dependencies, Graph Rewiring, Oversmoothing

73. ❌ Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting

作者: Chantal Pellegrini, Adrian Delchev, Ege Özsoy, Nassir Navab, Matthias Keicher 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11938v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文明确使用指令调优的LLM从自由文本报告中提取知识，属于LLM在生物医学（放射学）领域的应用创新，因此与’Large Language Models’、‘Instruction Tuning’和’AI for Science’高度相关（10分）。其他关键词如MoE、SFT、RAG等未在摘要中提及或与核心方法无关，故得0分。

!!! tip deepseek-chat TL;DR

该研究提出ProtoSR方法，利用指令调优的LLM从自由文本放射报告中提取知识构建多模态知识库，通过原型检索和条件残差增强结构化报告生成，在Rad-ReStruct基准上实现了最先进的性能，显著提升了细粒度属性问题的准确性。

摘要翻译

结构化放射学报告相较于自由文本有望实现更快速、更一致的沟通，但其自动化仍面临挑战，因为模型必须在有限的结构化监督下，针对罕见发现和属性做出大量细粒度的离散决策。相比之下，自由文本报告在常规诊疗中大规模生成，并通过详细描述隐式编码了与图像关联的细粒度信息。为利用这种非结构化知识，我们提出ProtoSR方法，将自由文本信息注入结构化报告生成中。首先，我们引入一个自动提取流程，使用指令调优的大语言模型（LLM）挖掘超过8万份MIMIC-CXR研究，构建与结构化报告模板对齐的多模态知识库，其中每个答案选项均通过视觉原型表示。基于此知识库，ProtoSR被训练为检索与当前图像-问题对相关的原型，并通过原型条件残差增强模型预测，从而提供数据驱动的第二意见，有选择性地修正预测。在Rad-ReStruct基准测试中，ProtoSR取得了最先进的性能，尤其在细粒度属性问题上提升最为显著，这证明了整合自由文本衍生信号对于细粒度图像理解的价值。

摘要 (Abstract)

Structured radiology reporting promises faster, more consistent communication than free text, but automation remains difficult as models must make many fine-grained, discrete decisions about rare findings and attributes from limited structured supervision. In contrast, free-text reports are produced at scale in routine care and implicitly encode fine-grained, image-linked information through detailed descriptions. To leverage this unstructured knowledge, we propose ProtoSR, an approach for injecting free-text information into structured report population. First, we introduce an automatic extraction pipeline that uses an instruction-tuned LLM to mine 80k+ MIMIC-CXR studies and build a multimodal knowledge base aligned with a structured reporting template, representing each answer option with a visual prototype. Using this knowledge base, ProtoSR is trained to retrieve prototypes relevant for the current image-question pair and augment the model predictions through a prototype-conditioned residual, providing a data-driven second opinion that selectively corrects predictions. On the Rad-ReStruct benchmark, ProtoSR achieves state-of-the-art results, with the largest improvements on detailed attribute questions, demonstrating the value of integrating free-text derived signal for fine-grained image understanding.

关键词: structured radiology reporting, large language models, instruction tuning, prototype retrieval, multimodal knowledge base, fine-grained image understanding, MIMIC-CXR, Rad-ReStruct benchmark

74. ❌ Fair Learning for Bias Mitigation and Quality Optimization in Paper Recommendation

作者: Uttamasha Anjally Oyshi, Susan Gauch 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11936v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是学术论文推荐系统中的公平性问题，使用传统的多层感知机（MLP）模型来解决作者人口统计偏差问题。论文内容完全不涉及大语言模型、深度学习技术原理、模型训练优化方法、推理加速技术、AI代理系统或科学AI应用等关键词领域。所有关键词均与大模型和深度学习技术相关，而本文使用的是传统机器学习方法，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了Fair-PaperRec模型来解决学术论文评审中的人口统计偏差问题，在保持学术质量的同时显著提高了代表性不足群体的参与度。

摘要翻译

尽管普遍采用双盲评审机制，作者的人口统计学特征仍使代表性不足的群体处于不利地位。本文提出Fair-PaperRec模型——一种基于多层感知机（MultiLayer Perceptron, MLP）的解决方案，该模型在维持高质量学术标准的同时，致力于缓解论文录用决策中的人口统计学差异。与启发式方法不同，我们的方法通过交叉性标准（如种族、国籍）和定制化的公平性损失函数，在惩罚人口统计学差异的同时保障论文质量。基于ACM人机交互特别兴趣小组（SIGCHI）、交互系统设计会议（DIS）及智能用户界面会议（IUI）数据的评估表明，该方法使代表性不足群体的参与度提升了42.03%，整体效用提高了3.16%。这证明促进多样性并不会损害学术严谨性，并为推动公平导向的同行评审机制提供了可行路径。

摘要 (Abstract)

Despite frequent double-blind review, demographic biases of authors still disadvantage the underrepresented groups. We present Fair-PaperRec, a MultiLayer Perceptron (MLP)-based model that addresses demographic disparities in post-review paper acceptance decisions while maintaining high-quality requirements. Our methodology penalizes demographic disparities while preserving quality through intersectional criteria (e.g., race, country) and a customized fairness loss, in contrast to heuristic approaches. Evaluations using conference data from ACM Special Interest Group on Computer-Human Interaction (SIGCHI), Designing Interactive Systems (DIS), and Intelligent User Interfaces (IUI) indicate a 42.03% increase in underrepresented group participation and a 3.16% improvement in overall utility, indicating that diversity promotion does not compromise academic rigor and supports equity-focused peer review solutions.

关键词: Fair Learning, Bias Mitigation, Paper Recommendation, Demographic Disparities, Fairness Loss, Underrepresented Groups, Academic Review, Quality Optimization

75. ❌ MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

作者: Xingze Zou, Jing Wang, Yuhua Zheng, Xueyi Chen, Haolei Bai, Lingcheng Kong, Syed A. R. Abu-Bakar, Zhaode Wang, Chengfei Lv, Haoji Hu, Huan Wang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11935v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在移动设备上生成高效内核的能力，因此与’Large Language Models’高度相关（10分）。研究涉及移动设备上的AI实现，与’Small Language Models/On-device AI’相关（8分）。论文提出MoKA多智能体系统，与’LLM Agents’和’Multi-agent Systems’高度相关（各10分）。论文提到LLMs存在幻觉问题，与’Hallucination Mitigation’相关（8分）。其他关键词如MoE、Scaling Laws、各种训练方法、推理技术、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs能否为移动设备编写高效内核的问题，发现现有LLMs因工程复杂性和数据稀缺性而表现不佳，但提出的多智能体系统MoKA显著提升了编译成功率和内核性能。

摘要翻译

大语言模型（LLM）在代码生成方面已展现出卓越能力，但其专门为移动设备生成内核的潜力在很大程度上仍未得到探索。在本研究中，我们将自动化内核生成的范围扩展到移动领域，以探究核心问题：LLM能否为移动设备编写高效内核？为支持系统性研究，我们提出了MobileKernelBench，这是一个综合评估框架，包含一个优先考虑算子多样性和跨框架互操作性的基准测试集，以及一个弥合主机与设备间差距以实现设备端验证的自动化流程。利用该框架，我们在移动神经网络（MNN）的CPU后端上进行了广泛评估，结果表明当前LLM难以应对移动框架固有的工程复杂性和数据稀缺性；标准模型乃至微调变体均表现出较高的编译失败率（超过54%），且由于幻觉问题和缺乏领域特定基础，其性能提升微乎其微。为克服这些限制，我们提出了移动内核智能体（Mobile Kernel Agent, MoKA），这是一个具备仓库感知推理能力并采用规划-执行范式的多智能体系统。在MobileKernelBench上的验证表明，MoKA实现了最先进的性能，将编译成功率提升至93.7%，并使27.4%的生成内核能够相比原生库带来可测量的加速效果。

摘要 (Abstract)

Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet their potential for generating kernels specifically for mobile de- vices remains largely unexplored. In this work, we extend the scope of automated kernel generation to the mobile domain to investigate the central question: Can LLMs write efficient kernels for mobile devices? To enable systematic investigation, we introduce MobileKernelBench, a comprehensive evaluation framework comprising a benchmark prioritizing operator diversity and cross-framework interoperability, coupled with an automated pipeline that bridges the host-device gap for on-device verification. Leveraging this framework, we conduct extensive evaluation on the CPU backend of Mobile Neural Network (MNN), revealing that current LLMs struggle with the engineering complexity and data scarcity inher-ent to mobile frameworks; standard models and even fine-tuned variants exhibit high compilation failure rates (over 54%) and negligible performance gains due to hallucinations and a lack of domain-specific grounding. To overcome these limitations, we propose the Mobile K ernel A gent (MoKA), a multi-agent system equipped with repository-aware reasoning and a plan-and-execute paradigm.Validated on MobileKernelBench, MoKA achieves state-of-the-art performance, boosting compilation success to 93.7% and enabling 27.4% of generated kernelsto deliver measurable speedups over native libraries.

关键词: Large Language Models, mobile devices, kernel generation, multi-agent system, compilation success, performance optimization, on-device verification, MobileKernelBench

76. ❌ Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

作者: Junjie Chu, Yiting Qu, Ye Leng, Michael Backes, Yun Shen, Savvas Zannettou, Yang Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11914v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在无害任务中遇到用户提供的有害内容时的伦理行为，直接涉及LLMs和Alignment/Value Alignment（10分），因为研究LLMs与人类价值观的对齐失败。与Hallucination Mitigation/Factuality/Truthfulness（5分）和Mechanistic Interpretability/Explainable AI（5分）有一定关联，因为涉及LLMs的真实性/事实性和行为解释。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、压缩、代理、科学应用等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在执行无害任务时遇到用户提供的有害内容是否会像有道德意识的人类一样拒绝处理，发现当前主流LLMs（包括GPT-5.2和Gemini-3-Pro）经常无法坚持人类对齐的伦理，继续处理有害内容，其中“暴力/图形”类别和“翻译”任务更容易引发有害响应。

摘要翻译

大型语言模型（LLMs）的训练日益注重与人类价值观对齐，但当前主要聚焦于任务层面，即拒绝执行直接有害的任务。然而，一个微妙却至关重要的内容层面伦理问题常被忽视：在执行看似无害的任务时，LLMs是否会像具有道德意识的人类一样，在遇到用户提供材料中的有害内容时拒绝继续处理？本研究旨在理解这一内容层面的伦理问题，并系统评估其对主流LLMs的影响。我们首先构建了一个有害知识数据集（即不符合OpenAI使用政策）作为用户提供的有害内容，涵盖十个有害类别，共包含1,357条条目。随后，我们设计了九项无害任务（即符合OpenAI使用政策）以模拟现实世界中的良性任务，并根据所需用户提供内容的程度分为三类：大量、适度和有限。利用该有害知识数据集与无害任务集，我们评估了九种LLMs在执行良性任务时面对用户提供有害内容的行为表现，并进一步探究有害知识类别与任务之间的动态关系如何影响不同模型。研究结果表明，当前的LLMs（包括最新的GPT-5.2和Gemini-3-Pro）在无害任务中常未能坚持与人类对齐的伦理准则，仍会继续处理有害内容。此外，“暴力/图像”类别的外部知识与“翻译”任务更易引发LLMs产生有害响应。我们还进行了广泛的消融实验，以探究影响这一新型滥用漏洞的潜在因素。我们希望本研究能启发相关利益方加强安全措施，以缓解这一被忽视的内容层面伦理风险。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly trained to align with human values, primarily focusing on task level, i.e., refusing to execute directly harmful tasks. However, a subtle yet crucial content-level ethical question is often overlooked: when performing a seemingly benign task, will LLMs – like morally conscious human beings – refuse to proceed when encountering harmful content in user-provided material? In this study, we aim to understand this content-level ethical question and systematically evaluate its implications for mainstream LLMs. We first construct a harmful knowledge dataset (i.e., non-compliant with OpenAI’s usage policy) to serve as the user-supplied harmful content, with 1,357 entries across ten harmful categories. We then design nine harmless tasks (i.e., compliant with OpenAI’s usage policy) to simulate the real-world benign tasks, grouped into three categories according to the extent of user-supplied content required: extensive, moderate, and limited. Leveraging the harmful knowledge dataset and the set of harmless tasks, we evaluate how nine LLMs behave when exposed to user-supplied harmful content during the execution of benign tasks, and further examine how the dynamics between harmful knowledge categories and tasks affect different LLMs. Our results show that current LLMs, even the latest GPT-5.2 and Gemini-3-Pro, often fail to uphold human-aligned ethics by continuing to process harmful content in harmless tasks. Furthermore, external knowledge from the Violence/Graphic'' category and the Translation’’ task is more likely to elicit harmful responses from LLMs. We also conduct extensive ablation studies to investigate potential factors affecting this novel misuse vulnerability. We hope that our study could inspire enhanced safety measures among stakeholders to mitigate this overlooked content-level ethical risk.

关键词: Large Language Models, Alignment, Ethical Behavior, Harmful Content, Harmless Tasks, Safety Evaluation, Content-level Ethics, Misuse Vulnerability

77. ❌ EnTransformer: A Deep Generative Transformer for Multivariate Probabilistic Forecasting

作者: Rajdeep Pathak, Rahul Goswami, Madhurima Panja, Palash Ghosh, Tanujit Chakraborty 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11909v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文EnTransformer专注于多元时间序列概率预测，使用Transformer架构和engression方法，属于深度学习在科学/工程领域的应用。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文应用于能源、交通等科学/工程领域，但未明确提及生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文提出EnTransformer，一个结合engression和Transformer的深度生成框架，用于多元时间序列概率预测，在多个基准数据集上实现了比现有方法更好的校准预测性能。

摘要翻译

可靠的不确定性量化在能源系统与交通网络等领域的多元时间序列预测问题中至关重要。尽管基于Transformer的架构近期在序列建模中表现出色，但大多数概率预测方法仍依赖于限制性参数似然或基于分位数的目标函数，难以捕捉多个相关时间序列间复杂的联合预测分布。本研究提出EnTransformer——一种深度生成式预测框架，它将用于建模条件分布的随机学习范式“engression”与Transformer强大的序列建模能力相结合。该方法通过向模型表示中注入随机噪声，并优化基于能量的评分目标，直接学习条件预测分布而无需施加参数假设。这一设计使EnTransformer能够生成协调的多元预测轨迹，同时保持Transformer有效建模长程时间依赖性和跨序列交互的能力。我们在电力、交通、太阳能、出租车、KDD-cup和维基百科等多个广泛使用的多元概率预测基准数据集上评估了EnTransformer。实验结果表明，EnTransformer能生成校准良好的概率预测，并持续超越基准模型。

摘要 (Abstract)

Reliable uncertainty quantification is critical in multivariate time series forecasting problems arising in domains such as energy systems and transportation networks, among many others. Although Transformer-based architectures have recently achieved strong performance for sequence modeling, most probabilistic forecasting approaches rely on restrictive parametric likelihoods or quantile-based objectives. They can struggle to capture complex joint predictive distributions across multiple correlated time series. This work proposes EnTransformer, a deep generative forecasting framework that integrates engression, a stochastic learning paradigm for modeling conditional distributions, with the expressive sequence modeling capabilities of Transformers. The proposed approach injects stochastic noise into the model representation and optimizes an energy-based scoring objective to directly learn the conditional predictive distribution without imposing parametric assumptions. This design enables EnTransformer to generate coherent multivariate forecast trajectories while preserving Transformers’ capacity to effectively model long-range temporal dependencies and cross-series interactions. We evaluate our proposed EnTransformer on several widely used benchmarks for multivariate probabilistic forecasting, including Electricity, Traffic, Solar, Taxi, KDD-cup, and Wikipedia datasets. Experimental results demonstrate that EnTransformer produces well-calibrated probabilistic forecasts and consistently outperforms the benchmark models.

关键词: multivariate time series forecasting, probabilistic forecasting, Transformer, deep generative model, engression, energy-based scoring, uncertainty quantification, long-range dependencies

78. ❌ Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

作者: Lu Wang, Zhuoran Jin, Yupu Hao, Yubo Chen, Kang Liu, Yulong Ao, Jun Zhao 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11896v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）的在线流式视频推理，核心创新在于提出Think While Watching框架，通过段级记忆和因果掩码解决长视频流中的记忆衰减问题。与关键词的相关性分析：1）高度相关（10分）：论文明确基于MLLMs（属于LLMs范畴），并构建了多轮chain-of-thought数据集用于训练；2）中等相关（5分）：论文涉及SFT（采用阶段匹配训练策略）、长上下文处理（处理连续视频流）、深度推理（多轮推理任务）和推理加速（重叠观看与思考的高效流水线）；3）无关（0分）：其余关键词未在论文中涉及，如MoE、量化、RAG等。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在在线流式视频多轮推理中存在的记忆衰减和并发处理问题，提出了Think While Watching框架，通过段级记忆和因果掩码实现了性能提升，在StreamingBench和OVO-Bench上分别提高了2.6%和3.79%的准确率。

摘要翻译

多模态大语言模型（MLLMs）在离线视频理解任务中展现出强大性能，但多数仅限于离线推理或在线推理能力较弱，难以对持续到达的视频流进行多轮交互。现有流式处理方法通常采用交替进行的感知-生成范式，这阻碍了感知与生成的并发执行，且随着视频流增长会导致早期记忆衰减，损害长程依赖建模。我们提出“边看边思”（Think While Watching），一种基于记忆锚定的流式视频推理框架，能够在多轮交互中保持连续的片段级记忆。我们构建了一个三阶段、多轮次的思维链数据集，并采用阶段匹配的训练策略，同时通过片段级流式因果掩码和流式位置编码确保严格的因果性。在推理阶段，我们引入了一种高效流水线，实现观看与思考过程的重叠，并自适应选择最佳注意力后端。在单轮与多轮流式输入协议下，我们的方法均取得了显著效果。基于Qwen3-VL构建的模型在StreamingBench上将单轮准确率提升了2.6%，在OVO-Bench上提升了3.79%。在多轮设置中，模型在保持性能的同时将输出标记数量减少了56%。代码发布于：https://github.com/wl666hhh/Think_While_Watching/

摘要 (Abstract)

Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/

关键词: Multimodal Large Language Models, Streaming Video Reasoning, Segment-Level Memory, Chain-of-Thought, Causal Mask, Multi-turn Interaction, Online Inference, Attention Backend

79. ❌ Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language

作者: Remigiusz Kinas, Paweł Kiszczak, Sergio P. Perez, Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11881v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型压缩技术，通过结构化剪枝和知识蒸馏创建波兰语优化模型。高度相关关键词：1) 大语言模型（核心研究对象）；2) 监督微调（SFT，对齐流程的一部分）；3) DPO（对齐流程的一部分）；4) 模型压缩（通过剪枝实现压缩）；5) 推理加速（实现50%加速）。其他关键词未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文通过结构化剪枝和知识蒸馏技术压缩Bielik-11B模型，创建了针对波兰语优化的7.35B参数模型Bielik-Minitron-7B，在保持90%基线性能的同时实现了50%的推理加速。

摘要翻译

本报告详细介绍了Bielik-Minitron-7B的创建过程，这是一个专门针对欧洲语言优化的压缩模型，参数量为73.5亿，源自Bielik-11B-v3.0模型。通过采用受英伟达（NVIDIA）Minitron方法启发的两阶段压缩方法，我们结合了结构化混合剪枝（structured hybrid pruning）与知识蒸馏（knowledge distillation），将模型参数量从110.4亿减少了33.4%，降至73.5亿。我们利用英伟达模型优化器（NVIDIA Model Optimizer）进行结构化剪枝，并借助英伟达NeMo框架（NVIDIA NeMo Framework）进行基于逻辑值（logit-based）的蒸馏以恢复模型质量。蒸馏完成后，模型经历了一个严谨的对齐流程，包括监督微调（Supervised Fine-Tuning, SFT）、直接偏好优化（Direct Preference Optimization, DPO-P）和强化学习（GRPO）。我们的最终模型成功恢复了基线模型约90%的性能，同时推理速度最高可提升50%。该方法展示了一条为代表性不足的语言创建语言模型的高效路径，在降低推理部署成本的同时，保持了原始模型的质量。

摘要 (Abstract)

This report details the creation of Bielik-Minitron-7B, a compressed 7.35B parameter version of the Bielik-11B-v3.0 model, specifically optimized for European languages. By leveraging a two-stage compression methodology inspired by the NVIDIA Minitron approach, we combined structured hybrid pruning and knowledge distillation to reduce the model’s parameter count by 33.4%, from 11.04B to 7.35B. We utilized the NVIDIA Model Optimizer for structural pruning and the NVIDIA NeMo Framework for logit-based distillation for quality recovery. Following distillation, the model underwent a rigorous alignment pipeline consisting of Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO-P), and Reinforcement Learning (GRPO). Our final model successfully recovered approximately 90% of the baseline model’s performance while providing up to 50% inference speedup. This approach demonstrates an efficient pathway to create language models for less-represented languages, preserving the original model quality while reducing inference deployment costs.

关键词: Large Language Models, Model Compression, Structured Pruning, Knowledge Distillation, Inference Acceleration, Supervised Fine-Tuning, Direct Preference Optimization, Polish Language

80. ❌ The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection

作者: J Alex Corll 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11875v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究提示注入检测，属于LLM安全领域，与’Large Language Models’相关（5分），因为提示注入是针对LLM的攻击；与’Scaling Laws AND Data Quality’相关（5分），因为论文强调数据质量（严格数据几何）比模型规模更重要。其他关键词主要涉及模型架构、训练方法、推理优化、应用领域等，论文未涉及这些具体技术，因此评0分。

!!! tip deepseek-chat TL;DR

论文提出Mirror设计模式，通过严格的数据几何组织提示注入语料库，训练稀疏字符n-gram线性SVM分类器，在L1提示注入筛查中实现了95.97%召回率和92.07% F1分数，证明数据质量比模型规模更重要。

摘要翻译

提示注入防御常被视作语义理解问题，并交由日益庞大的神经检测器处理。然而，对于第一层筛选环节，其需求有所不同：检测器需处理每个请求，因此必须具备快速、确定性、不可被提示操控及可审计的特性。我们提出“镜像”这一数据策展设计模式，它将提示注入语料组织为匹配的正负样本单元，使分类器学习控制平面攻击机制而非偶然的语料库捷径。基于5,000个严格筛选的开源样本——这是我们在公共数据有效性协议下可支持的最大规模语料库——我们定义了包含32个单元的镜像拓扑结构，用公共数据填充其中31个单元，训练稀疏字符n-gram线性支持向量机，将其权重编译为静态Rust组件，并在无外部模型运行时依赖的条件下，以亚毫秒级延迟实现了95.97%的召回率与92.07%的F1分数（基于524个案例的保留测试集）。在同一测试集上，我们的下一层防御——一个拥有2200万参数的Prompt Guard~2模型——仅达到44.35%的召回率与59.14%的F1分数，其中位延迟为49毫秒，p95延迟为324毫秒。线性模型虽会遗留诸如“使用与提及”等残余语义模糊性供后续流水线层处理，但在此范畴内，我们的研究结果表明：对于L1级提示注入筛查，严格的数据几何结构可能比模型规模更为关键。

摘要 (Abstract)

Prompt injection defenses are often framed as semantic understanding problems and delegated to increasingly large neural detectors. For the first screening layer, however, the requirements are different: the detector runs on every request and therefore must be fast, deterministic, non-promptable, and auditable. We introduce Mirror, a data-curation design pattern that organizes prompt injection corpora into matched positive and negative cells so that a classifier learns control-plane attack mechanics rather than incidental corpus shortcuts. Using 5,000 strictly curated open-source samples – the largest corpus supportable under our public-data validity contract – we define a 32-cell mirror topology, fill 31 of those cells with public data, train a sparse character n-gram linear SVM, compile its weights into a static Rust artifact, and obtain 95.97% recall and 92.07% F1 on a 524-case holdout at sub-millisecond latency with no external model runtime dependencies. On the same holdout, our next line of defense, a 22-million-parameter Prompt Guard~2 model reaches 44.35% recall and 59.14% F1 at 49,ms median and 324,ms p95 latency. Linear models still leave residual semantic ambiguities such as use-versus-mention for later pipeline layers, but within that scope our results show that for L1 prompt injection screening, strict data geometry can matter more than model scale.

关键词: prompt injection detection, data geometry, linear SVM, sparse character n-gram, fast screening, deterministic classifier, model scale, LLM security

81. ❌ ELISA: An Interpretable Hybrid Generative AI Agent for Expression-Grounded Discovery in Single-Cell Genomics

作者: Omar Coser 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11872v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文ELISA是一个结合scGPT表达嵌入、BioBERT语义检索和LLM解释的混合生成式AI代理，用于单细胞基因组学发现。核心相关关键词：1) ‘LLM Agents’ (10分)：论文明确构建了’agentic AI system’和’LLM-mediated interpretation’；2) ‘AI for Science’ (10分)：直接应用于生物信息学/单细胞基因组学；3) ‘Mechanistic Interpretability’ (10分)：强调’interpretable framework’和’bridging the gap’；4) ‘Retrieval-Augmented Generation’ (8分)：使用BioBERT进行语义检索；5) ‘Chain of Thought’和’System 2 Thinking’ (各8分)：涉及’grounded LLM reasoning’；6) ‘Tool Use’ (8分)：集成分析模块如通路评分、配体-受体预测等。其他关键词如MoE、量化、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了单细胞RNA测序数据难以转化为机制性生物学假设的问题，通过开发ELISA这一可解释的混合生成式AI代理框架，统一了scGPT表达嵌入、BioBERT语义检索和LLM解释，在多个数据集上显著优于现有方法，并成功复现生物学发现、生成候选假设。

摘要翻译

将单细胞RNA测序（scRNA-seq）数据转化为机制性生物学假说仍是一个关键瓶颈，因为具身人工智能系统无法直接获取转录组表征，而表达基础模型对自然语言仍不透明。本文介绍ELISA（Embedding-Linked Interactive Single-cell Agent），这是一个可解释框架，它将scGPT表达嵌入与基于BioBERT的语义检索、以及LLM介导的解析相统一，实现交互式单细胞发现。自动查询分类器根据输入内容为基因特征、自然语言概念或两者混合，将其路由至基因标记评分、语义匹配或互逆排序融合流程。集成分析模块可直接在嵌入数据上运行（无需访问原始计数矩阵），实现跨60余个基因集的通路活性评分、使用280余个经人工校验配对的配体-受体相互作用预测、条件感知比较分析和细胞类型比例估计。在涵盖炎症性肺病、儿童与成人癌症、类器官模型、健康组织及神经发育的六个不同scRNA-seq数据集上进行基准测试，ELISA在细胞类型检索方面显著优于CellWhisperer（组合置换检验，$p < 0.001$），尤其在基因特征查询上提升显著（平均倒数排名的Cohen’s $d = 5.98$）。ELISA成功复现了已发表的生物学发现（平均综合得分0.90），其通路对齐度和主题覆盖度接近完美（均为0.98），并通过基于数据的LLM推理生成候选假说，从而弥合了转录组数据探索与生物学发现之间的鸿沟。代码发布于：https://github.com/omaruno/ELISA-An-AI-Agent-for-Expression-Grounded-Discovery-in-Single-Cell-Genomics.git（若在研究中使用ELISA，请引用本工作）。

摘要 (Abstract)

Translating single-cell RNA sequencing (scRNA-seq) data into mechanistic biological hypotheses remains a critical bottleneck, as agentic AI systems lack direct access to transcriptomic representations while expression foundation models remain opaque to natural language. Here we introduce ELISA (Embedding-Linked Interactive Single-cell Agent), an interpretable framework that unifies scGPT expression embeddings with BioBERT-based semantic retrieval and LLM-mediated interpretation for interactive single-cell discovery. An automatic query classifier routes inputs to gene marker scoring, semantic matching, or reciprocal rank fusion pipelines depending on whether the query is a gene signature, natural language concept, or mixture of both. Integrated analytical modules perform pathway activity scoringacross 60+ gene sets, ligand–receptor interaction prediction using 280+ curated pairs, condition-aware comparative analysis, and cell-type proportion estimation all operating directly on embedded data without access to the original count matrix. Benchmarked across six diverse scRNA-seq datasets spanning inflammatory lung disease, pediatric and adult cancers, organoid models, healthy tissue, and neurodevelopment, ELISA significantly outperforms CellWhisperer in cell type retrieval (combined permutation test, $p < 0.001$), with particularly large gains on gene-signature queries (Cohen’s $d = 5.98$ for MRR). ELISA replicates published biological findings (mean composite score 0.90) with near-perfect pathway alignment and theme coverage (0.98 each), and generates candidate hypotheses through grounded LLM reasoning, bridging the gap between transcriptomic data exploration and biological discovery. Code available at: https://github.com/omaruno/ELISA-An-AI-Agent-for-Expression-Grounded-Discovery-in-Single-Cell-Genomics.git (If you use ELISA in your research, please cite this work).

关键词: single-cell genomics, LLM agent, interpretable AI, scGPT embeddings, BioBERT retrieval, biological discovery, expression foundation models, hybrid generative AI

作者: Radu Calinescu, Ana Cavalcanti, Marsha Chechik, Lina Marsso, Beverley Townsend 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11864v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究AI智能体与社会、法律、伦理、同理心和文化（SLEEC）规范的对接问题，提出了一个系统化的规范操作化流程。论文与大多数关键词无关，因为这些关键词主要涉及大模型的技术细节、训练方法、推理优化等具体技术。但论文与两个关键词高度相关：1）“Instruction Tuning OR Alignment OR Value Alignment”（评分10分）- 论文核心关注AI智能体与人类规范和价值观的对齐，这是价值对齐的直接应用；2）“LLM Agents OR Autonomous Agents OR Agentic Workflow”（评分10分）- 论文明确研究AI智能体在高风险领域的应用和行为对齐。其他关键词如大模型技术、训练方法、推理优化等均未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文针对AI智能体在高风险领域应用中难以将抽象的社会、法律、伦理、同理心和文化规范转化为具体可验证要求的问题，提出了一个系统化的SLEEC规范操作化流程，并建立了相应的研究和政策框架。

摘要翻译

随着人工智能代理在医疗保健和执法等高风险领域日益广泛应用，将其行为与社会、法律、伦理、共情及文化（SLEEC）规范相协调已成为一项关键的工程挑战。尽管国际框架已为人工智能确立了高层次的规范性原则，但将这些抽象原则转化为具体、可验证的需求仍存在显著差距。为弥合这一差距，我们提出了一种系统化的SLEEC规范操作化流程，用于确定、验证、实施和核验规范性需求。此外，我们全面梳理了支持该流程的方法与工具，并指出了当前面临的主要挑战及应对这些挑战的研究路径。由此，我们构建了一个框架——并明确了相应的研究与政策议程——旨在开发不仅功能实用，且能明确符合人类规范与价值的人工智能代理。

摘要 (Abstract)

As AI agents are increasingly used in high-stakes domains like healthcare and law enforcement, aligning their behaviour with social, legal, ethical, empathetic, and cultural (SLEEC) norms has become a critical engineering challenge. While international frameworks have established high-level normative principles for AI, a significant gap remains in translating these abstract principles into concrete, verifiable requirements. To address this gap, we propose a systematic SLEEC-norm operationalisation process for determining, validating, implementing, and verifying normative requirements. Furthermore, we survey the landscape of methods and tools supporting this process, and identify key remaining challenges and research avenues for addressing them. We thus establish a framework - and define a research and policy agenda - for developing AI agents that are not only functionally useful but also demonstrably aligned with human norms and values.

关键词: AI agents, norm operationalisation, SLEEC norms, value alignment, ethical AI, verifiable requirements, high-stakes domains, human norms

83. ❌ CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

作者: Zi-Han Wang, Lam Nguyen, Zhengyang Zhao, Mengyue Yang, Chengwei Qin, Yujiu Yang, Linyi Yang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究机器创造力评估和增强，涉及大模型在代码生成中的创造力表现分析。与’Large Language Models’高度相关（8分），因为研究分析了SOTA模型的行为；与’Scaling Laws AND Data Quality’有一定关联（5分），因为分析了缩放对创造力的影响；与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为涉及推理能力分析；与’Self-Correction’和’LLM Agents’有一定关联（各5分），因为涉及自我演化和代理式工作流；与’Hallucination Mitigation’高度相关（8分），因为明确区分创造力和幻觉。其他关键词如MoE、SLMs、训练方法、RAG、注意力优化、量化等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对机器创造力缺乏定量评估的问题，提出了CreativeBench基准和EvoRePE策略，通过分析大模型在代码生成中的创造力表现，发现缩放对组合创造力有益但对探索创造力收益递减，并提出了增强创造力的推理时引导方法。

摘要翻译

高质量预训练数据的饱和促使研究重心转向能够持续生成新颖产物的进化系统，这推动了AlphaEvolve的成功。然而，此类系统的发展因缺乏严谨、量化的评估而受阻。为应对这一挑战，我们提出了CreativeBench——一个基于经典认知框架、用于评估代码生成中机器创造力的基准。该基准包含两个子集——CreativeBench-Combo与CreativeBench-Explore——通过利用逆向工程和自我对弈的自动化流程，分别针对组合型创造力和探索型创造力进行评估。借助可执行代码，CreativeBench通过将创造力定义为质量与新异性的乘积这一统一度量标准，客观地区分了创造力与幻觉。我们对前沿模型的分析揭示了以下显著行为：（1）模型缩放能显著提升组合创造力，但对探索能力的提升收益递减；（2）更大模型表现出“缩放收敛性”，即答案更趋正确但多样性降低；（3）推理能力主要受益于受限探索而非组合任务。最后，我们提出了EvoRePE，一种即插即用的推理时引导策略，其通过内化进化搜索模式以持续提升机器创造力。

摘要 (Abstract)

The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets – CreativeBench-Combo and CreativeBench-Explore – the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,’’ becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

关键词: machine creativity, benchmark, code generation, self-evolving, scaling effects, hallucination distinction, evolutionary search, inference-time steering

84. ❌ You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

作者: Ching-Yu Kao, Xinfeng Li, Shenyu Dai, Tianze Qiu, Pengcheng Zhou, Eric Hanchen Jiang, Philip Sperl 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11862v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究LLM代理的安全漏洞，核心涉及LLM代理执行外部指令时的安全风险。高度相关的关键词包括：LLM代理（核心研究对象）、指令对齐（研究指令遵循的安全问题）、工具使用（代理执行终端/文件系统操作）。其他关键词如MoE、量化、推理加速等与论文内容无关。论文属于大模型应用安全领域，符合研究背景中“大模型在不同领域的研究应用”的要求。

!!! tip deepseek-chat TL;DR

该论文发现LLM代理在执行外部文档指令时存在结构性安全漏洞，无法区分恶意与合法指令，实验显示数据泄露成功率高达85%，且现有防御方法无法可靠检测。

摘要翻译

具备高权限、能自主处理外部文档的大型语言模型智能体正日益被信任用于自动化任务——通过读取并执行项目说明来实现操作。然而，这些智能体被授予终端访问权、文件系统控制权以及出站网络连接能力，却缺乏充分的安全监督。我们发现并系统性地测量了这种信任模型中的一个根本性漏洞，我们称之为可信执行器困境：智能体以高比例执行文档中嵌入的指令（包括恶意指令），因为它们无法区分恶意指令与合法的设置指导。这一漏洞是指令遵循设计范式带来的结构性后果，而非实现层面的缺陷。为构建测量框架，我们形式化了一个三维分类法，涵盖语言伪装、结构混淆和语义抽象，并构建了ReadSecBench——一个包含500份真实世界README文件的基准测试集，以实现可复现的评估。在商业部署的计算机使用智能体上进行的实验显示，端到端数据外泄成功率高达85%，且在五种编程语言和三种注入位置中结果一致。在模拟环境中对四个大型语言模型系列进行的跨模型评估证实，对于注入指令的语义遵循在不同模型系列间具有一致性。一项包含15名参与者的用户研究显示，所有参与者的检测率均为0%；对12种基于规则和6种基于大型语言模型的防御方案的评估表明，两类方案均无法在可接受的误报率下实现可靠检测。综合来看，这些结果量化了智能体功能遵循与其安全认知之间持续存在的语义安全鸿沟，证实文档嵌入式指令注入是对高权限大型语言模型智能体部署的一项持久且当前尚未得到缓解的威胁。

摘要 (Abstract)

High-privilege LLM agents that autonomously process external documentation are increasingly trusted to automate tasks by reading and executing project instructions, yet they are granted terminal access, filesystem control, and outbound network connectivity with minimal security oversight. We identify and systematically measure a fundamental vulnerability in this trust model, which we term the \emph{Trusted Executor Dilemma}: agents execute documentation-embedded instructions, including adversarial ones, at high rates because they cannot distinguish malicious directives from legitimate setup guidance. This vulnerability is a structural consequence of the instruction-following design paradigm, not an implementation bug. To structure our measurement, we formalize a three-dimensional taxonomy covering linguistic disguise, structural obfuscation, and semantic abstraction, and construct \textbf{ReadSecBench}, a benchmark of 500 real-world README files enabling reproducible evaluation. Experiments on the commercially deployed computer-use agent show end-to-end exfiltration success rates up to 85%, consistent across five programming languages and three injection positions. Cross-model evaluation on four LLM families in a simulation environment confirms that semantic compliance with injected instructions is consistent across model families. A 15-participant user study yields a 0% detection rate across all participants, and evaluation of 12 rule-based and 6 LLM-based defenses shows neither category achieves reliable detection without unacceptable false-positive rates. Together, these results quantify a persistent \emph{Semantic-Safety Gap} between agents’ functional compliance and their security awareness, establishing that documentation-embedded instruction injection is a persistent and currently unmitigated threat to high-privilege LLM agent deployments.

关键词: LLM agents, instruction injection, security vulnerability, data leakage, Trusted Executor Dilemma, Semantic-Safety Gap, ReadSecBench, computer-use agent

85. ❌ The Landscape of Generative AI in Information Systems: A Synthesis of Secondary Reviews and Research Agendas

作者: Aleksander Jarzębowicz, Adam Przybyłek, Jacinto Estima, Yen Ying Ng, Jakub Swacha, Beata Zielosko, Lech Madeyski, Noel Carroll, Kai-Kristian Kemell, Bartosz Marcinkowski, Alberto Rodrigues da Silva, Viktoria Stray, Netta Iivari, Anh Nguyen-Duc, Jorge Melegati, Boris Delibašić, Emilio Insfran 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11842v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文是对生成式AI（GenAI）在信息系统领域的二次文献综述和研究议程分析，主要关注组织采用GenAI的挑战和治理问题。与关键词的相关性分析如下：1）与"Large Language Models"等基础模型关键词相关度5分，因为GenAI通常基于大模型，但论文不深入技术细节；2）与"Hallucination Mitigation"相关度8分，论文明确提到幻觉是技术不可靠性的核心挑战；3）与"Instruction Tuning/Alignment"和"Explainable AI"相关度5分，论文涉及伦理对齐、可解释性等社会技术问题；4）其他技术性关键词（如MoE、量化、推理加速等）得0分，因论文未涉及具体模型架构、训练方法或优化技术；5）“AI for Science"得0分，论文聚焦信息系统，非科学领域应用。

!!! tip deepseek-chat TL;DR

该论文通过系统文献综述分析了生成式AI在信息系统中的采用现状，发现其面临技术不可靠性、社会伦理风险和治理真空等挑战，并提出了促进技术能力与组织社会系统协同演进的研究议程。

摘要翻译

随着各组织努力应对生成式人工智能（GenAI）的快速普及，本研究通过对二次文献及研究议程的系统性文献综述，整合了该领域的知识现状。通过分析2023年以来发表的28篇论文，我们发现：尽管生成式人工智能在提升生产力和推动创新方面具有变革性潜力，但其应用受到多重相互关联挑战的制约，包括技术不可靠性（如幻觉现象、性能漂移）、社会伦理风险（如偏见、滥用、技能退化）以及系统性治理缺失（如隐私、问责、知识产权）。从社会技术视角解读，这些发现揭示了生成式人工智能快速演进的技术子系统与适应相对滞后的社会子系统之间持续存在的错位，凸显了信息系统（IS）研究对于实现协同优化的关键作用。为弥合这一鸿沟，我们探讨了一项研究议程，旨在将信息系统研究的重心从分析影响转向积极塑造技术能力与组织流程、社会价值及监管制度的协同演化——重点聚焦人机混合协同系统、情境化验证、概率系统设计原则以及适应性治理框架。

摘要 (Abstract)

As organizations grapple with the rapid adoption of Generative AI (GenAI), this study synthesizes the state of knowledge through a systematic literature review of secondary studies and research agendas. Analyzing 28 papers published since 2023, we find that while GenAI offers transformative potential for productivity and innovation, its adoption is constrained by multiple interrelated challenges, including technical unreliability (hallucinations, performance drift), societal-ethical risks (bias, misuse, skill erosion), and a systemic governance vacuum (privacy, accountability, intellectual property). Interpreted through a socio-technical lens, these findings reveal a persistent misalignment between GenAI’s fast-evolving technical subsystem and the slower-adapting social subsystem, positioning IS research as critical for achieving joint optimization. To bridge this gap, we discuss a research agenda that reorients IS scholarship from analyzing impacts toward actively shaping the co-evolution of technical capabilities with organizational procedures, societal values, and regulatory institutions–emphasizing hybrid human–AI ensembles, situated validation, design principles for probabilistic systems, and adaptive governance.

关键词: Generative AI, Information Systems, Systematic Literature Review, Socio-technical Perspective, Hallucinations, Ethical Risks, Governance, Research Agenda

作者: Isuri Perera, Frits de Nijs, Julian Garcia 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11834v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究混合人-智能体在能源市场中的社会困境，主要涉及多智能体系统中的协调问题，使用强化学习和进化动态方法。仅与关键词’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文核心研究多智能体协调、混合人-智能体群体和合作行为涌现。其他关键词均与大模型技术、训练方法、推理技术、科学AI应用等无关，论文未涉及任何大模型或深度学习技术原理，也未在生物医药等科学领域应用AI，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究在能源负荷管理的混合人-智能体群体中，如何通过引入使用全局可观测信号的人工智能体来促进协调合作，并分析部分采用场景下的可行性和不对称效益问题。

摘要翻译

在人类将战略决策权委托给自主智能体的混合群体中，理解合作行为何时以及如何产生仍是一个关键挑战。本研究以能源负荷管理为背景探讨该问题：消费者智能体在需求依赖型定价机制下安排其电器使用。这种结构可能形成一种社会困境——协调合作能使所有人受益，但在均衡状态下，智能体往往选择承受拥堵成本，而合作轮转本可避免这些成本。为解决协调问题，我们引入利用全局可观测信号增强协调的人工智能体。通过演化动力学和强化学习实验，我们证明人工智能体能够改变学习动态，促使系统趋向协调结果。一个常被忽视的问题是部分采纳：当人工智能体技术处于早期采纳阶段时会发生什么？我们分析了采纳者与非采纳者共存的混合群体，证明单边进入是可行的：采纳者不会在结构上受损，部分采纳仍能改善整体结果。然而在某些参数区间，非采纳者可能从采纳者引发的合作中获取不成比例的利益。这种不对称性虽不阻碍有益的技术进入，但在实际部署中值得关注，并凸显了多智能体场景中人工智能技术采纳的战略性问题。

摘要 (Abstract)

In hybrid populations where humans delegate strategic decision-making to autonomous agents, understanding when and how cooperative behaviors can emerge remains a key challenge. We study this problem in the context of energy load management: consumer agents schedule their appliance use under demand-dependent pricing. This structure can create a social dilemma where everybody would benefit from coordination, but in equilibrium agents often choose to incur the congestion costs that cooperative turn-taking would avoid. To address the problem of coordination, we introduce artificial agents that use globally observable signals to increase coordination. Using evolutionary dynamics, and reinforcement learning experiments, we show that artificial agents can shift the learning dynamics to favour coordination outcomes. An often neglected problem is partial adoption: what happens when the technology of artificial agents is in the early adoption stages? We analyze mixed populations of adopters and non-adopters, demonstrating that unilateral entry is feasible: adopters are not structurally penalized, and partial adoption can still improve aggregate outcomes. However, in some parameter regimes, non-adopters may benefit disproportionately from the cooperation induced by adopters. This asymmetry, while not precluding beneficial entry, warrants consideration in deployment, and highlights strategic issues around the adoption of AI technology in multiagent settings.

关键词: hybrid human-agent populations, social dilemmas, energy load management, multi-agent coordination, reinforcement learning, evolutionary dynamics, partial adoption, autonomous agents

87. ❌ Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI

作者: Md. Hasin Sarwar Ifty, Nisharga Nirjan, Labib Islam, M. A. Diganta, Reeyad Ahmed Ornate, Anika Tasnim, Md. Saiful Islam 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11818v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于使用传统深度学习模型（CNN架构如InceptionV3）进行卵巢癌检测，并应用XAI方法解释模型决策，属于AI在生物医学领域的应用。论文未涉及大语言模型（LLM）、MoE、模型缩放、训练对齐、推理优化、智能体等大模型核心技术，因此绝大多数关键词得分为0。唯一相关的是’Mechanistic Interpretability OR Explainable AI’（得10分，因论文核心应用了LIME、SHAP等XAI方法）和’AI for Science OR Bioinformatics OR Cheminformatics’（得8分，因论文属于AI在生物信息学/医学领域的应用，但未直接涉及大模型）。

!!! tip deepseek-chat TL;DR

该研究利用多种卷积神经网络（如InceptionV3）和可解释人工智能（XAI）方法，开发了一个能够准确检测卵巢恶性病变的深度学习模型，在增强数据集上平均性能指标达到94%。

摘要翻译

恶性肿瘤细胞的无限增殖即为癌症。近年来，医疗专业人员通过应用深度学习模型分析医疗数据，以提升临床决策、疾病诊断和药物研发能力，从而持续获得增强的诊断与治疗手段。目前多数癌症的研究与治疗已融合了这些技术。然而，卵巢癌仍是一个难题，因其无创检测方法准确性不足，而精确检测又需依赖耗时且有创的流程。因此，本研究利用多种卷积神经网络（Convolutional Neural Networks, CNN），包括LeNet-5、ResNet、VGGNet和GoogLeNet/Inception，构建了15种模型变体，以筛选出能够准确检测与识别卵巢癌的模型。为有效训练模型，我们采用了来自Mendeley的OvarianCancer&SubtypesDatasetHistopathology数据集。模型构建完成后，我们运用可解释人工智能（Explainable Artificial Intelligence, XAI）方法，如LIME、积分梯度（Integrated Gradients）和SHAP，对所选模型的“黑箱”输出结果进行解释。模型性能评估采用了准确率、精确率、召回率、F1分数、ROC曲线与AUC等指标。评估结果显示，采用ReLu激活函数的轻量化InceptionV3模型在增强数据集中取得了最佳综合性能，所有评估指标平均得分达到94%。最后，针对可解释性分析，我们对上述三种XAI方法进行了整体比较分析。本研究旨在通过相关成果，为卵巢癌探索更优的检测方法提供助力。

摘要 (Abstract)

The unrestrained proliferation of cells that are malignant in nature is cancer. In recent times, medical professionals are constantly acquiring enhanced diagnostic and treatment abilities by implementing deep learning models to analyze medical data for better clinical decision, disease diagnosis and drug discovery. A majority of cancers are studied and treated by incorporating these technologies. However, ovarian cancer remains a dilemma as it has inaccurate non-invasive detection procedures and a time consuming, invasive procedure for accurate detection. Thus, in this research, several Convolutional Neural Networks such as LeNet-5, ResNet, VGGNet and GoogLeNet/Inception have been utilized to develop 15 variants and choose a model that accurately detects and identifies ovarian cancer. For effective model training, the dataset OvarianCancer&SubtypesDatasetHistopathology from Mendeley has been used. After constructing a model, we utilized Explainable Artificial Intelligence (XAI) models such as LIME, Integrated Gradients and SHAP to explain the black box outcome of the selected model. For evaluating the performance of the model, Accuracy, Precision, Recall, F1-Score, ROC Curve and AUC have been used. From the evaluation, it was seen that the slightly compact InceptionV3 model with ReLu had the overall best result achieving an average score of 94% across all the performance metrics in the augmented dataset. Lastly for XAI, the three aforementioned XAI have been used for an overall comparative analysis. It is the aim of this research that the contributions of the study will help in achieving a better detection method for ovarian cancer.

关键词: Ovarian Cancer Detection, Deep Learning, Convolutional Neural Networks, Explainable AI, Medical Image Analysis, InceptionV3, SHAP, LIME

88. ❌ VisiFold: Long-Term Traffic Forecasting via Temporal Folding Graph and Node Visibility

作者: Zhiwei Zhang, Xinyi Du, Weihao Wang, Xuanchi Guo, Wenjuan Han 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11816v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《VisiFold: Long-Term Traffic Forecasting via Temporal Folding Graph and Node Visibility》专注于交通预测领域，提出了一种新的时空图神经网络方法（Temporal Folding Graph 和 Node Visibility）来解决长期交通预测中的计算瓶颈和时空依赖问题。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理、AI for Science等）完全无关，未涉及任何大模型技术、训练方法、推理优化、对齐、代理、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文针对长期交通预测中计算资源消耗大和时空依赖复杂的问题，提出了VisiFold框架，通过时间折叠图和节点可见性机制显著降低了资源消耗并提升了预测性能。

摘要翻译

交通预测是智能交通系统的基石。尽管现有研究在短期预测方面已取得显著进展，但长期预测在很大程度上仍是一个尚未充分探索且充满挑战的前沿领域。延长预测时间范围会加剧两个关键问题：不断升级的计算资源消耗和日益复杂的时空依赖性。当前方法依赖于时空图并分别处理时间和空间维度，存在快照堆叠膨胀和跨步骤碎片化的缺陷。为克服这些局限性，我们提出 \textit{VisiFold} 框架。该框架引入了一种新颖的时间折叠图（temporal folding graph），将一系列时间快照整合为单一图结构。此外，我们提出节点可见性机制（node visibility mechanism），通过节点级掩码和子图采样来克服大节点数量带来的计算瓶颈。大量实验表明，VisiFold 不仅大幅降低了资源消耗，而且在长期预测任务中超越了现有基线模型。值得注意的是，即使在高达 80% 的掩码比例下，VisiFold 仍能保持其性能优势。通过有效突破时间和空间维度的资源限制，我们的工作为更符合实际需求的长期交通预测开辟了新路径。代码已发布于~ https://github.com/PlanckChang/VisiFold。

摘要 (Abstract)

Traffic forecasting is a cornerstone of intelligent transportation systems. While existing research has made significant progress in short-term prediction, long-term forecasting remains a largely uncharted and challenging frontier. Extending the prediction horizon intensifies two critical issues: escalating computational resource consumption and increasingly complex spatial-temporal dependencies. Current approaches, which rely on spatial-temporal graphs and process temporal and spatial dimensions separately, suffer from snapshot-stacking inflation and cross-step fragmentation. To overcome these limitations, we propose \textit{VisiFold}. Our framework introduces a novel temporal folding graph that consolidates a sequence of temporal snapshots into a single graph. Furthermore, we present a node visibility mechanism that incorporates node-level masking and subgraph sampling to overcome the computational bottleneck imposed by large node counts. Extensive experiments show that VisiFold not only drastically reduces resource consumption but also outperforms existing baselines in long-term forecasting tasks. Remarkably, even with a high mask ratio of 80%, VisiFold maintains its performance advantage. By effectively breaking the resource constraints in both temporal and spatial dimensions, our work paves the way for more realistic long-term traffic forecasting. The code is available at~ https://github.com/PlanckChang/VisiFold.

关键词: Traffic Forecasting, Long-term Forecasting, Temporal Folding Graph, Node Visibility, Spatial-temporal Dependencies, Computational Efficiency, Graph Neural Networks, Intelligent Transportation Systems

89. ❌ RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset

作者: Yongzhong Wang, Keyu Zhu, Yong Zhong, Liqiong Wang, Jinyu Yang, Feng Zheng 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11811v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RADAR专注于机器人数据生成的自主闭环系统，核心是视觉语言模型（VLM）在机器人任务规划、评估和重置中的应用。与评分关键词列表高度相关的是’In-context Learning OR Many-shot Learning’，因为论文明确提到使用’in-context imitation learning’进行策略学习，这属于上下文学习的范畴，但并非大语言模型（LLM）的上下文学习，而是机器人领域的模仿学习。其他关键词主要针对大语言模型（LLM）的技术原理、训练方法、推理优化、对齐、代理系统等，而本文虽然使用VLM，但重点在机器人自主数据采集的工程系统（如语义规划、图神经网络策略、有限状态机重置），未涉及LLM的核心技术创新（如MoE、缩放定律、RLHF、RAG等）或通用AI科学应用。因此，仅’In-context Learning’得5分（有一定关联），其余均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

论文解决了机器人学习中大规模物理交互数据采集成本高、可扩展性差的问题，通过引入RADAR——一个完全自主的闭环数据生成引擎，利用视觉语言模型进行语义任务规划和评估，结合图神经网络策略和自主环境重置，在仿真和现实中实现了高效、可扩展的复杂长时程任务数据采集，成功率高达90%。

摘要翻译

大规模物理交互数据的获取作为现代机器人学习的关键前提，正因人在回路采集模式的高昂成本与可扩展性限制而遭遇严重瓶颈。为突破此障碍，我们提出了机器人鲁棒自主数据采集系统（RADAR），这是一个完全自主、闭环的数据生成引擎，彻底将人类干预从采集循环中移除。RADAR将认知负载巧妙分解为四模块流程：以2-5个三维人体演示作为几何先验锚点，视觉语言模型首先通过精确的语义物体定位与技能检索，生成与场景相关的任务序列；随后，图神经网络策略通过上下文模仿学习将这些子任务转化为物理动作；执行完成后，视觉语言模型通过结构化视觉问答流程进行自动化成功评估；最后，为打破人工重置的瓶颈，有限状态机协调自主环境重置与非对称数据路由机制。该系统通过严格遵循后进先出因果序列的正向-逆向同步规划，能够无缝恢复非结构化工作空间，并从执行失败中稳健恢复。这种持续的大脑-小脑协同将数据采集转化为自我维持的过程。大量实验验证突显了RADAR的卓越泛化能力：在仿真环境中，本框架在复杂长周期任务上达成高达90%的成功率，轻松解决了传统基线方法性能骤降至接近零的挑战；在真实世界部署中，系统通过少量样本适应即可可靠执行多样化、高接触度的技能（如可变形物体操作），且无需领域特定微调，为机器人数据采集提供了高度可扩展的范式。

摘要 (Abstract)

The acquisition of large-scale physical interaction data, a critical prerequisite for modern robot learning, is severely bottlenecked by the prohibitive cost and scalability limits of human-in-the-loop collection paradigms. To break this barrier, we introduce Robust Autonomous Data Acquisition for Robotics (RADAR), a fully autonomous, closed-loop data generation engine that completely removes human intervention from the collection cycle. RADAR elegantly divides the cognitive load into a four-module pipeline. Anchored by 2-5 3D human demonstrations as geometric priors, a Vision-Language Model first orchestrates scene-relevant task generation via precise semantic object grounding and skill retrieval. Next, a Graph Neural Network policy translates these subtasks into physical actions via in-context imitation learning. Following execution, the VLM performs automated success evaluation using a structured Visual Question Answering pipeline. Finally, to shatter the bottleneck of manual resets, a Finite State Machine orchestrates an autonomous environment reset and asymmetric data routing mechanism. Driven by simultaneous forward-reverse planning with a strict Last-In, First-Out causal sequence, the system seamlessly restores unstructured workspaces and robustly recovers from execution failures. This continuous brain-cerebellum synergy transforms data collection into a self-sustaining process. Extensive evaluations highlight RADAR’s exceptional versatility. In simulation, our framework achieves up to 90% success rates on complex, long-horizon tasks, effortlessly solving challenges where traditional baselines plummet to near-zero performance. In real-world deployments, the system reliably executes diverse, contact-rich skills (e.g., deformable object manipulation) via few-shot adaptation without domain-specific fine-tuning, providing a highly scalable paradigm for robotic data acquisition.

关键词: robotic data generation, autonomous data acquisition, vision-language model, in-context imitation learning, closed-loop system, semantic planning, environment reset, long-horizon tasks

90. ❌ A Semi-Decentralized Approach to Multiagent Control

作者: Mahdi Al-Husseini, Mykel J. Kochenderfer, Kyle H. Wray 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11802v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多智能体系统的半分散控制框架和算法，特别是SDec-POMDP模型和RS-SDA*算法，与’Multi-agent Systems OR Agent Coordination’高度相关（10分）。其他关键词均涉及大模型、深度学习技术原理或特定AI应用领域，而本文是经典的多智能体控制理论研究，未涉及任何大模型、深度学习或AI for Science内容，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一个半分散控制框架（SDec-POMDP）和精确算法（RS-SDA*），用于解决具有通信不确定性的合作多智能体系统中的最优策略生成问题。

摘要翻译

我们提出了一种具有表现力的框架及算法，用于在通信不确定环境下实现协作智能体的半分散控制。半马尔可夫控制允许智能体动作的时间分布，而半马尔可夫通信（或称半分散化）则定义了智能体在其历史记录中存储动作与观测的时间分布。我们将半分散化扩展至部分可观测马尔可夫决策过程（POMDP），由此形成的SDec-POMDP统一了分散式与多智能体POMDP以及多种现有显式通信机制。本文提出了递归小步半分散A算法（RS-SDA），这是一种生成最优SDec-POMDP策略的精确算法。我们在多个标准测试的半分散化版本及海上医疗后送场景中对RS-SDA*进行了评估。本研究为通过半分散化视角探索多类多智能体通信问题奠定了明确的理论基础。

摘要 (Abstract)

We introduce an expressive framework and algorithms for the semi-decentralized control of cooperative agents in environments with communication uncertainty. Whereas semi-Markov control admits a distribution over time for agent actions, semi-Markov communication, or what we refer to as semi-decentralization, gives a distribution over time for what actions and observations agents can store in their histories. We extend semi-decentralization to the partially observable Markov decision process (POMDP). The resulting SDec-POMDP unifies decentralized and multiagent POMDPs and several existing explicit communication mechanisms. We present recursive small-step semi-decentralized A* (RS-SDA*), an exact algorithm for generating optimal SDec-POMDP policies. RS-SDA* is evaluated on semi-decentralized versions of several standard benchmarks and a maritime medical evacuation scenario. This paper provides a well-defined theoretical foundation for exploring many classes of multiagent communication problems through the lens of semi-decentralization.

关键词: semi-decentralized control, multiagent systems, POMDP, communication uncertainty, optimal policy generation, RS-SDA* algorithm, cooperative agents

91. ❌ DocSage: An Information Structuring Agent for Multi-Doc Multi-Entity Question Answering

作者: Teng Lin, Yizhang Zhu, Zhengxuan Zhang, Yuyu Luo, Nan Tang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11798v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出DocSage框架，核心是解决多文档多实体问答问题，直接涉及LLMs和RAG技术（高度相关10分），采用agentic框架设计（LLM Agents相关10分）。论文提到长上下文LLMs的局限性（Context Window Extension相关5分），涉及多步推理和深入推理（Chain of Thought和System 2 Thinking相关5分），包含错误感知校正机制（Self-Correction相关5分），使用SQL工具（Tool Use相关5分），旨在提高事实准确性（Hallucination Mitigation相关5分）。其他关键词如MoE、SLMs、训练方法、压缩加速等未涉及（0分）。

!!! tip deepseek-chat TL;DR

论文针对多文档多实体问答中现有LLMs和RAG框架检索不精确、缺乏模式意识导致证据链构建不足的问题，提出了DocSage代理框架，通过动态模式发现、结构化信息提取和模式感知关系推理，在基准测试中显著优于现有方法，准确率提升超过27%。

摘要翻译

多文档多实体问答任务本质上要求模型能够追踪分散文档中多个实体间的隐含逻辑关系。然而，现有的大语言模型与检索增强生成框架存在关键局限：标准检索增强生成基于向量相似度的粗粒度检索常遗漏关键事实，基于图结构的检索增强生成则难以高效整合碎片化的复杂关系网络，且两者均缺乏模式感知能力，导致跨文档证据链构建不充分与实体关系推理不准确。为应对这些挑战，我们提出DocSage——一个集成动态模式发现、结构化信息提取与具备错误保证的模式感知关系推理的端到端智能体框架。DocSage通过三个核心模块运作：（1）模式发现模块动态推断查询特定的最小可连接模式，以捕获核心实体与关系；（2）提取模块将非结构化文本转化为语义连贯的关系型数据表，并通过错误感知校正机制减少提取误差；（3）推理模块在结构化数据表上进行多跳关系推理，利用模式感知能力高效对齐跨文档实体并聚合证据。该智能体设计具备三大优势：通过SQL驱动的索引实现精准事实定位、借助关系型数据表天然支持跨文档实体连接、以及通过结构化表征缓解大语言模型的注意力扩散问题。在两个多文档多实体问答基准测试上的评估表明，DocSage显著优于现有最优的长上下文大语言模型与检索增强生成系统，分别实现了超过27%的准确率提升。

摘要 (Abstract)

Multi-document Multi-entity Question Answering inherently demands models to track implicit logic between multiple entities across scattered documents. However, existing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks suffer from critical limitations: standard RAG’s vector similarity-based coarse-grained retrieval often omits critical facts, graph-based RAG fails to efficiently integrate fragmented complex relationship networks, and both lack schema awareness, leading to inadequate cross-document evidence chain construction and inaccurate entity relationship deduction. To address these challenges, we propose DocSage, an end-to-end agentic framework that integrates dynamic schema discovery, structured information extraction, and schema-aware relational reasoning with error guarantees. DocSage operates through three core modules: (1) A schema discovery module dynamically infers query-specific minimal joinable schemas to capture essential entities and relationships; (2) An extraction module transforms unstructured text into semantically coherent relational tables, enhanced by error-aware correction mechanisms to reduce extraction errors; (3) A reasoning module performs multi-hop relational reasoning over structured tables, leveraging schema awareness to efficiently align cross-document entities and aggregate evidence. This agentic design offers three key advantages: precise fact localization via SQL-powered indexing, natural support for cross-document entity joins through relational tables, and mitigated LLM attention diffusion via structured representation. Evaluations on two MDMEQA benchmarks demonstrate that DocSage significantly outperforms state-of-the-art long-context LLMs and RAG systems, achieving more than 27% accuracy improvements respectively.

关键词: Multi-document Multi-entity Question Answering, Large Language Models, Retrieval-Augmented Generation, Agentic Framework, Schema Discovery, Relational Reasoning, Structured Information Extraction, Cross-document Evidence

92. ❌ Locating Demographic Bias at the Attention-Head Level in CLIP’s Vision Encoder

作者: Alaa Yasser, Kittipat Phunjanna, Marcos Escudero Viñolo, Catarina Barata, Jenny Benois-Pineau 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11793v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究CLIP视觉编码器中人口统计偏见的机制可解释性定位，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为核心方法就是机制可解释性审计；与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为CLIP是多模态基础模型，属于基础模型范畴；其他关键词如MoE、SFT、RAG、量化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种机制可解释性审计方法，用于在CLIP视觉编码器的注意力头层面定位人口统计偏见，发现性别偏见比年龄偏见更容易定位，且消除特定注意力头能减少偏见并略微提高准确性。

摘要翻译

对基础模型的标准公平性审计仅能量化模型存在偏见，但无法确定偏见在网络中的具体位置。我们提出一种机制公平性审计方法，该方法结合了投影残差流分解、零样本概念激活向量和偏见增强的文本片段分析，以在视觉Transformer的单个注意力头层面定位人口统计偏见。作为可行性案例研究，我们将此流程应用于CLIP ViT-L-14编码器，在FACET基准的42个职业类别上审计性别和年龄偏见。对于性别偏见，该流程识别出四个末端层注意力头，对其进行消融后全局偏见降低（克拉默V值：0.381 -> 0.362），同时准确率略有提升（+0.42%）；层匹配的随机对照实验证实该效果是所识别头部特有的。最终层的一个单独注意力头对刻板印象最严重类别的偏见减少贡献最大，类别级分析显示修正后的预测更趋向于正确职业。对于年龄偏见，同一流程识别出候选注意力头，但消融产生的效果较弱且一致性较低，表明在该模型中年龄偏见的编码方式比性别偏见更为分散。这些结果为判别式视觉编码器可实现头部级偏见定位提供了初步证据，并且不同受保护属性的可定位程度可能存在差异。关键词：偏见 . CLIP . 机制可解释性 . 视觉Transformer . 公平性

摘要 (Abstract)

Standard fairness audits of foundation models quantify that a model is biased, but not where inside the network the bias resides. We propose a mechanistic fairness audit that combines projected residual-stream decomposition, zero-shot Concept Activation Vectors, and bias-augmented TextSpan analysis to locate demographic bias at the level of individual attention heads in vision transformers. As a feasibility case study, we apply this pipeline to the CLIP ViT-L-14 encoder on 42 profession classes of the FACET benchmark, auditing both gender and age bias. For gender, the pipeline identifies four terminal-layer heads whose ablation reduces global bias (Cramer’s V: 0.381 -> 0.362) while marginally improving accuracy (+0.42%); a layer-matched random control confirms that this effect is specific to the identified heads. A single head in the final layer contributes to the majority of the reduction in the most stereotyped classes, and class-level analysis shows that corrected predictions shift toward the correct occupation. For age, the same pipeline identifies candidate heads, but ablation produces weaker and less consistent effects, suggesting that age bias is encoded more diffusely than gender bias in this model. These results provide preliminary evidence that head-level bias localisation is feasible for discriminative vision encoders and that the degree of localisability may vary across protected attributes. keywords: Bias . CLIP . Mechanistic Interpretability . Vision Transformer . Fairness

关键词: Demographic Bias, CLIP, Mechanistic Interpretability, Vision Transformer, Attention Heads, Fairness Audit, Bias Localization, FACET Benchmark

93. ❌ HELM: Hierarchical and Explicit Label Modeling with Graph Learning for Multi-Label Image Classification

作者: Marjan Stoimchev, Boshko Koloski, Jurica Levatić, Dragi Kocev, Sašo Džeroski 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11783v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的多标签图像分类，特别是遥感图像分析，使用Vision Transformer和GCN等技术。所有关键词均与大语言模型（LLM）相关，而论文完全不涉及LLM、深度学习技术原理创新或大模型在不同领域的应用。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为遥感可视为科学应用，但论文未明确提及生物信息学或化学信息学，且核心是视觉任务而非大模型，因此仅给5分（有一定关联）。其他关键词与LLM技术、训练方法、推理优化、代理系统等完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为HELM的新框架，用于解决遥感图像中多路径层次结构的多标签分类问题，通过结合Vision Transformer、图卷积网络和自监督学习，在多个数据集上实现了最先进的性能，尤其在低标签场景下表现突出。

摘要翻译

层次多标签分类（HMLC）对于遥感领域中复杂标签依赖关系的建模至关重要。然而，现有方法在处理实例属于多个分支的多路径层次结构时存在困难，并且很少利用未标注数据。我们提出了HELM（层次化与显式标签建模）这一新颖框架，以克服这些局限。HELM：（i）在视觉Transformer中使用层次特定的类别标记，以捕捉细微的标签交互；（ii）采用图卷积网络显式编码层次结构，并生成层次感知的嵌入表示；（iii）集成一个自监督分支，以有效利用未标注的遥感影像。我们在四个遥感图像数据集（UCM、AID、DFC-15、MLRSNet）上进行了全面评估。HELM取得了最先进的性能，在监督和半监督设置下均持续超越强基线模型，尤其在低标注数据场景中表现出显著优势。

摘要 (Abstract)

Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.

关键词: Hierarchical multi-label classification, Vision Transformer, Graph convolutional networks, Remote sensing images, Self-supervised learning, Multi-path hierarchies, Low-label scenarios, State-of-the-art performance

94. ❌ From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts

作者: Sunil Prakash 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11781v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体LLM系统的结构化集体推理框架（DCI），与LLM Agents、Multi-agent Systems高度相关（10分），涉及System 2 Thinking和Chain of Thought推理机制（10分），使用LLM作为基础模型（10分），并包含自我反思元素（5分）。其他关键词如MoE、SFT、RAG等未涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对多智能体LLM系统在复杂推理任务中缺乏结构化审议过程的问题，提出了Deliberative Collective Intelligence（DCI）框架，通过定义推理原型、类型化认知行为和收敛算法，显著提升了非例行任务的决策质量，但代价是更高的计算成本。

摘要翻译

多智能体大语言模型系统日益处理复杂推理任务，但其交互模式仍局限于投票、非结构化辩论或流水线协调。现有系统均未模拟审议过程：一种分阶段进行的流程，其中差异化的参与者交换类型化的推理行为，保留分歧意见，最终形成可问责的共识结果。我们提出审议式集体智能框架，该框架明确定义了四种推理原型、14种类型化认知行为、一个共享工作空间以及DCI-CF——一种保证终止的收敛流程算法，其输出为包含选定方案、保留异议、少数派报告及重启条件的结构化决策包。我们使用Gemini 2.5 Flash模型在七个领域的45项任务上进行评估。在非常规任务中，DCI较非结构化辩论取得显著提升。在需要视角整合的隐藏信息任务中，DCI表现卓越，而在常规决策任务中表现欠佳，证实了其任务依赖性。DCI能生成100%的结构化决策包和98%的少数派报告，这些成果在所有基线系统中均未出现。然而，DCI消耗的token量约为单智能体的62倍，且单智能体生成在整体质量上优于DCI。DCI的贡献不在于证明更多智能体必然更好，而在于表明：当流程可问责性能够证明成本合理时，重大决策将受益于审议式结构。

摘要 (Abstract)

Multi-agent LLM systems increasingly tackle complex reasoning, yet their interaction patterns remain limited to voting, unstructured debate, or pipeline orchestration. None model deliberation: a phased process where differentiated participants exchange typed reasoning moves, preserve disagreements, and converge on accountable outcomes. We introduce Deliberative Collective Intelligence (DCI), specifying four reasoning archetypes, 14 typed epistemic acts, a shared workspace, and DCI-CF, a convergent flow algorithm that guarantees termination with a structured decision packet containing the selected option, residual objections, minority report, and reopen conditions. We evaluate on 45 tasks across seven domains using Gemini 2.5 Flash. On non-routine tasks (n=40), DCI significantly improves over unstructured debate (+0.95, 95% CI [+0.41, +1.54]). DCI excels on hidden-profile tasks requiring perspective integration (9.56, highest of any system on any domain) while failing on routine decisions (5.39), confirming task-dependence. DCI produces 100% structured decision packets and 98% minority reports, artifacts absent from all baselines. However, DCI consumes ~62x single-agent tokens, and single-agent generation outperforms DCI on overall quality. DCI’s contribution is not that more agents are better, but that consequential decisions benefit from deliberative structure when process accountability justifies the cost.

关键词: Multi-agent LLM systems, Collective reasoning, Deliberative Collective Intelligence, Typed epistemic acts, Convergent flow algorithm, Structured decision packet, Perspective integration, Process accountability

95. ❌ An Automatic Text Classification Method Based on Hierarchical Taxonomies, Neural Networks and Document Embedding: The NETHIC Tool

作者: Luigi Lomasto, Rosario Di Florio, Andrea Ciapetti, Giuseppe Miscione, Giulia Ruggiero, Daniele Toti 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11770v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文描述了一个基于层次分类法、神经网络和文档嵌入的自动文本分类工具NETHIC，主要涉及传统神经网络（未指定是否为大模型）和文本分类任务。所有评分关键词均针对大模型（LLMs）及其相关技术（如MoE、RLHF、RAG等）、推理方法（CoT、MCTS）、优化技术（量化、注意力机制）或特定应用领域（AI for Science）。论文未提及任何大模型、大模型技术原理或科学领域应用，仅涉及通用的神经网络文本分类，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为NETHIC的自动文本分类工具，它结合层次分类法、神经网络和文档嵌入机制，在通用和领域特定语料库上实现了有效且高效的文本分类性能。

摘要翻译

本研究描述了一种名为NETHIC的软件工具所实现的自动文本分类方法，该方法充分利用了高可扩展性神经网络的内在能力与层次分类法的表达优势。因此，NETHIC成功构建了一种文本分类机制，该机制被证明具有显著的有效性与高效性。该工具已在通用语料库和领域特定语料库上进行了实验测试，并输出了具有前景的结果。基于此实验，NETHIC目前已通过添加文档嵌入机制得到进一步优化与扩展，该机制在单个网络及整体层次模型的性能方面均显示出提升效果。

摘要 (Abstract)

This work describes an automatic text classification method implemented in a software tool called NETHIC, which takes advantage of the inner capabilities of highly-scalable neural networks combined with the expressiveness of hierarchical taxonomies. As such, NETHIC succeeds in bringing about a mechanism for text classification that proves to be significantly effective as well as efficient. The tool had undergone an experimentation process against both a generic and a domain-specific corpus, outputting promising results. On the basis of this experimentation, NETHIC has been now further refined and extended by adding a document embedding mechanism, which has shown improvements in terms of performance on the individual networks and on the whole hierarchical model.

关键词: automatic text classification, hierarchical taxonomies, neural networks, document embedding, NETHIC tool, text classification method, domain-specific corpus, highly-scalable neural networks

96. ❌ Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework

作者: Chingkwun Lam, Jiaxin Li, Lingfei Zhang, Kuo Zhao 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11768v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM智能体的长期记忆治理问题，与’Large Language Models’和’LLM Agents’高度相关（10分），因为论文明确研究LLM智能体的记忆系统。其他关键词如MoE、SLMs、训练方法、推理技术、压缩技术、科学AI应用等均未在标题或摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在动态环境中长期记忆系统存在的治理风险、语义漂移和隐私漏洞问题，提出了稳定性与安全治理记忆（SSGM）框架来确保记忆系统的安全性和可靠性。

摘要翻译

长期记忆已成为自主大型语言模型（LLM）智能体的基础组件，能够支持持续适应、终身多模态学习和复杂推理。然而，随着记忆系统从静态检索数据库转向动态的、具有自主性的机制，关于记忆治理、语义漂移和隐私漏洞的关键问题日益凸显。尽管近期的综述广泛关注记忆检索效率，但它们大多忽视了在高度动态环境中记忆污染的新兴风险。为应对这些新兴挑战，我们提出了稳定性与安全治理记忆（SSGM）框架，这是一种概念性治理架构。SSGM通过在执行前实施一致性验证、时间衰减建模和动态访问控制，将记忆演化与执行过程解耦。通过形式化分析和架构分解，我们展示了SSGM如何缓解由拓扑结构导致的知识泄露（即敏感上下文被固化至长期存储），并有助于防止语义漂移（即知识通过迭代摘要而退化）。最终，本研究对记忆污染风险进行了全面分类，并为部署安全、持久且可靠的自主记忆系统建立了一个稳健的治理范式。

摘要 (Abstract)

Long-term memory has emerged as a foundational component of autonomous Large Language Model (LLM) agents, enabling continuous adaptation, lifelong multimodal learning, and sophisticated reasoning. However, as memory systems transition from static retrieval databases to dynamic, agentic mechanisms, critical concerns regarding memory governance, semantic drift, and privacy vulnerabilities have surfaced. While recent surveys have focused extensively on memory retrieval efficiency, they largely overlook the emergent risks of memory corruption in highly dynamic environments. To address these emerging challenges, we propose the Stability and Safety-Governed Memory (SSGM) framework, a conceptual governance architecture. SSGM decouples memory evolution from execution by enforcing consistency verification, temporal decay modeling, and dynamic access control prior to any memory consolidation. Through formal analysis and architectural decomposition, we show how SSGM can mitigate topology-induced knowledge leakage where sensitive contexts are solidified into long-term storage, and help prevent semantic drift where knowledge degrades through iterative summarization. Ultimately, this work provides a comprehensive taxonomy of memory corruption risks and establishes a robust governance paradigm for deploying safe, persistent, and reliable agentic memory systems.

关键词: LLM agents, long-term memory, memory governance, semantic drift, privacy vulnerabilities, SSGM framework, memory corruption, autonomous agents

97. ❌ Understanding Wikidata Qualifiers: An Analysis and Taxonomy

作者: Gilles Falquet, Sahar Aljalbout 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于知识图谱（Wikidata）中限定符的语义分析、分类和实际应用研究，属于知识表示和信息检索领域。论文内容完全不涉及大模型、深度学习、AI技术原理或科学AI应用，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文分析了Wikidata限定符的语义和实际使用情况，开发了一个分类法来解决选择适当限定符、查询图谱和进行逻辑推理的挑战。

摘要翻译

本文对维基数据（Wikidata）限定符（qualifiers）进行了深入分析，重点探讨其语义与实际使用情况，旨在构建一种分类法，以应对选择合适限定符、查询知识图谱以及进行逻辑推理所面临的挑战。研究基于频率和多样性评估限定符的重要性，采用修正的香农熵指数以考量“长尾”现象。通过分析维基数据转储文件，研究筛选出前300个限定符，并将其归类为一个精细化的分类体系，该体系包含语境类、认知/不确定性类、结构类以及附加类限定符。此分类法旨在指导贡献者创建和查询陈述（statements），改进限定符推荐系统，并优化知识图谱设计方法。结果表明，该分类法有效涵盖了最重要的限定符，并为理解和利用维基数据中的限定符提供了一种结构化方法。

摘要 (Abstract)

This paper presents an in-depth analysis of Wikidata qualifiers, focusing on their semantics and actual usage, with the aim of developing a taxonomy that addresses the challenges of selecting appropriate qualifiers, querying the graph, and making logical inferences. The study evaluates qualifier importance based on frequency and diversity, using a modified Shannon entropy index to account for the “long tail” phenomenon. By analyzing a Wikidata dump, the top 300 qualifiers were selected and categorized into a refined taxonomy that includes contextual, epistemic/uncertainty, structural, and additional qualifiers. The taxonomy aims to guide contributors in creating and querying statements, improve qualifier recommendation systems, and enhance knowledge graph design methodologies. The results show that the taxonomy effectively covers the most important qualifiers and provides a structured approach to understanding and utilizing qualifiers in Wikidata.

关键词: Wikidata, qualifiers, taxonomy, knowledge graph, semantic analysis, querying, logical inference, Shannon entropy

98. ❌ Anomaly detection in time-series via inductive biases in the latent space of conditional normalizing flows

作者: David Baumgartner, Eliezer de Souza da Silva, Iñigo Urteaga 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11756v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于时间序列异常检测，使用条件归一化流和状态空间模型，属于传统深度学习在特定应用领域的研究。所有评分关键词均涉及大语言模型（LLMs）及相关技术（如MoE、RLHF、RAG、量化等），或AI for Science的特定子领域（生物信息学、化学信息学）。论文未提及任何大语言模型、基础模型、AI for Science应用或相关技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于条件归一化流和状态空间框架的时间序列异常检测方法，通过将异常定义在潜在空间中并约束潜在轨迹的动态演化，实现了对频率、振幅和观测噪声异常的可靠检测，并提供了可解释的模型合规性诊断。

摘要翻译

多元时间序列异常检测中的深度生成模型通常通过最大化数据似然进行训练。然而，观测空间中的似然度量的是边缘密度而非对结构化时序动态的符合程度，因此可能为异常或分布外样本分配高概率。我们通过将异常概念重新定位到预设的潜在空间来解决这一结构局限性。我们在条件归一化流中引入显式归纳偏置，在离散时间状态空间框架内对时间序列观测进行建模，该框架约束潜在表征按照预设的时序动态演化。在此框架下，预期行为对应于遵循潜在轨迹的特定分布，而异常则被定义为对这些动态规律的违反。异常检测因此被简化为基于统计的符合性检验：观测值被映射到潜在空间，并通过拟合优度检验对照预设的潜在演化进行评估。这产生了一种原则性的决策规则，即使在观测似然较高的区域仍保持有效性。在合成与真实世界时间序列上的实验表明，该方法能可靠检测频率、振幅和观测噪声中的异常，同时提供模型符合度的可解释诊断。

摘要 (Abstract)

Deep generative models for anomaly detection in multivariate time-series are typically trained by maximizing data likelihood. However, likelihood in observation space measures marginal density rather than conformity to structured temporal dynamics, and therefore can assign high probability to anomalous or out-of-distribution samples. We address this structural limitation by relocating the notion of anomaly to a prescribed latent space. We introduce explicit inductive biases in conditional normalizing flows, modeling time-series observations within a discrete-time state-space framework that constrains latent representations to evolve according to prescribed temporal dynamics. Under this formulation, expected behavior corresponds to compliance with a specified distribution over latent trajectories, while anomalies are defined as violations of these dynamics. Anomaly detection is consequently reduced to a statistically grounded compliance test, such that observations are mapped to latent space and evaluated via goodness-of-fit tests against the prescribed latent evolution. This yields a principled decision rule that remains effective even in regions of high observation likelihood. Experiments on synthetic and real-world time-series demonstrate reliable detection of anomalies in frequency, amplitude, and observation noise, while providing interpretable diagnostics of model compliance.

关键词: anomaly detection, time-series, conditional normalizing flows, latent space, state-space framework, temporal dynamics, goodness-of-fit tests, interpretable diagnostics

作者: Erfan Mirzaei, Seyed Pooya Shariatpanahi, Alireza Tavakoli, Reshad Hosseini, Majid Nili Ahmadabadi 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11757v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多智能体社会学习场景下的强化学习算法，与大多数大模型/深度学习技术关键词无关。仅与’Multi-agent Systems OR Agent Coordination’有一定关联（5分），因为涉及多智能体协调学习，但论文聚焦传统强化学习而非大模型驱动的智能体。其他关键词均未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于自由能的多智能体社会赌博学习算法，使智能体能够通过观察其他智能体的行为来评估其专业水平并整合信息，从而在包含专家和非专家智能体的社会中实现最优策略收敛并保持对数遗憾。

摘要翻译

基于人工智能的个性化服务涉及大量个体强化学习智能体。然而，大多数强化学习算法侧重于利用个体学习，未能充分利用人类和动物普遍展现的社会学习能力。社会学习将个体经验与观察他人行为相结合，为提升学习效果提供了可能。本研究聚焦于一种社会多臂赌博机学习场景，其中社会智能体能够观察其他智能体的行为，但无法获知其奖励信息。各智能体独立遵循自身策略，并无明确相互教导的动机。我们提出了一种基于自由能的、在策略空间上进行的社会多臂赌博机学习算法。该算法使社会智能体能够在无需借助任何先知信息或社会规范的情况下，评估其他智能体的专业水平。据此，社会智能体将其在环境中的直接经验与他人的估计策略进行整合。我们证明了该算法在理论上能够收敛至最优策略。实证评估验证了我们的社会学习方法在多种场景下均优于其他替代方法。即使在存在随机或次优智能体的情况下，我们的算法也能策略性地识别出相关智能体，并巧妙地利用其行为信息。除了包含专家智能体的社会群体外，在存在相关但非专家的智能体时，我们的算法能显著提升个体学习性能，而大多数相关方法在此情况下均告失败。重要的是，该算法同时保持了对数级遗憾。

摘要 (Abstract)

Personalized AI-based services involve a population of individual reinforcement learning agents. However, most reinforcement learning algorithms focus on harnessing individual learning and fail to leverage the social learning capabilities commonly exhibited by humans and animals. Social learning integrates individual experience with observing others’ behavior, presenting opportunities for improved learning outcomes. In this study, we focus on a social bandit learning scenario where a social agent observes other agents’ actions without knowledge of their rewards. The agents independently pursue their own policy without explicit motivation to teach each other. We propose a free energy-based social bandit learning algorithm over the policy space, where the social agent evaluates others’ expertise levels without resorting to any oracle or social norms. Accordingly, the social agent integrates its direct experiences in the environment and others’ estimated policies. The theoretical convergence of our algorithm to the optimal policy is proven. Empirical evaluations validate the superiority of our social learning method over alternative approaches in various scenarios. Our algorithm strategically identifies the relevant agents, even in the presence of random or suboptimal agents, and skillfully exploits their behavioral information. In addition to societies including expert agents, in the presence of relevant but non-expert agents, our algorithm significantly enhances individual learning performance, where most related methods fail. Importantly, it also maintains logarithmic regret.

关键词: social bandit learning, multi-agent systems, reinforcement learning, free energy, expertise estimation, policy integration, logarithmic regret, non-expert agents

100. ❌ CINDI: Conditional Imputation and Noisy Data Integrity with Flows in Power Grid Data

作者: David Baumgartner, Helge Langseth, Heri Ramampiaro 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11745v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文CINDI专注于电力网络数据的时间序列异常检测和插补，使用条件归一化流构建无监督概率框架。所有关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，但论文未涉及LLM、深度学习模型架构、训练方法、推理优化、代理系统或模型解释性等主题。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI应用于电力基础设施（属于科学/工程领域），但并非核心生物信息学或化学信息学，因此给5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了CINDI，一个基于条件归一化流的无监督概率框架，用于统一检测和插补电力网络等复杂时间序列中的噪声和异常，并在真实电网数据上验证了其优于基线方法的鲁棒性能。

摘要翻译

现实世界中的多元时间序列，尤其是在电网等关键基础设施领域，常因噪声和异常值干扰而影响下游任务性能。传统数据清洗方法通常依赖分离式策略，即使用一个模型检测错误，再用另一个模型进行填补。这类方法往往难以捕捉数据的完整联合分布，且忽略了预测不确定性。本研究提出条件插补与噪声数据完整性框架，这是一种无监督概率框架，旨在修复复杂时间序列的数据完整性。与碎片化方法不同，该框架将异常检测与数据插补统一于基于条件标准化流的端到端系统。通过精确建模数据的条件似然，该框架能识别低概率数据段，并迭代生成统计一致的替代值。这使得框架能够高效复用已学习信息，同时保持系统内在的物理与统计特性。我们使用挪威配电运营商的真实电网损耗数据进行评估，但该方法设计上可推广至任意多元时间序列领域。实验结果表明，相较于现有基准方法，该框架展现出更强的鲁棒性，为噪声环境下的可靠性维护提供了可扩展的解决方案。

摘要 (Abstract)

Real-world multivariate time series, particularly in critical infrastructure such as electrical power grids, are often corrupted by noise and anomalies that degrade the performance of downstream tasks. Standard data cleaning approaches often rely on disjoint strategies, which involve detecting errors with one model and imputing them with another. Such approaches can fail to capture the full joint distribution of the data and ignore prediction uncertainty. This work introduces Conditional Imputation and Noisy Data Integrity (CINDI), an unsupervised probabilistic framework designed to restore data integrity in complex time series. Unlike fragmented approaches, CINDI unifies anomaly detection and imputation into a single end-to-end system built on conditional normalizing flows. By modeling the exact conditional likelihood of the data, the framework identifies low-probability segments and iteratively samples statistically consistent replacements. This allows CINDI to efficiently reuse learned information while preserving the underlying physical and statistical properties of the system. We evaluate the framework using real-world grid loss data from a Norwegian power distribution operator, though the methodology is designed to generalize to any multivariate time series domain. The results demonstrate that CINDI yields robust performance compared to competitive baselines, offering a scalable solution for maintaining reliability in noisy environments.

关键词: Conditional Imputation, Noisy Data Integrity, Normalizing Flows, Anomaly Detection, Multivariate Time Series, Power Grid Data, Unsupervised Probabilistic Framework, Data Cleaning

101. ❌ Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information

作者: Konstantin Krestnikov 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11749v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究语言模型在混合质量数据训练下为何偏好正确信息，核心是压缩-一致性原理。高度相关关键词：‘Hallucination Mitigation OR Factuality OR Truthfulness’（10分，直接研究事实性和幻觉缓解机制），‘Mechanistic Interpretability OR Explainable AI’（10分，探究模型内部工作机制和可解释性）。中等相关：‘Small Language Models OR SLMs OR On-device AI’（8分，使用小规模GPT-2风格模型），‘Scaling Laws AND Data Quality’（8分，研究数据质量对模型行为的影响），‘Pre-training OR Continual Pre-training OR Domain Adaptation’（8分，涉及预训练过程分析）。轻微相关：‘Large Language Models OR LLMs OR Foundation Models’（5分，虽用小模型但原理适用于大模型）。其余关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了语言模型在混合质量数据训练下为何偏好正确信息，提出了压缩-一致性原理，发现模型对正确信息的偏好主要源于压缩压力和内部一致性偏好而非内在的真理驱动力。

摘要翻译

为何语言模型有时更倾向于选择正确陈述，即使其训练数据质量参差不齐？我们提出“压缩—一致性原则”：在训练数据中，下一个词元预测更倾向于那些能够以更简短且内部一致的方式进行描述的假设。仅当错误替代项在结构上更难压缩时，模型才会显现出对真实性的偏好。我们使用小型GPT-2风格的字符级Transformer模型（参数量3.5M—86M）在合成数学语料库上对此进行验证，该语料库包含受控的正确与错误规则混合。在随机错误设置中，模型在配对评估中强烈偏好正确补全：数据平衡时准确率达83.1%，即使正确规则仅出现在10%的语料中，准确率仍达67.0%。若将随机错误替换为一套连贯但数学上不正确的规则体系，这种偏好基本消失（准确率接近随机水平）。在更接近自然语言的合成环境中，该效应虽减弱但仍存在（准确率57.7%）。进一步实验表明，嵌入验证步骤即使在小规模模型中也能恢复对正确性的偏好，而增加一致规则的数量会带来准确率的梯度提升。我们的结果表明，表面上呈现的“真实性偏好”主要是压缩压力与内部一致性偏好的副产品，而非对真实性的内在追求。完整代码与数据详见https://github.com/Rai220/compression-drives-truth。

摘要 (Abstract)

Why do language models sometimes prefer correct statements even when trained on mixed-quality data? We introduce the Compression–Consistency Principle: next-token prediction favors hypotheses that allow shorter and more internally consistent descriptions of the training data. Truth bias emerges only when false alternatives are structurally harder to compress. We test this using small GPT-2-style character-level transformers (3.5M–86M parameters) on synthetic math corpora with controlled mixtures of correct and incorrect rules. In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus. Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy). In a more natural-language-like synthetic world, the effect is weaker but still present (57.7%). Additional experiments show that embedding verification steps can restore preference for correctness even at small scale, while increasing the number of consistent rules produces a graded improvement in accuracy. Our results suggest that what appears as a “truth bias” is largely a side effect of compression pressure and preference for internal consistency, rather than an intrinsic drive toward truth. Full code and data are available at https://github.com/Rai220/compression-drives-truth.

关键词: language models, compression-consistency principle, truth bias, factuality, mechanistic interpretability, next-token prediction, data quality, synthetic corpora

102. ❌ Gender Bias in Generative AI-assisted Recruitment Processes

作者: Martina Ullasci, Marco Rondina, Riccardo Coppola, Antonio Vetrò 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11736v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文直接研究大型语言模型（GPT-5）在招聘过程中的应用，因此与’Large Language Models’高度相关（10分）。论文关注模型输出中的性别偏见和事实性问题，与’Hallucination Mitigation’和’Mechanistic Interpretability’有一定关联（各5分），但并非核心技术创新。其他关键词涉及具体技术原理、训练方法、推理优化、代理系统、科学应用等，论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究评估了GPT-5在模拟招聘场景中基于性别和工作经验背景推荐职位时存在的性别偏见，发现模型在职位和行业推荐上无显著差异，但在描述候选人的形容词上表现出性别刻板印象，将女性与情感特质关联，男性与分析特质关联。

摘要翻译

近年来，生成式人工智能系统在选拔流程、人员招聘及候选人履历分析中扮演着日益关键的角色。然而，大型语言模型的使用可能复制甚至放大劳动力市场中已有的性别刻板印象与偏见。本文旨在评估和量化这一现象，通过分析当前先进的生成模型基于性别与工作经历背景为意大利35岁以下毕业生推荐职业的倾向。研究采用模拟的24份候选人档案对模型进行测试，这些档案在性别、年龄、经验及专业领域上均保持平衡。尽管在职位名称与行业推荐方面未出现显著差异，但模型在描述女性和男性候选人时呈现出性别化的语言模式：女性更常被赋予情感丰富、富有同理心等形容词，而男性则更多与战略性、分析性特质相关联。这项研究提出了在敏感流程中使用此类模型所涉及的伦理问题，强调了未来数字化劳动力市场对透明度与公平性的迫切需求。

摘要 (Abstract)

In recent years, generative artificial intelligence (GenAI) systems have assumed increasingly crucial roles in selection processes, personnel recruitment and analysis of candidates’ profiles. However, the employment of large language models (LLMs) risks reproducing, and in some cases amplifying, gender stereotypes and bias already present in the labour market. The objective of this paper is to evaluate and measure this phenomenon, analysing how a state-of-the-art generative model (GPT-5) suggests occupations based on gender and work experience background, focusing on under-35-year-old Italian graduates. The model has been prompted to suggest jobs to 24 simulated candidate profiles, which are balanced in terms of gender, age, experience and professional field. Although no significant differences emerged in job titles and industry, gendered linguistic patterns emerged in the adjectives attributed to female and male candidates, indicating a tendency of the model to associate women with emotional and empathetic traits, while men with strategic and analytical ones. The research raises an ethical question regarding the use of these models in sensitive processes, highlighting the need for transparency and fairness in future digital labour markets.

关键词: Generative AI, Large Language Models, Gender Bias, Recruitment, GPT-5, Ethical AI, Fairness, Labor Market

103. ❌ Adapting Dijkstra for Buffers and Unlimited Transfers

作者: Denys Katkalo, Andrii Rohovyi, Toby Walsh 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11729v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于公共交通路径规划算法（Time-Dependent Dijkstra和Transfer Aware Dijkstra），研究内容涉及算法优化、图论和交通网络建模，与所有评分关键词（均围绕大模型、深度学习、AI技术原理及应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对公共交通路径规划中无限换乘场景，发现传统时间依赖Dijkstra算法在站点存在缓冲时间时存在缺陷，提出了一种新的Transfer Aware Dijkstra算法，在伦敦和瑞士网络上实现了比MR算法快两倍以上的速度提升，同时保证结果最优。

摘要翻译

近年来，基于RAPTOR的算法在无需预处理的无限换乘路径规划中被视为最优方法。然而，这一地位主要源于路由研究的发展历程——基于迪杰斯特拉（Dijkstra）的解决方案被基于时刻表的算法所取代，而两者之间缺乏系统性的比较。在本研究中，我们重新审视了基于迪杰斯特拉经典方法的公共交通无限换乘路由，并证明时间依赖型迪杰斯特拉（TD-Dijkstra）算法优于多目标路由（MR）。然而，高效的TD-Dijkstra实现依赖于在预处理阶段过滤被支配的连接，这假设乘客总能切换到更快的连接。我们指出，当站点存在缓冲时间时，这种过滤方式是不严谨的，因为它无法区分可能无需等待即可继续行程的座位乘客与必须遵守缓冲时间的换乘乘客。为解决这一局限，我们提出了换乘感知迪杰斯特拉（Transfer Aware Dijkstra, TAD）算法，该改进方法通过扫描完整的行程序列而非单一边缘，在保持对MR性能优势的同时，能正确处理缓冲时间。我们在伦敦和瑞士交通网络上的实验表明，该算法在有无缓冲时间的两种情况下均能产生最优结果，同时相比MR实现了超过两倍的加速。

摘要 (Abstract)

In recent years, RAPTOR based algorithms have been considered the state-of-the-art for path-finding with unlimited transfers without preprocessing. However, this status largely stems from the evolution of routing research, where Dijkstra-based solutions were superseded by timetable-based algorithms without a systematic comparison. In this work, we revisit classical Dijkstra-based approaches for public transit routing with unlimited transfers and demonstrate that Time-Dependent Dijkstra (TD-Dijkstra) outperforms MR. However, efficient TD-Dijkstra implementations rely on filtering dominated connections during preprocessing, which assumes passengers can always switch to a faster connection. We show that this filtering is unsound when stops have buffer times, as it cannot distinguish between seated passengers who may continue without waiting and transferring passengers who must respect the buffer. To address this limitation, we introduce Transfer Aware Dijkstra (TAD), a modification that scans entire trip sequences rather than individual edges, correctly handling buffer times while maintaining performance advantages over MR. Our experiments on London and Switzerland networks show that we can achieve a greater than two time speed-up over MR while producing optimal results on both networks with and without buffer times.

关键词: public transit routing, unlimited transfers, Time-Dependent Dijkstra, buffer times, Transfer Aware Dijkstra, path-finding algorithms, optimal routing, algorithm optimization

104. ❌ Affect Decoding in Phonated and Silent Speech Production from Surface EMG

作者: Simon Pistrosch, Kleanthis Avramidis, Tiantian Feng, Jihwan Lee, Monica Gonzalez-Machorro, Shrikanth Narayanan, Björn W. Schuller 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11715v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究基于表面肌电图（sEMG）的情感解码，属于生物医学信号处理和情感计算领域，与大多数大模型技术关键词（如LLM、MoE、RLHF等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究涉及生物医学信号（EMG）和情感分析，可视为AI在科学（生物医学）领域的应用，但并非核心内容，因此给予5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该研究通过表面肌电图（sEMG）分析在发声和无声言语产生中的情感表达，发现EMG表征能可靠解码挫败感（AUC达0.845），且情感特征嵌入在面部肌肉活动中，在无声状态下仍存在，为情感感知的无声语音接口提供了潜力。

摘要翻译

情感表达是口语交流的组成部分，但其与底层发音执行之间的关联尚不明确。通过肌电图（EMG）等发音肌肉活动测量手段，结合声学语音分析，可以揭示情绪如何调节言语产生过程。本研究探讨在发声与无声言语产生过程中，如何通过面部与颈部表面肌电图（sEMG）解码情感状态。为此，我们构建了一个包含12名参与者在3项任务中产生的2,780条话语的数据集，并基于多种特征与模型嵌入评估了被试内与被试间解码性能。结果表明，肌电图表征能可靠地区分沮丧情绪（AUC最高达0.845），且在不同发音模式间具有良好的泛化能力。我们的消融研究进一步证明，情感特征编码于面部运动活动中，且在无发声状态下依然存在，这凸显了肌电传感技术在未来情感感知无声语音交互系统中的应用潜力。

摘要 (Abstract)

The expression of affect is integral to spoken communication, yet, its link to underlying articulatory execution remains unclear. Measures of articulatory muscle activity such as EMG could reveal how speech production is modulated by emotion alongside acoustic speech analyses. We investigate affect decoding from facial and neck surface electromyography (sEMG) during phonated and silent speech production. For this purpose, we introduce a dataset comprising 2,780 utterances from 12 participants across 3 tasks, on which we evaluate both intra- and inter-subject decoding using a range of features and model embeddings. Our results reveal that EMG representations reliably discriminate frustration with up to 0.845 AUC, and generalize well across articulation modes. Our ablation study further demonstrates that affective signatures are embedded in facial motor activity and persist in the absence of phonation, highlighting the potential of EMG sensing for affect-aware silent speech interfaces.

关键词: affect decoding, surface EMG, silent speech, phonated speech, facial muscle activity, emotion recognition, speech production, biomedical signal processing

105. ❌ OSCBench: Benchmarking Object State Change in Text-to-Video Generation

作者: Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11698v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究文本到视频生成中的对象状态变化评估，与大多数大模型技术关键词无关；仅与"Large Language Models OR LLMs OR Foundation Models"有中等关联（5分），因为使用了多模态大语言模型进行自动评估，但这不是论文的核心创新点。

!!! tip deepseek-chat TL;DR

该论文提出了OSCBench基准来评估文本到视频生成模型在对象状态变化方面的性能，发现现有模型在准确性和时间一致性上存在显著不足。

摘要翻译

文本到视频（T2V）生成模型在生成视觉质量高且时序连贯的视频方面已取得快速进展。然而，现有基准主要关注感知质量、文本-视频对齐或物理合理性，而忽略了动作理解的一个关键方面：文本提示中明确指定的物体状态变化（OSC）。OSC 指的是由动作引发的物体状态转变，例如削土豆皮或切柠檬。本文中，我们介绍了 OSCBench，这是一个专门设计用于评估 T2V 模型中 OSC 性能的基准。OSCBench 基于烹饪教学数据构建，并将动作-物体交互系统地组织为常规、新颖和组合性场景，以探究模型在分布内性能及泛化能力。我们通过人工用户研究和基于多模态大语言模型（MLLM）的自动评估，对六个代表性的开源及专有 T2V 模型进行了评估。我们的结果表明，尽管当前 T2V 模型在语义和场景对齐方面表现强劲，但在准确且时序一致的物体状态变化方面始终存在困难，尤其在新颖和组合性场景中。这些发现确立了 OSC 是文本到视频生成中的一个关键瓶颈，并将 OSCBench 定位为推动状态感知视频生成模型发展的诊断性基准。

摘要 (Abstract)

Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object’s state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.

关键词: text-to-video generation, object state change, benchmark, multimodal large language model, evaluation, instructional cooking, generalization, video generation models

106. ❌ Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

作者: Mei Chee Leong, Ying Gu, Hui Li Tan, Liyuan Li, Nancy Chen 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11689v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Explicit Logic Channel（ELC）方法，使用LLM和视觉基础模型进行显式逻辑推理，以验证、选择和增强多模态大语言模型（MLLMs）。核心相关关键词：1）‘Large Language Models’（10分）：论文基于LLM构建ELC，并评估MLLMs；2）‘Chain of Thought’和’System 2 Thinking’（各8分）：ELC模仿人类逻辑推理，进行多步推理；3）‘Hallucination Mitigation’（8分）：通过显式视觉证据增强可信度，减少幻觉；4）‘Explainable AI’（10分）：ELC提供可解释性，提升模型透明度；5）‘Self-Correction’（5分）：交叉验证和集成可视为自我改进。其他关键词如MoE、SFT、RAG等未涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在零样本任务中缺乏可解释性的问题，提出了一种并行于黑盒模型的显式逻辑通道方法，通过逻辑推理和视觉证据验证模型行为，提高了模型的可信度和性能。

摘要翻译

前沿多模态大语言模型（MLLMs）在视觉-语言理解任务中展现出卓越能力。然而，这些模型通常以黑盒方式作为零样本解决方案部署于新任务。验证与理解此类模型的行为对于其在新任务中的应用至关重要。我们提出一种显式逻辑通道，与黑盒模型通道并行运作，通过显式逻辑推理实现模型验证、选择与增强。蕴含隐性视觉-语言知识的前沿MLLMs可视为隐式逻辑通道。所提出的显式逻辑通道模拟人类逻辑推理，整合大语言模型、视觉基础模型以及概率推理逻辑系统，对显式视觉证据进行事实性、反事实性与关系性推理。我们提出一致性比率指标，用于跨通道验证与模型选择，该指标无需真实标注即可实现。此外，跨通道整合能基于显式视觉证据增强可信度，在零样本任务中实现超越原始MLLMs的性能提升。我们在三个挑战性基准上，针对两项代表性视觉-语言理解任务（即多项选择视觉问答与硬案例指代表达理解），对来自4个前沿系列的11个近期开源MLLMs进行了全面实验。系统化评估表明，所提出的显式逻辑通道与一致性比率机制能有效实现多模态大语言模型的验证、选择与性能改进，同时增强模型的可解释性与可信度。

摘要 (Abstract)

Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model channel, to perform explicit logical reasoning for model validation, selection and enhancement. The frontier MLLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logical reasoning, incorporates a LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over the explicit visual evidence. A Consistency Rate (CR) is proposed for cross-channel validation and model selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over MLLMs, grounded with explicit visual evidence to enhance trustworthiness. Comprehensive experiments conducted for two representative VLC tasks, i.e., MC-VQA and HC-REC, on three challenging benchmarks, with 11 recent open-source MLLMs from 4 frontier families. Our systematic evaluations demonstrate the effectiveness of proposed ELC and CR for model validation, selection and improvement on MLLMs with enhanced explainability and trustworthiness.

关键词: Multimodal Large Language Models, Explicit Logic Channel, Zero-shot Tasks, Model Validation, Logical Reasoning, Explainability, Trustworthiness, Visual-Language Comprehension

107. ❌ Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2

作者: Suvendu Sekhar Mohanty 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11683v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于语音合成（TTS）领域，提出了一种因果韵律调解框架来改进FastSpeech2模型的情感表达。虽然论文涉及深度学习技术（如神经网络架构、训练目标），但其核心内容与大语言模型（LLM）及相关技术（如MoE、SFT、RAG、量化等）完全无关。论文未提及任何大模型技术原理创新或大模型在科学领域的应用，也未涉及生物信息学或化学信息学。所有评分关键词均针对大模型技术，而该论文研究的是特定领域的语音合成模型，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种因果韵律调解框架，通过反事实训练目标改进FastSpeech2模型，实现了更好的情感语音合成效果和可控的韵律编辑。

摘要翻译

本文提出了一种新颖的因果韵律调节框架，用于富有表现力的文本到语音（Text-to-Speech, TTS）合成。我们的方法通过显式的情感条件增强FastSpeech2架构，并引入反事实训练目标，以将情感韵律从语言学内容中解耦。通过构建一个关于文本（内容）、情感和说话人如何共同影响韵律（时长、音高、能量）并最终影响语音波形的结构因果模型，我们推导出两个互补的损失项：间接路径约束（Indirect Path Constraint, IPC）用于强制情感仅通过韵律影响语音，以及反事实韵律约束（Counterfactual Prosody Constraint, CPC）以鼓励不同情感产生不同的韵律模式。所得到的模型在多说话人情感语料库（LibriTTS, EmoV-DB, VCTK）上进行训练，其组合目标函数包括标准的声谱图重建损失、方差预测损失以及我们提出的因果损失。在富有表现力的语音合成评估中，我们的方法在韵律操控和情感渲染方面取得了显著改进，与基线FastSpeech2变体相比，获得了更高的平均意见得分（Mean Opinion Score, MOS）和情感准确率。我们还观察到，在跨说话人迁移情感时，模型具有更好的可懂度（低词错误率，Word Error Rate, WER）和说话人一致性。大量的消融实验证实，因果目标成功地分离了韵律归因，产生了一个可解释的模型，允许进行可控的反事实韵律编辑（例如“相同话语，不同情感”），同时不损害自然度。我们讨论了该框架对韵律建模可识别性的意义，并概述了其局限性，例如假设情感效应完全由音高、时长和能量捕获。我们的工作展示了将因果学习原理整合到TTS中，如何能够提高生成语音的可控性和表现力。

摘要 (Abstract)

We propose a novel causal prosody mediation framework for expressive text-to-speech (TTS) synthesis. Our approach augments the FastSpeech2 architecture with explicit emotion conditioning and introduces counterfactual training objectives to disentangle emotional prosody from linguistic content. By formulating a structural causal model of how text (content), emotion, and speaker jointly influence prosody (duration, pitch, energy) and ultimately the speech waveform, we derive two complementary loss terms: an Indirect Path Constraint (IPC) to enforce that emotion affects speech only through prosody, and a Counterfactual Prosody Constraint (CPC) to encourage distinct prosody patterns for different emotions. The resulting model is trained on multi-speaker emotional corpora (LibriTTS, EmoV-DB, VCTK) with a combined objective that includes standard spectrogram reconstruction and variance prediction losses alongside our causal losses. In evaluations on expressive speech synthesis, our method achieves significantly improved prosody manipulation and emotion rendering, with higher mean opinion scores (MOS) and emotion accuracy than baseline FastSpeech2 variants. We also observe better intelligibility (low WER) and speaker consistency when transferring emotions across speakers. Extensive ablations confirm that the causal objectives successfully separate prosody attribution, yielding an interpretable model that allows controlled counterfactual prosody editing (e.g. “same utterance, different emotion”) without compromising naturalness. We discuss the implications for identifiability in prosody modeling and outline limitations such as the assumption that emotion effects are fully captured by pitch, duration, and energy. Our work demonstrates how integrating causal learning principles into TTS can improve controllability and expressiveness in generated speech.

关键词: text-to-speech synthesis, causal prosody mediation, counterfactual training, FastSpeech2, emotional prosody, prosody disentanglement, expressive speech, structural causal model

108. ❌ Entropy-Preserving Reinforcement Learning

作者: Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik Wijmans, Marco Cusumano-Towner, Raja Giryes, Philipp Krähenbühl 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11682v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习中的策略梯度算法，特别是研究训练过程中熵的动态变化及其对探索多样性的影响，并提出了REPO和ADAPO等熵控制方法。虽然论文提到这些算法在语言模型推理中的应用背景，但核心内容完全是强化学习算法层面的创新，不涉及大模型架构、训练技术、推理优化或具体领域应用。唯一相关的关键词是’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’，因为论文研究的策略梯度算法是RLHF等技术的基础组成部分，但论文本身并未直接研究RLHF。其他所有关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究发现策略梯度算法在训练中会自然减少熵从而限制探索多样性，提出了REPO和ADAPO等熵保持方法，使模型在训练中保持多样性，最终获得性能更好且在新环境中保持可训练性的策略。

摘要翻译

策略梯度算法推动了语言模型推理领域的诸多近期进展。其吸引人的特性在于能够从自身轨迹的探索中学习，这一过程对于促进多样化和创造性解决方案至关重要。如本文所示，许多策略梯度算法在训练过程中会自然降低熵值——从而减少探索轨迹的多样性——导致策略的探索能力逐渐受限。本文主张应在整个训练过程中主动监测并控制熵值。我们系统分析了主流策略梯度目标对熵动态的影响，识别出显著影响熵行为的实证因素（如数值精度），并提出明确的熵控制机制。这些机制包括：通过修改优势函数来调节熵值的算法族REPO，以及自适应非对称截断方法ADAPO。采用我们提出的熵保持方法训练的模型能够在整个训练过程中维持多样性，最终产生性能更优的策略，并保留在新环境中进行序列学习的可训练性。

摘要 (Abstract)

Policy gradient algorithms have driven many recent advancements in language model reasoning. An appealing property is their ability to learn from exploration on their own trajectories, a process crucial for fostering diverse and creative solutions. As we show in this paper, many policy gradient algorithms naturally reduce the entropy – and thus the diversity of explored trajectories – as part of training, yielding a policy increasingly limited in its ability to explore. In this paper, we argue that entropy should be actively monitored and controlled throughout training. We formally analyze the contributions of leading policy gradient objectives on entropy dynamics, identify empirical factors (such as numerical precision) that significantly impact entropy behavior, and propose explicit mechanisms for entropy control. These include REPO, a family of algorithms that modify the advantage function to regulate entropy, and ADAPO, an adaptive asymmetric clipping approach. Models trained with our entropy-preserving methods maintain diversity throughout training, yielding final policies that are more performant and retain their trainability for sequential learning in new environments.

关键词: policy gradient algorithms, entropy preservation, exploration diversity, REPO, ADAPO, reinforcement learning, trajectory diversity, sequential learning

109. ❌ LLMs can construct powerful representations and streamline sample-efficient supervised learning

作者: Ilker Demirel, Larry Shi, Zeshan Hussain, David Sontag 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11679v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种基于LLM的agentic pipeline，用于分析文本序列化输入并生成rubric来改进监督学习的输入表示。核心相关关键词：1) ‘Large Language Models’ (权重1.0) - 论文核心使用LLM分析输入并生成rubric，高度相关给10分；2) ‘LLM Agents’ (权重1.0) - 论文明确描述为’agentic pipeline’，涉及LLM代理工作流，高度相关给10分；3) ‘In-context Learning’ (权重1.0) - LLM通过分析上下文中的输入示例来合成rubric，属于上下文学习，高度相关给10分；4) ‘AI for Science’ (权重1.0) - 论文在15个临床任务（EHRSHOT基准）上评估，属于生物信息学/科学AI应用，高度相关给10分。其他关键词如MoE、SFT、RAG等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究如何利用LLM通过上下文学习生成程序化rubric来改进复杂异构数据（如临床记录）的输入表示，从而在15个临床任务上显著提升监督学习性能并降低部署成本。

摘要翻译

随着现实世界数据集日益复杂和异质化，监督学习常受限于输入表征的设计。为下游任务（如时间序列、自由文本和结构化记录）建模多模态数据通常需要大量特定领域的工程工作。我们提出一种智能体流程以简化此过程。首先，大型语言模型（LLM）通过上下文分析一小部分多样化文本序列化输入样本，综合生成一个全局评估框架，该框架作为程序化规范用于提取和组织证据。随后，该框架被用于将原始的文本序列化输入转换为更适合下游模型的标准化格式。我们还描述了局部评估框架，即由LLM生成的、以任务为条件的摘要。在EHRSHOT基准测试的15项临床任务中，我们基于评估框架的方法显著优于传统的计数特征模型、基于原始文本序列化的LLM基线模型，以及一个在数量级更多数据上预训练的临床基础模型。除性能优势外，评估框架为医疗操作环境提供了多项优势：易于审核、具备大规模部署的成本效益，并可转换为表格表征形式，从而解锁大量机器学习技术的应用潜力。

摘要 (Abstract)

As real-world datasets become increasingly complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data for downstream tasks, such as time-series, free text, and structured records, often requires non-trivial domain-specific engineering. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric-based approaches significantly outperform traditional count-feature models, naive text-serialization-based LLM baselines, and a clinical foundation model, which is pretrained on orders of magnitude more data. Beyond performance, rubrics offer several advantages for operational healthcare settings such as being easy to audit, cost-effectiveness to deploy at scale, and they can be converted to tabular representations that unlock a swath of machine learning techniques.

关键词: LLMs, agentic pipeline, in-context learning, rubric generation, supervised learning, clinical tasks, EHRSHOT benchmark, input representation

110. ❌ Stable Spike: Dual Consistency Optimization via Bitwise AND Operations for Spiking Neural Networks

作者: Yongqi Ding, Kunshan Yang, Linze Li, Yiyang Zhang, Mengmeng Jing, Lin Zuo 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11676v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于脉冲神经网络（SNNs）的优化，通过双一致性优化和位与操作解决SNNs中的时序不一致性问题，提高识别性能。所有评分关键词均针对大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、代理等），而论文研究的是脉冲神经网络这一完全不同的神经网络架构，两者在模型类型、技术原理和应用领域上均无交集。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对脉冲神经网络（SNNs）中因时序脉冲动态导致的表示不一致问题，提出了一种基于位与操作的双一致性优化方法（Stable Spike），通过分离稳定脉冲骨架和噪声脉冲、注入幅度感知噪声来提升跨时间步的一致性和泛化能力，在超低延迟下显著提高了神经形态物体识别的准确率（最高提升8.33%）。

摘要翻译

尽管脉冲神经网络的时序脉冲动态特性使其具备低功耗时序模式捕捉能力，但同时也引发了固有的不一致性问题，严重损害了表征性能。本文通过稳定脉冲进行双重一致性优化以缓解该问题，从而提升SNNs的识别性能。借助硬件友好的“与”位运算操作，我们高效地从多时间步脉冲图中解耦出稳定脉冲骨架，在捕捉关键语义的同时减少来自可变噪声脉冲的不一致性。强制不稳定脉冲图向稳定脉冲骨架收敛，显著提升了跨时间步的固有一致性。此外，我们在稳定脉冲骨架中注入幅度感知的脉冲噪声以丰富表征多样性，同时保持一致的语义信息。该方法促使SNN产生扰动一致的预测结果，从而增强泛化能力。跨多种架构与数据集的广泛实验验证了本方法的有效性与普适性。特别地，我们的方法在超低延迟条件下显著推进了神经形态物体识别任务，最高可实现8.33%的精度提升。这将有助于充分释放SNNs在功耗与速度方面的潜力。

摘要 (Abstract)

Although the temporal spike dynamics of spiking neural networks (SNNs) enable low-power temporal pattern capture capabilities, they also incur inherent inconsistencies that severely compromise representation. In this paper, we perform dual consistency optimization via Stable Spike to mitigate this problem, thereby improving the recognition performance of SNNs. With the hardware-friendly ``AND” bit operation, we efficiently decouple the stable spike skeleton from the multi-timestep spike maps, thereby capturing critical semantics while reducing inconsistencies from variable noise spikes. Enforcing the unstable spike maps to converge to the stable spike skeleton significantly improves the inherent consistency across timesteps. Furthermore, we inject amplitude-aware spike noise into the stable spike skeleton to diversify the representations while preserving consistent semantics. The SNN is encouraged to produce perturbation-consistent predictions, thereby contributing to generalization. Extensive experiments across multiple architectures and datasets validate the effectiveness and versatility of our method. In particular, our method significantly advances neuromorphic object recognition under ultra-low latency, improving accuracy by up to 8.33%. This will help unlock the full power consumption and speed potential of SNNs.

关键词: Spiking Neural Networks, Stable Spike, Dual Consistency Optimization, Bitwise AND Operations, Temporal Spike Dynamics, Neuromorphic Object Recognition, Low-power Temporal Pattern Capture, Inherent Inconsistencies

作者: Chongxiao Wang, Junjie Liang, Peng Cao, Jinzhu Yang, Osmar R. Zaiane 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11644v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文IDRL专注于多模态抑郁症诊断，提出了一种个体感知的表示学习框架，属于AI在生物医学（精神健康）领域的应用。所有关键词（共27个）中，仅“AI for Science OR Bioinformatics OR Cheminformatics”与论文主题有一定关联（抑郁症诊断属于生物信息学/医学AI应用），但论文未涉及大模型、深度学习技术原理创新或任何其他关键词的具体技术（如LLM、MoE、Scaling Laws、微调方法、推理技术、代理系统等）。因此，除该关键词得5分外，其余26个关键词均得0分。加权总分 = 5.0 × 1.0 = 5.0，远低于动态及格分26.6，表明论文与评审关注的大模型/深度学习技术原理创新或广泛领域应用相关性极低。

!!! tip deepseek-chat TL;DR

该论文针对多模态抑郁症检测中存在的模态间不一致、抑郁无关干扰以及个体表现差异问题，提出了一个个体感知的多模态抑郁相关表示学习框架（IDRL），通过解耦表示和动态融合模块，在实验中实现了优越且稳健的诊断性能。

摘要翻译

抑郁症是一种严重的精神障碍，可靠的识别对早期干预与治疗至关重要。多模态抑郁症检测旨在通过联合建模来自多个模态的互补信息以提升诊断性能。近年来，已有大量多模态学习方法被提出用于抑郁症分析；然而，这些方法存在以下局限：1）模态间不一致性与抑郁无关干扰，即与抑郁相关的线索在不同模态间可能存在冲突，同时大量无关内容掩盖了关键的抑郁信号；2）个体抑郁表现的多样性，导致模态及线索重要性存在个体差异，阻碍了可靠的融合。为解决这些问题，我们提出了一种用于稳健抑郁症诊断的个体感知多模态抑郁相关表征学习框架（Individual-aware Multimodal Depression-related Representation Learning Framework, IDRL）。具体而言，IDRL 1）将多模态表征解耦为模态共有抑郁空间、模态特定抑郁空间以及抑郁无关空间，以增强模态对齐并抑制无关信息；2）引入个体感知模态融合模块（Individual-aware modality-fusion module, IAF），该模块根据解耦出的抑郁相关特征的预测重要性动态调整其权重，从而实现对不同个体的自适应跨模态融合。大量实验表明，IDRL在多模态抑郁症检测中取得了优异且稳健的性能。

摘要 (Abstract)

Depression is a severe mental disorder, and reliable identification plays a critical role in early intervention and treatment. Multimodal depression detection aims to improve diagnostic performance by jointly modeling complementary information from multiple modalities. Recently, numerous multimodal learning approaches have been proposed for depression analysis; however, these methods suffer from the following limitations: 1) inter-modal inconsistency and depression-unrelated interference, where depression-related cues may conflict across modalities while substantial irrelevant content obscures critical depressive signals, and 2) diverse individual depressive presentations, leading to individual differences in modality and cue importance that hinder reliable fusion. To address these issues, we propose Individual-aware Multimodal Depression-related Representation Learning Framework (IDRL) for robust depression diagnosis. Specifically, IDRL 1) disentangles multimodal representations into a modality-common depression space, a modality-specific depression space, and a depression-unrelated space to enhance modality alignment while suppressing irrelevant information, and 2) introduces an individual-aware modality-fusion module (IAF) that dynamically adjusts the weights of disentangled depression-related features based on their predictive significance, thereby achieving adaptive cross-modal fusion for different individuals. Extensive experiments demonstrate that IDRL achieves superior and robust performance for multimodal depression detection.

关键词: multimodal depression detection, representation learning, individual-aware fusion, modality disentanglement, depression diagnosis, mental health AI, adaptive cross-modal fusion, robust performance

112. ❌ VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

作者: Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, Jihie Kim 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11631v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出VisDoT框架，通过人类感知任务和Decomposition-of-Thought（DoT）提示增强视觉推理，核心创新在于将复杂问题分解为视觉感知子问题和逻辑子问题，这与Chain of Thought和System 2 Thinking高度相关（10分）。论文基于大型视觉语言模型（LVLMs），属于大模型应用（8分），通过微调InternVL实现性能提升，涉及Post-training/SFT（8分）。框架强调可解释性，与Mechanistic Interpretability相关（8分）。论文涉及预训练模型的适应，与Pre-training/Domain Adaptation有一定关联（5分）。其他关键词如MoE、SLMs、RAG、RLHF等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型在图表理解中感知基础不足的问题，提出了VisDoT框架，通过人类感知任务和Decomposition-of-Thought提示策略，实现了在ChartQA等基准上的显著性能提升和可解释的视觉推理。

摘要翻译

大型视觉语言模型（LVLMs）在可靠检测图表中的视觉基元并将其与语义表征对齐方面存在困难，这严重限制了其在复杂视觉推理任务上的表现。这种感知基础能力的缺失构成了基于图表的推理的主要瓶颈。我们提出VisDoT框架，该框架通过类人解释基础来增强视觉推理能力。基于图形感知理论，我们形式化了包括位置与长度在内的四项感知任务。在此基础上，我们引入思维分解（Decomposition-of-Thought, DoT）提示法，将问题顺序分解为视觉感知子问题和逻辑子问题。使用VisDoT对InternVL进行微调后，在ChartQA基准上实现了+11.2%的性能提升，并在更具挑战性的ChartQAPro基准上超越了GPT-4o。在新推出的VisDoTQA基准测试中，模型性能提升达+33.2%。此外，在多样化开放域视觉问答（VQA）基准上持续的零样本增益证实了感知-逻辑分离策略对视觉问答任务具有普适性。VisDoT通过类人感知机制增强视觉基础能力，实现了最先进的图表理解和可解释的视觉推理。

摘要 (Abstract)

Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.

关键词: Visual Reasoning, Large Vision-Language Models, Decomposition-of-Thought, Perceptual Grounding, Chart Understanding, Visual Question Answering, Interpretable AI, Fine-tuning

113. ❌ EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

作者: Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12252v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究在扩散模型中激活多模态大语言模型（MLLMs）的链式思维推理能力，因此与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分），与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（8分），与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为MLLMs属于大语言模型范畴。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、AI for Science等均未在摘要中提及或与论文核心内容无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了EndoCoT框架，通过迭代思维引导和终端思维接地模块激活多模态大语言模型的链式思维推理能力，使扩散模型能够逐步执行复杂任务，在多个基准测试中达到92.1%的平均准确率，优于最强基线8.3个百分点。

摘要翻译

近年来，多模态大语言模型（MLLMs）被广泛集成到扩散框架中，主要作为文本编码器以处理空间推理等复杂任务。然而，该范式存在两个关键局限：（i）MLLMs 文本编码器的推理深度不足。单步编码无法激活思维链过程，而该过程对于 MLLMs 为复杂任务提供准确指导至关重要。（ii）指导信息在解码过程中保持不变。即使 MLLM 编码正确，解码过程中不变的指导也会阻碍扩散变换模型（DiT）将复杂指令逐步分解为可执行的去噪步骤。为此，我们提出了内源性思维链（EndoCoT），这是一个新颖的框架。首先，通过迭代思维指导模块对潜在思维状态进行迭代优化，从而激活 MLLMs 的推理潜力，然后将这些状态与 DiT 的去噪过程相连接。其次，应用终端思维锚定模块，通过将最终状态与真实答案对齐，确保推理轨迹始终基于文本监督。借助这两个组件，MLLM 文本编码器能够提供经过细致推理的指导，使 DiT 能够逐步执行该指导，最终以分步方式解决复杂任务。在多样化基准测试（如迷宫、旅行商问题、视觉空间规划和数独）上的广泛评估实现了平均 92.1% 的准确率，比最强基线高出 8.3 个百分点。

摘要 (Abstract)

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs’ reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT’s denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.

关键词: Endogenous Chain-of-Thought, Multimodal Large Language Models, Diffusion Models, Iterative Thought Guidance, Terminal Thought Grounding, Complex Task Solving, Step-by-step Reasoning, Spatial Reasoning

114. ❌ CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

作者: Alexandre Le Mercier, Thomas Demeester, Chris Develder 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12206v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究针对混合大语言模型（结合SSM如Mamba）的隐藏状态中毒攻击（HiSPA）的防御方法CLASP。核心与大语言模型（LLMs）相关，因此给8分。论文涉及安全防御，但未直接涉及幻觉缓解、事实性等关键词。其他关键词如MoE、SLMs、缩放定律、训练方法、推理加速、代理等均未在论文中涉及。论文应用场景是简历筛选，但未涉及生物信息学等具体科学AI应用。

!!! tip deepseek-chat TL;DR

论文提出了CLASP模型，通过分析Mamba块输出嵌入模式并使用XGBoost分类器，有效防御针对混合大语言模型的隐藏状态中毒攻击，在简历筛选场景中实现了高精度的恶意令牌检测和良好的泛化能力。

摘要翻译

以Mamba为代表的状态空间模型（SSMs）作为Transformer的高效替代方案已获得广泛关注，其在保持竞争力的性能同时实现了线性复杂度。然而，最近发现的隐藏状态投毒攻击（Hidden State Poisoning Attacks, HiSPAs）——一种通过对抗性字符串破坏SSM记忆的漏洞——对该架构及其混合变体构成了严重威胁。本文将HiSPA防御任务构建为词元级别的二分类问题，并提出了CLASP模型以应对此威胁。CLASP利用Mamba块输出嵌入（block output embeddings, BOEs）中的独特模式，采用XGBoost分类器以极低计算开销识别恶意词元。我们设定了一个SSM与HiSPA均可能被使用的现实场景：大型语言模型（LLM）筛选简历以确定职位最佳候选人。在包含2483份简历（总计950万词元）并注入受控攻击的语料库上评估，CLASP在恶意词元检测上取得了95.9%的词元级F1分数和99.3%的文档级F1分数。关键的是，该模型能泛化至未见过的攻击模式：在留一法交叉验证中，性能保持高位（文档级F1分数96.9%）；而在使用结构新颖触发器的聚类交叉验证中，仍保持有效的检测能力（平均文档级F1分数91.6%）。CLASP独立于任何下游模型运行，每秒可处理1032个词元且显存消耗低于4GB，有望作为基于SSM及混合架构的轻量级前线防御方案投入实际部署。所有代码及详细结果发布于https://anonymous.4open.science/r/hispikes-91C0。

摘要 (Abstract)

State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba’s block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.

关键词: Hidden State Poisoning Attacks, State Space Models, Mamba, Large Language Models, Hybrid LLMs, Adversarial Defense, XGBoost Classifier, Token-level Detection

115. ❌ IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

作者: Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12201v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文IndexCache专注于加速稀疏注意力机制，这是大模型推理优化的核心技术。高度相关的关键词包括：1) ‘Large Language Models’（论文明确针对LLM的agentic workflows）；2) ‘Mixture of Experts OR Sparse Models’（核心研究稀疏注意力，DSA是代表性稀疏模型）；3) ‘Context Window Extension OR Long Context LLMs’（针对长上下文场景）；4) ‘KV Cache Compression OR FlashAttention’（属于注意力优化技术范畴）；5) ‘LLM Agents OR Agentic Workflow’（论文开篇即指出这是关键应用场景）；6) ‘Speculative Decoding OR Inference Acceleration’（核心目标是加速推理）。其他关键词如SLMs、对齐、RAG、CoT等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

论文提出IndexCache方法，通过跨层索引复用减少稀疏注意力中的冗余计算，在30B DSA模型上实现了75%的索引器计算削减，带来1.82倍预填充加速和1.48倍解码加速，且质量损失可忽略。

摘要翻译

长上下文智能体工作流已成为大语言模型的关键应用场景，这使得注意力机制的效率对推理速度和服务成本至关重要。稀疏注意力技术能有效应对这一挑战，其中DeepSeek稀疏注意力（DSA）是代表性的生产级解决方案：其轻量级闪电索引器为每个查询选择最相关的k个令牌，将核心注意力复杂度从$O(L^2)$降至$O(Lk)$。然而，索引器本身仍保持$O(L^2)$复杂度，且必须在每个层独立运行，尽管相邻层产生的top-k选择结果具有高度相似性。本文提出IndexCache方法，通过利用这种跨层冗余性，将网络层划分为两类：少量运行独立索引器的完整层（Full layers）与多数直接复用最近完整层top-k索引的共享层（Shared layers）。我们提出两种互补的配置确定与优化方案：免训练的IndexCache采用贪心搜索算法，通过在校准集上直接最小化语言建模损失来选择保留索引器的层，无需权重更新；支持训练的IndexCache引入多层蒸馏损失，使每个保留的索引器针对其服务的所有层的平均注意力分布进行训练，即使采用简单的交错层模式也能达到全索引器的精度水平。在30B参数的DSA模型上的实验表明，IndexCache可消除75%的索引器计算量且质量损失可忽略，相比标准DSA实现预填充阶段最高加速1.82倍，解码阶段最高加速1.48倍。我们在生产级GLM-5模型上的初步实验进一步验证了这些积极成果（图1）。

摘要 (Abstract)

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer’s top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

关键词: sparse attention, inference acceleration, large language models, attention efficiency, cross-layer redundancy, index reuse, prefill speedup, decode speedup

116. ❌ QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

作者: Jiayin Lei, Ming Ma, Yunxi Duan, Chenxi Li, Tianming Yang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12165v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大模型训练中合成数据的质量筛选问题，与LLMs、数据质量、指令调优、监督微调、幻觉缓解等关键词高度相关。论文提出QAQ框架，通过双向语义一致性评估数据质量，直接针对合成数据中的噪声和幻觉问题，与’Hallucination Mitigation’核心相关。论文涉及代码生成模型的训练，属于大模型应用，但未涉及其他技术如MoE、量化、推理加速等。

!!! tip deepseek-chat TL;DR

该论文针对大模型训练中合成数据存在的噪声和幻觉问题，提出了基于双向语义一致性的QAQ数据选择框架，通过评估答案预测查询的能力来筛选高质量数据，实验表明仅使用25%的数据即可达到全数据训练的性能。

摘要翻译

合成数据已成为训练代码生成模型的关键资源，但其引入的显著噪声与幻觉问题难以通过现有指标有效检测。当前主流的数据选择方法（如指令遵循难度IFD）通常评估模型在给定查询时生成答案的难度（$A|Q$）。然而，该指标在噪声合成数据上存在歧义：低概率可能源于任务本身的内在复杂性，也可能由模型产生的幻觉导致。为此，我们提出QAQ——一种新颖的数据选择框架，从反向角度评估数据质量：答案能在多大程度上预测查询（$Q|A$）？我们定义了反向互信息（Reverse Mutual Information, RMI）来量化在给定答案条件下关于查询的信息增益。分析表明，RMI的两个极端值均暗示质量问题：低RMI反映语义失准，而过高的RMI可能包含大语言模型（LLMs）易于识别的缺陷模式。此外，我们引入一种基于强弱模型分歧的选择策略，以识别有效但具有挑战性的样本。在WarriorCoder数据集上的实验表明，使用分层RMI仅选择25%的数据进行训练，即可达到与全数据训练相当的性能，显著优于现有数据选择方法。本方法凸显了双向语义连贯性在合成数据构建中的重要性，为在不牺牲模型能力的前提下降低计算成本提供了可扩展的路径。

摘要 (Abstract)

Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.

关键词: synthetic data, data selection, code generation models, hallucination mitigation, bidirectional semantic coherence, Reverse Mutual Information, instruction following, model training

117. ❌ LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

作者: Feiyu Duan, Xuanjing Huang, Zhongyu Wei 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12152v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确提到LLMs（大语言模型）在个性化助手评估中的应用，因此’Large Language Models’相关关键词得10分。论文提出LifeSim用户模拟器，通过BDI模型模拟用户认知和行为，属于智能代理（LLM Agents）的研究范畴，因此’LLM Agents’相关关键词得10分。其他关键词如MoE、量化、推理加速、对齐技术等，论文未涉及，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有个性化AI助手评估基准与真实用户交互脱节的问题，提出了LifeSim用户模拟器和LifeSim-Eval评估基准，实验发现当前大语言模型在处理隐式意图和长期用户偏好建模方面存在显著局限性。

摘要翻译

大型语言模型（LLM）的快速发展加速了通用人工智能助手的研究进程。然而，现有的个性化助手评估基准仍与实际用户-助手交互场景存在偏差，未能充分捕捉外部环境与用户认知状态的复杂性。为弥补这一差距，我们提出了LifeSim，一种用户模拟器，它通过在物理环境中应用信念-期望-意图（Belief-Desire-Intention, BDI）模型来模拟用户认知，以生成连贯的生活轨迹，并模拟意图驱动的用户交互行为。基于LifeSim，我们构建了LifeSim-Eval，一个面向多场景、长周期个性化助手的综合性评估基准。LifeSim-Eval涵盖8个生活领域和1,200个多样化场景，采用多轮交互式评估方法，以测试模型在完成显性与隐性意图、还原用户画像以及生成高质量回应等方面的能力。在单场景与长周期两种设定下的实验表明，当前大型语言模型在处理隐性意图和长期用户偏好建模方面仍存在显著局限。

摘要 (Abstract)

The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users’ cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models’ abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.

关键词: Large Language Models, Personalized Assistant, User Simulator, BDI Model, LifeSim, LifeSim-Eval, Implicit Intention, Long-term Preference Modeling

118. ❌ Linking Perception, Confidence and Accuracy in MLLMs

作者: Yuetian Du, Yucheng Wang, Rongyu Zhang, Zhijie Xu, Boyu Yang, Ming Kong, Jie Liu, Qiang Zhu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12149v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）的置信度校准问题，属于大模型技术原理的创新应用。核心相关关键词：1）‘Large Language Models’（10分）- 论文明确研究MLLMs，是LLMs的多模态扩展；2）‘Self-Correction OR Self-Improvement OR Self-Reflection’（10分）- 论文提出Self-Reflection模块，是核心方法组成部分。部分相关关键词：1）‘Hallucination Mitigation OR Factuality OR Truthfulness’（5分）- 置信度校准有助于提高模型事实性；2）‘Mechanistic Interpretability OR Explainable AI’（5分）- 置信度分析涉及模型可解释性。其他关键词与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文揭示了多模态大语言模型存在严重的置信度校准问题，并提出置信度驱动的强化学习框架和置信度感知的测试时扩展方法，在四个基准测试上实现了8.8%的稳定性能提升。

摘要翻译

多模态大语言模型（MLLMs）的最新进展主要集中于增强视觉感知以提高准确性。然而，一个关键问题尚未被探索：模型是否知道它们何时不知道？通过一项探测性实验，我们揭示了MLLMs中存在严重的置信度误校准问题。为解决此问题，我们提出置信度驱动的强化学习（Confidence-Driven Reinforcement Learning, CDRL），该方法利用原始-噪声图像对和一种新颖的基于置信度的奖励机制，以增强感知敏感性并稳健地校准模型的置信度。除了训练阶段的优势，校准后的置信度能够作为一种“免费午餐”，在测试阶段实现更有效的扩展。我们进一步提出置信度感知的测试时扩展（Confidence-Aware Test-Time Scaling, CA-TTS），该方法在置信度信号的引导下动态协调自一致性、自反思和视觉自检模块。一个专家模型扮演多重角色（如规划者、评判者、投票者）来调度这些模块并提供外部验证。我们的集成框架在四个基准测试中取得了新的最先进成果，实现了8.8%的稳定性能提升。更多消融研究证明了每个模块的有效性以及扩展方案的优越性。

摘要 (Abstract)

Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model’s confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.

关键词: Multi-modal Large Language Models, Confidence Calibration, Reinforcement Learning, Self-Reflection, Test-Time Scaling, Perceptual Sensitivity, Confidence-Driven, State-of-the-Art

119. ❌ To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times

作者: Thomas Hikaru Clark, Carlos Arriaga, Javier Conde, Gonzalo Martínez, Pedro Reviriego 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12105v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLMs在心理语言学领域的应用，通过监督微调（SFT）预测句子级特征（记忆性和阅读时间），与人类认知测量相关。因此，‘Large Language Models’和’Post-training/SFT’高度相关（10分）。论文涉及零样本/少样本提示，与’In-context Learning’有一定关联（5分）。研究属于AI在科学（心理学）的应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、量化、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型通过监督微调预测句子级心理语言学特征（记忆性和阅读时间）的能力，发现微调后的模型能提供与人类判断相关的估计，优于可解释基线预测器，但零样本和少样本性能较差。

摘要翻译

近期研究表明，大型语言模型能够通过零样本提示方式，生成对词汇及多词表达的心理语言学指标（如效价、唤醒度或具体性）的估计值，这些估计值与人类判断具有相关性。此类估计通过向大型语言模型提出与人类研究相似的问题提示而获得。与此同时，对于词汇决策时间或习得年龄等其他指标，大型语言模型需要进行监督微调才能获得与真实值相符的结果。本文将此方法扩展至此前未被研究的句子可记忆性与阅读时间特征，这些特征涉及句子层面语境中多个词汇间的关联。我们的研究结果表明，通过微调，模型能够提供与人类衍生的规范相关且超越可解释基线预测器预测能力的估计值，这证明大型语言模型蕴含关于句子层面特征的有效信息。同时，我们的研究显示出零样本和少样本性能表现存在显著差异，这进一步表明在使用大型语言模型提示作为人类认知测量替代指标时需要格外审慎。

摘要 (Abstract)

Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by prompting an LLM, in zero-shot fashion, with a question similar to those used in human studies. Meanwhile, for other norms such as lexical decision time or age of acquisition, LLMs require supervised fine-tuning to obtain results that align with ground-truth values. In this paper, we extend this approach to the previously unstudied features of sentence memorability and reading times, which involve the relationship between multiple words in a sentence-level context. Our results show that via fine-tuning, models can provide estimates that correlate with human-derived norms and exceed the predictive power of interpretable baseline predictors, demonstrating that LLMs contain useful information about sentence-level features. At the same time, our results show very mixed zero-shot and few-shot performance, providing further evidence that care is needed when using LLM-prompting as a proxy for human cognitive measures.

关键词: Large Language Models, psycholinguistic norms, sentence memorability, reading times, supervised fine-tuning, zero-shot prompting, human cognitive measures, sentence-level features

120. ❌ Translationese as a Rational Response to Translation Task Difficulty

作者: Maria Kunilovskaya 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12050v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究翻译任务难度与翻译腔（translationese）的关系，主要使用LLM surprisal作为信息论指标来量化翻译任务难度，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（评分5分）。论文未涉及其他关键词所描述的大模型技术原理创新（如MoE、量化、推理加速等）或具体应用领域（如生物信息学），因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究翻译腔（translationese）是否可归因于翻译任务本身的认知负荷，通过使用基于LLM surprisal的信息论指标量化翻译难度，发现翻译腔部分可由翻译任务难度解释，尤其在英译德中跨语言迁移难度比源文本复杂度影响更大。

摘要翻译

译文与目标语言原创文本存在系统性差异，这一现象被广泛称为“翻译腔”。翻译腔常被归因于产出倾向（如干扰、简化）、社会文化变量及语言对效应，但目前仍缺乏统一的解释框架。我们认为，翻译腔反映了翻译任务本身固有的认知负荷。本研究通过可量化的翻译任务难度指标，检验其能否预测可观测的翻译腔特征。我们将翻译腔操作化为由自动分类器生成的片段级“翻译程度分数”，并将翻译任务难度概念化为源文本难度与跨语言转换难度的复合体，主要基于大语言模型（LLM）惊异值的信息论指标进行量化，辅以成熟的句法与语义替代指标。实验采用包含书面语与口语子库的英德双向平行语料库。结果表明，翻译任务难度可部分解释翻译腔现象，尤其在英译德方向更为显著。在多数实验中，跨语言转换难度比源文本复杂度的影响更为突出。在书面语模式下，信息论指标与传统特征表现相当或更优，但在口语模式下未显现优势。源文本句法复杂度和翻译解决方案熵值，在所有语言对与语态中均成为预测翻译腔的最强指标。

摘要 (Abstract)

Translations systematically diverge from texts originally produced in the target language, a phenomenon widely referred to as translationese. Translationese has been attributed to production tendencies (e.g. interference, simplification), socio-cultural variables, and language-pair effects, yet a unified explanatory account is still lacking. We propose that translationese reflects cognitive load inherent in the translation task itself. We test whether observable translationese can be predicted from quantifiable measures of translation task difficulty. Translationese is operationalised as a segment-level translatedness score produced by an automatic classifier. Translation task difficulty is conceptualised as comprising source-text and cross-lingual transfer components, operationalised mainly through information-theoretic metrics based on LLM surprisal, complemented by established syntactic and semantic alternatives. We use a bidirectional English-German corpus comprising written and spoken subcorpora. Results indicate that translationese can be partly explained by translation task difficulty, especially in English-to-German. For most experiments, cross-lingual transfer difficulty contributes more than source-text complexity. Information-theoretic indicators match or outperform traditional features in written mode, but offer no advantage in spoken mode. Source-text syntactic complexity and translation-solution entropy emerged as the strongest predictors of translationese across language pairs and modes.

关键词: translationese, translation task difficulty, LLM surprisal, information-theoretic metrics, English-German corpus, cross-lingual transfer, source-text complexity, translation-solution entropy

121. ❌ CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

作者: Pranav Raikote, Korbinian Randl, Ioanna Miliou, Athanasios Lakes, Panagiotis Papapetrou 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11957v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CHiL(L)Grader专注于使用大语言模型（LLMs）进行教育评估中的自动评分，核心创新在于将校准的置信度估计与人在环工作流结合，以解决指令调优模型过度自信和可靠性问题。因此，与’Large Language Models’和’Instruction Tuning’高度相关（10分），因为论文明确使用LLMs进行评分，并涉及指令调优模型的校准。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等均未在摘要中提及或与论文主题无关，故得0分。论文属于大模型在教育领域的应用，符合研究背景要求，但未涉及生物医药等特定科学领域。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在教育自动评分中过度自信和可靠性问题，提出了CHiL(L)Grader框架，通过置信度校准和人在环工作流，实现了35-65%的专家级自动评分，并有效路由不确定案例给人工评分员。

摘要翻译

利用大语言模型扩展教育评估不仅需要准确性，更需具备判断预测结果可信度的能力。经过指令微调的模型往往表现出过度自信，且其可靠性会随着课程体系演变而下降，这导致高风险场景下的全自主部署存在安全隐患。本文提出CHiL(L)Grader——首个将校准化置信度估计融入人机协同工作流的自动评分框架。通过事后温度缩放、基于置信度的选择性预测以及持续学习技术，该框架仅对高置信度预测进行自动化评分，同时将不确定案例路由至人工评分员处理，并能适应不断更新的评分标准和未见过的题目。在三个简答题评分数据集上的实验表明，CHiL(L)Grader能以专家级质量（QWK >= 0.80）自动评估35-65%的作答。接受预测与拒绝预测之间0.347的QWK差距证实了基于置信度的路由机制的有效性。每个修正周期通过吸收教师反馈持续增强模型评分能力。这些结果表明，不确定性量化是实现可靠人工智能辅助评分的关键。

摘要 (Abstract)

Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model’s grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.

关键词: large language models, automated grading, confidence calibration, human-in-the-loop, instruction-tuned models, uncertainty quantification, short-answer grading, continual learning

122. ❌ Resurfacing Paralinguistic Awareness in Large Audio Language Models

作者: Hao Yang, Minghan Wang, Tongtong Wu, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11947v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型音频语言模型（LALMs）的副语言感知能力，属于大模型在音频模态的应用。核心贡献是提出副语言增强微调协议（PE-FT），包括选择性层微调和辅助分类头，这直接涉及大模型微调技术。因此，与’Large Language Models’高度相关（10分），与’Post-training/SFT’高度相关（10分），与’PEFT/LoRA’相关（8分，因为PE-FT是参数高效的微调方法），与’Mechanistic Interpretability’有一定关联（5分，因为论文进行了层分析以识别副语言层）。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型音频语言模型忽视副语言线索的问题，提出了一种副语言增强微调协议，通过选择性层微调和辅助分类头有效提升了模型的副语言感知能力。

摘要翻译

大型音频语言模型（LALMs）将人机交互扩展至语音模态，由于副语言线索隐式指示用户情境，这带来了巨大的交互潜力。然而，在当前以内容为中心的范式基础上，LALMs通常忽视此类副语言线索，仅基于查询内容进行回应。在本研究中，为重新唤醒LALMs中的副语言感知能力，我们引入了五种分层分析方法，共同识别副语言处理层与语义理解层。基于这些发现，我们相应提出了一种副语言增强微调（PE-FT）方案，使LALMs具备副语言感知能力，包括：（1）选择性层级微调，以及（2）辅助的双层级分类头。实验表明，PE-FT方案能高效且有效地重建副语言感知能力，其表现甚至超越全层级微调策略。

摘要 (Abstract)

Large Audio Language Models (LALMs) have expanded the interaction with human to speech modality, which introduces great interactive potential, due to the paralinguistic cues implicitly indicating the user context. However, building on the current content-centred paradigm, LALMs usually neglect such paralinguistic cues and respond solely based on query content. In this work, to resurface the paralinguistic awareness in LALMs, we introduce five diverse layer-wise analyses to jointly identify paralinguistic layers and semantic understanding layers. Based on these insights, we propose a paralinguistic-enhanced fine-tuning (PE-FT) protocol accordingly to equip LALMs with paralinguistic-aware capabilities, including (1) selective-layer fine-tuning, and (2) an auxiliary dual-level classification head. Our experiments demonstrate that PE-FT protocol efficiently and effectively resurfaces the paralinguistic awareness, even surpassing the performance of the all-layer fine-tuning strategy.

关键词: Large Audio Language Models, paralinguistic awareness, layer-wise analysis, selective-layer fine-tuning, parameter-efficient fine-tuning, audio modality, dual-level classification head, PE-FT protocol

123. ❌ CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?

作者: Ruirui Chen, Weifeng Jiang, Chengwei Qin, Cheston Tan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11915v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在心理理论任务上的能力评估，因此与’Large Language Models’高度相关（10分）。论文涉及推理和认知能力评估，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），但非技术核心。其他关键词涉及具体技术方法（如MoE、量化、对齐等）或特定应用领域（如科学AI），论文未涉及，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为CoMMET的多模态基准数据集，用于评估大型语言模型在心理理论任务上的能力，并通过全面测试揭示了当前模型的优势和局限性。

摘要翻译

心理理论（Theory of Mind, ToM）——即推理自我与他人心理状态的能力——是人类社会智能的基石。随着大语言模型（Large Language Models, LLMs）在现实应用中的普及，验证其是否具备这种层次的社会推理能力，对于实现有效且自然的交互至关重要。然而，现有评估LLMs心理理论的基准存在局限；大多仅依赖文本输入，并狭隘地聚焦于与信念相关的任务。本文提出了一种新的多模态基准数据集CoMMET（Comprehensive Mental states and Moral Evaluation Task），其灵感来源于心理理论手册任务。CoMMET通过涵盖更广泛的心理状态并引入多轮测试，扩展了评估范围。据我们所知，这是首个在多轮对话环境中评估心理理论的多模态数据集。通过对不同系列和规模的LLMs进行全面评估，我们分析了当前模型的优势与局限，并指明了未来改进的方向。我们的工作为深入理解现代LLMs的社会认知能力提供了新的视角。

摘要 (Abstract)

Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.

关键词: Theory of Mind, Large Language Models, multimodal benchmark, social reasoning, mental states, evaluation, cognitive capabilities, CoMMET

124. ❌ Chem4DLLM: 4D Multimodal LLMs for Chemical Dynamics Understanding

作者: Xinyu Li, Zhen Zhang, Qi Chen, Anton van den Hengel, Lina Yao, Javen Qinfeng Shi 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11924v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文核心是开发Chem4DLLM模型，将预训练大语言模型（LLM）与图编码器结合，用于化学动力学理解（ChemDU）任务，属于大模型在科学领域（AI for Science/Chemoinformatics）的创新应用。论文明确使用预训练LLM（相关度10），涉及化学信息学应用（相关度10），并要求模型对分子轨迹进行推理（Chain of Thought/System 2 Thinking相关度10）。模型构建可能涉及预训练和微调（相关度5），且生成解释性叙述与可解释AI相关（相关度5）。其他关键词如MoE、SLMs、对齐、RAG、加速等未在摘要中体现，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对化学动力学理解任务，提出了Chem4DLLM模型，通过整合等变图编码器和预训练大语言模型，将4D分子轨迹转化为自然语言解释，并构建了首个配对数据集Chem4DBench。

摘要翻译

现有化学理解任务主要依赖静态分子表征，这限制了其模拟本质上具有动态特性的现象（如化学键断裂或构象变化）的能力，而这些动态过程对于化学家理解化学反应至关重要。为弥补这一不足，我们提出了化学动态理解（Chemical Dynamics Understanding，简称ChemDU）这一新任务，其目标是将四维分子轨迹转化为可解释的自然语言描述。ChemDU聚焦于基础动态场景，包括气相反应与催化反应，要求模型能够对分子轨迹中的关键事件（如化学键形成与解离）进行推理，并生成连贯且基于机理的叙述。为评估此项能力，我们构建了Chem4DBench——首个在多种场景下将四维分子轨迹与专家撰写的解释说明相匹配的数据集。我们进一步提出了Chem4DLLM模型，该统一模型通过将等变图编码器与预训练大语言模型相结合，显式地捕捉分子几何结构与旋转动力学特征。我们期望ChemDU任务及其配套的Chem4DBench数据集与Chem4DLLM模型，能够推动动态化学理解与多模态科学推理领域的进一步研究。

摘要 (Abstract)

Existing chemical understanding tasks primarily rely on static molecular representations, limiting their ability to model inherently dynamic phenomena such as bond breaking or conformational changes, which are essential for a chemist to understand chemical reactions. To address this gap, we introduce Chemical Dynamics Understanding (ChemDU), a new task that translates 4D molecular trajectories into interpretable natural-language explanations. ChemDU focuses on fundamental dynamic scenarios, including gas-phase and catalytic reactions, and requires models to reason about key events along molecular trajectories, such as bond formation and dissociation, and to generate coherent, mechanistically grounded narratives. To benchmark this capability, we construct Chem4DBench, the first dataset pairing 4D molecular trajectories with expert-authored explanations across these settings. We further propose Chem4DLLM, a unified model that integrates an equivariant graph encoder with a pretrained large language model to explicitly capture molecular geometry and rotational dynamics. We hope that ChemDU, together with Chem4DBench and Chem4DLLM, will stimulate further research in dynamic chemical understanding and multimodal scientific reasoning.

关键词: Chemical Dynamics Understanding, 4D molecular trajectories, large language model, multimodal scientific reasoning, equivariant graph encoder, natural-language explanations, Chem4DBench dataset, chemical reactions

125. ❌ DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining

作者: Yutong Yan, Raphael Tang, Zhenyu Gao, Wenxi Jiang, Yao Lu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11838v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究大语言模型在金融领域的应用，通过时间感知预训练解决前瞻性偏差问题。高度相关的关键词包括：大语言模型（核心研究对象）、预训练（从头训练12个模型）、指令微调（在通用和金融领域数据集上）、监督微调（指令微调属于SFT范畴）。有一定关联的关键词：幻觉缓解（解决前瞻性偏差可视为提高事实性）、AI for Science（金融应用属于科学应用范畴）。其他关键词如MoE、量化、RAG等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文针对金融回测中大语言模型存在前瞻性偏差的问题，提出了DatedGPT系列模型，通过时间分区的预训练和指令微调确保模型知识受限于特定年份，有效防止了前瞻性偏差并保持了竞争力。

摘要翻译

在金融回测中，基于互联网规模数据预训练的大型语言模型存在引入前瞻性偏差的风险，这可能削弱其预测有效性，因为模型在训练期间可能已经接触过真实结果。为解决这一问题，我们提出了DatedGPT系列模型——包含十二个13亿参数的语言模型，每个模型均从零开始训练，使用约1000亿token按时间严格划分的数据进行训练，数据年度截止时间覆盖2013年至2024年。我们进一步通过指令微调增强每个模型，所用数据集涵盖通用领域和金融专业领域，且所有数据均遵循相同的时间边界约束。基于困惑度的探测实验证实，每个模型的知识范围均有效受限于其数据截止年份，同时在标准基准测试中展现出与同类规模模型相当的性能。我们提供了一个交互式网页演示平台，允许用户查询并比较不同截止年份模型的响应输出。

摘要 (Abstract)

In financial backtesting, large language models pretrained on internet-scale data risk introducing lookahead bias that undermines their forecasting validity, as they may have already seen the true outcome during training. To address this, we present DatedGPT, a family of twelve 1.3B-parameter language models, each trained from scratch on approximately 100 billion tokens of temporally partitioned data with strict annual cutoffs spanning 2013 to 2024. We further enhance each model with instruction fine-tuning on both general-domain and finance-specific datasets curated to respect the same temporal boundaries. Perplexity-based probing confirms that each model’s knowledge is effectively bounded by its data cutoff year, while evaluation on standard benchmarks shows competitive performance with existing models of similar scale. We provide an interactive web demo that allows users to query and compare responses from models across different cutoff years.

关键词: Large Language Models, Time-Aware Pretraining, Lookahead Bias, Financial Backtesting, Instruction Fine-tuning, Temporal Partitioning, DatedGPT, Forecasting Validity

126. ❌ Large Language Models for Biomedical Article Classification

作者: Jakub Proboszcz, Paweł Cichosz 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11780v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型在生物医学文章分类中的应用，与’Large Language Models’和’AI for Science’高度相关（10分）。研究涉及few-shot prompting，与’In-context Learning’相关（5分）。使用了small and mid-size models，与’Small Language Models’有一定关联（5分）。论文未涉及其他关键词的技术原理或创新，如MoE、Scaling Laws、训练方法、推理优化、代理系统等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了大语言模型作为文本分类器在生物医学文章分类任务中的性能，发现零样本和少样本提示的PR AUC接近传统分类算法，证实了LLM在非平凡领域的实用性。

摘要翻译

本研究对大型语言模型作为生物医学文献分类文本分类器的实用性进行了系统而深入的探讨。该研究使用了多种中小型开源模型以及部分精选的闭源模型，并在评估配置的覆盖范围上比以往大多数研究更为全面：包括不同类型的提示、用于生成类别及类别概率预测的输出处理方法，以及少样本示例的数量与选择策略。研究将最成功配置的性能与传统分类算法进行了比较。在15个具有挑战性的数据集上，零样本提示获得的平均PR AUC超过0.4，少样本提示接近0.5，这一表现接近朴素贝叶斯分类器（0.5）、随机森林算法（默认设置为0.5，超参数调优后为0.55）以及微调后的Transformer模型（0.5）。这些结果证实了大型语言模型在复杂领域作为文本分类器的实用性，并为最具潜力的配置方案提供了实用建议，其中特别强调利用输出标记概率进行类别概率预测。

摘要 (Abstract)

This work presents a systematic and in-depth investigation of the utility of large language models as text classifiers for biomedical article classification. The study uses several small and mid-size open source models, as well as selected closed source ones, and is more comprehensive than most prior work with respect to the scope of evaluated configurations: different types of prompts, output processing methods for generating both class and class probability predictions, as well as few-shot example counts and selection methods. The performance of the most successful configurations is compared to that of conventional classification algorithms. The obtained average PR AUC over 15 challenging datasets above 0.4 for zero-shot prompting and nearly 0.5 for few-shot prompting comes close to that of the naïve Bayes classifier (0.5), the random forest algorithm (0.5 with default settings or 0.55 with hyperparameter tuning) and fine-tuned transformer models (0.5). These results confirm the utility of large language models as text classifiers for non-trivial domains and provide practical recommendations of the most promising setups, including in particular using output token probabilities for class probability prediction.

关键词: Large Language Models, Biomedical Article Classification, Text Classification, Zero-shot Prompting, Few-shot Prompting, PR AUC, Transformer Models, Biomedical Informatics

127. ❌ Trust Oriented Explainable AI for Fake News Detection

作者: Krzysztof Siwek, Daniel Stankowski, Maciej Stodolski 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11778v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究可解释人工智能（XAI）在假新闻检测中的应用，具体比较了SHAP、LIME和Integrated Gradients等解释方法。论文内容与大多数关键词（如大模型、MoE、量化、推理加速等）完全无关，因为这些关键词主要涉及大模型技术原理、训练方法、优化技术等，而本文专注于传统NLP分类模型的可解释性。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文核心就是研究XAI方法在NLP中的应用，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该研究探讨了可解释人工智能（XAI）方法在NLP假新闻检测中的应用，通过比较SHAP、LIME和Integrated Gradients等方法，发现XAI能有效提升模型透明度和可解释性，同时保持高检测准确率。

摘要翻译

本文研究了可解释人工智能（Explainable Artificial Intelligence, XAI）在基于自然语言处理的虚假新闻检测中的应用，并对选定的可解释性方法进行了比较。研究概述了虚假信息的关键特征、神经网络架构以及XAI技术，重点关注SHAP、LIME和积分梯度法。在实验研究中，我们实现了分类模型并运用这些方法进行解释。结果表明，XAI在保持高检测准确率的同时，增强了模型的透明度和可解释性。每种方法均提供了独特的解释价值：SHAP可提供细致的局部归因分析，LIME能生成简洁直观的解释，而积分梯度法在卷积模型中表现高效。研究也指出了现有方法的局限性，如计算成本高、对参数设置敏感等。总体而言，本研究表明将XAI与自然语言处理技术相结合，是提升虚假新闻检测系统可靠性与可信度的有效途径。

摘要 (Abstract)

This article examines the application of Explainable Artificial Intelligence (XAI) in NLP based fake news detection and compares selected interpretability methods. The work outlines key aspects of disinformation, neural network architectures, and XAI techniques, with a focus on SHAP, LIME, and Integrated Gradients. In the experimental study, classification models were implemented and interpreted using these methods. The results show that XAI enhances model transparency and interpretability while maintaining high detection accuracy. Each method provides distinct explanatory value: SHAP offers detailed local attributions, LIME provides simple and intuitive explanations, and Integrated Gradients performs efficiently with convolutional models. The study also highlights limitations such as computational cost and sensitivity to parameterization. Overall, the findings demonstrate that integrating XAI with NLP is an effective approach to improving the reliability and trustworthiness of fake news detection systems.

关键词: Explainable AI, Fake News Detection, SHAP, LIME, Integrated Gradients, Model Interpretability, NLP, Transparency

128. ❌ Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents

作者: Yaocong Li, Qiang Lan, Leihan Zhang, Le Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11772v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究检索增强生成（RAG）在法律文档中的应用，与关键词’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），因为这是论文的主要技术框架。论文提到’dual-path self-reflection mechanism’，与关键词’Self-Correction OR Self-Improvement OR Self-Reflection’有一定关联（8分），但非核心。论文涉及大模型在自动化评估中的应用，与’Large Language Models OR LLMs OR Foundation Models’相关（8分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、AI for Science等与论文内容无关（0分），因为论文专注于法律领域的RAG系统，未涉及这些技术原理或科学应用。

!!! tip deepseek-chat TL;DR

该研究针对中文法律文档检索增强生成（RAG）系统缺乏专业评估基准和结构化处理能力的问题，提出了Legal-DC基准数据集和LegRAG框架，通过法律自适应索引和双路径自反思机制，在关键指标上比现有方法提升1.3%至5.6%。

摘要翻译

检索增强生成技术已成为法律文件咨询领域一项前景广阔的技术，但其在中国法律场景中的应用面临两个关键局限：现有基准测试缺乏对检索器-生成器联合评估的专业支持，且主流RAG系统往往难以适应法律条款的结构化特性。为弥补这些不足，本研究提出两项核心贡献：首先，我们构建了Legal-DC基准数据集，包含480份法律文件（涵盖市场监管、合同管理等领域）和2475个精炼问答对，每个问答对均标注条款级参考文献，填补了中文法律RAG领域专业评估资源的空白。其次，我们提出LegRAG框架，该框架将法律自适应索引（基于条款边界的文本分割）与双路径自反思机制相结合，在保障条款完整性的同时提升答案准确性。再次，我们针对法律检索场景的高可靠性需求，引入了面向大语言模型的自动化评估方法。实验表明，LegRAG在关键评估指标上较现有最优方法提升1.3%至5.6%。本研究通过提供专业基准、实用框架与实证洞察，推动中文法律RAG系统的发展。相关代码与数据已发布于https://github.com/legal-dc/Legal-DC。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) has emerged as a promising technology for legal document consultation, yet its application in Chinese legal scenarios faces two key limitations: existing benchmarks lack specialized support for joint retriever-generator evaluation, and mainstream RAG systems often fail to accommodate the structured nature of legal provisions. To address these gaps, this study advances two core contributions: First, we constructed the Legal-DC benchmark dataset, comprising 480 legal documents (covering areas such as market regulation and contract management) and 2,475 refined question-answer pairs, each annotated with clause-level references, filling the gap for specialized evaluation resources in Chinese legal RAG. Second, we propose the LegRAG framework, which integrates legal adaptive indexing (clause-boundary segmentation) with a dual-path self-reflection mechanism to ensure clause integrity while enhancing answer accuracy. Third, we introduce automated evaluation methods for large language models to meet the high-reliability demands of legal retrieval scenarios. LegRAG outperforms existing state-of-the-art methods by 1.3% to 5.6% across key evaluation metrics. This research provides a specialized benchmark, practical framework, and empirical insights to advance the development of Chinese legal RAG systems. Our code and data are available at https://github.com/legal-dc/Legal-DC.

关键词: Retrieval-Augmented Generation, Legal Documents, Benchmark Dataset, Legal-DC, LegRAG Framework, Self-Reflection Mechanism, Chinese Legal Scenarios, Clause-level References

129. ❌ Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair

作者: Assaf Siani, Anna Kernerman, Ilan Kernerman 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11743v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器翻译质量评估（QE）的数据集构建和模型训练，使用BERT和XLM-R等传统预训练模型，而非大语言模型（LLMs）或深度学习技术原理的创新。论文内容涉及半合成数据生成、BLEU筛选、人工评估和神经QE模型训练，但未涉及评分关键词中的任何大模型技术、训练方法、推理优化、对齐技术、代理系统或科学AI应用。所有关键词均与论文主题无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究构建了一个用于英语-希伯来语机器翻译质量评估的半合成平行数据集，通过训练BERT和XLM-R模型评估了数据集大小、平衡性和错误分布对模型性能的影响，旨在提升资源匮乏语言对的QE性能。

摘要翻译

质量评估在机器翻译工作流程中扮演着关键角色，其作用在于评估无参考译文的生成输出，并决定是否需要人工后期编辑或完全重译。然而，为资源匮乏的语言对开发高精度、适应性强且可靠的质量评估系统在很大程度上仍未得到解决，这主要受限于有限的平行语料库以及多样化的语言依赖性因素，例如形态句法复杂的语言。本研究提出了一个用于英语到希伯来语质量评估的半合成平行数据集，该数据集通过以下方式构建：基于典型语言模式的使用示例创建英语句子，使用多种机器翻译引擎将其译为希伯来语，并通过基于BLEU的筛选机制过滤输出。每个翻译片段均由语言学家手动评估和打分，我们还整合了来自自有资源的专业翻译的英希双语片段，这些片段被赋予了最高质量评分。我们引入了受控的翻译错误以应对语言挑战，特别是在性和数的一致性方面，并在此数据集上训练了包括BERT和XLM-R在内的神经质量评估模型，以评估句子级别的机器翻译质量。我们的研究结果凸显了数据集规模、分布平衡以及错误分布对模型性能的影响。我们将阐述实验面临的挑战、方法论及结果，并指明旨在提升质量评估性能的未来研究方向。此项研究有助于推动针对资源匮乏语言对（包括形态丰富的语言）的质量评估模型的发展。

摘要 (Abstract)

Quality estimation (QE) plays a crucial role in machine translation (MT) workflows, as it serves to evaluate generated outputs that have no reference translations and to determine whether human post-editing or full retranslation is necessary. Yet, developing highly accurate, adaptable and reliable QE systems for under-resourced language pairs remains largely unsolved, due mainly to limited parallel corpora and to diverse language-dependent factors, such as with morphosyntactically complex languages. This study presents a semi-synthetic parallel dataset for English-to-Hebrew QE, generated by creating English sentences based on examples of usage that illustrate typical linguistic patterns, translating them to Hebrew using multiple MT engines, and filtering outputs via BLEU-based selection. Each translated segment was manually evaluated and scored by a linguist, and we also incorporated professionally translated English-Hebrew segments from our own resources, which were assigned the highest quality score. Controlled translation errors were introduced to address linguistic challenges, particularly regarding gender and number agreement, and we trained neural QE models, including BERT and XLM-R, on this dataset to assess sentence-level MT quality. Our findings highlight the impact of dataset size, distributed balance, and error distribution on model performance. We will describe the challenges, methodology and results of our experiments, and specify future directions aimed at improving QE performance. This research contributes to advancing QE models for under resourced language pairs, including morphology-rich languages.

关键词: Quality Estimation, Machine Translation, Semi-synthetic Dataset, Under-resourced Language, English-Hebrew, BERT, XLM-R, Dataset Building

130. ❌ In the LLM era, Word Sense Induction remains unsolved

作者: Anna Mosolova, Marie Candito, Carlos Ramisch 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11686v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在词义归纳（WSI）任务中的应用，直接涉及’Large Language Models’关键词，因此给予10分。论文评估了LLM-based WSI方法，并探讨了LLM生成数据增强、LLM的词汇语义能力等，与LLM高度相关。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术、推理优化、代理系统、模型压缩、AI for Science等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在LLM时代词义归纳（WSI）任务仍未解决，通过提出新的评估方法并测试LLM-based WSI方法，发现LLM在此任务上表现不佳，但数据增强和利用Wiktionary能提升性能，最终在测试集上超越先前SOTA系统3.3%。

摘要翻译

在缺乏词义标注数据的情况下，词义归纳（WSI）是词义消歧的一种引人注目的替代方案，尤其在低资源或特定领域场景中。本文重点探讨当前WSI评估中存在的方法学问题。我们提出基于SemCor衍生的数据集进行评估，该数据集保留了原始语料库的多义性和词频分布特征。我们评估了不同词性下的预训练词向量和聚类算法，并提出并评估了一种基于大语言模型（LLM）的英语WSI方法。我们评估了多种数据增强来源（LLM生成数据、语料库数据和词典数据），以及利用维基词典进行数据增强的半监督场景，同时考察了必须链接约束和每个词元的聚类数量设置。

研究发现，无论是本文方法还是已有方法，尚无任何无监督方法能够超越“每个词元单一聚类”（1cpl）这一强启发式基线。我们还表明：（i）不同词性的结果和最佳系统可能存在差异；（ii）大语言模型在执行此任务时存在困难；（iii）数据增强具有积极作用；（iv）有效利用维基词典确实能提升性能。该方法在我们的测试集上超越了先前的最先进（SOTA）系统3.3%。词义归纳问题尚未完全解决，需要进一步探索词典资源与大语言模型词汇语义能力之间更有效的结合方式。

摘要 (Abstract)

In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma. We find that no unsupervised method (whether ours or previous) surpasses the strong “one cluster per lemma” heuristic (1cpl). We also show that (i) results and best systems may vary across POS, (ii) LLMs have troubles performing this task, (iii) data augmentation is beneficial and (iv) capitalizing on Wiktionary does help. It surpasses previous SOTA system on our test set by 3.3%. WSI is not solved, and calls for a better articulation of lexicons and LLMs’ lexical semantics capabilities.

关键词: Word Sense Induction, LLM-based method, evaluation methodology, data augmentation, Wiktionary, unsupervised methods, lexical semantics, clustering algorithms

131. ❌ A technology-oriented mapping of the language and translation industry: Analysing stakeholder values and their potential implication for translation pedagogy

作者: María Isabel Rivas Ginel, Janiça Hackenbuchner, Alina Secară, Ralph Krüger, Caroline Rossi 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11667v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究翻译行业自动化背景下价值的构建与协商，属于社会科学、翻译研究领域，而非大模型或深度学习技术研究。论文讨论的是自动化技术（如机器翻译）对行业的影响，但未涉及任何具体的大模型技术、算法创新或技术原理。所有关键词均与大模型技术相关，而本文聚焦行业价值分析，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了自动化如何重塑翻译行业中的价值构建，发现技术效率与人类专业价值在自动化生产环境中形成相互依存的配置，其中适应性成为连接人类与技术领域的核心价值。

摘要翻译

本文探讨了在当今日益自动化的语言与翻译行业中，价值如何被构建与协商。研究基于LT-LiDER项目中收集的二十九位行业利益相关者的访谈数据，分析了人类价值、技术价值、效率与适应性在不同职业角色中如何被表述。借助切斯特曼（Chesterman）的翻译伦理及相关价值理论作为分析框架，本文指出，在以速度、可扩展性和交付能力为主导评价标准的自动化生产环境中，与服务伦理相契合的、以效率为导向的技术价值已成为基础性预期。与此同时，人类价值并未被取代，而是被重新定位——主要通过专业知识、监督职责、问责机制以及嵌入技术中介工作流程的语境判断得以体现。一个核心发现是，适应性作为连接人类与技术领域的中介价值尤为突出。适应性被构建为一项核心专业要求，反映出行业对译者持续调整其技能、角色与身份以适应不断演变的工具与组织需求的期望。本文认为，自动化并未取代翻译价值，而是重塑了价值体系，形成了一种相互依存的配置：技术效率为人类的交际性工作提供了支撑。

摘要 (Abstract)

This paper examines how value is constructed and negotiated in today’s increasingly automated language and translation industry. Drawing on interview data from twenty-nine industry stakeholders collected within the LT-LiDER project, the study analyses how human value, technological value, efficiency, and adaptability are articulated across different professional roles. Using Chesterman’s framework of translation ethics and associated values as an analytical lens, the paper shows that efficiency-oriented technological values aligned with the ethics of service have become baseline expectations in automated production environments, where speed, scalability, and deliverability dominate evaluation criteria. At the same time, human value is not displaced but repositioned, emerging primarily through expertise, oversight, accountability, and contextual judgment embedded within technology-mediated workflows. A central finding is the prominence of adaptability as a mediating value linking human and technological domains. Adaptability is constructed as a core professional requirement, reflecting expectations that translators continuously adjust their skills, roles, and identities in response to evolving tools and organisational demands. The paper argues that automation reshapes rather than replaces translation value, creating an interdependent configuration in which technological efficiency enables human communicative work.

关键词: translation industry, automation, value construction, human value, technological value, adaptability, translation ethics, LT-LiDER project

132. ❌ Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

作者: Junjie Wu, Xuan Kan, Zihao He, Shunwen Tan, Bo Pan, Kaitai Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11665v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）作为评估者（Judge）的优化，使用多任务强化学习（MT-RL-Judge）提升其判断一致性和与人类偏好的相关性。因此，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展；与’RLHF’高度相关（10分），因为论文使用强化学习进行优化；与’Alignment’有一定关联（5分），因为涉及与人类偏好的对齐。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有MLLM-as-a-Judge模型在单任务优化下泛化能力不足的问题，提出了多任务强化学习框架MT-RL-Judge，实验表明其在判断一致性和与人类偏好相关性方面优于基线模型，并展现出对分布外任务的鲁棒泛化能力。

摘要翻译

多模态大语言模型因其在多种视觉任务中与人类判断的高度一致性，已被广泛采纳为“MLLM-as-a-Judge”（基于多模态大语言模型的评判器）。然而，现有的大多数评判模型仅针对单任务场景进行优化，难以泛化至多样化情境，而这正是实现可靠评估的关键需求。为克服这一局限，我们提出了面向MLLM-as-a-Judge的多任务强化学习框架（MT-RL-Judge），该框架利用强化学习的泛化能力，通过多任务联合优化评判模型。与多个强基线的实验对比结果表明，MT-RL-Judge在判断一致性以及与人类偏好的相关性方面均优于现有基线方法。此外，我们的方法在分布外任务上展现出强大的泛化能力，进一步验证了其有效性。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.

关键词: Multimodal Large Language Models, MLLM-as-a-Judge, Multi-Task Reinforcement Learning, Judgment Consistency, Human Preference Alignment, Out-of-Distribution Generalization, MT-RL-Judge

133. ❌ QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate

作者: Jihao Zhao, Daixuan Li, Pengfei Li, Shuaishuai Zu, Biao Qin, Hongyan Liu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11650v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统的文本分块优化，与’Retrieval-Augmented Generation’高度相关（10分）。提出多智能体辩论框架，与’Multi-agent Systems’高度相关（10分）。将能力迁移到小语言模型，与’Small Language Models’相关（8分）。框架涉及智能体协作，与’LLM Agents’相关（8分）。论文提及大模型但非核心，给’Large Language Models’基础分5分。其他关键词如MoE、Scaling Laws、Alignment等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对RAG系统中文本分块的语义完整性和信息粒度问题，提出了QChunker方法，通过多智能体辩论框架优化分块过程，并开发了直接评估指标ChunkScore，实验证明能有效提升RAG性能。

摘要翻译

检索增强生成（RAG）的效能上限从根本上受限于其知识库中文本块（text chunks）的语义完整性与信息粒度。为应对这些挑战，本文提出QChunker，将RAG范式从“检索增强”重构为“理解-检索-增强”。首先，QChunker将文本分块建模为文本分割与知识补全的复合任务，以确保文本块的逻辑连贯性与完整性。受Hal Gregersen“问题即答案”理论的启发，我们设计了一个包含四个专用组件的多智能体辩论框架：问题大纲生成器、文本分割器、完整性审查员和知识补全器。该框架基于“问题是深度洞察的催化剂”这一原则运作。通过此流程，我们成功构建了一个包含45K条目的高质量数据集，并将此能力迁移至小型语言模型。此外，针对现有分块评估方法过度依赖下游问答任务、评估链长且效率低的问题，我们引入了一种新颖的直接评估指标——ChunkScore。理论与实验验证均表明，ChunkScore能够直接且高效地判别文本块的质量。进一步地，在文本分割阶段，我们利用文档大纲进行多路径采样以生成多个候选块，并运用ChunkScore选取最优解。在四个异构领域的大量实验结果表明，QChunker通过为RAG提供逻辑更连贯、信息更丰富的文本块，有效解决了上述问题。

摘要 (Abstract)

The effectiveness upper bound of retrieval-augmented generation (RAG) is fundamentally constrained by the semantic integrity and information granularity of text chunks in its knowledge base. To address these challenges, this paper proposes QChunker, which restructures the RAG paradigm from retrieval-augmentation to understanding-retrieval-augmentation. Firstly, QChunker models the text chunking as a composite task of text segmentation and knowledge completion to ensure the logical coherence and integrity of text chunks. Drawing inspiration from Hal Gregersen’s “Questions Are the Answer” theory, we design a multi-agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge completer. This framework operates on the principle that questions serve as catalysts for profound insights. Through this pipeline, we successfully construct a high-quality dataset of 45K entries and transfer this capability to small language models. Additionally, to handle long evaluation chains and low efficiency in existing chunking evaluation methods, which overly rely on downstream QA tasks, we introduce a novel direct evaluation metric, ChunkScore. Both theoretical and experimental validations demonstrate that ChunkScore can directly and efficiently discriminate the quality of text chunks. Furthermore, during the text segmentation phase, we utilize document outlines for multi-path sampling to generate multiple candidate chunks and select the optimal solution employing ChunkScore. Extensive experimental results across four heterogeneous domains exhibit that QChunker effectively resolves aforementioned issues by providing RAG with more logically coherent and information-rich text chunks.

关键词: Retrieval-Augmented Generation, RAG, text chunking, multi-agent debate, small language models, ChunkScore, domain adaptation, knowledge base

134. ❌ Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE

作者: Mohammad Aflah Khan, Krishna P. Gummadi, Manish Gupta, Abhilasha Ravichander 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11611v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究RoPE（Rotary Positional Embedding）的部分应用，这是transformer架构中的位置编码技术，与大模型（LLMs）高度相关。论文核心贡献是探索部分维度应用RoPE对训练动态和收敛的影响，发现仅10%维度应用RoPE即可达到与全RoPE相当的收敛效果，同时显著减少内存使用（高达10倍）。这与’KV Cache Compression’高度相关（10分），因为RoPE缓存优化直接减少KV缓存内存；与’Context Window Extension’相关（8分），因为内存节省对长上下文尤其重要；与’Quantization/Model Compression’和’Inference Acceleration’相关（各8分），因为部分RoPE是一种模型压缩和推理加速技术；与’Pre-training’相关（8分），因为研究涉及训练动态；与’Scaling Laws AND Data Quality’有一定关联（5分），因为论文提到数据质量影响损失；其他关键词如MoE、SFT、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了在transformer架构中仅对部分隐藏维度应用Rotary Positional Embedding（RoPE）对训练收敛和性能的影响，发现仅需10%维度应用RoPE即可达到与全RoPE相当的收敛效果，同时实现高达10倍的内存节省。

摘要翻译

旋转位置编码（Rotary Positional Embedding, RoPE）是Transformer架构中编码相对位置信息的常用选择。尽管先前研究已探讨在特定层中省略RoPE的效果，但调整接收旋转变换的隐藏维度比例的影响仍未被充分探索。这一设计选择能显著节省内存，在长上下文场景下尤为重要。我们发现相较于标准RoPE缓存，该方法最高可节省10倍内存，同时达到可比的最终损失。本研究系统性地分析了部分RoPE在不同架构和数据集上对训练动态与收敛的影响，揭示了若干重要规律：（1）仅对少量维度（约10%）应用RoPE即可实现与完整RoPE相当的收敛效果；（2）该规律在不同模型规模、序列长度、数据质量及架构中保持一致，更高质量的数据会带来更低的总体损失和相似的基准性能；（3）部分使用无位置编码（NoPE）训练的模型表现出不稳定的学习轨迹，可通过施加最低限度的RoPE或采用QK-Norm缓解，但后者会收敛至较高损失值。这些结果为模型设计者平衡效率与训练稳定性提供了实践指导，同时强调了部分RoPE这一曾被忽视的重要性。

摘要 (Abstract)

Rotary Positional Embedding (RoPE) is a common choice in transformer architectures for encoding relative positional information. Although earlier work has examined omitting RoPE in specific layers, the effect of varying the fraction of hidden dimensions that receive rotary transformations remains largely unexplored. This design choice can yield substantial memory savings, which becomes especially significant at long context lengths. We find up to 10x memory savings over the standard RoPE cache, while achieving comparable final loss. In this work, we present a systematic study examining the impact of partial RoPE on training dynamics and convergence across architectures and datasets. Our findings uncover several notable patterns: (1) applying RoPE to only a small fraction of dimensions (around 10%) achieves convergence comparable to using full RoPE; (2) these trends hold consistently across model size, sequence lengths and datasets of varying quality and architectures, with higher-quality data resulting in lower overall loss and similar benchmark performance; and (3) some models trained with NoPE (No Positional Encoding) showcase unstable learning trajectories, which can be alleviated through minimal RoPE application or QK-Norm which converges to a higher loss. Together, these results offer practical guidance for model designers aiming to balance efficiency and training stability, while emphasizing the previously overlooked importance of partial RoPE.

关键词: Rotary Positional Embedding, RoPE, transformer architectures, memory savings, training convergence, partial RoPE, positional encoding, KV cache optimization

135. ❌ Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

作者: Masataka Kawai, Singo Sakashita, Shumpei Ishikawa, Shogo Watanabe, Anna Matsuoka, Mikio Sakurai, Yasuto Fujimoto, Yoshiyuki Takahara, Atsushi Ohara, Hirohiko Miyake, Genichiro Ishii 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11597v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文明确研究大型语言模型（LLMs）在医学病理报告写作中的应用，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。研究涉及病理学领域，属于’AI for Science OR Bioinformatics OR Cheminformatics’范畴（10分）。论文提到’Thinking models’在需要推理的结构化报告任务中表现优势，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究评估了开源大型语言模型在辅助日语病理报告写作中的性能，发现思维模型和医学专用模型在结构化报告和纠错任务中表现较好，但模型效用因任务而异，在有限但临床相关场景中具有应用价值。

摘要翻译

大型语言模型（LLM）在支持日语病理报告撰写方面的性能尚未得到充分探索。我们从三个维度评估了七种开源LLM：（A）按照预定义格式生成和提取病理诊断文本，（B）纠正日语病理报告中的拼写错误，以及（C）由病理学家和临床医生对模型生成的解释性文本进行主观评价。思维链模型和医学专用模型在需要推理的结构化报告任务及拼写纠错方面表现出优势。相比之下，评估者对解释性文本输出的偏好存在显著差异。尽管LLM的实用性因任务而异，但我们的研究结果表明，开源LLM在有限但具有临床相关性的场景中，能够有效辅助日语病理报告的撰写。

摘要 (Abstract)

The performance of large language models (LLMs) for supporting pathology report writing in Japanese remains unexplored. We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C) subjective evaluation of model-generated explanatory text by pathologists and clinicians. Thinking models and medical-specialized models showed advantages in structured reporting tasks that required reasoning and in typo correction. In contrast, preferences for explanatory outputs varied substantially across raters. Although the utility of LLMs differed by task, our findings suggest that open-source LLMs can be useful for assisting Japanese pathology report writing in limited but clinically relevant scenarios.

关键词: large language models, pathology report writing, Japanese, open-source LLMs, structured reporting, typo correction, clinical evaluation, medical AI

136. ❌ UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization

作者: Ofir Marom 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11583v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出一种名为UtilityMax Prompting的框架，用于优化大型语言模型（LLM）在需要同时满足多个目标的任务中的表现。它通过将任务重构为影响图，并定义效用函数来指导LLM寻找最大化期望效用的答案。因此，论文与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLM是研究的核心对象和应用平台。然而，论文并未涉及其他关键词所描述的具体技术（如MoE、SFT、RAG、量化等）、训练方法（如预训练、指令调优）、推理技术（如思维链、推测解码）或特定应用领域（如生物信息学）。这些关键词与论文的研究焦点——一种通用的、基于形式化数学语言的提示框架——没有直接关联，因此得分为0。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在多目标任务中自然语言提示存在模糊性的问题，提出了一个基于形式化数学语言的UtilityMax Prompting框架，通过在三个前沿模型上的实验验证，该框架在电影推荐任务中相比自然语言基线能持续提升精确度和NDCG指标。

摘要翻译

大型语言模型（LLM）任务的成功在很大程度上取决于其提示。大多数应用场景使用自然语言来定义提示，而当需要同时满足多个目标时，自然语言本质上具有模糊性。本文提出效用最大化提示（UtilityMax Prompting）框架，该框架使用形式化数学语言来定义任务。我们将任务重构为一个影响图，其中LLM的答案是唯一的决策变量。在图中，我们基于条件概率分布定义了一个效用函数，并指示LLM寻找能够最大化期望效用的答案。这约束了LLM对目标的每个组成部分进行显式推理，从而将其输出导向精确的优化目标，而非主观的自然语言解释。我们在MovieLens 1M数据集上使用三种前沿模型（Claude Sonnet 4.6、GPT-5.4和Gemini 2.5 Pro）验证了我们的方法，结果表明，在多目标电影推荐任务中，相较于自然语言基线，该方法在精确率和归一化折损累计增益（Normalized Discounted Cumulative Gain, NDCG）上均取得了持续提升。

摘要 (Abstract)

The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM’s answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.

关键词: Large Language Models, Prompting, Multi-Objective Optimization, Utility Maximization, Formal Framework, Influence Diagram, Movie Recommendation, NDCG

137. ❌ Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

作者: Zhenxu Tian, Yi Su, Juntao Li, Min Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11564v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究KV cache压缩以提升LLM推理效率，与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（15分），直接解决该问题。论文涉及LLM推理（10分）、长上下文（8分）和推理加速（8分），与模型压缩有一定关联（5分）。其他关键词如MoE、SLMs、对齐、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于位置感知伪查询的解码对齐KV缓存压缩方法（DapQ），解决了长上下文下KV缓存内存占用过高的问题，在严格内存约束下实现了接近无损的性能。

摘要翻译

键值（Key-Value，KV）缓存对于大型语言模型（Large Language Models，LLMs）的高效推理至关重要，但过长的上下文会急剧增加KV缓存的内存占用。现有的KV缓存压缩方法通常依赖于提示观察窗口内的输入侧注意力模式来估计预填充阶段中词元的重要性。由于这些评估并非源自解码过程，它们无法为未来生成保留关键词元。直观而言，有效的观察窗口应反映解码阶段的查询，以准确捕捉生成过程将关注哪些词元。然而，真实的解码查询在推理过程中本质上是无法获取的。为了构建伪查询以近似这些查询，我们发现位置信息比语义内容起着更为关键的作用。基于这一洞见，我们提出了一种通过位置感知伪查询的解码对齐KV缓存压缩方法（DapQ），这是一种新颖且轻量级的淘汰框架，利用位置感知伪查询来模拟输出词元，从而为重要性评估建立一个有效的观察窗口。该方法与实际的生成上下文紧密对齐，并实现了精确的词元淘汰。在多个基准测试和大型语言模型上的广泛评估表明，DapQ实现了卓越的性能，尤其在严格的内存约束下（例如，在仅使用3% KV缓存预算的NIAH任务上达到接近无损的99.5%性能）。

摘要 (Abstract)

The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side attention patterns within a prompt observation window to estimate token importance during the prefill stage. They fail to preserve critical tokens for future generation since these assessments are not derived from the decoding process. Intuitively, an effective observation window should mirror the decoding-stage queries to accurately reflect which tokens the generation process will attend to. However, ground-truth decoding queries are inherently unavailable during inference. For constructing pseudo queries to approximate them, we find that positional information plays a more critical role than semantic content. Motivated by this insight, we propose decoding-aligned KV cache compression via position-aware pseudo queries (DapQ), a novel and lightweight eviction framework that leverages position-aware pseudo queries to simulate the output tokens, thereby establishing an effective observation window for importance assessment. It aligns closely with the actual generation context and enables precise token eviction. Extensive evaluations across multiple benchmarks and LLMs demonstrate that DapQ achieves superior performance, particularly under strict memory constraints (e.g., up to nearly lossless performance 99.5% on NIAH with 3% KV cache budgets).

关键词: KV cache compression, Large Language Models, inference efficiency, position-aware pseudo queries, decoding-aligned, memory footprint, long contexts, token eviction

138. ❌ Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

作者: Hanchi Sun, Yixin Liu, Yonghui Wu, Lichao Sun 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11535v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	15.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Mixture of Experts（MoE）路由机制，提出Expert Threshold（ET）路由方法，属于MoE/Sparse Models的核心技术创新，因此该关键词得15分。论文在预训练实验中验证方法，与Pre-training相关，得10分。论文研究应用于自回归语言建模，属于大模型技术范畴，与Large Language Models相关，得10分。其他关键词如Small Language Models、Scaling Laws、Post-training、Instruction Tuning等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对Token-choice Mixture-of-Experts（TC-MoE）路由机制在动态计算分配和负载平衡方面的限制，提出了一种新的Expert Threshold（ET）路由方法，通过专家阈值实现动态计算分配和无需辅助损失的负载平衡，在预训练实验中比TC-MoE实现了更低的交叉熵损失。

摘要翻译

令牌选择专家混合模型（Token-choice Mixture-of-Experts, TC-MoE）将每个令牌路由至固定数量的专家，这限制了动态计算分配，并需要辅助损失函数来维持负载均衡。我们提出专家阈值（Expert Threshold, ET）路由机制，其中每个专家维护一个基于全局令牌分布估计的指数移动平均（Exponential Moving Average, EMA）阈值。在训练和推理过程中，若令牌的评分超过专家阈值，则将其独立路由至该专家，从而实现动态计算分配，并在无需辅助损失的情况下达成负载均衡。这一完全因果机制消除了对批次中其他令牌的依赖，使其特别适用于自回归语言建模。在FineWeb-Edu数据集上进行的参数规模达24亿的预训练实验中，ET路由的交叉熵损失比TC-MoE低0.067，相当于使用1.6倍更少的令牌即可达到相同性能。

摘要 (Abstract)

Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert’s threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6$\times$ fewer tokens.

关键词: Mixture-of-Experts, Expert Threshold routing, autoregressive language modeling, dynamic computation allocation, load balancing, pretraining, cross-entropy loss, FineWeb-Edu

139. ❌ Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale

作者: Sanchit Pandey 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11513v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究小语言模型（SLMs，7B参数及以下）在检索增强生成（RAG）中的表现，因此与’Small Language Models’和’Retrieval-Augmented Generation’高度相关（10分）。研究涉及模型规模（从360M到8B）对RAG效果的影响，与’Large Language Models’和’Scaling Laws’有一定关联（8分和5分）。论文旨在通过RAG提高事实准确性，与’Hallucination Mitigation’相关（8分），并通过错误分析探讨模型行为，与’Mechanistic Interpretability’部分相关（5分）。其他关键词如MoE、训练方法、推理技术、代理系统等未在论文中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该论文实证研究了小语言模型（7B参数及以下）在检索增强生成（RAG）中利用检索信息的能力，发现这些模型存在根本性的利用瓶颈，即使提供完美检索内容，也经常无法提取正确答案，且检索上下文会干扰模型原有知识，导致RAG在此规模下可能产生负面效果。

摘要翻译

检索增强生成（RAG）被广泛部署以提升语言模型的事实准确性，但目前尚不清楚参数规模在7B或更小的模型是否能有效利用检索到的信息。为探究此问题，我们评估了参数量从360M到8B的五个模型规模，涵盖三种架构系列（SmolLM2、Qwen2.5和Llama 3.1），并在四种检索条件下进行测试：无检索、BM25检索、使用E5-large-v2的密集检索，以及保证检索段落包含答案的“先知检索”。我们引入了一种参数化知识划分方法，将模型已能独立回答的问题与需要外部知识的问题区分开来，从而能够将“利用失败”与“检索质量失败”分离。我们得到三个主要结论。首先，即使在先知检索条件下，对于模型原本无法独立回答的问题，7B或更小规模的模型仍有85%至100%的概率无法提取正确答案，这表明存在根本性的利用瓶颈。其次，添加检索上下文会破坏模型原本已知答案的42%至100%，这表明一种由上下文存在（而非其质量）驱动的“干扰效应”。第三，对2588次先知检索失败案例的错误分析显示，主要的失败模式是无关生成，即模型完全忽略所提供的上下文。这些模式在多种提示模板和检索方法中均保持一致。结果表明，对于7B参数以下的模型，RAG的主要限制在于上下文利用而非检索质量，且在此规模下部署RAG在标准评估条件下可能导致净效益为负的权衡。

摘要 (Abstract)

Retrieval augmented generation RAG is widely deployed to improve factual accuracy in language models yet it remains unclear whether smaller models of size 7B parameters or less can effectively utilize retrieved information. To investigate this question we evaluate five model sizes from 360M to 8B across three architecture families SmolLM2 Qwen2.5 and Llama 3.1 under four retrieval conditions including no retrieval BM25 dense retrieval using E5 large v2 and oracle retrieval where the retrieved passage is guaranteed to contain the answer. We introduce a parametric knowledge split that separates questions a model can already answer from those that require external knowledge which allows us to isolate utilization failure from retrieval quality failure. We find three main results. First even with oracle retrieval models of size 7B or smaller fail to extract the correct answer 85 to 100 percent of the time on questions they cannot answer alone which indicates a fundamental utilization bottleneck. Second adding retrieval context destroys 42 to 100 percent of answers the model previously knew suggesting a distraction effect driven by the presence of context rather than its quality. Third an error analysis of 2588 oracle failures shows that the dominant failure mode is irrelevant generation where the model ignores the provided context entirely. These patterns hold across multiple prompt templates and retrieval methods. The results indicate that for models below 7B parameters the main limitation of RAG is context utilization rather than retrieval quality and that deploying RAG at this scale can lead to a net negative trade off under standard evaluation conditions.

关键词: Small Language Models, Retrieval-Augmented Generation, RAG, Model Scale, Context Utilization, Factual Accuracy, Empirical Study, Oracle Retrieval

140. ❌ EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

作者: Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, Xihui Liu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12267v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文EVATok专注于视频生成中的自适应视频标记化技术，属于计算机视觉和视频生成领域。虽然涉及深度学习技术，但所有关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），或特定科学AI应用。论文内容完全不涉及语言模型、文本处理、LLM训练技术或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了EVATok框架，通过自适应视频标记化技术优化视频重建和自回归生成的效率与质量，相比现有方法节省至少24.4%的标记使用量。

摘要翻译

自回归（AR）视频生成模型依赖于将像素压缩为离散令牌序列的视频分词器。这些令牌序列的长度对于平衡重建质量与下游生成计算成本至关重要。传统视频分词器在不同视频的时间块上采用统一的令牌分配策略，常将令牌浪费在简单、静态或重复的片段上，而对动态或复杂片段分配不足。为解决这一效率低下的问题，我们引入了$\textbf{EVATok}$框架，以生成$\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers（高效视频自适应分词器）。该框架通过估计每个视频的最优令牌分配来实现最佳质量-成本权衡，开发轻量级路由器以快速预测这些最优分配，并训练能够根据路由器预测的分配方案对视频进行编码的自适应分词器。我们证明，EVATok在视频重建和下游AR生成的效率与整体质量方面均带来显著提升。通过集成视频语义编码器的先进训练方案增强后，EVATok在UCF-101数据集上实现了卓越的重建效果和最先进的类别到视频生成性能，与先前最优的LARP方法及我们自身的固定长度基线相比，平均令牌使用量至少节省24.4%。

摘要 (Abstract)

Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.

关键词: video tokenization, autoregressive video generation, adaptive token assignment, efficiency-quality trade-off, video reconstruction, computational cost, lightweight routers, state-of-the-art generation

141. ❌ MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

作者: Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12266v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）在视觉基础深度组合推理方面的评估，核心贡献是提出了MM-CondChain基准。与关键词的相关性分析如下：1）高度相关（10分）：论文明确涉及“Large Language Models”（MLLMs是LLMs的扩展）、“Chain of Thought”（多步推理链）、“System 2 Thinking”（深度推理）和“LLM Agents”（使用代理合成管道构建基准）。2）无关（0分）：其他关键词如MoE、量化、RAG等未在论文中提及或讨论，论文主要关注基准构建和评估，而非这些具体技术。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在视觉基础深度组合推理能力上的不足，提出了MM-CondChain基准，并通过实验证明现有模型在此任务上表现有限（最高仅53.33 Path F1），揭示了深度组合推理仍是一个根本性挑战。

摘要翻译

多模态大语言模型（MLLMs）正日益被用于执行视觉工作流程，例如图形用户界面导航，其中后续步骤取决于已验证的视觉组合条件（例如，“如果出现权限对话框且界面颜色为绿色，则点击允许”），且流程可能提前分支或终止。然而，这一能力仍未得到充分评估：现有基准测试主要关注浅层组合或独立约束，而非深度链式组合条件。本文中，我们提出了MM-CondChain，一个面向视觉基础深度组合推理的基准测试。每个测试实例均组织为多层推理链，其中每一层都包含一个基于视觉证据构建的非平凡组合条件，该条件由多个对象、属性或关系构成。为正确作答，MLLM必须细致感知图像，在每一步对多个视觉元素进行推理，并沿着生成的执行路径推导至最终结果。为可扩展地构建此类工作流风格数据，我们提出了一种智能体合成流程：规划器（Planner）协调逐层生成组合条件，而可验证的程序化中间表示（Verifiable Programmatic Intermediate Representation, VPIR）确保每一层的条件在机制上可验证。随后，合成器（Composer）将这些已验证的层组装为完整指令。利用此流程，我们在三个视觉领域构建了基准测试：自然图像、数据图表和GUI轨迹。对一系列MLLM的实验表明，即使最强模型也仅达到53.33的路径F1分数，且在困难负例以及随着深度或谓词复杂度增加时性能急剧下降，这证实深度组合推理仍然是一个根本性挑战。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., “if a permission dialog appears and the color of the interface is green, click Allow”) and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer’s condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.

关键词: Multimodal Large Language Models, Visually Grounded Reasoning, Compositional Reasoning, Benchmark, Agentic Synthesis Pipeline, Verifiable Programmatic Intermediate Representation, Deep Compositional Conditionals, Path F1

142. ❌ OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

作者: Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12265v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出OmniStream，一个统一的流式视觉骨干网络，专注于视觉感知、重建和行动。与关键词的相关性分析如下：1）高度相关（10分）：‘Pre-training’（论文使用多任务框架预训练）、‘KV Cache Compression’（使用持久KV-cache进行在线流处理）。2）中等相关（5分）：‘Large Language Models’（作为视觉基础模型，属于基础模型范畴）、‘Instruction Tuning/Alignment’（涉及视觉-语言对齐）、‘Chain of Thought/System 2 Thinking’（支持复杂视频和空间推理）、‘LLM Agents’（面向交互式和具身智能体）、‘World Models’（学习通用视觉理解，类似世界模型概念）。3）无关（0分）：其余关键词主要针对语言模型、训练技术、推理优化等，与论文的视觉焦点无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了当前视觉基础模型在实时流式环境中碎片化的问题，提出了OmniStream——一个通过因果时空注意力和3D-RoPE实现统一感知、重建和行动的流式视觉骨干网络，并在多任务预训练后展现出跨语义、空间和时间推理的泛化能力。

摘要翻译

现代视觉智能体需要在实时流式环境中运行，这要求其表征具备通用性、因果性和物理结构化特性。然而，当前的视觉基础模型仍处于割裂状态，仅能专门处理图像语义感知、离线时序建模或空间几何中的单一任务。本文提出了 OmniStream，一个统一的流式视觉骨干网络，能够从多样化的视觉输入中有效地进行感知、重建与行动。通过引入因果时空注意力机制与三维旋转位置编码（3D-RoPE），我们的模型借助持久的键值缓存（KV-cache）实现了对视频流的高效逐帧在线处理。我们在 29 个数据集上，通过耦合静态与时序表征学习、流式几何重建以及视觉-语言对齐的协同多任务框架对 OmniStream 进行预训练。大量评估表明，即使在骨干网络严格冻结的情况下，OmniStream 在图像与视频探测、流式几何重建、复杂视频与空间推理以及机器人操控（训练中未见过）等任务上，均能与各领域专家模型持续保持竞争力。我们的工作并非追求在特定基准测试上的绝对优势，而是证明了训练一个单一、通用的视觉骨干网络是可行的，该网络能够在语义、空间和时序推理中实现泛化。这为面向交互式与具身智能体的通用视觉理解迈出了更具实质性的一步。

摘要 (Abstract)

Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.

关键词: streaming visual backbone, causal spatiotemporal attention, 3D rotary positional embeddings, persistent KV-cache, multi-task pre-training, vision-language alignment, geometric reconstruction, embodied agents

143. ❌ GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

作者: Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, Shaofeng Zhang, Qibing Ren, Zhihang Zhong, Xuanhe Zhou, Junchi Yan, Xue Yang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12264v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文GRADE专注于评估多模态模型在跨学科图像编辑中的推理能力，与大多数具体的大模型技术关键词（如MoE、量化、RLHF等）无直接关联。它涉及“Chain of Thought/System 2 Thinking”（评估推理）和“AI for Science”（应用于科学领域），相关度较高（8分）。与“Large Language Models”有一定关联（5分），因为多模态模型常基于LLMs，但论文不深入LLM技术本身。其他关键词均无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了GRADE基准，首次评估多模态模型在跨学科知识驱动的图像编辑中的推理能力，发现当前模型在隐含知识密集型编辑设置下存在显著局限性。

摘要翻译

统一多模态模型旨在实现联合理解、推理与生成，但当前的图像编辑基准大多局限于自然图像和浅层常识推理，难以在结构化、领域特定的约束下充分评估此类能力。本研究提出GRADE，首个用于评估图像编辑中学科知识与推理能力的基准。GRADE涵盖从自然科学到社会科学的10个学术领域，共包含520个精心构建的样本。为支持严谨评估，我们提出一个多维评估框架，综合考量学科推理（Discipline Reasoning）、视觉一致性（Visual Consistency）与逻辑可读性（Logical Readability）。通过对20个前沿开源与闭源模型的大规模实验，我们发现当前模型在隐含的、知识密集的编辑场景中存在显著局限，导致性能差距巨大。除量化评分外，我们通过深入分析与消融实验，揭示了模型的不足并明确了学科编辑中的关键约束。GRADE为统一多模态模型的未来发展指明了方向，推动了基于学科知识的图像编辑与推理研究。本基准及相关评估代码已公开发布。

摘要 (Abstract)

Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.

关键词: multimodal models, image editing, discipline-informed reasoning, benchmark evaluation, knowledge-intensive tasks, academic domains, visual consistency, logical readability

144. ❌ DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

作者: Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Ruihang Chu, Yingya Zhang, Yike Guo, Xihui Liu, Hongming Shan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12257v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频扩散模型的多主体定制和全运动控制，涉及条件感知3D旋转位置嵌入、分层运动注入、组/角色嵌入和潜在身份奖励反馈学习等技术。所有关键词均针对大语言模型（LLMs）及相关技术（如MoE、RLHF、RAG、CoT、量化等），而本文研究的是视频生成扩散模型，属于计算机视觉和生成模型领域，与LLMs无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了DreamVideo-Omni框架，通过两阶段训练范式解决了多主体视频定制中身份保持和全运动控制的挑战，实现了高质量、可控的视频生成。

摘要翻译

尽管大规模扩散模型已彻底改变了视频合成领域，但实现对多主体身份与多粒度运动的精确控制仍是一项重大挑战。近期为弥合这一差距的尝试常受限于运动粒度的不足、控制模糊性以及身份退化等问题，导致在身份保持与运动控制方面的表现欠佳。本研究提出了DreamVideo-Omni，这是一个通过渐进式两阶段训练范式实现和谐多主体定制与全运动控制的统一框架。在第一阶段，我们整合了涵盖主体外观、全局运动、局部动态及摄像机运动的综合控制信号进行联合训练。为确保鲁棒且精确的可控性，我们引入了条件感知的三维旋转位置编码以协调异构输入，并采用分层运动注入策略以增强全局运动引导。此外，为解决多主体模糊问题，我们引入了组别与角色嵌入，将运动信号显式锚定至特定身份，从而有效将复杂场景解耦为独立可控的实例。在第二阶段，为缓解身份退化，我们设计了一种潜在身份奖励反馈学习范式，通过在预训练的视频扩散骨干网络上训练潜在身份奖励模型，在潜在空间中提供运动感知的身份奖励，优先保障符合人类偏好的身份保持。基于我们构建的大规模数据集及用于多主体与全运动控制评估的综合基准DreamOmni Bench，DreamVideo-Omni在生成具有精确可控性的高质量视频方面展现出卓越性能。

摘要 (Abstract)

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.

关键词: video diffusion models, multi-subject customization, omni-motion control, latent identity reinforcement learning, hierarchical motion injection, condition-aware 3D rotary positional embedding, group and role embeddings, identity preservation

145. ❌ Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

作者: Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12262v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出Video Streaming Thinking (VST)范式，专注于在线视频大语言模型(VideoLLMs)的实时推理。核心相关关键词包括：1) ‘Large Language Models’ (权重1.0，评分10.0)：论文基于VideoLLMs，是LLMs在视频领域的应用；2) ‘Post-training OR Supervised Fine-tuning OR SFT’ (权重1.0，评分10.0)：论文明确提出了VST-SFT作为后训练流程的一部分；3) ‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’ (权重1.0，评分10.0)：论文设计了基于实体关系的流式Chain-of-Thought来增强多证据推理。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Context Window等未在论文中涉及，评分为0。加权总分计算为(101.0 + 101.0 + 10*1.0) = 30.0。

!!! tip deepseek-chat TL;DR

该论文针对在线视频大语言模型在实时交互中推理延迟与响应速度的权衡问题，提出了Video Streaming Thinking (VST)范式，通过'边看边思考'机制和包含VST-SFT与VST-RL的后训练流程，在保持实时响应的同时显著提升了视频流理解性能，在多个基准测试中表现出色。

摘要翻译

在线视频大语言模型（VideoLLMs）在支持响应式实时交互中发挥着关键作用。现有方法主要关注流式感知，缺乏同步的逻辑推理流。然而，直接应用测试时缩放方法会带来不可接受的响应延迟。为解决这一权衡问题，我们提出了视频流式思考（Video Streaming Thinking，VST），这是一种用于流式视频理解的新范式。它支持“边观看边思考”机制，在视频流传输过程中对传入的视频片段激活推理。该设计通过在视频播放过程中分摊大语言模型的推理延迟，在保持实时响应能力的同时，提升了及时理解与连贯认知能力。此外，我们引入了一个全面的后训练流程，该流程整合了VST-SFT（通过结构适配将离线VideoLLM转变为因果流式推理模型）和VST-RL（通过在多轮视频交互环境中进行自我探索，提供端到端的改进）。另外，我们设计了一个自动化的训练数据合成流程，该流程利用视频知识图谱生成高质量的流式问答对，并采用基于实体-关系接地的流式思维链，以强化对视频流的多证据推理和持续注意力。大量评估表明，VST-7B模型在在线基准测试中表现强劲，例如在StreamingBench上达到79.5%，在OVO-Bench上达到59.3%。同时，VST在离线长视频或推理基准测试中仍保持竞争力。与Video-R1相比，VST的响应速度快了15.7倍，并在VideoHolmes基准上实现了+5.4%的性能提升，证明了其在多样化视频理解任务中具有更高的效率和强大的泛化能力。代码、数据和模型将在 https://github.com/1ranGuan/VST 发布。

摘要 (Abstract)

Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.

关键词: Video Large Language Models, Streaming Video Understanding, Real-time Interaction, Chain-of-Thought Reasoning, Post-training Pipeline, Video Knowledge Graphs, Multi-evidence Reasoning, Online Benchmarks

146. ❌ Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

作者: Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, Yueqi Duan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12255v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉和视频理解领域，提出了一种用于流式空间视频处理的混合架构和测试时训练方法，虽然涉及深度学习技术，但所有关键词均与大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等）或特定科学AI应用（如生物信息学）相关，而本文核心是视觉空间理解，未提及任何语言模型或相关技术，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何从长时视频流中持续维护和更新空间信息以实现空间智能，提出了Spatial-TTT方法，通过测试时训练和混合架构在视频空间基准上取得了最先进的性能。

摘要翻译

人类通过连续的视觉观察来感知和理解现实世界空间。因此，从潜在无限的视频流中持续维护和更新空间证据的能力，对于空间智能至关重要。核心挑战不仅在于更长的上下文窗口，更在于如何随时间推移对空间信息进行选择、组织和保留。本文提出 Spatial-TTT，旨在通过测试时训练（Test-Time Training, TTT）实现基于视觉的流式空间智能。该方法通过调整部分参数（快速权重）来捕获和组织长时序场景视频中的空间证据。具体而言，我们设计了一种混合架构，采用大块更新与滑动窗口注意力并行的方法，以实现高效的空间视频处理。为进一步增强空间感知能力，我们在TTT层中引入了结合3D时空卷积的空间预测机制，促使模型捕捉跨帧的几何对应关系与时间连续性。除架构设计外，我们还构建了一个包含密集3D空间描述的数据集，引导模型更新其快速权重，以结构化的方式记忆和组织全局3D空间信号。大量实验表明，Spatial-TTT 显著提升了长时序空间理解能力，并在视频空间基准测试中取得了最先进的性能。项目页面：https://liuff19.github.io/Spatial-TTT。

摘要 (Abstract)

Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.

关键词: spatial intelligence, streaming video, test-time training, long-horizon spatial understanding, 3D spatial descriptions, sliding-window attention, spatiotemporal convolution, video spatial benchmarks

147. ❌ Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

作者: Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12254v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	2.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	2.0/10	0.0
Scaling Laws AND Data Quality	0.0	2.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	2.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	2.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	2.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	2.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	2.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	2.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	2.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	2.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	2.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	2.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	2.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	2.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	2.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	2.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	2.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	2.0/10	0.0
World Models AND General World Models	0.0	2.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	2.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	2.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	2.0/10	0.0

评分理由: 论文提出AutoGaze方法，通过强化学习训练的自回归补丁选择模块，显著减少多模态大语言模型处理长视频时的视觉token数量（4x-100x）并加速推理（最高19x）。核心相关关键词：1）‘Large Language Models’（8分）- 论文针对MLLMs进行优化；2）‘Context Window Extension’（8分）- 使MLLMs能处理1K帧4K分辨率长视频；3）‘Speculative Decoding’（8分）- 实现高达19倍的推理加速；4）‘RLHF’（5分）- 使用强化学习训练AutoGaze模块。其他关键词与论文核心内容关联较弱，主要涉及通用大模型技术而非视频理解特定优化。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型处理长高分辨率视频时存在时空冗余的问题，提出了AutoGaze方法，通过自回归补丁选择显著减少视觉token并加速推理，使模型能扩展到1K帧4K视频，在多个基准测试中取得优异性能。

摘要翻译

多模态大语言模型（MLLMs）在通用视频理解方面取得了进展，但在处理长时长、高分辨率视频时仍面临困难——尽管存在显著的时空冗余，其视觉变换器（ViTs）或大语言模型仍对每个像素进行同等处理。我们提出了AutoGaze，这是一个轻量级模块，可在视频被ViT或MLLM处理前移除冗余图像块。通过下一词预测和强化学习进行训练，AutoGaze能够自回归地选择一组最少的、多尺度的图像块，这些块可在用户指定的误差阈值内重建视频，从而在保留信息的同时消除冗余。实验表明，AutoGaze将视觉标记数量减少了4倍至100倍，并将ViTs和MLLMs的处理速度提升高达19倍，使得MLLMs能够扩展至处理长达1000帧的4K分辨率视频，并在视频基准测试中取得优异结果（例如在VideoMME上达到67.0%）。此外，我们提出了HLVid：首个包含5分钟4K分辨率视频的高分辨率、长视频问答基准测试，其中采用AutoGaze扩展的MLLM相比基线模型提升了10.1%，并优于先前最佳MLLM 4.5%。项目页面：https://autogaze.github.io/。

摘要 (Abstract)

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos – they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.

关键词: Multi-modal Large Language Models, Video Understanding, Autoregressive Patch Selection, Reinforcement Learning, Inference Acceleration, Long-form Video, High-resolution Video, Token Reduction

148. ❌ DVD: Deterministic Video Depth Estimation with Generative Priors

作者: Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Jing He, Zixin Zhang, Haodong Li, Yihao Liang, Kanghao Chen, Bin Ren, Xu Zheng, Shuai Yang, Kun Zhou, Yinchuan Li, Nicu Sebe, Ying-Cong Chen 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12250v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的视频深度估计任务，提出了一种利用预训练视频扩散模型进行确定性深度回归的方法。虽然涉及生成模型（视频扩散模型）和深度学习技术，但所有评分关键词都特指大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、Agent等），或特定科学领域应用（如生物信息学）。论文内容完全不涉及语言模型、语言处理、Agent系统或评分关键词中指定的科学领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了视频深度估计中生成模型存在随机几何幻觉和尺度漂移、判别模型需要大量标注数据的问题，提出了DVD框架，通过将预训练视频扩散模型适配为单次深度回归器，实现了最先进的零样本性能，并大幅减少了任务特定数据需求。

摘要翻译

现有视频深度估计面临一个根本性权衡：生成模型易受随机几何幻觉和尺度漂移影响，而判别模型需要海量标注数据来解决语义模糊性问题。为突破此僵局，我们提出DVD——首个将预训练视频扩散模型确定性适配为单次深度回归器的框架。具体而言，DVD包含三项核心设计：（i）将扩散时间步重新定义为结构锚点，以平衡全局稳定性与高频细节；（ii）潜在流形校正（Latent Manifold Rectification, LMR）机制，通过施加微分约束缓解回归导致的过度平滑问题，恢复锐利边界与连贯运动；（iii）全局仿射相干性作为固有属性，能约束跨窗口发散，从而实现无需复杂时序对齐的长视频无缝推理。大量实验表明，DVD在多个基准测试中实现了零样本（zero-shot）性能的突破。此外，DVD仅使用领先基线方法1/163的任务特定数据，便成功解锁了视频基础模型中隐含的深层几何先验。值得关注的是，我们完整开源了训练流程，为开源社区提供实现视频深度估计前沿性能的完整工具套件。

摘要 (Abstract)

Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.

关键词: video depth estimation, generative models, video diffusion models, deterministic depth regression, zero-shot performance, geometric priors, latent manifold rectification, global affine coherence

149. ❌ Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

作者: Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, Xue Yang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12247v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图像生成和编辑中的强化学习奖励建模，核心是解决奖励模型的幻觉和噪声问题以提高忠实度。与大多数关键词无关，因为论文不涉及大语言模型、MoE、小模型、训练技术、推理优化、代理系统等。仅与三个关键词相关：1) ‘Hallucination Mitigation OR Factuality OR Truthfulness’（10分）- 核心内容，直接解决奖励模型的幻觉问题以提高忠实度；2) ‘Scaling Laws AND Data Quality’（5分）- 有一定关联，论文提到数据质量对奖励模型的重要性；3) ‘Instruction Tuning OR Alignment OR Value Alignment’（5分）- 有一定关联，论文涉及指令遵循和模型对齐以生成忠实图像。

!!! tip deepseek-chat TL;DR

该论文提出FIRM框架，通过构建高质量数据集和训练稳健的奖励模型来解决图像编辑和生成中奖励模型的幻觉和噪声问题，显著提高了生成图像的忠实度和指令遵循能力。

摘要翻译

强化学习（Reinforcement Learning, RL）已成为提升图像编辑与文本到图像（Text-to-Image, T2I）生成能力的一种前景广阔的研究范式。然而，当前在强化学习中充当评判者的奖励模型常存在幻觉问题，并给出噪声评分，从而在本质上误导优化过程。本文提出FIRM（Faithful Image Reward Modeling，忠实图像奖励建模），这是一个构建鲁棒奖励模型的综合性框架，旨在为忠实的图像生成与编辑提供准确可靠的指导。首先，我们设计了定制化的数据构建流程，以建立高质量的评分数据集。具体而言，我们通过执行度与一致性两方面评估编辑任务，而生成任务则主要通过指令遵循程度进行评估。利用这些流程，我们收集了FIRM-Edit-370K和FIRM-Gen-293K数据集，并训练了专门化的奖励模型（FIRM-Edit-8B和FIRM-Gen-8B），这些模型能精确反映上述评估标准。其次，我们推出了FIRM-Bench，这是一个专门为编辑与生成任务评判者设计的综合性基准测试。评估结果表明，与现有指标相比，我们的模型在与人判断的一致性方面表现更优。此外，为了将这些评判者无缝整合到强化学习流程中，我们提出了一种新颖的“基础与加成”奖励策略，以平衡相互竞争的目标：针对编辑任务的“一致性调节执行度”（Consistency-Modulated Execution, CME）和针对生成任务的“质量调节对齐度”（Quality-Modulated Alignment, QMA）。在此框架支持下，我们最终得到的模型FIRM-Qwen-Edit和FIRM-SD3.5实现了显著的性能突破。全面的实验证明，FIRM有效缓解了幻觉问题，在忠实度与指令遵循方面为现有通用模型树立了新的标准。我们所有的数据集、模型和代码均已公开于https://firm-reward.github.io。

摘要 (Abstract)

Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel “Base-and-Bonus” reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.

关键词: reinforcement learning, reward modeling, image editing, text-to-image generation, hallucination mitigation, faithful generation, data curation, benchmark evaluation

150. ❌ One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

作者: Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Dogyun Park, Anil Kag, Michael Vasilkovsky, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12245v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散变换器（DiTs）的计算效率优化，提出ELIT机制实现动态计算分配，属于计算机视觉和生成模型领域。所有评分关键词均针对大语言模型（LLMs）及其相关技术（如训练方法、推理优化、应用等），而本文专注于扩散变换器（一种视觉生成模型），未涉及任何LLM技术、训练方法或应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文针对扩散变换器（DiTs）计算效率低的问题，提出弹性潜在接口变换器（ELIT），通过动态调整潜在序列长度实现计算与图像分辨率的解耦，在多个数据集和架构上显著提升生成质量（如ImageNet-1K 512px上FID平均提升35.3%）。

摘要翻译

扩散变换器（Diffusion Transformers, DiTs）虽能实现较高的生成质量，但其计算量（FLOPs）与图像分辨率锁定，限制了在延迟与质量之间进行原则性权衡的能力，并且将计算均匀分配于输入的空间标记（spatial tokens）上，导致对不重要区域的计算资源浪费。我们提出弹性潜在接口变换器（Elastic Latent Interface Transformer, ELIT），这是一种即插即用、与DiT兼容的机制，能够将输入图像尺寸与计算量解耦。该方法引入了一个潜在接口（latent interface）——一个可学习的可变长度标记序列，标准的变换器模块可在此基础上进行操作。轻量级的读取（Read）与写入（Write）交叉注意力层在空间标记与潜在标记之间传递信息，并优先处理重要的输入区域。通过随机丢弃尾部潜在标记进行训练，ELIT学会生成按重要性排序的表征，其中较早的潜在标记捕获全局结构，而较晚的则包含用于细化细节的信息。在推理阶段，潜在标记的数量可根据计算约束进行动态调整。ELIT的设计力求极简，仅增加了两个交叉注意力层，同时保持了修正流（rectified flow）目标与DiT主干结构不变。在不同数据集和架构（DiT, U-ViT, HDiT, MM-DiT）上，ELIT均带来了一致的性能提升。在ImageNet-1K 512px数据集上，ELIT在FID和FDD分数上分别实现了平均35.3%和39.6%的提升。项目页面：https://snap-research.github.io/elit/

摘要 (Abstract)

Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3%$ and $39.6%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/

关键词: Diffusion Transformers, Elastic Latent Interface, Compute Efficiency, Dynamic Computation, Importance-ordered Representations, Cross-attention Layers, Latency-quality Trade-offs, Generative Models

151. ❌ BiGain: Unified Token Compression for Joint Generation and Classification

作者: Jiacheng Liu, Shengkun Tang, Jiacheng Cui, Dongkuan Xu, Zhiqiang Shen 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12240v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型的token压缩技术（BiGain框架），专注于图像生成和分类任务，与所有评分关键词（均针对大语言模型/LLM相关技术）完全无关。关键词涉及LLM架构、训练、对齐、推理优化、代理系统等，而论文研究的是扩散模型中的视觉token处理，属于不同的模型领域（生成式AI中的扩散模型vs.大语言模型）。

!!! tip deepseek-chat TL;DR

论文提出了BiGain框架，通过频率感知的token压缩技术，在加速扩散模型的同时，联合优化图像生成质量和分类准确性。

摘要翻译

扩散模型加速方法（如令牌合并或下采样）通常在降低计算量的同时优化合成质量，却往往忽视判别能力。我们以联合目标重新审视令牌压缩，提出BiGain——一个无需训练、即插即用的框架，在保持生成质量的同时提升加速后扩散模型的分类性能。我们的核心见解是频率分离：将特征空间信号映射为频率感知表示，从而解耦细节信息与全局语义，实现兼顾生成保真度与判别效用的压缩。BiGain通过两个频率感知算子体现这一原则：（1）拉普拉斯门控令牌合并，鼓励频谱平滑的令牌间合并，同时抑制高对比度令牌的合并，从而保留边缘与纹理；（2）插值-外推KV下采样，通过在最近邻池化与平均池化间进行可控的内插-外推来下采样键/值，同时保持查询向量完整，从而维护注意力精度。在基于DiT和U-Net的骨干网络及ImageNet-1K、ImageNet-100、Oxford-IIIT Pets和COCO-2017数据集上的实验表明，我们的算子能持续改善基于扩散的分类任务的速度-精度权衡，并在可比加速条件下维持或提升生成质量。例如在ImageNet-1K上，对Stable Diffusion 2.0实施70%令牌合并时，BiGain将分类准确率提升7.15%，同时将FID改善0.34（相对提升1.85%）。我们的分析表明，平衡的频谱保留——同时保持高频细节与中低频语义——是扩散模型中令牌压缩的可靠设计准则。据我们所知，BiGain是首个在加速扩散条件下共同研究并推进生成与分类性能的框架，有助于实现更低成本的部署。

摘要 (Abstract)

Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.

关键词: diffusion models, token compression, generation quality, classification accuracy, frequency-aware representation, Laplacian-gated token merging, KV downsampling, speed-accuracy trade-off

152. ❌ SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

作者: Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12238v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SceneAssistant，一个基于视觉反馈的智能体，用于开放词汇3D场景生成。核心创新在于利用视觉语言模型（VLMs）的空间推理和规划能力，通过迭代反馈和原子操作（如缩放、旋转）来生成和编辑3D场景。因此，与智能体（LLM Agents/Autonomous Agents/Agentic Workflow）和工具使用（Tool Use/Function Calling）高度相关（10分），因为论文的核心是构建一个能够执行具体操作（工具）的智能体系统。与推理（Chain of Thought/System 2 Thinking）和自我改进（Self-Correction）有一定关联（5分），因为智能体通过视觉反馈进行迭代优化，涉及多步推理和自我修正过程。与基础大模型（Large Language Models/Foundation Models）有间接关联（5分），因为VLMs属于大模型范畴，但论文未深入探讨其技术原理。其他关键词（如MoE、量化、对齐等）与论文内容无关（0分），论文未涉及这些具体技术。

!!! tip deepseek-chat TL;DR

该论文提出SceneAssistant，一个基于视觉语言模型的智能体，通过迭代视觉反馈和原子操作来解决开放词汇3D场景生成问题，实现了高质量、多样化的场景生成和自然语言编辑。

摘要翻译

基于自然语言的文本到三维场景生成对于数字内容创作具有重要价值。然而，现有方法大多局限于特定领域或依赖于预定义的空间关系，这限制了其进行无约束、开放词汇的三维场景合成的能力。本文提出SceneAssistant，一种基于视觉反馈驱动的智能体，专为开放词汇的三维场景生成而设计。我们的框架利用现代三维物体生成模型，并结合视觉语言模型（Vision-Language Models, VLMs）的空间推理与规划能力。为实现开放词汇的场景组合，我们为VLM提供了一套全面的原子操作（例如：缩放、旋转、聚焦于）。在每一个交互步骤中，VLM接收渲染后的视觉反馈并据此采取行动，通过迭代优化场景，以实现更连贯的空间布局以及与输入文本更好的对齐。实验结果表明，我们的方法能够生成多样化、开放词汇且高质量的三维场景。定性分析和定量的人工评估均证明了我们的方法相较于现有方法的优越性。此外，我们的方法允许用户通过自然语言指令引导智能体编辑现有场景。代码发布于 https://github.com/ROUJINN/SceneAssistant。

摘要 (Abstract)

Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant

关键词: 3D scene generation, Vision-Language Models, autonomous agent, visual feedback, open-vocabulary, spatial reasoning, iterative refinement, natural language editing

153. ❌ HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

作者: Andy Li, Aiden Durrant, Milan Markovic, Georgios Leontidis 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12222v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Vision Transformers的自动剪枝框架，属于深度学习模型压缩和效率优化领域。与大多数关键词（特别是LLM相关技术）无关，但与稀疏模型、边缘AI部署和模型压缩有一定关联。具体来说：1）‘Mixture of Experts OR MoE OR Sparse Models’得5分，因为HiAP通过结构化剪枝实现稀疏模型；2）‘Small Language Models OR SLMs OR On-device AI’得5分，因为研究目标是将模型部署到边缘设备；3）‘Quantization OR Model Compression OR Low-bit Weights’得5分，因为剪枝是模型压缩的一种形式。其他关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为HiAP的多粒度随机自动剪枝框架，用于在单次端到端训练中为Vision Transformers发现高效子网络，解决了边缘设备部署中的计算和内存瓶颈问题，并在ImageNet上实现了竞争性的准确率-效率平衡。

摘要翻译

视觉Transformer需要大量计算资源和内存带宽，这严重限制了其在边缘设备上的部署。虽然近期的结构化剪枝方法成功降低了理论FLOPs，但它们通常仅在单一结构粒度上操作，并依赖复杂的多阶段流程与事后阈值处理来满足稀疏性预算。本文提出分层自动剪枝（HiAP），这是一种连续松弛框架，可在单次端到端训练阶段中发现最优子网络，无需依赖人工设计的重要性启发式规则或预定义的逐层稀疏性目标。HiAP在多个粒度上引入随机Gumbel-Sigmoid门控：宏观门控用于剪枝整个注意力头与前馈网络（FFN）模块，微观门控则用于选择性剪枝头内维度和FFN神经元。通过同时优化这两个层级，HiAP同时解决了加载大型矩阵的内存瓶颈开销和计算密集型数学运算问题。HiAP通过结合结构可行性惩罚项与解析FLOPs的损失函数，能够自然收敛到稳定的子网络。在ImageNet上的大量实验表明，HiAP能够有机地发现高效架构，并为DeiT-Small等模型实现了具有竞争力的精度-效率帕累托前沿，其性能与复杂的多阶段方法相当，同时显著简化了部署流程。

摘要 (Abstract)

Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.

关键词: Vision Transformers, structured pruning, edge devices, sparsity, model compression, efficient inference, auto-pruning, FLOPs reduction

154. ❌ A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

作者: Jiajun Sun, Zhe Gao 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12221v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究基于视觉和音频模态的面部情感识别，属于计算机视觉和多媒体分析领域。与大多数大模型关键词无关，但明确使用了Mixture of Experts (MoE)训练头来增强分类器多样性，因此该关键词高度相关（10分）。同时，论文使用了预训练的DINOv2模型，与预训练关键词有一定关联（5分）。情感识别可视为AI在行为科学中的应用，与AI for Science有弱关联（5分）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种两阶段双模态模型，通过视觉特征提取（使用DINOv2和MoE头）和音频特征融合，有效提升了无约束视频中面部情感表达的帧级分类性能，在ABAW数据集上超越了官方基线。

摘要翻译

本文针对第十届野外情感行为分析（ABAW）研讨会与竞赛中的表情识别挑战，该任务要求对无约束视频中的八种面部情绪表达进行帧级分类。由于人脸定位不准确、姿态与尺度变化大、运动模糊、时序不稳定性以及相邻帧间的其他干扰因素，该任务极具挑战性。为应对这些困难，我们提出了一种两阶段双模态（视听）模型。第一阶段侧重于通过预训练的DINOv2基编码器实现鲁棒的视觉特征提取：具体采用DINOv2 ViT-L/14作为主干网络，运用填充感知增强（PadAug）策略对原始视频进行图像填充与数据预处理，并引入专家混合（MoE）训练头以增强分类器多样性。第二阶段处理模态融合与时序一致性问题：在视觉模态上，从原始视频中多尺度重裁剪人脸，并对提取的视觉特征进行平均以形成鲁棒的帧级表征；同时，从短音频窗口提取帧对齐的Wav2Vec 2.0音频特征以提供互补的声学线索。这些双模态特征通过轻量级门控融合模块进行整合，并在推理阶段进行时序平滑处理。在ABAW数据集上的实验验证了所提方法的有效性：该两阶段模型在官方验证集上取得了0.5368的宏平均F1分数，在五折交叉验证下达到0.5122 +/- 0.0277，性能优于官方基线模型。

摘要 (Abstract)

This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.

关键词: Facial Emotional Expression Recognition, Two-stage Model, Dual-modality, Mixture of Experts, DINOv2, Audio-visual Fusion, ABAW Competition, Frame-level Classification

155. ❌ Real-World Point Tracking with Verifier-Guided Pseudo-Labeling

作者: Görkay Aydemir, Fatma Güney, Weidi Xie 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12217v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究计算机视觉中的点跟踪问题，提出了一种用于真实世界视频微调的验证器引导伪标签方法。核心内容涉及使用预训练跟踪器生成候选轨迹，通过元模型（验证器）评估可靠性并选择可信预测来生成高质量伪标签，然后用于微调。这与大模型/深度学习技术原理创新或科学领域应用的关键词基本无关。仅与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（涉及微调），以及与’Pre-training OR Continual Pre-training OR Domain Adaptation’有微弱关联（提及预训练模型和领域适应），其他关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文解决了真实世界视频中点跟踪模型性能下降的问题，通过引入验证器引导的伪标签方法选择可信的跟踪器预测来生成高质量监督信号，实验表明该方法在四个真实世界基准测试中实现了最先进的结果，且比先前的自训练方法需要更少的数据。

摘要翻译

长期点跟踪模型通常在大规模合成数据集上进行训练。由于真实世界视频具有不同特性且缺乏密集的真实标注，这些模型在实际视频中的性能会下降。在未标注视频上进行自训练已被探索为一种实用解决方案，但伪标签的质量高度依赖于教师模型的可靠性，而教师模型的可靠性在不同帧和场景中存在差异。本文针对真实世界微调问题，提出验证器——一种元模型，其通过学习评估跟踪器预测的可靠性并指导伪标签生成。给定来自多个预训练跟踪器的候选轨迹，验证器逐帧评估这些轨迹并选择最可信的预测，从而生成高质量的伪标签轨迹。在微调过程中，验证器引导的伪标注能显著提升监督信号的质量，并实现面向未标注视频的数据高效适应。在四个真实世界基准数据集上的大量实验表明，我们的方法在取得最先进结果的同时，比先前的自训练方法所需数据量更少。项目页面：https://kuis-ai.github.io/track_on_r

摘要 (Abstract)

Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: https://kuis-ai.github.io/track_on_r

关键词: point tracking, real-world videos, pseudo-labeling, verifier-guided, fine-tuning, self-training, tracker predictions, state-of-the-art

156. ❌ ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

作者: Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12208v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态大语言模型（MLLMs）在多媒体取证中的应用，提出了一种训练无关的视觉令牌压缩框架ForensicZip。核心相关关键词为’Large Language Models OR LLMs OR Foundation Models’（8分），因为论文明确使用MLLMs进行取证分析。‘Speculative Decoding OR Inference Acceleration’（5分）有一定关联，因为论文关注计算成本优化和加速策略，但并非直接研究解码加速技术。其他关键词如MoE、SFT、RAG、量化等与论文内容无关，均得0分。

!!! tip deepseek-chat TL;DR

论文针对多模态大语言模型处理高分辨率图像视频时计算成本高的问题，提出了一种基于伪造驱动的视觉令牌压缩框架ForensicZip，在仅保留10%令牌的情况下实现了2.97倍加速和90%以上FLOPs减少，同时保持最先进的伪造检测性能。

摘要翻译

多模态大语言模型（MLLMs）能够通过生成用于伪造检测的文本推理，实现可解释的多媒体取证。然而，处理密集的视觉序列会产生高昂的计算成本，尤其是对于高分辨率图像和视频。视觉令牌剪枝是一种实用的加速策略，但现有方法主要基于语义驱动，保留显著对象的同时丢弃了背景区域，而高频异常和时间抖动等篡改痕迹往往存在于这些背景区域中。为解决这一问题，我们提出了ForensicZip，这是一个无需训练的框架，从伪造驱动的角度重新构建了令牌压缩问题。ForensicZip将时序令牌演化建模为一个带有松弛虚拟节点的生灭最优传输问题，量化了指示瞬态生成伪影的物理不连续性。取证评分进一步整合了基于传输的新颖性度量与高频先验，以在大比例压缩下将取证证据与语义内容分离。在深度伪造和AIGC基准测试上的实验表明，在仅保留10%令牌的情况下，ForensicZip实现了$2.97\times$的加速和超过90%的浮点运算量减少，同时保持了最先进的检测性能。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10% token retention, ForensicZip achieves $2.97\times$ speedup and over 90% FLOPs reduction while maintaining state-of-the-art detection performance.

关键词: Multimodal Large Language Models, Forensic Vision-Language Models, Token Compression, Forgery Detection, Computational Acceleration, Birth-Death Optimal Transport, Deepfake Detection, AIGC Benchmark

157. ❌ SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

作者: Mengzhen Liu, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong, Liming Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12193v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人领域的视觉-语言-动作模型（VLA），研究主动感知与操作，核心是机器人控制框架、数据集和基准测试。所有评分关键词均与大语言模型（LLM）技术、训练方法、推理优化、对齐、代理系统、模型压缩、科学AI应用等直接相关，而本文未涉及LLM或相关技术，也未应用于生物信息学等科学领域，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了SaPaVe框架，解决了机器人主动感知与操作中语义驱动感知与鲁棒执行难以统一的问题，通过解耦相机与操作动作、自底向上训练策略和新数据集，在仿真和真实环境中实现了比现有VLA模型更高的成功率。

摘要翻译

主动感知与操作是机器人在复杂场景中进行交互的关键能力。现有方法难以将语义驱动的主动感知与鲁棒、视角无关的执行能力相统一。我们提出SaPaVe，一种端到端框架，能够以数据高效的方式联合学习这些能力。我们的方法将相机控制与操作动作解耦，而非将其置于共享动作空间中，并采用自底向上的训练策略：首先在大规模数据集上训练语义相机控制，随后利用混合数据联合优化两种动作类型。为支持该框架，我们引入了ActiveViewPose-200K数据集（包含20万个图像-语言-相机运动配对数据，用于语义相机运动学习）以及一个3D几何感知模块，该模块提升了动态视角下执行的鲁棒性。我们还提出了ActiveManip-Bench，这是首个用于评估超越固定视角设置的主动操作任务的基准测试。在仿真与真实环境中的大量实验表明，SaPaVe在性能上超越了近期如GR00T N1和(π_0)等视觉-语言-动作模型，在真实世界任务中成功率最高提升31.25%。这些结果表明，通过解耦但协调的策略进行训练，紧密耦合的感知与执行能够实现高效且可泛化的主动操作。项目页面：https://lmzpai.github.io/SaPaVe

摘要 (Abstract)

Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and (π_0), achieving up to 31.25% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe

关键词: active perception, active manipulation, vision-language-action models, robotics, semantic camera control, end-to-end framework, data-efficient learning, viewpoint-invariant execution

158. ❌ LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning

作者: Haiying Xu, Zihan Wang, Song Dai, Zhengxuan Zhang, Kairan Dou, Xuming Hu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12166v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）在几何推理中的辅助构造表示问题，核心涉及大模型技术（关键词1得10分）和强化学习优化（关键词8得8分）。论文提出LatentGeo框架，通过连续潜在视觉表示学习内部化几何构造，涉及多步推理（关键词13得8分）和深度推理过程（关键词14得8分）。其他关键词如MoE、SLMs、缩放定律、预训练、微调、RAG、上下文扩展、注意力优化、量化、幻觉缓解等均未在摘要中提及或相关，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在几何推理中难以表示辅助构造的问题，提出了LatentGeo框架，通过连续潜在视觉表示学习和强化学习优化，显著提升了需要辅助构造的几何推理任务的性能。

摘要翻译

尽管多模态推理领域近期取得了进展，辅助几何构造的表示仍然是多模态大语言模型面临的一项根本性挑战。这些构造在原图中并不存在，必须在定理应用之前被引入。现有方法主要依赖于显式构造范式，包括基于文本的几何描述、推理过程中的视觉-标记交错以及工具增强的几何执行。然而，这些方法要么无法忠实表示复杂的空间关系，要么在离散符号与连续几何结构之间产生表示失配，或者依赖于外部能力，从而阻碍了端到端的优化。为解决这些局限性，我们提出了LatentGeo框架，该框架学习连续的潜在视觉表示，以将辅助几何构造内化，而无需像素级渲染或外部执行器。我们设计了一个三阶段课程，通过辅助视觉监督逐步对齐并内化这些潜在表示，随后引入LaGDPO——一种潜在感知的强化学习过程，该过程在策略优化期间稳定潜在表示，同时提升终端任务的正确性。为系统评估以构造为中心的表示质量，我们引入了GeoAux，这是一个针对视觉依赖性几何问题的新基准，并在GeoAux和MathVerse上进行了实验。结果表明，LatentGeo在几何推理任务上取得了显著提升，尤其是在需要辅助构造的任务上。广泛的分析与消融研究进一步验证了我们框架中每个组件的有效性。

摘要 (Abstract)

Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.

关键词: multimodal large language models, geometric reasoning, auxiliary constructions, latent visual representations, reinforcement learning, end-to-end optimization, GeoAux benchmark, MathVerse

159. ❌ EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

作者: Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, Xuming Hu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12147v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在自我中心视频中的细粒度意图理解，与’Large Language Models’高度相关（10分）。研究涉及推理能力评估，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为意图理解需要多步推理和深入思考。论文提到智能助手和机器人模仿学习等应用，与’LLM Agents’相关（5分）。其他关键词如MoE、量化、RAG等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该论文提出了EgoIntent基准，用于评估多模态大语言模型在自我中心视频中理解细粒度步骤级意图（包括做什么、为什么做、下一步计划）的能力，发现现有模型平均得分仅33.31%，表明这是一个极具挑战性的问题。

摘要翻译

多模态大语言模型（MLLMs）已在多种任务中展现出卓越的视频推理能力。然而，其在第一人称视角视频中细粒度理解人类意图的能力仍很大程度上未被探索。现有基准主要关注片段级别的意图推理，忽视了步骤级别意图理解这一更精细的粒度。然而，智能助手、机器人模仿学习及增强现实引导等应用不仅需要理解人在每一步做什么，还需理解其动机及后续计划，以提供及时且情境感知的支持。为此，我们提出了EgoIntent，一个面向第一人称视角视频的步骤级别意图理解基准。该基准涵盖15种不同的室内外日常生活场景，包含3,014个步骤，并从三个互补维度评估模型：局部意图（做什么）、全局意图（为什么做）以及下一步计划（接下来做什么）。关键的是，每个视频片段均在所查询步骤的关键结果（例如接触或抓握）发生前即刻截断，且不包含后续步骤的任何帧，从而避免了未来帧信息泄露，并实现了对预测性步骤理解及下一步规划的清晰评估。我们评估了15个多模态大语言模型，包括最先进的闭源与开源模型。即使表现最佳的模型在三个意图维度上的平均得分也仅为33.31，这凸显了在第一人称视角视频中进行步骤级别意图理解仍是一个极具挑战性的问题，需要进一步深入研究。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.

关键词: Multimodal Large Language Models, egocentric videos, step-level intent understanding, benchmark, video reasoning, intelligent assistants, anticipatory understanding, MLLMs

160. ❌ O3N: Omnidirectional Open-Vocabulary Occupancy Prediction

作者: Mengfei Duan, Hao Shi, Fei Teng, Guoqiang Zhao, Yuheng Zhang, Zhiyong Li, Kailun Yang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12144v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文O3N专注于计算机视觉和3D场景理解领域，提出了一种用于全向开放词汇占用预测的纯视觉端到端框架。其核心贡献在于Polar-spiral Mamba模块、Occupancy Cost Aggregation模块和Natural Modality Alignment模块，用于实现360度连续空间表示、几何与语义监督的统一以及视觉特征、体素嵌入和文本语义的对齐。虽然论文提到了’embodied agents’和’autonomous agents’，但这些指的是物理机器人或智能体，而非大语言模型智能体。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science的具体应用直接相关，而本论文研究的是3D视觉感知和重建，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了O3N，首个纯视觉、端到端的全向开放词汇占用预测框架，通过新颖的模块设计在多个基准测试上实现了最先进的性能，并展现出卓越的跨场景泛化能力和语义可扩展性。

摘要翻译

通过全向感知理解与重建三维世界是自主智能体与具身智能发展的必然趋势。然而，现有的三维占据预测方法受限于有限视角输入与预定义训练分布，难以应用于需要在开放世界探索中进行全面、安全场景感知的具身智能体。为此，我们提出了O3N，首个纯视觉、端到端的全向开放词汇占据预测框架。O3N通过极坐标螺旋状态空间模型模块将全向体素嵌入极坐标螺旋拓扑结构，实现了连续空间表征与跨360°的长程上下文建模。占据代价聚合模块引入了一种原理性机制，在体素空间内统一几何与语义监督，确保重建几何与底层语义结构的一致性。此外，自然模态对齐模块建立了一条无梯度对齐路径，协调视觉特征、体素嵌入与文本语义，形成一致的“像素-体素-文本”表征三元组。在多个模型上的大量实验表明，我们的方法不仅在QuadOcc和Human360Occ基准测试中取得了最先进的性能，还展现出卓越的跨场景泛化能力与语义可扩展性，为通用三维世界建模开辟了道路。源代码将在https://github.com/MengfeiD/O3N公开。

摘要 (Abstract)

Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent “pixel-voxel-text” representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.

关键词: Omnidirectional Occupancy Prediction, Open-vocabulary, 3D World Modeling, Polar-spiral Mamba, Occupancy Cost Aggregation, Natural Modality Alignment, Autonomous Agents, Embodied Intelligence

161. ❌ HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

作者: Rui Shao, Ruize Gao, Bin Xie, Yixing Li, Kaiwen Zhou, Shuai Wang, Weili Guan, Gongwei Chen 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12138v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究GUI agents powered by large vision-language models (VLMs)，属于大模型在特定领域（GUI自动化）的应用，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。论文核心是解决agent训练中数据质量问题（hardness-aware trajectory synthesis），与’Scaling Laws AND Data Quality’有一定关联（5分）。论文涉及agent训练，与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分）。论文重点解决instruction-execution alignment问题，与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（8分）。论文直接研究GUI agents，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。其他关键词如MoE、SLMs、RLHF、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对GUI agents训练中语义模糊动作导致泛化能力差的问题，提出了Hardness-Aware Trajectory Synthesis (HATS)框架，通过硬度驱动探索和对齐引导精炼的闭环设计，显著提升了agent在GUI环境中的性能。

摘要翻译

基于大规模视觉语言模型（VLM）的图形用户界面（GUI）智能体在自动化数字任务方面展现出显著潜力，这凸显了需要高质量轨迹数据以支持有效智能体训练的重要性。然而，现有的轨迹合成流程往往只能产生泛化能力有限、难以超越简单交互的智能体。我们认为这一局限源于对语义模糊动作的忽视，这类动作的含义依赖于上下文、操作序列或存在视觉模糊性。此类动作对于现实场景的鲁棒性至关重要，但在当前数据集中代表性不足且处理不当，导致任务指令与执行之间存在语义错位。为解决这些问题，我们提出了HATS（Hardness-Aware Trajectory Synthesis，难度感知轨迹合成框架），旨在减轻语义模糊性的影响。我们将难度定义为动作相关的语义模糊程度，并开发了两个互补模块：（1）难度驱动探索，引导数据收集面向模糊但信息丰富的交互；（2）对齐引导优化，迭代验证并修复指令与执行的对齐关系。两个模块形成闭环运行：探索为优化提供具有挑战性的轨迹，而优化的反馈则更新难度信号以指导后续探索。大量实验表明，使用HATS训练的智能体在多个基准GUI环境中持续优于现有最先进的基线方法。

摘要 (Abstract)

Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.

关键词: GUI agents, large vision-language models, trajectory synthesis, semantic ambiguity, hardness-aware, instruction-execution alignment, agent training, data quality

162. ❌ Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D

作者: Agniv Sharma, Xianghui Xie, Tom Fischer, Eddy Ilg, Gerard Pons-Moll 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Hoi3DGen主要研究3D人机交互生成，仅与关键词’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为摘要中提到’leveraging multimodal large language models’来策划高质量交互数据。其他关键词涉及大模型技术原理、训练方法、推理优化、代理系统等，论文未涉及这些具体技术，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了Hoi3DGen框架，通过利用多模态大语言模型策划高质量交互数据，解决了现有方法在文本到3D人机交互生成中存在的Janus问题和文本提示不忠实的问题，实现了文本一致性和3D模型质量的显著提升。

摘要翻译

基于文本对三维人-物交互进行建模与生成，对于增强现实（AR）、扩展现实（XR）及游戏等应用至关重要。现有方法通常依赖于从文本到图像模型的分数蒸馏，但由于高质量交互数据的稀缺，其生成结果常出现“双面神”问题，且难以忠实遵循文本提示。我们提出了Hoi3DGen框架，该框架能够生成精确遵循输入交互描述的高质量带纹理三维交互网格。我们首先利用多模态大语言模型构建了真实且高质量的交互数据集，进而建立了一套完整的文本到三维生成流程，在交互保真度上实现了数量级的提升。我们的方法在文本一致性上超越基线方法4至15倍，在三维模型质量上超越3至7倍，能够泛化至多样化的物体类别与交互类型，同时保持高质量的三维生成效果。

摘要 (Abstract)

Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.

关键词: 3D human-object interactions, text-to-3D generation, multimodal large language models, interaction fidelity, text consistency, 3D model quality, high-quality textured meshes, generalization

163. ❌ EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation

作者: Yan Li, Ning Liao, Xiangyu Zhao, Shaofeng Zhang, Xiaoxing Wang, Yifan Yang, Junchi Yan, Xue Yang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12108v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态大语言模型（MLLMs）中的视觉理解与生成统一表示问题，提出了EvoTok图像分词器。与关键词的相关性分析：1）论文明确提到“multimodal large language models (MLLMs)”，因此与“Large Language Models”高度相关（8分）；2）论文涉及视觉理解与生成的统一表示，属于大模型在不同领域的应用，但未具体涉及其他关键词如MoE、SLMs、Scaling Laws、各种训练方法、推理技术、代理系统、压缩技术等；3）论文未涉及科学领域的AI应用（如生物信息学）。其他关键词均未在论文中提及或相关，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型中视觉理解与生成的粒度差异问题，提出了EvoTok统一图像分词器，通过残差潜在演化在共享潜在空间中实现高质量图像重建和跨视觉任务的性能提升。

摘要翻译

统一多模态大语言模型（MLLMs）的发展面临一个根本性挑战：视觉理解与生成之间存在粒度鸿沟——理解需要高层语义抽象，而图像生成则要求细粒度的像素级表征。现有方法通常在同一组表征上施加两种监督，或将这两种监督解耦到不同的特征空间，分别导致干扰与不一致性。本研究中，我们提出EvoTok，一种通过在共享潜在空间内进行残差演化来调和这些需求的统一图像分词器。EvoTok并非为像素和语义维护独立的分词空间，而是通过残差向量量化将图像编码为级联的残差分词序列。这一残差序列形成了一条演化轨迹：早期阶段捕捉低级细节，更深阶段则逐步过渡到高层语义表征。尽管仅在1300万张图像的相对较小数据集上训练（远小于以往许多统一分词器使用的十亿级数据集），EvoTok在ImageNet-1K数据集256×256分辨率下仍实现了0.43 rFID的强劲重建质量。当与大语言模型结合时，EvoTok在9项视觉理解基准测试中的7项上展现出有竞争力的性能，并在图像生成基准（如GenEval和GenAI-Bench）上取得了显著成果。这些结果表明，将视觉表征建模为演化轨迹为统一视觉理解与生成提供了一种有效且原理清晰的解决方案。

摘要 (Abstract)

The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where earlier stages capture low-level details and deeper stages progressively transition toward high-level semantic representations. Despite being trained on a relatively modest dataset of 13M images, far smaller than the billion-scale datasets used by many previous unified tokenizers, EvoTok achieves a strong reconstruction quality of 0.43 rFID on ImageNet-1K at 256x256 resolution. When integrated with a large language model, EvoTok shows promising performance across 7 out of 9 visual understanding benchmarks, and remarkable results on image generation benchmarks such as GenEval and GenAI-Bench. These results demonstrate that modeling visual representations as an evolving trajectory provides an effective and principled solution for unifying visual understanding and generation.

关键词: unified multimodal large language models, image tokenizer, residual latent evolution, visual understanding, image generation, residual vector quantization, EvoTok, MLLMs

164. ❌ Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis

作者: Xiaolong Qian, Qi Jiang, Yao Gao, Lei Sun, Zhonghua Yi, Kailun Yang, Luc Van Gool, Kaiwei Wang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12083v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	5.0/10	0.0

评分理由: 该论文专注于计算像差校正（CAC）的计算机视觉任务，通过构建大规模基准数据集和评估框架来研究图像恢复算法。论文内容与绝大多数大模型和深度学习技术关键词（如LLMs、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及自然语言处理、模型架构、训练方法等特定领域，而本文研究的是光学像差校正的计算机视觉问题。唯一相关的关键词是’AI for Science’，因为该论文将AI技术应用于光学成像这一科学领域，但相关性较弱（5分），因为论文更偏向计算机视觉应用而非核心科学发现（如生物信息学或化学信息学）。

!!! tip deepseek-chat TL;DR

该论文针对计算像差校正方法泛化性差的问题，提出了一个大规模基准数据集UniCAC和评估框架ODE，通过实验分析发现先验利用、网络架构和训练策略是影响CAC性能的三个关键因素。

摘要翻译

当前主流的计算像差校正方法通常针对特定光学系统定制，导致其泛化能力较差，且在新镜头应用中需耗费大量精力重新训练。开发能够跨不同摄影镜头泛化的CAC范式，为应对这些挑战提供了前景广阔的解决方案。然而，由于缺乏一个涵盖足够广泛光学像差的综合性基准，在消费级摄影领域实现此类跨镜头通用性的努力仍处于早期阶段。此外，现有CAC方法具体受哪些因素影响，以及这些因素如何影响其性能，目前尚不明确。本文通过我们新提出的UniCAC——一个基于自动光学设计构建的大规模摄影相机基准——对24种图像恢复与CAC算法进行了全面的实验与评估。我们引入了光学退化评估器作为一种新颖框架，以客观评估CAC任务的难度，提供可信的光学像差量化，并实现可靠的性能评价。基于对比分析，我们识别出三个对CAC性能影响最为显著的关键因素——先验利用、网络架构和训练策略，并进一步探究了它们各自的影响。我们相信，本研究的基准、数据集及观察发现为相关领域贡献了基础性见解，并为未来研究奠定了基础。基准数据、代码及Zemax文件将在https://github.com/XiaolongQian/UniCAC 公开。

摘要 (Abstract)

Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and labor-intensive re-training for new lenses. Developing CAC paradigms capable of generalizing across diverse photographic lenses offers a promising solution to these challenges. However, efforts to achieve such cross-lens universality within consumer photography are still in their early stages due to the lack of a comprehensive benchmark that encompasses a sufficiently wide range of optical aberrations. Furthermore, it remains unclear which specific factors influence existing CAC methods and how these factors affect their performance. In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed UniCAC, a large-scale benchmark for photographic cameras constructed via automatic optical design. The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation. Drawing on our comparative analysis, we identify three key factors – prior utilization, network architecture, and training strategy – that most significantly influence CAC performance, and further investigate their respective effects. We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations. Benchmarks, codes, and Zemax files will be available at https://github.com/XiaolongQian/UniCAC.

关键词: Computational Aberration Correction, Benchmark Analysis, Optical Degradation Evaluator, Image Restoration, Universal Generalization, Photographic Cameras, Optical Aberrations, Deep Learning

165. ❌ Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs

作者: Hiran Sarkar, Liming Kuang, Yordanka Velikova, Benjamin Busam 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12078v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Node-RF专注于计算机视觉和神经渲染领域，提出了一种结合神经ODE和动态NeRF的方法来预测连续时空场景动态。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用直接相关，而本文研究的是视觉场景动态预测，属于计算机视觉的特定子领域，与评分关键词列表中的大模型技术、训练方法、推理优化、AI代理、科学AI应用等主题均无直接关联。论文未涉及任何语言模型、大模型训练技术、AI代理或生物/化学信息学内容。

!!! tip deepseek-chat TL;DR

论文提出Node-RF方法，通过整合神经ODE和动态NeRF来学习连续时空场景动态，实现了对未观测轨迹的泛化预测和长范围外推。

摘要翻译

从视觉观测中预测场景动态具有挑战性。现有方法仅能捕捉观测边界内的动态，无法在训练序列范围之外进行有效外推。Node-RF（基于神经常微分方程的神经辐射场）通过将神经常微分方程（Neural Ordinary Differential Equations, NODEs）与动态神经辐射场（Neural Radiance Fields, NeRFs）相结合，克服了这一局限，实现了连续时空表征，能够在恒定内存成本下泛化至观测轨迹之外。该系统从视觉输入中学习隐式场景状态，该状态通过ODE求解器随时间演化，并借助微分运算传播特征嵌入。基于NeRF的渲染器对计算得到的嵌入进行解译，以合成任意视角，实现长时程外推。通过对具有共享动态特性的多组运动序列进行训练，模型能够泛化至未观测条件。实验表明，Node-RF无需显式模型即可表征抽象系统行为，并能识别对未来预测至关重要的关键点。

摘要 (Abstract)

Predicting scene dynamics from visual observations is challenging. Existing methods capture dynamics only within observed boundaries failing to extrapolate far beyond the training sequence. Node-RF (Neural ODE-based NeRF) overcomes this limitation by integrating Neural Ordinary Differential Equations (NODEs) with dynamic Neural Radiance Fields (NeRFs), enabling a continuous-time, spatiotemporal representation that generalizes beyond observed trajectories at constant memory cost. From visual input, Node-RF learns an implicit scene state that evolves over time via an ODE solver, propagating feature embeddings via differential calculus. A NeRF-based renderer interprets calculated embeddings to synthesize arbitrary views for long-range extrapolation. Training on multiple motion sequences with shared dynamics allows for generalization to unseen conditions. Our experiments demonstrate that Node-RF can characterize abstract system behavior without explicit model to identify critical points for future predictions.

关键词: Neural ODE, Neural Radiance Fields, scene dynamics, continuous-time representation, spatiotemporal modeling, long-range extrapolation, visual prediction, implicit scene state

166. ❌ Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos

作者: Shuo Sun, Unal Artan, Malcolm Mielle, Achim J. Lilienthaland, Martin Magnusson 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12064v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的多视角视频密集动态场景重建和相机姿态估计，使用视觉SLAM、优化框架和光流等技术。所有评分关键词均涉及大语言模型、深度学习技术原理或AI科学应用，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种两阶段优化框架，解决了从多个自由移动相机进行密集动态场景重建和相机姿态估计的挑战，并在合成和真实世界基准测试中显著优于现有方法。

摘要翻译

我们致力于解决从多个自由移动相机中实现稠密动态场景重建与相机姿态估计这一挑战性问题——这种设定在多观察者同时记录同一事件时自然出现。现有方法要么仅能处理单相机输入，要么需要依赖刚性固定且预先标定的相机阵列，限制了其实际应用范围。我们提出了一种两阶段优化框架，将任务解耦为鲁棒的相机跟踪与稠密深度优化。在第一阶段，我们通过构建一个同时利用相机内时间连续性与相机间空间重叠性的时空连接图，将单相机视觉SLAM扩展至多相机场景，从而实现尺度一致且鲁棒的跟踪。为确保在有限重叠区域下的鲁棒性，我们引入了一种基于前馈重建模型的宽基线初始化策略。在第二阶段，我们通过优化基于宽基线光流的相机间与相机内稠密一致性，进一步优化深度与相机姿态。此外，我们提出了MultiCamRobolab——一个包含运动捕捉系统提供真实姿态标注的新现实世界数据集。最后，我们通过合成与真实场景基准测试证明，本方法在显著优于当前最优前馈模型的同时，所需内存更少。

摘要 (Abstract)

We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras – a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.

关键词: dense dynamic scene reconstruction, camera pose estimation, multi-view videos, visual SLAM, optimization framework, wide-baseline optical flow, MultiCamRobolab dataset

167. ❌ NBAvatar: Neural Billboards Avatars with Realistic Hand-Face Interaction

作者: David Svitov, Mahtab Dahaghin 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12063v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文NBAvatar专注于计算机视觉和图形学领域，研究头部虚拟形象的神经渲染方法，特别是处理手脸交互引起的非刚性变形。论文内容涉及神经渲染、显式与隐式表示结合、几何一致性等，但完全不涉及大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大模型、深度学习技术原理或AI for Science相关，而本文属于纯粹的计算机视觉/图形学研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为NBAvatar的神经渲染方法，用于处理手脸交互引起的非刚性变形，通过结合显式平面基元与隐式神经渲染，实现了高质量的头部虚拟形象渲染，在多项指标上超越了现有方法。

摘要翻译

我们提出NBAvatar——一种能够处理手部-面部交互引起的非刚性形变的头部虚拟形象真实感渲染方法。我们通过将定向平面图元训练与神经渲染相结合，引入了一种新颖的动态虚拟形象表征方式。这种显式与隐式表征的结合使NBAvatar能够处理时序与姿态一致的几何结构，同时保留神经渲染技术提供的细粒度外观细节。在实验中，我们证明NBAvatar能够隐式学习手部-面部交互引起的色彩变换，并在新视角与新姿态渲染质量上超越现有方法。具体而言，与基于高斯分布的虚拟形象方法相比，NBAvatar在高分辨率百万像素渲染下实现了高达30%的LPIPS（学习感知图像块相似度）指标降低，同时提升了PSNR（峰值信噪比）和SSIM（结构相似性指数）；相较于当前最先进的手部-面部交互方法InteractAvatar，本方法获得了更高的结构相似性。

摘要 (Abstract)

We present NBAvatar - a method for realistic rendering of head avatars handling non-rigid deformations caused by hand-face interaction. We introduce a novel representation for animated avatars by combining the training of oriented planar primitives with neural rendering. Such a combination of explicit and implicit representations enables NBAvatar to handle temporally and pose-consistent geometry, along with fine-grained appearance details provided by the neural rendering technique. In our experiments, we demonstrate that NBAvatar implicitly learns color transformations caused by face-hand interactions and surpasses existing approaches in terms of novel-view and novel-pose rendering quality. Specifically, NBAvatar achieves up to 30% LPIPS reduction under high-resolution megapixel rendering compared to Gaussian-based avatar methods, while also improving PSNR and SSIM, and achieves higher structural similarity compared to the state-of-the-art hand-face interaction method InteractAvatar.

关键词: neural rendering, head avatars, hand-face interaction, non-rigid deformations, oriented planar primitives, novel-view rendering, novel-pose rendering, Gaussian-based avatar methods

168. ❌ Continual Learning with Vision-Language Models via Semantic-Geometry Preservation

作者: Chiyuan He, Zihuan Qiu, Fanman Meng, Runtong Zhang, Linfeng Xu, Qingbo Wu, Hongliang Li 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12055v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）的持续学习，核心是解决灾难性遗忘问题，通过语义几何保持方法（SeGP-CL）来保护预训练和先前任务阶段的跨模态语义几何结构。与关键词的相关性分析：1）仅与“Pre-training OR Continual Pre-training OR Domain Adaptation”高度相关（10分），因为论文明确涉及预训练视觉语言模型的持续学习（Continual Learning），属于领域适应范畴；2）其他关键词主要针对纯语言模型（LLMs）、特定技术（如MoE、RLHF、RAG等）或科学AI应用，而本文专注于视觉语言模型（VLMs）的持续学习技术，未涉及这些具体方面，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对预训练视觉语言模型在持续学习中容易发生灾难性遗忘的问题，提出了一种语义几何保持方法（SeGP-CL），通过对抗锚点构建和跨模态几何蒸馏来保护语义结构，在多个基准测试中实现了最先进的性能并更好地保持了模型的语义几何。

摘要翻译

预训练视觉语言模型（VLMs）的持续学习容易遭受灾难性遗忘，然而现有方法在适应新任务时未能显式保持从预训练及先前阶段继承的跨模态语义几何结构，导致新任务的监督信号引发几何失真。我们观察到，最显著的语义漂移往往集中于新旧语义交界处的脆弱邻域，其中共享的视觉模式易被新任务的文本语义重新解释。为在无示例存储约束下解决此问题，我们提出面向持续学习的语义几何保持方法（SeGP-CL）。该方法首先通过构建紧凑的对抗锚点集来探测易漂移区域，该锚点集采用双目标投影梯度下降法（DPGD）生成，在保持原始视觉空间忠实性的同时，将选定的新任务样本向旧类语义方向驱动。训练过程中，我们通过锚点引导的跨模态几何蒸馏（ACGD）保持跨模态结构，并借助轻量级文本语义几何正则化（TSGR）稳定跨任务的文本参照系。训练完成后，我们通过估计锚点诱导的原始空间漂移来迁移旧视觉原型，并融合跨模态与视觉线索进行双路径推理。在五个持续学习基准上的大量实验表明，SeGP-CL能持续提升模型稳定性与前向迁移能力，在取得最先进性能的同时更好地保持了视觉语言模型的语义几何结构。

摘要 (Abstract)

Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.

关键词: Continual Learning, Vision-Language Models, Semantic Geometry Preservation, Catastrophic Forgetting, Cross-modal Distillation, Adversarial Anchors, Exemplar-free Learning, State-of-the-art Performance

169. ❌ Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

作者: Umberto Cappellazzo, Stavros Petridis, Maja Pantic 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12046v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于音频-视觉语音识别（AVSR）中的模态贡献分析，使用Shapley值进行可解释性研究。与绝大多数关键词（涉及大模型技术、训练方法、推理优化等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（8分），因为Shapley值是一种可解释性方法；与’AI for Science OR Bioinformatics OR Cheminformatics’有微弱关联（5分），因属于AI在特定领域（语音处理）的应用，但非核心的生物/化学信息学。

!!! tip deepseek-chat TL;DR

该论文提出了Dr. SHAP-AV框架，使用Shapley值分析音频-视觉语音识别模型中模态贡献的平衡问题，发现模型在噪声下会转向视觉依赖但仍保持音频偏好，且信噪比是影响模态权重的关键因素。

摘要翻译

视听语音识别（AVSR）利用声学和视觉信息在噪声环境下实现鲁棒识别。然而，模型如何平衡这些模态仍不明确。我们提出Dr. SHAP-AV框架，该框架使用沙普利值（Shapley values）分析AVSR中的模态贡献。通过在两个基准数据集和不同信噪比（SNR）水平下对六个模型进行实验，我们引入了三种分析：用于整体模态平衡的全局SHAP分析、用于解码过程中贡献动态的生成式SHAP分析，以及用于输入输出对应关系的时间对齐SHAP分析。我们的研究结果表明，模型在噪声下会转向依赖视觉信息，但即使在严重声学退化情况下仍保持较高的音频贡献度。模态平衡在生成过程中动态演变，时间对齐关系在噪声下依然成立，而信噪比是驱动模态权重分配的主导因素。这些发现揭示了模型存在持续的音频偏向性，这启发了我们设计自适应的模态加权机制，并将基于沙普利的归因分析作为标准的AVSR诊断工具。

摘要 (Abstract)

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

关键词: Audio-Visual Speech Recognition, AVSR, Shapley values, modality contributions, modality balance, SNR, audio bias, model interpretability

170. ❌ Single Pixel Image Classification using an Ultrafast Digital Light Projector

作者: Aisha Kanwal, Graeme E. Johnstone, Fahimeh Dehkhoda, Johannes H. Herrnsdorf, Robert K. Henderson, Martin D. Dawson, Xavier Porte, Michael J. Strain 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究单像素成像（SPI）与低复杂度机器学习模型（ELM和DNN）结合，用于超高速图像分类，核心是硬件加速的计算机视觉系统，未涉及任何大语言模型（LLM）、深度学习技术原理创新或科学AI应用，与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文通过结合单像素成像技术和低复杂度机器学习模型，实现了多kHz帧率的图像分类，无需图像重建，并在MNIST数据集上验证了分类性能。

摘要翻译

模式识别与图像分类是机器视觉领域的核心任务。以自动驾驶为例，车辆需实时采集动态环境中的复杂信息并进行分类。本文通过实验展示了将单像素成像（SPI）技术与低复杂度机器学习模型相结合，实现数千赫兹帧率的图像分类。采用CMOS集成微LED数字光投影器进行SPI，可实现亚毫秒级图像编码的超高速图案生成。我们通过广泛认可的MNIST手写数字分类基准任务，评估了实验系统的分类准确率。比较了两种机器学习模型的分类性能：极限学习机（ELM）与基于反向传播训练的深度神经网络。两种模型均保持低复杂度，使其推理时间开销与图像生成时间相当。关键的是，我们的单像素图像分类方法基于信息的时空变换，完全无需图像重建过程。通过探索基于SPI的ELM作为二分类器的性能，我们证明了其在超高速成像场景中实现高效异常检测的潜力。

摘要 (Abstract)

Pattern recognition and image classification are essential tasks in machine vision. Autonomous vehicles, for example, require being able to collect the complex information contained in a changing environment and classify it in real time. Here, we experimentally demonstrate image classification at multi-kHz frame rates combining the technique of single pixel imaging (SPI) with a low complexity machine learning model. The use of a microLED-on-CMOS digital light projector for SPI enables ultrafast pattern generation for sub-ms image encoding. We investigate the classification accuracy of our experimental system against the broadly accepted benchmarking task of the MNIST digits classification. We compare the classification performance of two machine learning models: An extreme learning machine (ELM) and a backpropagation trained deep neural network. The complexity of both models is kept low so the overhead added to the inference time is comparable to the image generation time. Crucially, our single pixel image classification approach is based on a spatiotemporal transformation of the information, entirely bypassing the need for image reconstruction. By exploring the performance of our SPI based ELM as binary classifier we demonstrate its potential for efficient anomaly detection in ultrafast imaging scenarios.

关键词: single pixel imaging, image classification, ultrafast imaging, extreme learning machine, deep neural network, MNIST, anomaly detection, machine vision

171. ❌ Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era

作者: Nicholas Schaub, Andriy Kharchenko, Hamdah Abbasi, Sameeul Samee, Hythem Sidky, Nathan Hotaling 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要介绍Nyxus图像特征提取库，专注于处理大规模2D/3D图像数据，提供可扩展的特征提取解决方案，并支持多种部署方式。论文与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词都特指大语言模型相关技术。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文提到Nyxus应用于生物医学领域（如放射组学和细胞分析），属于AI在科学领域的应用，但并非核心焦点（核心是特征提取库本身），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对大规模图像数据处理中特征提取的计算效率、准确性和标准化问题，开发了名为Nyxus的可扩展图像特征提取库，支持2D/3D数据、多平台部署，并适用于生物医学等领域的机器学习应用。

摘要翻译

现代成像仪器单次实验即可产生太字节至拍字节量级的数据。处理大规模图像数据集的主要障碍在于计算能力——现有图像分析算法往往缺乏处理此类海量数据所需的效率，或在鲁棒性与准确性之间做出妥协。深度学习算法已显著提升了分析流程第一步（区域分割）的准确性，但各科学领域中专有特征提取库的激增，使得比较不同方法提取特征的性能与准确性变得困难。为应对这些需求，我们开发了名为Nyxus的新型特征提取库。Nyxus自设计之初即面向二维与三维图像数据的可扩展外核特征提取，并依据既有标准进行了严格测试。其综合特征集覆盖放射组学与细胞分析等多个生物医学领域，并针对CPU与GPU的计算可扩展性进行优化。Nyxus为满足不同技能水平与需求的用户提供了多样化封装形式：面向代码开发者的Python包、命令行工具、适用于低代码或无代码用户或需要可视化结果用户的Napari插件，以及符合开放容器倡议（OCI）标准的容器，可部署于云端或超算工作流以处理大规模数据集。此外，Nyxus开创了特征提取的新方法学路径，支持通过编程方式灵活调优多种特征集，从而在计算效率与覆盖范围之间取得平衡，为新型机器学习与深度学习应用提供支持。

摘要 (Abstract)

Modern imaging instruments can produce terabytes to petabytes of data for a single experiment. The biggest barrier to processing big image datasets has been computational, where image analysis algorithms often lack the efficiency needed to process such large datasets or make tradeoffs in robustness and accuracy. Deep learning algorithms have vastly improved the accuracy of the first step in an analysis workflow (region segmentation), but the expansion of domain specific feature extraction libraries across scientific disciplines has made it difficult to compare the performance and accuracy of extracted features. To address these needs, we developed a novel feature extraction library called Nyxus. Nyxus is designed from the ground up for scalable out-of-core feature extraction for 2D and 3D image data and rigorously tested against established standards. The comprehensive feature set of Nyxus covers multiple biomedical domains including radiomics and cellular analysis, and is designed for computational scalability across CPUs and GPUs. Nyxus has been packaged to be accessible to users of various skill sets and needs: as a Python package for code developers, a command line tool, as a Napari plugin for low to no-code users or users that want to visualize results, and as an Open Container Initiative (OCI) compliant container that can be used in cloud or super-computing workflows aimed at processing large data sets. Further, Nyxus enables a new methodological approach to feature extraction allowing for programmatic tuning of many features sets for optimal computational efficiency or coverage for use in novel machine learning and deep learning applications.

关键词: image feature extraction, big data, scalable, 2D/3D image data, biomedical domains, computational efficiency, deep learning applications, Nyxus

172. ❌ Pano360: Perspective to Panoramic Vision with Geometric Consistency

作者: Zhengdong Zhu, Weiyi Xue, Zuyuan Yang, Wenlve Zhou, Zhiheng Zhou 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12013v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的全景图像拼接技术，具体涉及3D几何一致性、多视图对齐、Transformer架构和图像扭曲等。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文专注于传统的计算机视觉任务，未涉及任何大模型、深度学习技术原理或AI在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于3D几何一致性的全景图像拼接方法，通过Transformer架构实现多视图全局对齐，显著提升了在弱纹理、大视差等挑战性场景下的拼接精度和感知质量。

摘要翻译

现有全景图拼接方法严重依赖成对特征匹配，且无法利用多视角间的几何一致性，这导致在弱纹理、大视差和重复纹理等挑战性场景中产生严重畸变与错位。鉴于多视角几何对应关系可直接在三维空间中构建，从而获得更高精度与全局一致性，我们将二维对齐任务扩展至三维摄影测量空间。我们采用一种基于Transformer的新型架构来实现三维感知并聚合所有视角的全局信息。该方法直接利用相机位姿指导三维空间中的图像形变以实现全局对齐，并采用多特征联合优化策略计算拼接缝。此外，为建立评估基准并训练网络，我们构建了一个大规模真实场景数据集。大量实验表明，本方法在对齐精度与视觉感知质量上均显著优于现有方案。

摘要 (Abstract)

Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns. Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams. Additionally, to establish an evaluation benchmark and train our network, we constructed a large-scale dataset of real-world scenes. Extensive experiments show that our method significantly outperforms existing alternatives in alignment accuracy and perceptual quality.

关键词: panorama stitching, 3D geometric consistency, multi-view alignment, transformer architecture, image warping, global alignment, real-world dataset, perceptual quality

173. ❌ Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation

作者: Chongyang Xu, Yixian Zou, Ziliang Feng, Fanman Meng, Shuaicheng Liu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11984v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation》专注于机器人视觉运动控制领域，提出了一种基于扩散模型和流匹配的改进方法（Ada3Drift），旨在通过训练时漂移技术实现单步高保真动作生成，以解决实时机器人控制中的推理延迟问题。论文的核心内容涉及机器人学习、3D点云处理、扩散模型、流匹配和训练优化，但未涉及任何大语言模型（LLMs）、深度学习技术原理创新或关键词列表中指定的其他大模型相关主题（如MoE、Scaling Laws、Alignment、RAG、Agents等）。所有关键词均与大语言模型或特定深度学习子领域（如指令调优、推理加速、模型解释性）直接相关，而本论文研究的是机器人领域的专用模型和方法，与这些关键词无关联。因此，所有关键词的相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文针对基于扩散模型的视觉运动策略在机器人控制中因迭代去噪导致高推理延迟的问题，提出了一种自适应训练时漂移方法（Ada3Drift），通过将迭代细化从推理时转移到训练时，实现了从3D点云观察中高保真的单步动作生成，在多个仿真和真实世界机器人操作任务中达到了最先进的性能，同时比基于扩散的替代方法减少了10倍的函数评估次数。

摘要翻译

基于扩散的视觉运动策略通过迭代去噪有效捕捉多模态动作分布，但其高推理延迟限制了实时机器人控制。近期基于流匹配和一致性的方法实现了单步生成，却牺牲了保持不同动作模态的能力，导致多模态行为坍缩为平均化且往往物理不可行的轨迹。我们观察到机器人领域计算预算的不对称性（离线训练与实时推理）自然启示我们通过将迭代优化从推理阶段转移至训练阶段来恢复这种多模态保真度。基于此洞见，我们提出Ada3Drift方法，该方法学习一个训练时漂移场，将预测动作吸引至专家示范模态，同时排斥其他生成样本，从而实现对三维点云观测的高保真单步生成（1次神经网络前向传播）。为应对机器人领域的少样本训练机制，Ada3Drift进一步引入从粗粒度分布学习到模态锐化优化的S型调度损失过渡策略，以及捕捉不同空间粒度动作模态的多尺度场聚合机制。在三个仿真基准测试（Adroit、Meta-World和RoboTwin）和真实世界机器人操作任务上的实验表明，Ada3Drift在达到最先进性能的同时，相比基于扩散的方法减少了10倍的函数评估次数。

摘要 (Abstract)

Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring $10\times$ fewer function evaluations than diffusion-based alternatives.

关键词: Ada3Drift, 3D visuomotor robotic manipulation, diffusion-based policies, training-time drifting, single-step generation, flow matching, multimodal action distributions, real-time robotic control

174. ❌ AstroSplat: Physics-Based Gaussian Splatting for Rendering and Reconstruction of Small Celestial Bodies

作者: Jennifer Nolan, Travis Driver, John Christian 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11969v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文AstroSplat专注于计算机视觉和图形学领域，提出了一种基于物理的Gaussian Splatting框架，用于小行星等小天体的表面重建和渲染。论文核心是改进Gaussian Splatting技术，通过集成行星反射模型来增强重建精度和渲染性能，并利用NASA Dawn任务的真实图像进行验证。所有关键词均与大模型、深度学习技术原理或相关应用（如生物信息学）相关，但论文内容完全不涉及大模型、语言模型、训练方法、推理优化、对齐技术、代理系统等主题。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（天文学/行星科学）中的应用，但并非核心匹配（论文重点是视觉重建，而非典型的生物/化学信息学），因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了AstroSplat，一种基于物理的Gaussian Splatting框架，通过集成行星反射模型来改进小天体的表面重建和渲染，并在NASA Dawn任务图像上验证了其优于传统方法的性能。

摘要翻译

基于图像的表面重建与表征对小天体（如小行星）探测任务至关重要，它为任务规划、导航和科学分析提供依据。高斯泼溅（Gaussian splatting）技术的最新进展能够实现高保真度的神经场景表征，但通常依赖于球谐光照强度参数化方法，该方法严格基于外观建模，并未显式模拟材料属性或光与表面的相互作用。我们提出AstroSplat，这是一个基于物理的高斯泼溅框架，通过集成行星反射率模型，以提升从小天体原位图像中实现自主表面重建与光度学表征的能力。该框架在美国国家航空航天局（NASA）黎明号任务拍摄的真实图像上得到验证，实验表明相较于传统的球谐参数化方法，本框架在渲染效果与表面重建精度方面均表现出更优性能。

摘要 (Abstract)

Image-based surface reconstruction and characterization are crucial for missions to small celestial bodies (e.g., asteroids), as it informs mission planning, navigation, and scientific analysis. Recent advances in Gaussian splatting enable high-fidelity neural scene representations but typically rely on a spherical harmonic intensity parameterization that is strictly appearance-based and does not explicitly model material properties or light-surface interactions. We introduce AstroSplat, a physics-based Gaussian splatting framework that integrates planetary reflectance models to improve the autonomous reconstruction and photometric characterization of small-body surfaces from in-situ imagery. The proposed framework is validated on real imagery taken by NASA’s Dawn mission, where we demonstrate superior rendering performance and surface reconstruction accuracy compared to the typical spherical harmonic parameterization.

关键词: Gaussian splatting, physics-based rendering, small celestial bodies, surface reconstruction, planetary reflectance models, AstroSplat, NASA Dawn mission, photometric characterization

175. ❌ Preliminary analysis of RGB-NIR Image Registration techniques for off-road forestry environments

作者: Pankaj Deoli, Karthik Ranganath, Karsten Berns 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11952v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究RGB-NIR图像配准技术在林业环境中的应用，主要涉及计算机视觉和传感器融合领域。论文虽然提到了深度学习（DL）方法，但所有评分关键词都专注于大语言模型（LLM）及相关技术（如MoE、RLHF、RAG等），而该论文完全不涉及语言模型、文本生成或大模型技术。论文内容与所有关键词均无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文评估了经典和深度学习方法在林业环境RGB-NIR图像配准中的适用性，发现现有方法在几何一致性保持和密集植被细节处理方面仍存在挑战，需要进一步改进。

摘要翻译

RGB-NIR图像配准在传感器融合、图像增强与越野自主系统中扮演着重要角色。本研究评估了经典方法与基于深度学习（DL）的图像配准技术，以探究其在越野林业应用中的适用性。在六种不同配置下训练的NeMAR模型展现出部分成功，但其生成对抗网络（GAN）损失的不稳定性表明，保持几何一致性仍面临挑战。MURF方法在越野林业数据测试中，于共享信息提取阶段展现出有前景的大尺度特征对齐能力，但在稠密植被区域的精细细节配准上仍存在困难。尽管这仅为初步评估，我们的研究表明，为越野森林应用实现鲁棒的多尺度配准仍需进一步优化。

摘要 (Abstract)

RGB-NIR image registration plays an important role in sensor-fusion, image enhancement and off-road autonomy. In this work, we evaluate both classical and Deep Learning (DL) based image registration techniques to access their suitability for off-road forestry applications. NeMAR, trained under 6 different configurations, demonstrates partial success however, its GAN loss instability suggests challenges in preserving geometric consistency. MURF, when tested on off-road forestry data shows promising large scale feature alignment during shared information extraction but struggles with fine details in dense vegetation. Even though this is just a preliminary evaluation, our study necessitates further refinements for robust, multi-scale registration for off-road forest applications.

关键词: RGB-NIR image registration, off-road forestry, sensor-fusion, Deep Learning, NeMAR, MURF, geometric consistency, multi-scale registration

176. ❌ AS-Bridge: A Bidirectional Generative Framework Bridging Next-Generation Astronomical Surveys

作者: Dichang Zhang, Yixuan Shao, Simon Birrer, Dimitris Samaras 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11928v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于天文学领域的生成模型应用，使用扩散模型和布朗桥过程解决天文观测数据的联合分析问题。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（天文学）领域的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了AS-Bridge，一个双向生成模型，用于桥接地面（LSST）和空间（Euclid）天文观测数据，通过扩散模型和布朗桥过程实现跨观测模态的联合分析和概率预测。

摘要翻译

未来十年的观测宇宙学将由大规模巡天项目所主导，例如位于薇拉·C·鲁宾天文台的地基大型综合巡天望远镜（LSST）以及天基的欧几里得太空任务（Euclid mission）。尽管这些项目有望在深度、分辨率和波长范围上提供前所未有的宇宙视野，但它们之间在观测模式、天区覆盖范围、点扩散函数以及扫描频率上的差异，使得联合分析既具有重要价值，也面临挑战。为促进联合分析，我们提出了A(stronomical)S(urvey)-Bridge（天文巡天桥接模型），这是一个能够在基于地面与基于空间的观测之间进行转换的双向生成模型。AS-Bridge学习一个扩散模型，该模型在LSST和欧几里得的观测数据之间采用了一种随机布朗桥过程。这两项巡天计划拥有重叠的天区，使我们能够显式地建模它们之间的条件概率分布。我们证明，这种建模方式能够实现超越单一巡天分析的新科学能力，包括对缺失巡天观测数据的可靠概率预测，以及通过跨巡天数据探测稀有事件。这些结果证实了跨巡天生成建模的可行性。因此，AS-Bridge有望成为未来LSST-欧几里得联合数据处理流程中的一个互补组件，从而在双方数据可用时提升科学回报。数据与代码可在 \href{https://github.com/ZHANG7DC/AS-Bridge}{https://github.com/ZHANG7DC/AS-Bridge} 获取。

摘要 (Abstract)

The upcoming decade of observational cosmology will be shaped by large sky surveys, such as the ground-based LSST at the Vera C. Rubin Observatory and the space-based Euclid mission. While they promise an unprecedented view of the Universe across depth, resolution, and wavelength, their differences in observational modality, sky coverage, point-spread function, and scanning cadence make joint analysis beneficial, but also challenging. To facilitate joint analysis, we introduce A(stronomical)S(urvey)-Bridge, a bidirectional generative model that translates between ground- and space-based observations. AS-Bridge learns a diffusion model that employs a stochastic Brownian Bridge process between the LSST and Euclid observations. The two surveys have overlapping sky regions, where we can explicitly model the conditional probabilistic distribution between them. We show that this formulation enables new scientific capabilities beyond single-survey analysis, including faithful probabilistic predictions of missing survey observations and inter-survey detection of rare events. These results establish the feasibility of inter-survey generative modeling. AS-Bridge is therefore well-positioned to serve as a complementary component of future LSST-Euclid joint data pipelines, enhancing the scientific return once data from both surveys become available. Data and code are available at \href{https://github.com/ZHANG7DC/AS-Bridge}{https://github.com/ZHANG7DC/AS-Bridge}.

关键词: astronomical surveys, generative model, diffusion model, Brownian Bridge, joint analysis, LSST, Euclid, probabilistic prediction

177. ❌ PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation

作者: Pietro Bonazzi, Nicola Farronato, Stefan Zihlmann, Haotong Qin, Michele Magno 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11917v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PicoSAM3专注于计算机视觉领域的轻量级分割模型，与大多数大语言模型（LLM）技术关键词无关。唯一相关的关键词是’Small Language Models OR SLMs OR On-device AI’（评分8.0），因为论文研究边缘设备上的实时推理，属于on-device AI范畴；以及’Quantization OR Model Compression OR Low-bit Weights’（评分10.0），因为论文明确使用了INT8量化来压缩模型并保持精度。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了PicoSAM3，一种轻量级可提示视觉分割模型，通过知识蒸馏和INT8量化实现了在索尼IMX500视觉传感器上的实时、高精度边缘分割，在COCO和LVIS数据集上超越了现有基线模型。

摘要翻译

实时、设备端分割对于智能眼镜和物联网设备等延迟敏感且注重隐私的应用至关重要。我们提出PicoSAM3，一种轻量级可提示视觉分割模型，专为边缘和传感器内执行优化，包括在索尼IMX500视觉传感器上的部署。PicoSAM3具有130万参数，融合了密集卷积神经网络（CNN）架构与感兴趣区域提示编码、高效通道注意力（Efficient Channel Attention）机制，并采用了来自SAM2和SAM3的知识蒸馏技术。在COCO和LVIS数据集上，PicoSAM3分别实现了65.45%和64.01%的平均交并比（mIoU），在相同或更低复杂度下超越了现有基于SAM及面向边缘的基准模型。INT8量化模型在精度损失可忽略的前提下保持了准确性，同时在IMX500传感器上实现了11.82毫秒延迟的实时传感器内推理，完全符合其内存和算子约束。消融实验表明，相较于监督训练，从大型SAM模型进行知识蒸馏可带来高达+14.5%的mIoU提升，并证明高质量、空间灵活的可提示分割直接在传感器层级实现是可行的。

摘要 (Abstract)

Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82 ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.

关键词: real-time segmentation, on-device AI, edge computing, model quantization, knowledge distillation, visual segmentation, in-sensor processing, lightweight model

178. ❌ InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

作者: InSpatio Team, Xiaoyu Zhang, Weihong Pan, Zhichao Ye, Jialin Liu, Yipeng Chen, Nan Wang, Xiaojun Xiang, Weijian Xie, Yifu Wang, Haoyu Ji, Siji Pan, Zhewen Le, Jing Guo, Xianbin Liu, Donghui Shen, Ziqiang Zhao, Haomin Liu, Guofeng Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11911v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于空间智能的实时帧生成模型，与大多数关键词无关。仅与’World Models AND General World Models’高度相关（10分），因为论文明确提出了InSpatio-WorldFM作为世界模型的替代方案，并多次提及’world models’。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为摘要提到使用预训练图像扩散模型进行转换，但这不是核心焦点。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出了InSpatio-WorldFM，一种开源实时帧模型，通过独立帧生成和空间一致性机制，解决了视频基世界模型的高延迟问题，实现了在消费级GPU上的强多视图一致性和交互式探索。

摘要翻译

我们提出InSpatio-WorldFM，一种用于空间智能的开源实时帧模型。与依赖序列帧生成、因窗口级处理而产生高延迟的基于视频的世界模型不同，InSpatio-WorldFM采用基于帧的范式，能够独立生成每一帧，从而实现低延迟的实时空间推理。该模型通过显式三维锚点与隐式空间记忆强制实施多视角空间一致性，在保持跨视角变化中细粒度视觉细节的同时，保留了全局场景几何结构。我们进一步引入渐进式三阶段训练流程，将预训练的图像扩散模型转化为可控帧模型，最终通过少步蒸馏实现实时生成器。实验结果表明，InSpatio-WorldFM在消费级GPU上实现强大多视角一致性的同时支持交互式探索，为实时世界模拟提供了相较于传统视频世界模型的高效替代方案。

摘要 (Abstract)

We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.

关键词: InSpatio-WorldFM, real-time frame model, spatial intelligence, world models, multi-view consistency, image diffusion model, few-step distillation, interactive exploration

179. ❌ Single-View Rolling-Shutter SfM

作者: Sofía Errázuriz Muñoz, Kim Kiehn, Petr Hruby, Kathlén Kohn 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11888v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究滚动快门相机的单视图结构恢复（SfM）问题，属于计算机视觉和几何重建领域。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而该论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了滚动快门相机单视图结构恢复（RS SfM）问题，通过表征单视图几何并推导最小重建问题，证明了从单张滚动快门图像恢复运动和场景参数的可行性。

摘要翻译

卷帘快门（Rolling-shutter, RS）相机已无处不在，但卷帘快门运动恢复结构（RS SfM）问题尚未得到完全解决。本研究提出了一种改进方法：我们刻画了所观测世界点或线条的卷帘快门单视图几何特性。利用此几何特性，我们阐述了如何从单张卷帘快门图像中恢复运动与场景参数，并系统性地推导了最小化重建问题。通过概念验证求解器对若干代表性案例进行评估，结果既凸显了该方法的可行性，也揭示了其实际局限性。

摘要 (Abstract)

Rolling-shutter (RS) cameras are ubiquitous, but RS SfM (structure-from-motion) has not been fully solved yet. This work suggests an approach to remedy this: We characterize RS single-view geometry of observed world points or lines. Exploiting this geometry, we describe which motion and scene parameters can be recovered from a single RS image and systematically derive minimal reconstruction problems. We evaluate several representative cases with proof-of-concept solvers, highlighting both feasibility and practical limitations.

关键词: Rolling-shutter, Structure-from-motion, Single-view geometry, Reconstruction, Minimal problems, Motion recovery, Scene parameters, Proof-of-concept solvers

180. ❌ Derain-Agent: A Plug-and-Play Agent Framework for Rainy Image Restoration

作者: Zhaocheng Yu, Xiang Chen, Runzhe Li, Zihan Geng, Guanglu Sun, Haipeng Li, Kui Jiang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11866v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像去雨任务，提出了一种基于智能体（agent）的框架来改进现有去雨模型。虽然论文使用了’agent’这一术语，但其含义是计算机视觉中的智能体框架（如规划网络和强度调制），而非大语言模型（LLM）相关的智能体。论文内容涉及深度学习、图像恢复、自适应处理等技术，但未涉及任何大模型、语言模型、模型训练技术、推理优化、对齐方法、科学AI应用等关键词。所有关键词均与大模型技术或科学AI应用相关，而本文是纯计算机视觉研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对现有单图像去雨模型静态推理范式无法适应真实世界复杂退化的问题，提出了Derain-Agent框架，通过规划网络和强度调制机制实现动态、自适应的图像恢复，显著提升了现有去雨模型的性能。

摘要翻译

尽管深度学习已推动了单幅图像去雨技术的发展，但现有模型存在一个根本性局限：它们采用静态推理范式，无法适应真实场景中复杂且耦合的退化问题（例如噪声伪影、模糊和色彩偏差）。因此，修复后的图像常存在残留伪影和感知质量不一致的问题。本研究提出Derain-Agent，一种即插即用的优化框架，将去雨任务从静态处理转变为基于智能体的动态恢复过程。Derain-Agent为基础去雨模型赋予两项核心能力：1）规划网络，可为每个输入实例智能调度最优的修复工具序列；2）强度调制机制，能以空间自适应的强度应用这些工具。该设计能够以可接受的成本实现针对残留误差的精确、区域特异性校正，而无需进行代价高昂的迭代搜索。我们的方法展现出强大的泛化能力，在合成与真实场景基准测试中均能持续提升前沿去雨模型的性能。

摘要 (Abstract)

While deep learning has advanced single-image deraining, existing models suffer from a fundamental limitation: they employ a static inference paradigm that fails to adapt to the complex, coupled degradations (e.g., noise artifacts, blur, and color deviation) of real-world rain. Consequently, restored images often exhibit residual artifacts and inconsistent perceptual quality. In this work, we present Derain-Agent, a plug-and-play refinement framework that transitions deraining from static processing to dynamic, agent-based restoration. Derain-Agent equips a base deraining model with two core capabilities: 1) a Planning Network that intelligently schedules an optimal sequence of restoration tools for each instance, and 2) a Strength Modulation mechanism that applies these tools with spatially adaptive intensity. This design enables precise, region-specific correction of residual errors without the prohibitive cost of iterative search. Our method demonstrates strong generalization, consistently boosting the performance of state-of-the-art deraining models on both synthetic and real-world benchmarks.

关键词: image deraining, agent-based restoration, planning network, strength modulation, rainy image restoration, plug-and-play framework, adaptive processing, residual artifact correction

181. ❌ Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning

作者: Johan Andreas Balle Rubak, Sara Haghighat, Sanyam Jain, Mostafa Aldesoki, Akhilanand Chaurasia, Sarah Sadat Ehsani, Faezeh Dehghan Ghanatkaman, Ahmad Badruddin Ghazali, Julien Issa, Basel Khalil, Rishi Ramani, Ruben Pauwels 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11850v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是医学影像（全景X光片）中第三磨牙与下颌管关系的自动分类，使用深度学习（ResNet-34）和联邦学习技术。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词特指大型语言模型（LLM）及相关技术。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在医学影像分析（可视为生物医学信息学的一个子领域）的应用，但并非核心匹配，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究比较了本地学习、联邦学习和集中学习在自动分类全景X光片中第三磨牙与下颌管重叠关系上的性能，发现集中学习效果最佳，联邦学习在保护隐私的前提下提供了可行的替代方案。

摘要翻译

下颌第三磨牙阻生且邻近下颌管会增加下牙槽神经损伤的风险。全景X线片常规用于评估这一解剖关系。对磨牙-下颌管重叠区域进行自动分类可辅助临床分诊并减少不必要的锥形束CT（CBCT）转诊，而联邦学习（FL）技术能在不共享患者数据的前提下实现多中心协作。本研究在由八位独立标注者划分的裁剪全景片上，比较了局部学习（LL）、联邦学习（FL）与集中式学习（CL）在二分类（重叠/非重叠）任务中的表现。采用预训练的ResNet-34模型，分别在三种范式下进行训练，并通过两种方式评估性能：基于各客户端本地优化阈值的分项指标，以及采用全局阈值的汇总测试性能。评估指标包括受试者工作特征曲线下面积（AUC）与基于阈值的度量，同时结合训练动态分析、梯度加权类激活映射（Grad-CAM）可视化及服务器端聚合监测信号。在测试集上，CL获得最佳性能（AUC 0.831；准确率=0.782），FL表现居中（AUC 0.757；准确率=0.703），而LL在各客户端间泛化能力较差（AUC范围=0.619-0.734；均值=0.672）。训练曲线提示存在过拟合现象（尤其在LL模型中），Grad-CAM显示CL与FL模型的注意力更集中于解剖相关区域。总体而言，集中式训练能提供最优性能，而FL作为一种保护隐私的替代方案，其表现优于LL。

摘要 (Abstract)

Impaction of the mandibular third molar in proximity to the mandibular canal increases the risk of inferior alveolar nerve injury. Panoramic radiography is routinely used to assess this relationship. Automated classification of molar-canal overlap could support clinical triage and reduce unnecessary CBCT referrals, while federated learning (FL) enables multi-center collaboration without sharing patient data. We compared Local Learning (LL), FL, and Centralized Learning (CL) for binary overlap/no-overlap classification on cropped panoramic radiographs partitioned across eight independent labelers. A pretrained ResNet-34 was trained under each paradigm and evaluated using per-client metrics with locally optimized thresholds and pooled test performance with a global threshold. Performance was assessed using area under the receiver operating characteristic curve (AUC) and threshold-based metrics, alongside training dynamics, Grad-CAM visualizations, and server-side aggregate monitoring signals. On the test set, CL achieved the highest performance (AUC 0.831; accuracy = 0.782), FL showed intermediate performance (AUC 0.757; accuracy = 0.703), and LL generalized poorly across clients (AUC range = 0.619-0.734; mean = 0.672). Training curves suggested overfitting, particularly in LL models, and Grad-CAM indicated more anatomically focused attention in CL and FL. Overall, centralized training provided the strongest performance, while FL offers a privacy-preserving alternative that outperforms LL.

关键词: deep learning, panoramic radiographs, third molar, mandibular canal, federated learning, centralized learning, medical image classification, ResNet-34

182. ❌ ZeroSense:How Vision matters in Long Context Compression

作者: Yonghan Gao, Zehong Chen, Lijian Xu, Jingzhi Chen, Jingwei Guan, Xingyu Zeng 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉-文本压缩（VTC）方法在长上下文建模中的评估问题，与’Context Window Extension OR Long Context LLMs’高度相关（10分），因为论文明确讨论长上下文建模任务和压缩方法。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为论文涉及多模态大语言模型（MLLMs）作为评估对象。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、代理系统等均未在论文中提及或相关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有视觉-文本压缩方法在长上下文建模中评估不准确的问题，提出了一个解耦多模态大语言模型能力的评估框架和ZeroSense基准，实验表明压缩质量与下游任务准确性存在显著差异。

摘要翻译

近期以DeepSeek-OCR为代表的视觉文本压缩方法，通过利用文本到图像的渲染技术，在长上下文建模任务中实现了令人瞩目的高令牌压缩率。然而，现有评估方案高度依赖下游任务性能指标。由于多模态大语言模型本身具备强大的语言先验知识，此类评估指标难以准确衡量文本信息的真实保留程度。本研究提出一种新的评估框架，通过解耦多模态大语言模型的能力来精确评估视觉文本压缩质量。在此框架内，我们进一步构建了ZeroSense基准测试，确保测试样本具有较低的语义关联性。通过消除上下文依赖性，该基准能保证评估结果纯粹反映视觉文本压缩质量，而不受下游模型语义推理能力的影响。在多个数据集上的大量实验表明，视觉文本压缩质量与下游任务准确率存在显著偏差，这印证了我们提出的解耦评估框架的必要性。

摘要 (Abstract)

Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs’ capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.

关键词: visual-text compression, long-context modeling, Multimodal Large Language Models, evaluation framework, ZeroSense Benchmark, semantic correlation, token compression ratios, downstream task performance

183. ❌ A Decade of Generative Adversarial Networks for Porous Material Reconstruction

作者: Ali Sadeghkhani, Brandon Bennett, Masoud Babaei, Arash Rabbani 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11836v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用生成对抗网络（GANs）进行多孔材料图像重建的系统性综述，属于深度学习在科学领域的应用。所有关键词均与大模型（LLMs）或相关技术原理直接相关，而论文未涉及任何大模型技术，仅讨论传统深度学习中的GANs。唯一部分相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为多孔材料重建可视为科学应用，但论文未明确涉及生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词完全无关，评分为0分。

!!! tip deepseek-chat TL;DR

这篇综述系统分析了2017年至2026年初发表的96篇论文，探讨了生成对抗网络（GANs）在多孔材料图像重建中的演变、应用和挑战，展示了在孔隙度准确性、渗透率预测和重建体积方面的显著进展。

摘要翻译

多孔材料的数字化重建对于从地质储层表征到组织工程和电化学器件设计等应用领域已变得日益关键。尽管微计算机断层扫描和统计重建方法等传统技术已在该领域奠定基础，但深度学习技术，特别是生成对抗网络的出现，彻底革新了多孔介质重建的能力。本综述系统分析了2017年至2026年初发表的96篇同行评议文献，探讨了基于生成对抗网络的多孔材料图像重建方法的演进与应用。我们将生成对抗网络架构归纳为六种主要类型：原始生成对抗网络、多尺度生成对抗网络、条件生成对抗网络、注意力增强生成对抗网络、基于风格的生成对抗网络以及混合架构生成对抗网络。分析表明该领域已取得显著进展，包括孔隙度精度提升（误差控制在原始样本1%以内）、渗透率预测改进（平均相对误差降低达79%）以及可重建体积的扩大（从初始的$64^3$体素发展到当前的$2{,}200^3$体素）。尽管取得这些进展，该领域仍面临持续挑战，包括计算效率问题、大规模重建的内存限制以及二维到三维转换中结构连续性的保持。本系统分析为根据具体应用需求选择适宜的生成对抗网络架构提供了完整框架。

摘要 (Abstract)

Digital reconstruction of porous materials has become increasingly critical for applications ranging from geological reservoir characterization to tissue engineering and electrochemical device design. While traditional methods such as micro-computed tomography and statistical reconstruction approaches have established foundations in this field, the emergence of deep learning techniques, particularly Generative Adversarial Networks (GANs), has revolutionized porous media reconstruction capabilities. This review systematically analyzes 96 peer-reviewed articles published from 2017 to early 2026, examining the evolution and applications of GAN-based approaches for porous material image reconstruction. We categorize GAN architectures into six distinct classes, namely Vanilla GANs, Multi-Scale GANs, Conditional GANs, Attention-Enhanced GANs, Style-based GANs, and Hybrid Architecture GANs. Our analysis reveals substantial progress including improvements in porosity accuracy (within 1% of original samples), permeability prediction (up to 79% reduction in mean relative errors), and achievable reconstruction volumes (from initial $64^3$ to current $2{,}200^3$ voxels). Despite these advances, persistent challenges remain in computational efficiency, memory constraints for large-scale reconstruction, and maintaining structural continuity in 2D-to-3D transformations. This systematic analysis provides a comprehensive framework for selecting appropriate GAN architectures based on specific application requirements.

关键词: Generative Adversarial Networks, GANs, porous material reconstruction, image reconstruction, deep learning, systematic review, permeability prediction, computational efficiency

184. ❌ Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding

作者: Jiahao Li, Qingwang Zhang, Qiuyu Chen, Guozhan Qiu, Yunzhong Lou, Xiangdong Zhou 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11831v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确使用大语言模型（LLMs）进行CAD程序生成，并应用监督微调（SFT）和强化学习（RL）进行训练，因此这两个关键词高度相关（10分）。论文属于AI在工程/设计领域的应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、量化、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FutureCAD的新型文本到CAD框架，通过结合大语言模型和B-Rep grounding transformer，解决了复杂工业产品设计中AI驱动CAD建模的范式差距问题，实现了最先进的CAD生成性能。

摘要翻译

近年来，计算机辅助设计（CAD）生成领域取得了显著进展。现有方法通常分为两个独立类别：参数化CAD建模与直接边界表示（B-Rep）合成。在现代基于特征的CAD系统中，参数化建模与B-Rep本质上是相互交织的，因为高级参数化操作（如圆角与倒角）需要显式选择B-Rep几何基元，而B-Rep本身也源自参数化操作。因此，这种范式差距仍然是限制复杂工业产品设计中人工智能驱动CAD建模的关键因素。本文提出FutureCAD，一种新颖的文本到CAD框架，其利用大语言模型（LLMs）与边界表示基础变换器（BRepGround）实现高保真CAD生成。我们的方法生成可执行的CadQuery脚本，并引入一种基于文本的查询机制，使大语言模型能够通过自然语言指定几何选择，随后由BRepGround将其关联至目标基元。为训练该框架，我们构建了一个包含真实世界CAD模型的新数据集。针对大语言模型，我们应用监督微调（SFT）以建立基础CAD生成能力，随后通过强化学习（RL）提升泛化性能。实验表明，FutureCAD实现了最先进的CAD生成性能。

摘要 (Abstract)

The field of Computer-Aided Design (CAD) generation has made significant progress in recent years. Existing methods typically fall into two separate categorie: parametric CAD modeling and direct boundary representation (B-Rep) synthesis. In modern feature-based CAD systems, parametric modeling and B-Rep are inherently intertwined, as advanced parametric operations (e.g., fillet and chamfer) require explicit selection of B-Rep geometric primitives, and the B-Rep itself is derived from parametric operations. Consequently, this paradigm gap remains a critical factor limiting AI-driven CAD modeling for complex industrial product design. This paper present FutureCAD, a novel text-to-CAD framework that leverages large language models (LLMs) and a B-Rep grounding transformer (BRepGround) for high-fidelity CAD generation. Our method generates executable CadQuery scripts, and introduces a text-based query mechanism that enables the LLM to specify geometric selections via natural language, which BRepGround then grounds to the target primitives. To train our framework, we construct a new dataset comprising real-world CAD models. For the LLM, we apply supervised fine-tuning (SFT) to establish fundamental CAD generation capabilities, followed by reinforcement learning (RL) to improve generalization. Experiments show that FutureCAD achieves state-of-the-art CAD generation performance.

关键词: CAD generation, large language models, B-Rep grounding, text-to-CAD, supervised fine-tuning, reinforcement learning, FutureCAD, parametric modeling

185. ❌ Multimodal classification of Radiation-Induced Contrast Enhancements and tumor recurrence using deep learning

作者: Robin Peretzke, Marlin Hanstein, Maximilian Fischer, Lars Badhi Wessel, Obada Alhalabi, Sebastian Regnery, Andreas Kudak, Maximilian Deng, Tanja Eichkorn, Philipp Hoegen Saßmannshausen, Fabian Allmendinger, Jan-Hendrik Bolten, Philipp Schröter, Christine Jungk, Jürgen Peter Debus, Peter Neher, Laila König, Klaus Maier-Hein 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11827v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于使用深度学习进行医学影像分类，特别是区分胶质母细胞瘤患者的肿瘤复发与放疗引起的对比增强。它不涉及大语言模型（LLM）、模型架构（如MoE）、训练技术（如预训练、微调、对齐）、推理优化、代理系统或模型压缩等。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究将AI应用于生物医学领域（神经肿瘤学），评分为10分（高度相关）。此外，‘Mechanistic Interpretability OR Explainable AI’评分为5分（有一定关联），因为论文提到了基于遮挡的可解释性分析，但这不是核心焦点。其他所有关键词均不相关，评分为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个名为RICE-NET的多模态3D深度学习模型，用于区分胶质母细胞瘤患者的肿瘤复发与放疗引起的对比增强，通过整合纵向MRI数据和放疗剂量分布，在独立测试集上实现了0.92的F1分数。

摘要翻译

区分治疗后胶质母细胞瘤患者的肿瘤复发与放射性对比剂增强仍是临床面临的主要挑战。现有方法依赖于临床中获取受限的扩散磁共振成像，或未纳入放射剂量分布图——后者在肿瘤委员会进行此类鉴别时正受到日益关注。本研究提出RICE-NET，一种多模态三维深度学习模型，该模型通过整合纵向磁共振成像数据与放射治疗剂量分布，利用常规T1加权磁共振成像数据实现自动化病灶分类。基于92例患者队列的验证，该模型在独立测试集上取得了0.92的F1分数。在大量消融实验中，我们量化了各时间点及各模态的贡献度，证明可靠的分类在很大程度上依赖于放射剂量分布图。基于遮挡的可解释性分析进一步证实了模型对临床相关区域的关注重点。这些发现凸显了多模态深度学习在提升神经肿瘤学诊断准确性及支持临床决策方面的潜力。

摘要 (Abstract)

The differentiation between tumor recurrence and radiation-induced contrast enhancements in post-treatment glioblastoma patients remains a major clinical challenge. Existing approaches rely on clinically sparsely available diffusion MRI or do not consider radiation maps, which are gaining increasing interest in the tumor board for this differentiation. We introduce RICE-NET, a multimodal 3D deep learning model that integrates longitudinal MRI data with radiotherapy dose distributions for automated lesion classification using conventional T1-weighted MRI data. Using a cohort of 92 patients, the model achieved an F1 score of 0.92 on an independent test set. During extensive ablation experiments, we quantified the contribution of each timepoint and modality and showed that reliable classification largely depends on the radiation map. Occlusion-based interpretability analyses further confirmed the model’s focus on clinically relevant regions. These findings highlight the potential of multimodal deep learning to enhance diagnostic accuracy and support clinical decision-making in neuro-oncology.

关键词: deep learning, multimodal classification, glioblastoma, radiation-induced contrast enhancements, tumor recurrence, MRI, radiotherapy dose distributions, neuro-oncology

186. ❌ CEI-3D: Collaborative Explicit-Implicit 3D Reconstruction for Realistic and Fine-Grained Object Editing

作者: Yue Shi, Rui Shi, Yuxuan Xiong, Bingbing Ni, Wenjun Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11810v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D重建和编辑的计算机视觉任务，提出了一种协作显式-隐式重建方法（CEI-3D），用于实现逼真和细粒度的对象编辑。论文内容涉及SDF网络、物理属性解耦、双漫反射反照率网络、空间感知编辑模块等具体技术，但所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用无关。论文未提及任何语言模型、模型训练/微调技术、推理优化、代理系统或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有3D编辑方法因重建网络深度集成而导致结果不真实和不精细的问题，提出了CEI-3D协作显式-隐式重建管道，通过隐式SDF网络和显式处理点表示实现全局结构与局部编辑的相互指导，结合物理属性解耦和空间感知编辑模块，在真实和合成数据集上实现了比现有方法更逼真、更细粒度的编辑效果，同时减少了编辑时间。

摘要翻译

现有三维编辑方法因其重建网络的高度集成特性，常产生不真实且粗糙的结果。为应对这一挑战，本文提出CEI-3D——一种面向编辑的重建流程，旨在实现真实且细粒度的编辑。具体而言，我们提出了一种协同显隐式重建方法，该方法通过隐式符号距离函数（SDF）网络与差异化采样、局部可控的操控点集合来表征目标物体。隐式网络提供平滑连续的几何先验，而显式操控点则提供局部控制能力，实现了全局三维结构与用户指定局部编辑区域间的相互引导。为独立控制操控点的各属性，我们设计了物理属性解耦模块，将操控点的颜色分解为独立的物理属性。该模块中还提出了双漫反射反照率网络，通过独立分支处理编辑区域与非编辑区域，从而避免编辑操作产生非预期的干扰。基于解耦属性后的协同显隐式重建表示，我们进一步提出空间感知编辑模块，支持对相关操控点进行部件级调整。该模块采用基于跨视图传播的三维分割策略，帮助用户高效编辑目标部件的指定物理属性。在真实与合成数据集上的大量实验表明，相较于现有最优方法，本方法能以更少的编辑时间获得更真实、更细粒度的编辑结果。代码已开源：https://github.com/shiyue001/CEI-3D。

摘要 (Abstract)

Existing 3D editing methods often produce unrealistic and unrefined results due to the deeply integrated nature of their reconstruction networks. To address the challenge, this paper introduces CEI-3D, an editing-oriented reconstruction pipeline designed to facilitate realistic and fine-grained editing. Specifically, we propose a collaborative explicit-implicit reconstruction approach, which represents the target object using an implicit SDF network and a differentially sampled, locally controllable set of handler points. The implicit network provides a smooth and continuous geometry prior, while the explicit handler points offer localized control, enabling mutual guidance between the global 3D structure and user-specified local editing regions. To independently control each attribute of the handler points, we design a physical properties disentangling module to decouple the color of the handler points into separate physical properties. We also propose a dual-diffuse-albedo network in this module to process the edited and non-edited regions through separate branches, thereby preventing undesired interference from editing operations. Building on the reconstructed collaborative explicit-implicit representation with disentangled properties, we introduce a spatial-aware editing module that enables part-wise adjustment of relevant handler points. This module employs a cross-view propagation-based 3D segmentation strategy, which helps users to edit the specified physical attributes of a target part efficiently. Extensive experiments on both real and synthetic datasets demonstrate that our approach achieves more realistic and fine-grained editing results than the state-of-the-art (SOTA) methods while requiring less editing time. Our code is available on https://github.com/shiyue001/CEI-3D.

关键词: 3D reconstruction, object editing, explicit-implicit representation, SDF network, physical properties disentangling, spatial-aware editing, fine-grained editing, collaborative pipeline

187. ❌ A Diffeomorphism Groupoid and Algebroid Framework for Discontinuous Image Registration

作者: Lili Bao, Bin Xiao, Shihui Ying, Stefan Sommer 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11806v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于数学图像配准框架，提出了一种基于微分同胚群胚和代数胚的新方法，用于处理不连续滑动运动。论文内容完全属于计算数学和医学图像处理领域，涉及微分几何、李群、偏微分方程等数学理论，与所有评分关键词（均围绕大模型、深度学习技术原理及应用）无任何关联。论文未提及任何人工智能、机器学习、深度学习或大语言模型相关内容，也未涉及生物信息学或化学信息学应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于微分同胚群胚和代数胚的数学框架，用于解决传统连续配准方法无法处理的不连续滑动图像配准问题，并通过数值实验验证了其有效性。

摘要翻译

本文提出了一种基于微分同胚群胚与李代数胚方法的、处理非连续滑动运动的分片微分同胚图像配准新数学框架。传统的大变形微分同胚度量映射（Large Deformation Diffeomorphic Metric Mapping, LDDMM）配准方法建立在李群基础上，其假设速度场连续且光滑，这限制了该方法在处理非连续滑动运动时的适用性。为克服这一局限，我们将微分同胚李群扩展为非连续微分同胚李群胚框架，从而允许在滑动边界处出现非连续性，同时在均匀区域内保持微分同胚性质。我们对相关数学结构（包括李代数胚及其对偶）进行了严格分析，并推导了特定的欧拉-阿诺德方程（Euler-Arnold equations）以控制非连续变形的最优流。通过数值实验验证了所提方法的有效性。

摘要 (Abstract)

In this paper, we propose a novel mathematical framework for piecewise diffeomorphic image registration that involves discontinuous sliding motion using a diffeomorphism groupoid and algebroid approach. The traditional Large Deformation Diffeomorphic Metric Mapping (LDDMM) registration method builds on Lie groups, which assume continuity and smoothness in velocity fields, limiting its applicability in handling discontinuous sliding motion. To overcome this limitation, we extend the diffeomorphism Lie groups to a framework of discontinuous diffeomorphism Lie groupoids, allowing for discontinuities along sliding boundaries while maintaining diffeomorphism within homogeneous regions. We provide a rigorous analysis of the associated mathematical structures, including Lie algebroids and their duals, and derive specific Euler-Arnold equations to govern optimal flows for discontinuous deformations. Some numerical tests are performed to validate the efficiency of the proposed approach.

关键词: image registration, diffeomorphism groupoid, Lie algebroid, discontinuous sliding motion, Euler-Arnold equations, piecewise diffeomorphic, LDDMM, mathematical framework

188. ❌ OSM-based Domain Adaptation for Remote Sensing VLMs

作者: Stefan Maria Ailuro, Mario Markov, Mohammad Mahdi, Delyan Boychev, Luc Van Gool, Danda Pani Paudel 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11804v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出OSMDA框架，通过结合OpenStreetMap数据与视觉语言模型（VLM）进行遥感领域的领域自适应，核心涉及领域自适应（Domain Adaptation）和微调（Fine-tuning），与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），与’Post-training OR Supervised Fine-tuning OR SFT’相关（8分）。论文使用基础VLM，属于大模型范畴，与’Large Language Models OR LLMs OR Foundation Models’相关（8分）。遥感应用属于科学领域，与’AI for Science OR Bioinformatics OR Cheminformatics’相关（8分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、推理方法、代理、压缩等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文解决了遥感视觉语言模型（VLM）因高质量标注数据稀缺而依赖大型教师模型进行伪标注的问题，提出了一种基于OpenStreetMap（OSM）的自包含领域自适应框架OSMDA，通过利用基础VLM的OCR和图表理解能力生成标注，并仅使用卫星图像进行微调，实现了无需手动标注或外部强模型的领域自适应，在多个基准测试中达到最先进性能且训练成本显著降低。

摘要翻译

适应遥感领域的视觉语言模型（VLMs）高度依赖特定领域的图像-文本监督数据，然而卫星与航空影像的高质量标注仍然稀缺且制作成本高昂。主流的伪标注流程通过从大型前沿模型中蒸馏知识来弥补这一缺口，但这种对大型教师模型的依赖成本高昂、限制了可扩展性，并将可达到的性能上限约束在教师模型水平。我们提出OSMDA：一种自包含的领域自适应框架，以消除这种依赖性。我们的核心见解是，一个具备良好能力的基础VLM可以充当自身的标注引擎：通过将航空影像与渲染的OpenStreetMap（OSM）地图瓦片配对，我们利用模型的光学字符识别和图表理解能力，生成由OSM海量辅助元数据增强的描述文本。随后，模型仅使用卫星影像在生成的语料库上进行微调，从而得到OSMDA-VLM——一个无需人工标注且无需依赖更强外部模型的领域自适应VLM。我们在涵盖图像-文本到文本任务的10个基准测试上进行了全面评估，并与9个竞争性基线方法进行了比较。在与真实数据等量混合时，我们的方法取得了最先进的性能，同时训练成本显著低于依赖教师模型的替代方案。这些结果表明，在拥有强大基础模型的前提下，与众包地理数据对齐是实现遥感领域自适应的一条实用且可扩展的路径。数据集与模型权重将公开提供。

摘要 (Abstract)

Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM’s vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.

关键词: Vision-Language Models, Domain Adaptation, Remote Sensing, OpenStreetMap, Fine-tuning, Satellite Imagery, Self-contained Framework, Pseudo-labeling

189. ❌ Intrinsic Concept Extraction Based on Compositional Interpretability

作者: Hanyu Shi, Hong Tao, Guoheng Huang, Jianbin Jiang, Xuhang Chen, Chi-Man Pun, Shanhu Wang, Pan Pan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11795v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的无监督概念提取任务，提出CI-ICE任务和HyperExpress方法，利用双曲空间和扩散模型进行概念解缠和组合。与绝大多数大模型技术关键词（如LLMs、MoE、RLHF、RAG等）完全无关。唯一相关的是’Mechanistic Interpretability OR Explainable AI’（8分），因为论文明确强调概念的可解释性和组合性，属于可解释AI范畴，但并非大模型的可解释性。其他关键词均未涉及大模型或深度学习在科学领域的应用创新。

!!! tip deepseek-chat TL;DR

本文提出了一种名为HyperExpress的方法，用于从单张图像中提取可组合的、可解释的内在概念，解决了现有无监督概念提取方法无法提取可组合概念的问题。

摘要翻译

无监督概念提取旨在从单张图像中提取概念，然而现有方法存在无法提取可组合本质概念的缺陷。为解决这一问题，本文提出了一项名为“组合式可解释本质概念提取”的新任务。该任务旨在利用基于扩散的文本到图像模型，从单张图像中提取可组合的对象级与属性级概念，使得原始概念能够通过这些概念的组合进行重建。为实现这一目标，我们提出名为HyperExpress的方法，该方法通过两个核心层面解决组合式可解释本质概念提取任务：首先，我们提出一种利用双曲空间固有层次建模能力的概念学习方法，在保持概念间层次结构与关系依赖的同时实现精确的概念解耦；其次，我们引入一种概念级优化方法，将概念嵌入空间映射至既能维持复杂概念间关系、又能确保概念可组合性的表征空间。我们的方法在从单张图像中提取组合式可解释本质概念方面展现出卓越性能。

摘要 (Abstract)

Unsupervised Concept Extraction aims to extract concepts from a single image; however, existing methods suffer from the inability to extract composable intrinsic concepts. To address this, this paper introduces a new task called Compositional and Interpretable Intrinsic Concept Extraction (CI-ICE). The CI-ICE task aims to leverage diffusion-based text-to-image models to extract composable object-level and attribute-level concepts from a single image, such that the original concept can be reconstructed through the combination of these concepts. To achieve this goal, we propose a method called HyperExpress, which addresses the CI-ICE task through two core aspects. Specifically, first, we propose a concept learning approach that leverages the inherent hierarchical modeling capability of hyperbolic space to achieve accurate concept disentanglement while preserving the hierarchical structure and relational dependencies among concepts; second, we introduce a concept-wise optimization method that maps the concept embedding space to maintain complex inter-concept relationships while ensuring concept composability. Our method demonstrates outstanding performance in extracting compositionally interpretable intrinsic concepts from a single image.

关键词: Unsupervised Concept Extraction, Compositional Interpretability, Intrinsic Concept Extraction, Hyperbolic Space, Diffusion Models, Concept Disentanglement, Concept Composability, Single Image Analysis

190. ❌ Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

作者: Chenyangguang Zhang, Botao Ye, Boqi Chen, Alexandros Delitzas, Fangjinhua Wang, Marc Pollefeys, Xi Wang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11755v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和图形学领域的以自我为中心的视频生成，特别是手部运动控制。它提出了一种利用稀疏3D手部关节作为控制信号的新框架，并解决了遮挡问题。虽然论文涉及深度学习（可能使用生成模型如扩散模型），但其核心内容与所有评分关键词（均围绕大语言模型及其相关技术、应用或评估）完全无关。关键词列表中没有涵盖计算机视觉、视频生成、3D重建、手部姿态估计或机器人学等主题。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用稀疏3D手部关节作为控制信号的新框架，解决了以自我为中心的视频生成中因遮挡导致的运动不一致和伪影问题，并实现了向机器人手的跨具身泛化。

摘要翻译

运动可控的视频生成对于虚拟现实与具身人工智能中的第一人称应用至关重要。然而，现有方法往往难以实现三维一致性的精细手部关节运动。这些方法基于二维轨迹或隐式姿态，将三维几何结构坍缩为空间模糊的信号，或过度依赖以人为中心的先验知识。在严重的第一人称遮挡情况下，这会导致运动不一致和虚假伪影，并阻碍向机器人手部进行跨具身泛化。为解决这些局限，我们提出了一种新颖框架，该框架从单张参考帧生成第一人称视频，利用稀疏的三维手部关节作为与具体形态无关的控制信号，这些信号具有清晰的语义和几何结构。我们引入了一个高效的控制模块，该模块在完整保留三维信息的同时解决遮挡模糊性问题。具体而言，它通过惩罚来自隐藏关节的不可靠视觉信号，从源参考帧中提取遮挡感知特征，并采用基于三维的加权机制，在运动传播过程中鲁棒地处理动态遮挡的目标关节。同时，该模块直接将三维几何嵌入注入潜在空间，以严格保证结构一致性。为促进鲁棒的训练与评估，我们开发了一套自动化标注流程，生成了超过一百万段高质量的第一人称视频片段，并配有精确的手部轨迹数据。此外，我们通过配准类人运动学与相机数据，构建了一个跨具身基准测试集。大量实验表明，我们的方法显著优于现有先进基线，能够生成具有逼真交互的高保真第一人称视频，并在向机器人手部进行跨具身泛化方面表现出卓越性能。

摘要 (Abstract)

Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.

关键词: egocentric video generation, 3D hand joints, occlusion-aware, motion control, cross-embodiment generalization, robotic hands, sparse control signals, video synthesis

191. ❌ SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

作者: Dingcheng Zhen, Xu Zheng, Ruixin Zhang, Zhiqi Jiang, Yichao Yan, Ming Tao, Shunshun Yin 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11746v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频合成和人类动画生成，提出了一种自回归扩散模型（AR diffusion model）的改进方法，包括Neighbor Forcing和ConvKV memory机制。所有关键词均与大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等）或特定科学AI应用（如生物信息学）相关。论文未提及任何大语言模型、深度学习技术原理创新或科学领域应用，其核心是扩散模型在视频生成中的效率优化，与关键词列表完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了自回归扩散模型在小时级实时人类动画生成中因扩散状态不匹配和历史表示无界增长导致的训练不稳定和推理效率低下的问题，通过提出Neighbor Forcing和ConvKV memory机制，显著提升了训练收敛性、生成质量和推理效率，实现了在少量GPU上的实时流式推理。

摘要翻译

自回归（AR）扩散模型通过将扩散建模与因果推断相结合，为视频合成等序列生成任务提供了一个前景广阔的框架。尽管它们支持流式生成，但现有的AR扩散方法难以实现高效扩展。本文针对小时级实时人体动画生成指出了两大关键挑战：首先，多数强制传播策略在扩散状态不匹配的情况下传递样本级表征，导致学习信号不一致且收敛不稳定；其次，历史表征无限制增长且缺乏结构，阻碍了缓存状态的有效复用，严重限制了推理效率。为解决这些问题，我们提出邻域强制传播策略——一种扩散步骤一致的自回归建模方法，其在相同噪声条件下将时序相邻帧作为潜在邻域进行传播。该设计在保持自回归链中漂移特性的同时，提供了分布对齐的稳定学习信号。在此基础上，我们进一步提出结构化卷积键值记忆机制，将因果注意力中的键与值压缩为固定长度的表征，实现了恒定内存推理和真正无限长度的视频生成，且无需依赖短期运动帧记忆。大量实验表明，相较于现有AR扩散方法，我们的方案显著提升了训练收敛速度、小时级生成质量与推理效率。数值实验显示，LiveAct系统仅需两块NVIDIA H100或H200 GPU即可实现小时级实时人体动画生成，并支持20 FPS的实时流式推理。定量结果表明，本方法在唇形同步精度、人体动画质量与情感表现力方面均达到最先进水平，同时保持着最低的推理成本。

摘要 (Abstract)

Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.

关键词: autoregressive diffusion models, real-time human animation, Neighbor Forcing, ConvKV memory, video synthesis, streaming generation, inference efficiency, causal attention

192. ❌ Cross-Resolution Attention Network for High-Resolution PM2.5 Prediction

作者: Ammar Kheder, Helmi Toropainen, Wenqing Peng, Samuel Antão, Zhi-Song Liu, Michael Boy 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11725v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用Vision Transformer进行高分辨率PM2.5预测，属于AI在环境科学领域的应用。论文内容与绝大多数关键词（涉及大语言模型技术、训练方法、推理优化、智能体等）完全无关，因为这些关键词主要针对文本大模型领域。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在环境科学（可视为科学应用的一个子领域）的应用，但并非核心的生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CRAN-PM的双分支Vision Transformer，通过跨分辨率注意力机制高效融合全球气象数据和本地高分辨率PM2.5数据，用于欧洲范围内的每日PM2.5预测，在2022年的评估中相比最佳单尺度基线降低了预测误差和复杂地形下的偏差。

摘要翻译

视觉Transformer在时空预测领域取得了显著成功，但其可扩展性在实际环境监测所需的超高分辨率、大陆尺度应用中仍受限制。单张1公里分辨率的欧洲空气质量地图包含2900万像素，远超原始自注意力机制的处理极限。本文提出CRAN-PM——一种双分支视觉Transformer，通过跨分辨率注意力机制高效融合全球气象数据（25公里分辨率）与当前时刻局部高分辨率PM2.5数据（1公里分辨率）。不同于将温度、地形等物理驱动因子直接作为输入，我们进一步引入高程感知自注意力与风场引导交叉注意力机制，迫使网络学习符合物理规律的PM2.5预测特征表示。CRAN-PM具备完全可训练性与内存高效性，在单GPU上仅需1.8秒即可生成完整的2900万像素欧洲地图。基于2022年全年（362天）欧洲范围每日PM2.5预测评估（涵盖2971个欧洲环境署监测站点），该模型在T+1时刻将均方根误差降低4.7%，在T+3时刻降低10.7%，同时在复杂地形区域将预测偏差减少36%，显著优于最佳单尺度基线模型。

摘要 (Abstract)

Vision Transformers have achieved remarkable success in spatio-temporal prediction, but their scalability remains limited for ultra-high-resolution, continent-scale domains required in real-world environmental monitoring. A single European air-quality map at 1 km resolution comprises 29 million pixels, far beyond the limits of naive self-attention. We introduce CRAN-PM, a dual-branch Vision Transformer that leverages cross-resolution attention to efficiently fuse global meteorological data (25 km) with local high-resolution PM2.5 at the current time (1 km). Instead of including physically driven factors like temperature and topography as input, we further introduce elevation-aware self-attention and wind-guided cross-attention to force the network to learn physically consistent feature representations for PM2.5 forecasting. CRAN-PM is fully trainable and memory-efficient, generating the complete 29-million-pixel European map in 1.8 seconds on a single GPU. Evaluated on daily PM2.5 forecasting throughout Europe in 2022 (362 days, 2,971 European Environment Agency (EEA) stations), it reduces RMSE by 4.7% at T+1 and 10.7% at T+3 compared to the best single-scale baseline, while reducing bias in complex terrain by 36%.

关键词: Vision Transformer, PM2.5 prediction, cross-resolution attention, high-resolution, environmental monitoring, spatio-temporal prediction, air-quality map, memory-efficient

193. ❌ VTEdit-Bench: A Comprehensive Benchmark for Multi-Reference Image Editing Models in Virtual Try-On

作者: Xiaoye Liang, Zhiyuan Qu, Mingye Zou, Jiaxin Liu, Lai Jiang, Mai Xu, Yiheng Zhu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11734v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文专注于计算机视觉领域的虚拟试穿（VTON）任务，提出一个评估通用多参考图像编辑模型的基准（VTEdit-Bench）和基于视觉语言模型（VLM）的评估器（VTEdit-QA）。论文内容涉及图像编辑、基准测试、模型评估等，但所有给定的关键词均与大语言模型（LLM）技术、训练方法、推理优化、代理系统、科学AI应用等具体技术直接相关。论文摘要和标题中未提及任何LLM、深度学习技术原理或科学领域应用，也未涉及关键词列表中的任何具体技术术语。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对虚拟试穿领域缺乏系统评估基准的问题，提出了VTEdit-Bench基准和VTEdit-QA评估器，用于评估通用多参考图像编辑模型，发现顶级通用编辑模型在常规任务上具有竞争力但在复杂参考配置上仍面临挑战。

摘要翻译

随着虚拟试穿技术的持续发展，越来越多的现实场景不断涌现，已超越现有专用虚拟试穿模型的能力范围。与此同时，通用多参考图像编辑模型进展迅速，在视觉编辑中展现出强大的泛化能力，这为构建更灵活的虚拟试穿系统提供了可行路径。然而，尽管这些通用编辑器功能强大，但由于缺乏系统性的评估基准，它们在虚拟试穿任务中的优势与局限仍未得到充分探索。为填补这一空白，我们提出了VTEdit-Bench——一个旨在评估通用多参考图像编辑模型在多样化真实虚拟试穿场景中性能的综合基准。该基准包含24,220对测试图像，涵盖五个具有代表性的虚拟试穿任务，其复杂度逐级递增，从而支持对模型鲁棒性与泛化能力的系统分析。我们进一步提出VTEdit-QA，这是一个基于参考感知视觉语言模型的评估器，可从模型一致性、服装一致性和整体图像质量三个关键维度评估虚拟试穿效果。通过该框架，我们系统评估了八种通用编辑模型，并将其与七种专用虚拟试穿模型进行比较。结果表明，顶尖的通用编辑器在常规任务上具有竞争力，且在更复杂场景中表现出更稳定的泛化能力，但在处理复杂参考配置（尤其是多服装条件控制）时仍面临挑战。

摘要 (Abstract)

As virtual try-on (VTON) continues to advance, a growing number of real-world scenarios have emerged, pushing beyond the ability of the existing specialized VTON models. Meanwhile, universal multi-reference image editing models have progressed rapidly and exhibit strong generalization in visual editing, suggesting a promising route toward more flexible VTON systems. However, despite their strong capabilities, the strengths and limitations of universal editors for VTON remain insufficiently explored due to the lack of systematic evaluation benchmarks. To address this gap, we introduce VTEdit-Bench, a comprehensive benchmark designed to evaluate universal multi-reference image editing models across various realistic VTON scenarios. VTEdit-Bench contains 24,220 test image pairs spanning five representative VTON tasks with progressively increasing complexity, enabling systematic analysis of robustness and generalization. We further propose VTEdit-QA, a reference-aware VLM-based evaluator that assesses VTON performance from three key aspects: model consistency, cloth consistency, and overall image quality. Through this framework, we systematically evaluate eight universal editing models and compare them with seven specialized VTON models. Results show that top universal editors are competitive on conventional tasks and generalize more stably to harder scenarios, but remain challenged by complex reference configurations, particularly multi-cloth conditioning.

关键词: virtual try-on, multi-reference image editing, benchmark evaluation, VTON tasks, generalization analysis, model consistency, cloth consistency, image quality assessment

194. ❌ PolyCrysDiff: Controllable Generation of Three-Dimensional Computable Polycrystalline Material Structures

作者: Chi Chen, Tianle Jiang, Xiaodong Wei, Yanming Wang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11695v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文PolyCrysDiff专注于材料科学领域，提出了一种基于条件潜在扩散模型（conditional latent diffusion）的框架，用于生成三维多晶材料微结构。该研究属于AI在科学领域的应用（AI for Science），具体涉及材料科学和计算材料学，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（评10分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、Pre-training、RLHF等）、推理技术（如CoT、System 2 Thinking）、代理系统（LLM Agents）、模型优化（Quantization、Speculative Decoding）或其他关键词，因此其他所有关键词评0分。

!!! tip deepseek-chat TL;DR

该论文解决了三维多晶材料微结构难以真实、可控构建的挑战，提出了一种基于条件潜在扩散的框架PolyCrysDiff，能够端到端生成可计算的微结构，并通过仿真验证了其物理有效性，系统阐明了微观结构特征对材料力学性能的影响。

摘要翻译

多晶材料的三维微观结构对其力学与物理性能具有关键影响。真实、可控地构建这些微观结构是阐明结构-性能关系的关键步骤，但目前仍面临巨大挑战。本文提出PolyCrysDiff——一种基于条件潜在扩散的框架，能够端到端生成可计算的三维多晶微观结构。综合定性与定量评估表明，PolyCrysDiff能够准确复现目标晶粒形貌、取向分布及三维空间关联，同时在晶粒属性（如尺寸与球形度）控制上达到$R^2$超过0.972的精度，其性能优于基于马尔可夫随机场（Markov random field, MRF）和卷积神经网络（convolutional neural network, CNN）的主流方法。通过一系列晶体塑性有限元法（crystal plasticity finite element method, CPFEM）模拟，验证了生成微观结构的可计算性与物理有效性。借助PolyCrysDiff的可控生成能力，我们系统阐明了晶粒尺度微观结构特征如何影响多晶材料的力学性能。这一进展有望为加速、数据驱动的多晶材料优化与设计铺平关键道路。

摘要 (Abstract)

The three-dimensional (3D) microstructures of polycrystalline materials exert a critical influence on their mechanical and physical properties. Realistic, controllable construction of these microstructures is a key step toward elucidating structure-property relationships, yet remains a formidable challenge. Herein, we propose PolyCrysDiff, a framework based on conditional latent diffusion that enables the end-to-end generation of computable 3D polycrystalline microstructures. Comprehensive qualitative and quantitative evaluations demonstrate that PolyCrysDiff faithfully reproduces target grain morphologies, orientation distributions, and 3D spatial correlations, while achieving an $R^2$ over 0.972 on grain attributes (e.g., size and sphericity) control, thereby outperforming mainstream approaches such as Markov random field (MRF)- and convolutional neural network (CNN)-based methods. The computability and physical validity of the generated microstructures are verified through a series of crystal plasticity finite element method (CPFEM) simulations. Leveraging PolyCrysDiff’s controllable generative capability, we systematically elucidate how grain-level microstructural characteristics affect the mechanical properties of polycrystalline materials. This development is expected to pave a key step toward accelerated, data-driven optimization and design of polycrystalline materials.

关键词: polycrystalline materials, 3D microstructure generation, conditional latent diffusion, grain morphology, crystal plasticity finite element method, structure-property relationships, controllable generation, computable microstructures

195. ❌ COTONET: A custom cotton detection algorithm based on YOLO11 for stage of growth cotton boll detection

作者: Guillem González, Guillem Alenyà, Sergi Foix 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11717v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于农业领域的计算机视觉应用，提出了一种基于YOLO11的棉花检测算法COTONET，用于棉花生长阶段的检测。论文的核心是目标检测模型的架构改进（如注意力机制、CARAFE上采样等），属于深度学习在特定领域（农业）的应用。所有关键词均与大语言模型（LLM）相关，而本文完全不涉及LLM、语言模型、推理、对齐、微调、代理、量化等LLM相关技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为农业可视为广义的’AI for Science’应用领域，但并非核心匹配（论文未明确提及科学发现或生物信息学），因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该研究针对棉花收获过程中纤维质量易受损的问题，提出了一种基于改进YOLO11的棉花棉铃生长阶段检测算法COTONET，通过引入多种注意力机制和特征重组技术，在保持较低计算资源需求的同时，实现了优于标准YOLO基线的检测性能（mAP50达81.1%）。

摘要翻译

棉花采收是棉花铃受到物理操作并可能导致纤维品质下降的关键阶段。为维持最高品质，采收方法必须模拟精细的人工抓取，以保持棉花的固有特性。实现这一过程的自动化需要能够识别不同物候期棉花铃的系统。为应对这一挑战，我们提出了 COTONET，这是一种增强型定制 YOLO11 模型，专门配备了注意力机制以改进困难实例的检测。该架构在不可学习的操作中引入梯度以增强形状和特征提取能力。关键的架构改进包括：用挤压-激励模块（Squeeze-and-Excitation blocks）替换卷积块、重新设计了集成注意力机制的主干网络，以及用内容感知特征重组（Content Aware Reassembly of Features, CARAFE）替代标准上采样操作。此外，我们集成了简单注意力模块（Simple Attention Modules, SimAM）用于初级特征聚合，并在下采样路径中采用并行混合注意力机制（Parallel Hybrid Attention Mechanisms, PHAM）以实现通道、空间和坐标维度的注意力。这种配置为解析棉花作物生长的复杂性提供了更高的灵活性和鲁棒性。COTONET 属于中小型 YOLO 模型，参数量为 7.6M，计算量为 27.8 GFLOPS，适用于资源受限的边缘计算和移动机器人平台。COTONET 性能优于标准 YOLO 基线模型，其 mAP50 达到 81.1%，mAP50-95 达到 60.6%。

摘要 (Abstract)

Cotton harvesting is a critical phase where cotton capsules are physically manipulated and can lead to fibre degradation. To maintain the highest quality, harvesting methods must emulate delicate manual grasping, to preserve cotton’s intrinsic properties. Automating this process requires systems capable of recognising cotton capsules across various phenological stages. To address this challenge, we propose COTONET, an enhanced custom YOLO11 model tailored with attention mechanisms to improve the detection of difficult instances. The architecture incorporates gradients in non-learnable operations to enhance shape and feature extraction. Key architectural modifications include: the replacement of convolutional blocks with Squeeze-and-Exitation blocks, a redesigned backbone integrating attention mechanisms, and the substitution of standard upsampling operations for Content Aware Reassembly of Features (CARAFE). Additionally, we integrate Simple Attention Modules (SimAM) for primary feature aggregation and Parallel Hybrid Attention Mechanisms (PHAM) for channel-wise, spatial-wise and coordinate-wise attention in the downward neck path. This configuration offers increased flexibility and robustness for interpreting the complexity of cotton crop growth. COTONET aligns with small-to-medium YOLO models utilizing 7.6M parameters and 27.8 GFLOPS, making it suitable for low-resource edge computing and mobile robotics. COTONET outperforms the standard YOLO baselines, achieving a mAP50 of 81.1% and a mAP50-95 of 60.6%.

关键词: cotton detection, YOLO11, attention mechanisms, edge computing, agricultural robotics, object detection, low-resource, CARAFE

196. ❌ UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution

作者: Cao Thien Tan, Phan Thi Thu Trang, Do Nghiem Duc, Ho Ngoc Anh, Hanyang Zhuang, Nguyen Duc Dung 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11680v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UCAN专注于计算机视觉领域的图像超分辨率任务，提出了一种结合卷积和注意力的轻量级网络架构。虽然论文使用了注意力机制（Hedgehog Attention），但这是计算机视觉中的空间注意力，与自然语言处理中的大语言模型（LLMs）技术完全不同。所有评分关键词都针对大语言模型、深度学习技术原理或AI在科学领域的应用，而本论文研究的是图像处理中的卷积神经网络和视觉注意力机制，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

论文提出了一种轻量级的统一卷积注意力网络（UCAN），用于高效扩展图像超分辨率中的感受野，在保持高精度的同时显著降低了计算复杂度。

摘要翻译

混合CNN-Transformer架构在图像超分辨率任务中取得了优异性能，但扩大注意力窗口或卷积核会显著增加计算成本，限制了其在资源受限设备上的部署。我们提出UCAN——一种通过统一卷积与注意力机制来高效扩展有效感受野的轻量级网络。UCAN结合基于窗口的空间注意力与刺猬注意力（Hedgehog Attention）机制，共同建模局部纹理与长程依赖关系，并引入基于蒸馏的大核模块以在避免沉重计算负担的同时保持高频结构。此外，我们采用跨层参数共享策略进一步降低模型复杂度。在Manga109数据集（$4\times$超分）上，UCAN-L仅以48.4G乘加运算量即达到31.63 dB的峰值信噪比（PSNR），超越了近期轻量级模型。在BSDS100数据集上，UCAN取得27.79 dB的指标，其性能优于参数量显著更大的方法。大量实验表明，UCAN在精度、效率与可扩展性之间实现了更优的平衡，使其特别适用于实际高分辨率图像复原场景。

摘要 (Abstract)

Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.

关键词: lightweight super-resolution, convolutional attention network, receptive field expansion, Hedgehog Attention, parameter sharing, computational efficiency, image restoration, hybrid CNN-Transformer

作者: Haohua Chen, Tianze Zhou, Wei Zhu, Runqi Wang, Yandong Guan, Dejia Song, Yibo Chen, Xu Tang, Yao Hu, Lu Sheng, Zhiyong Wu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11675v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PROMO专注于虚拟试穿（VTON）任务，使用Flow Matching DiT骨干网络和潜在多模态条件连接，属于计算机视觉和图像生成领域。所有评分关键词均针对大语言模型（LLM）及相关技术（如训练方法、推理优化、对齐、代理等），而本文完全不涉及语言模型、文本处理或LLM技术原理，也未应用于科学领域（如生物信息学）。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出PROMO框架，通过基于Flow Matching DiT的骨干网络和潜在多模态条件连接，解决了虚拟试穿中高保真度与高效推理之间的权衡问题，在标准基准测试中超越了现有方法。

摘要翻译

虚拟试穿技术已成为在线零售的核心能力，其逼真的试穿效果能为消费者提供可靠的合身参考、降低退货率，使买卖双方共同受益。基于扩散模型的虚拟试穿方法虽能实现照片级真实感的合成，但通常依赖复杂的架构（如辅助参考网络）且采样速度缓慢，导致保真度与效率之间的权衡成为长期挑战。本文将虚拟试穿视为结构化图像编辑问题，该问题需在三个关键要求下实现强条件生成：主体保持、精准纹理迁移与无缝融合。基于此视角，我们的训练框架具有通用性，可迁移至更广泛的图像编辑任务。此外，虚拟试穿生成的配对数据为训练通用编辑模型提供了丰富的监督资源。我们提出PROMO——一个基于流匹配扩散Transformer主干网络并结合潜在多模态条件拼接的可提示虚拟试穿框架。通过利用条件编码效率与自参考机制，我们的方法显著降低了推理开销。在标准测试集上，PROMO在视觉保真度方面超越了先前的虚拟试穿方法与通用图像编辑模型，同时在质量与速度间实现了优越的平衡。这些结果表明，流匹配Transformer结合潜在多模态条件控制与自参考加速机制，为高质量虚拟试穿提供了一种高效且训练成本较低的解决方案。

摘要 (Abstract)

Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors. We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.

关键词: Virtual Try-On, Flow Matching, Diffusion Transformer, Image Editing, Conditional Generation, Inference Efficiency, Self-reference Mechanisms, Multi-modal Conditioning

198. ❌ BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder

作者: Siquan Huang, Yijiang Li, Ningzhi Gao, Xingfu Yan, Leyu Shi 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11664v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	2.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	3.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于预训练视觉编码器的后门检测，与大多数关键词无关。仅与’Large Language Models’有微弱关联（提及LVLMs），与’Pre-training’相关（涉及预训练编码器）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为BackdoorIDS的零样本推理时后门样本检测方法，用于保护预训练视觉编码器免受后门攻击，实验表明该方法优于现有防御方法。

摘要翻译

自监督与多模态视觉编码器能够学习强大的视觉表征，这些表征被广泛应用于下游视觉任务及大型视觉语言模型（LVLMs）中。然而，下游用户往往依赖来源不确定的第三方预训练编码器，使其面临后门攻击的风险。在本研究中，我们提出BackdoorIDS，一种简单而有效的零样本、推理时后门样本检测方法，适用于预训练视觉编码器。BackdoorIDS的设计基于两个观察：注意力劫持与恢复。在渐进式输入掩码下，被植入后门的图像最初会将注意力集中在恶意触发特征上。一旦掩码比例超过触发器的鲁棒性阈值，触发器即被禁用，注意力迅速转向良性内容。这一转变会导致图像嵌入发生显著变化，而干净图像的嵌入在掩码过程中变化则更为平缓。BackdoorIDS通过沿掩码轨迹提取嵌入序列，并应用基于密度的聚类方法（如DBSCAN）来捕捉这一信号。若某输入的嵌入序列形成多于一个聚类，则被标记为后门样本。大量实验表明，BackdoorIDS在不同攻击类型、数据集和模型家族中均持续优于现有防御方法。值得注意的是，该方法为即插即用式，无需重新训练，且在推理时完全以零样本方式运行，使其能够兼容广泛的编码器架构，包括卷积神经网络（CNNs）、视觉Transformer（ViTs）、CLIP以及LLaVA-1.5。

摘要 (Abstract)

Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger’s robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.

关键词: backdoor detection, pretrained vision encoders, zero-shot, inference-time, attention hijacking, embedding sequence, DBSCAN, plug-and-play

199. ❌ FL-MedSegBench: A Comprehensive Benchmark for Federated Learning on Medical Image Segmentation

作者: Meilu Zhu, Zhiwei Wang, Axiu Mao, Yuxing Li, Xiaohan Xing, Yixuan Yuan, Edmund Y. Lam 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11659v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于联邦学习在医学图像分割领域的基准测试，与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其研究属于AI在生物医学（医学影像分析）领域的应用，但并非核心创新于大模型或深度学习技术原理本身，而是应用现有FL方法解决领域问题，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对医学图像分割领域缺乏标准化联邦学习评估基准的问题，提出了首个综合性基准FL-MedSegBench，并通过系统实验发现个性化联邦学习方法（如FedBN）在准确性、公平性和泛化性方面通常优于通用方法，但性能表现高度依赖于具体数据集。

摘要翻译

联邦学习（Federated Learning, FL）为协作式医学图像分析提供了一种无需共享原始数据的隐私保护范式。然而，医学图像分割领域缺乏标准化基准，阻碍了对联邦学习方法进行公平且全面的评估。为填补这一空白，我们推出了FL-MedSegBench——首个针对医学图像分割的综合性联邦学习基准。该基准涵盖十种成像模态下的九项分割任务，包含2D和3D格式，并体现了真实的临床异质性。我们系统评估了八种通用联邦学习方法（gFL）和五种个性化联邦学习方法（pFL），评估维度包括：分割准确性、公平性、通信效率、收敛行为以及对未见域的泛化能力。大量实验揭示了若干关键发现：（i）个性化联邦学习方法，尤其是采用客户端特定批量归一化的方法（如FedBN），始终优于通用方法；（ii）没有单一方法能在所有任务中占优，其性能表现依赖于具体数据集；（iii）通信频率分析表明，基于归一化的个性化方法在通信频率降低时表现出显著的鲁棒性；（iv）公平性评估识别出如Ditto和FedRDN等方法能有效保护表现欠佳的客户端；（v）方法对未见域的泛化能力与其在各参与客户端上的良好表现密切相关。我们将发布一个开源工具包，以促进可重复研究并加速临床适用的联邦学习解决方案，为真实世界的临床部署提供基于实证的指导。源代码发布于https://github.com/meiluzhu/FL-MedSegBench。

摘要 (Abstract)

Federated learning (FL) offers a privacy-preserving paradigm for collaborative medical image analysis without sharing raw data. However, the absence of standardized benchmarks for medical image segmentation hinders fair and comprehensive evaluation of FL methods. To address this gap, we introduce FL-MedSegBench, the first comprehensive benchmark for federated learning on medical image segmentation. Our benchmark encompasses nine segmentation tasks across ten imaging modalities, covering both 2D and 3D formats with realistic clinical heterogeneity. We systematically evaluate eight generic FL (gFL) and five personalized FL (pFL) methods across multiple dimensions: segmentation accuracy, fairness, communication efficiency, convergence behavior, and generalization to unseen domains. Extensive experiments reveal several key insights: (i) pFL methods, particularly those with client-specific batch normalization (\textit{e.g.}, FedBN), consistently outperform generic approaches; (ii) No single method universally dominates, with performance being dataset-dependent; (iii) Communication frequency analysis shows normalization-based personalization methods exhibit remarkable robustness to reduced communication frequency; (iv) Fairness evaluation identifies methods like Ditto and FedRDN that protect underperforming clients; (v) A method’s generalization to unseen domains is strongly tied to its ability to perform well across participating clients. We will release an open-source toolkit to foster reproducible research and accelerate clinically applicable FL solutions, providing empirically grounded guidelines for real-world clinical deployment. The source code is available at https://github.com/meiluzhu/FL-MedSegBench.

关键词: Federated Learning, Medical Image Segmentation, Benchmark, Personalized Federated Learning, Clinical Heterogeneity, Generalization, Fairness, Communication Efficiency

200. ❌ OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

作者: Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, Zezhong Qian, Haoyang Huang, Nan Duan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11647v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	5.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态（音频-视觉）扩散模型的实时生成优化，核心贡献是提出OmniForcing框架，通过知识蒸馏将双向扩散模型转换为自回归生成器以实现实时推理。与大多数关键词（主要针对语言模型）无关，仅与少数关键词有间接关联：1. “KV Cache Compression OR Linear Attention OR FlashAttention”（5分）：论文提到"modality-independent rolling KV-cache inference scheme”，涉及KV缓存优化以实现高效推理，有一定关联但非核心。2. “Self-Correction OR Self-Improvement OR Self-Reflection”（5分）：论文提出"Joint Self-Forcing Distillation"以动态自校正累积跨模态错误，涉及自校正机制，有一定关联。3. “Speculative Decoding OR Inference Acceleration”（5分）：论文旨在加速推理以实现实时生成（~25 FPS），涉及推理加速，有一定关联。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文解决了联合音频-视觉扩散模型因双向注意力依赖导致高延迟而无法实时应用的问题，提出了OmniForcing框架，通过知识蒸馏和自校正机制将离线双向模型转换为高保真流式自回归生成器，在单GPU上实现了约25 FPS的实时生成，同时保持多模态同步和视觉质量。

摘要翻译

近期联合视听扩散模型虽能实现卓越的生成质量，却因其双向注意力依赖导致高延迟，阻碍了实时应用。我们提出OmniForcing——首个将离线双流双向扩散模型蒸馏为高保真流式自回归生成器的框架。然而，对此类双流架构直接应用因果蒸馏会引发严重的训练不稳定问题，其根源在于模态间极端的时间不对称性及由此产生的令牌稀疏性。我们通过引入非对称块因果对齐（Asymmetric Block-Causal Alignment）与零截断全局前缀（zero-truncation Global Prefix）来解决固有的信息密度差距，从而防止多模态同步漂移。因果转换过程中因音频令牌极度稀疏导致的梯度爆炸问题，则通过配备身份旋转位置编码约束（Identity RoPE constraint）的音频汇聚令牌机制（Audio Sink Token mechanism）进一步化解。最后，联合自强制蒸馏（Joint Self-Forcing Distillation）范式使模型能够在长序列生成过程中动态自校正因曝光偏差累积的跨模态误差。借助模态独立的滚动KV缓存推理方案，OmniForcing在单GPU上实现了约25 FPS的先进流式生成性能，同时在多模态同步性与视觉质量上保持与双向教师模型相当的水平。\textbf{项目页面：} \href{https://omniforcing.com}{https://omniforcing.com}

摘要 (Abstract)

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com}

关键词: audio-visual generation, diffusion models, real-time inference, knowledge distillation, autoregressive generation, KV-cache optimization, multi-modal synchronization, streaming generation

201. ❌ MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

作者: Baicheng Li, Dong Wu, Jun Li, Shunkai Zhou, Zecui Zeng, Lusong Li, Hongbin Zha 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11633v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D生成领域，提出了一种训练免费的框架MV-SAM3D，通过多视角融合和物理感知优化来改进布局感知的3D生成。论文的核心技术涉及扩散模型、多视角融合、物理约束优化等计算机视觉和图形学方法，但未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用。所有评分关键词均与大语言模型、深度学习技术原理或特定科学领域AI应用相关，与该论文的3D生成研究内容无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了单视图输入在布局感知3D生成中无法利用多视角互补信息且物体姿态估计不准确导致物理不合理布局的问题，提出了MV-SAM3D框架，通过自适应多视角融合和物理感知优化，显著提升了重建保真度和布局合理性，且无需额外训练。

摘要翻译

近期统一化三维生成模型在从单张图像生成高质量三维资产方面取得了显著进展。值得注意的是，诸如SAM3D等布局感知方法能够重建多个物体并保持其空间排布，为实用级场景三维生成打开了大门。然而，现有方法局限于单视角输入，无法利用互补的多视角观测信息，且独立估计的物体姿态常导致物理上不合理的布局，如物体相互穿透和悬浮伪影等问题。

我们提出MV-SAM3D，一个无需训练的框架，通过多视角一致性与物理合理性扩展了布局感知的三维生成能力。我们将多视角融合建模为三维潜在空间中的多重扩散过程，并提出两种自适应加权策略——注意力熵加权与可见性加权——实现置信度感知的融合机制，确保每个视角根据其局部观测可靠性贡献信息。针对多物体组合，我们引入物理感知优化方法，在生成过程中及生成后注入碰撞约束与接触约束，从而产生物理合理的物体排布。在标准基准测试和真实世界多物体场景上的实验表明，该方法在重建保真度与布局合理性方面均取得显著提升，且无需任何额外训练。代码发布于https://github.com/devinli123/MV-SAM3D。

摘要 (Abstract)

Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies – attention-entropy weighting and visibility weighting – that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at https://github.com/devinli123/MV-SAM3D.

关键词: 3D generation, multi-view fusion, layout-aware, physical plausibility, training-free, diffusion models, collision constraints, scene reconstruction

202. ❌ Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography

作者: Yichi Zhang, Le Xue, Wenbo Zhang, Lanlan Li, Feiyang Xiao, Yuchen Liu, Xiaohui Zhang, Hongwei Zhang, Shuqi Wang, Gang Feng, Liling Peng, Xin Gao, Yuanfan Xu, Yuan Qi, Kuangyu Shi, Hong Zhang, Yuan Cheng, Mei Tian, Zixin Hu 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11627v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文开发了用于3D全身PET图像分割的基础模型SegAnyPET，属于AI for Science（生物医学成像）领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文明确构建了基础模型，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。模型开发涉及预训练，与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分）。论文构建了大规模数据集，隐含数据质量对模型性能的影响，与’Scaling Laws AND Data Quality’有一定关联（5分）。模型支持人类校正，可能涉及微调，与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分）。其他关键词如MoE、SLMs、对齐、推理加速等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究开发了名为SegAnyPET的基础模型，用于3D全身PET图像的通用器官和病变分割，通过构建大规模数据集和创新的3D架构实现了强大的零样本性能。

摘要翻译

正电子发射断层扫描（PET）是一种关键的核医学成像模态，通过可视化放射性示踪剂分布来量化体内生理与代谢过程，在疾病管理中发挥着不可替代的作用。尽管其临床意义重大，但由于PET图像固有的解剖对比度不足带来的分割挑战，以及数据采集与标注的高昂成本，针对定量PET图像分析的深度学习模型发展仍受到严重限制。为弥补这一空白，我们开发了适用于三维全身PET成像的通用分割基础模型。我们首先构建了迄今为止规模最大、最全面的PET数据集，包含11041例三维全身PET扫描及59831个分割掩模，用于模型开发。基于此数据集，我们提出了SegAnyPET——一种具有广泛适用性的创新基础模型，可适配多样化的分割任务。该模型基于三维架构构建，采用掩模生成的提示工程策略，能够实现通用且可扩展的器官与病灶分割，支持通过极简人工干预进行高效校正，并赋能临床人机协同工作流程。在多中心、多示踪剂、多疾病数据集上的广泛评估表明，SegAnyPET在广泛的分割任务中展现出强大的零样本性能，凸显了其推动分子影像临床应用的潜力。

摘要 (Abstract)

Positron emission tomography (PET) is a key nuclear medicine imaging modality that visualizes radiotracer distributions to quantify in vivo physiological and metabolic processes, playing an irreplaceable role in disease management. Despite its clinical importance, the development of deep learning models for quantitative PET image analysis remains severely limited, driven by both the inherent segmentation challenge from PET’s paucity of anatomical contrast and the high costs of data acquisition and annotation. To bridge this gap, we develop generalist foundational models for universal segmentation from 3D whole-body PET imaging. We first build the largest and most comprehensive PET dataset to date, comprising 11041 3D whole-body PET scans with 59831 segmentation masks for model development. Based on this dataset, we present SegAnyPET, an innovative foundational model with general-purpose applicability to diverse segmentation tasks. Built on a 3D architecture with a prompt engineering strategy for mask generation, SegAnyPET enables universal and scalable organ and lesion segmentation, supports efficient human correction with minimal effort, and enables a clinical human-in-the-loop workflow. Extensive evaluations on multi-center, multi-tracer, multi-disease datasets demonstrate that SegAnyPET achieves strong zero-shot performance across a wide range of segmentation tasks, highlighting its potential to advance the clinical applications of molecular imaging.

关键词: Foundation Models, PET Imaging, Universal Segmentation, 3D Whole-Body, SegAnyPET, Zero-shot Performance, Medical Imaging, Deep Learning

203. ❌ MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models

作者: Shengyuan Liu, Zanting Ye, Yunrui Lin, Chen Hu, Wanting Geng, Xu Han, Bulat Ibragimov, Yefeng Zheng, Yixuan Yuan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11625v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学视觉语言模型（VLMs）的效率优化，提出了一种训练无关的分层令牌剪枝框架MedPruner，用于3D医学图像理解。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"高度相关（10分），因为论文直接涉及AI在生物医学领域的应用（3D医学图像分析）。其他关键词主要针对纯语言模型、训练技术、推理方法、对齐、代理系统等，与论文的视觉-语言模型效率优化焦点无关，因此得0分。

!!! tip deepseek-chat TL;DR

论文针对3D医学视觉语言模型的计算效率低下问题，提出了训练无关的分层令牌剪枝框架MedPruner，能在保留少于5%视觉令牌的情况下维持或提升模型性能，显著降低计算开销。

摘要翻译

尽管专业医学视觉-语言模型（VLMs）在解读2D和3D医学模态数据方面取得了显著成功，但其在3D体数据上的部署仍受限于显著的计算效率不足。当前架构通常因直接串联连续2D切片而存在大量解剖冗余，且缺乏灵活性，无法通过固定剪枝比率处理不同切片间的异质信息密度。为应对这些挑战，我们提出了MedPruner——一种免训练且与模型无关的分层令牌剪枝框架，专为高效3D医学图像理解而设计。MedPruner引入了一种两阶段机制：首先通过基于切片间锚点的过滤模块消除切片级时间冗余，随后采用动态信息核心选择策略，通过量化累积注意力权重实现自适应的令牌级压缩。在三个3D医学基准数据集及三种不同医学VLM上的大量实验表明，现有架构中存在大量令牌冗余。值得注意的是，MedPruner能使MedGemma等模型在保留少于5%视觉令牌的同时，维持甚至超越其原始性能，从而大幅降低计算开销，并验证了动态令牌选择对于实际临床部署的必要性。我们的代码将公开发布。

摘要 (Abstract)

While specialized Medical Vision-Language Models (VLMs) have achieved remarkable success in interpreting 2D and 3D medical modalities, their deployment for 3D volumetric data remains constrained by significant computational inefficiencies. Current architectures typically suffer from massive anatomical redundancy due to the direct concatenation of consecutive 2D slices and lack the flexibility to handle heterogeneous information densities across different slices using fixed pruning ratios. To address these challenges, we propose MedPruner, a training-free and model-agnostic hierarchical token pruning framework specifically designed for efficient 3D medical image understanding. MedPruner introduces a two-stage mechanism: an Inter-slice Anchor-based Filtering module to eliminate slice-level temporal redundancy, followed by a Dynamic Information Nucleus Selection strategy that achieves adaptive token-level compression by quantifying cumulative attention weights. Extensive experiments on three 3D medical benchmarks and across three diverse medical VLMs reveal massive token redundancy in existing architectures. Notably, MedPruner enables models such as MedGemma to maintain or even exceed their original performance while retaining fewer than 5% of visual tokens, thereby drastically reducing computational overhead and validating the necessity of dynamic token selection for practical clinical deployment. Our code will be released.

关键词: Medical Vision-Language Models, 3D medical image understanding, token pruning, computational efficiency, training-free framework, dynamic token selection, attention weights, clinical deployment

204. ❌ Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild

作者: Jiin Im, Sisung Liu, Je Hyeong Hong 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11618v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究计算机视觉中的语义对应问题，使用3D基础模型和Fused Gromov-Wasserstein最优传输方法。与大多数大语言模型（LLM）关键词无直接关联，仅与’Foundation Models’（文中提到2D和3D基础模型）和’Pre-training’（基础模型通常经过预训练）有中等相关度（5分），其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Shape-of-You的框架，通过将伪标签生成重新表述为Fused Gromov-Wasserstein最优传输问题，并结合3D基础模型来定义几何空间内的内部结构，解决了野外图像语义对应中的几何模糊性问题，在SPair-71k和AP-10k数据集上实现了最先进的性能。

摘要翻译

语义对应对于处理缺乏显式对应标注的多样化真实场景图像至关重要。尽管当前的二维基础模型提供了强大的特征表示，但通过最近邻伪标签使其适应无监督学习存在关键局限：该方法仅局部操作，忽略了结构关系，且其依赖二维表观特征无法解决由对称性或重复特征引起的几何歧义。本研究通过将伪标签生成重新表述为融合Grom瓦瑟斯坦（Fused Gromov-Wasserstein, FGW）问题来解决这一缺陷，该问题联合优化特征间相似性与结构内一致性。我们的框架Shape-of-You（SoY）利用三维基础模型在几何空间中定义这种内部结构，从而解决上述歧义问题。然而，由于FGW是计算复杂度极高的二次规划问题，我们通过基于锚点的线性化方法进行近似求解。所得的概率传输计划提供了结构一致但含噪声的监督信号。为此，我们引入了一种软目标损失函数，动态融合该传输计划的指导与网络预测，从而构建对此类噪声具有鲁棒性的学习框架。SoY在SPair-71k和AP-10k数据集上取得了最先进的性能，在无需显式几何标注的语义对应任务中确立了新的基准。代码发布于Shape-of-You项目。

摘要 (Abstract)

Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.

关键词: Semantic Correspondence, Fused Gromov-Wasserstein, Optimal Transport, 3D Foundation Model, Geometric Ambiguity, Unsupervised Learning, Pseudo-label Generation, Structural Consistency

205. ❌ Noise-aware few-shot learning through bi-directional multi-view prompt alignment

作者: Lu Niu, Cheng Xue 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11617v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉-语言模型（Vision-Language Models）在少样本学习中的噪声标签问题，提出了NA-MVP框架进行双向多视图提示对齐。所有给定的关键词都专门针对大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化等），而本文聚焦于视觉-语言模型（VLMs），属于多模态模型而非纯文本大模型。虽然VLMs与大模型有技术重叠（如提示学习、对齐），但论文内容未涉及任何关键词中指定的LLM特定技术、方法或应用领域（如科学AI）。因此，所有关键词评分为0分，表示完全无关。

!!! tip deepseek-chat TL;DR

该论文针对视觉-语言模型在少样本学习中易受噪声标签影响的问题，提出了NA-MVP框架，通过双向多视图提示对齐和最优传输来区分干净与噪声信号，实验证明其在噪声监督下能实现更鲁棒的少样本学习。

摘要翻译

视觉语言模型通过提示调优展现出强大的小样本学习能力，但其对噪声标签的鲁棒性较弱，噪声可能破坏提示并削弱跨模态对齐效果。现有方法常因难以建模细粒度语义线索及自适应区分干净与噪声信号而面临挑战。为解决这些问题，我们提出NA-MVP框架，即通过双向多视图提示对齐实现噪声感知的小样本学习。NA-MVP基于一个关键理念转变：鲁棒的提示学习需从全局匹配转向区域感知对齐，以显式区分干净线索与噪声线索。为实现这一目标，NA-MVP采用（1）结合非平衡最优传输的多视图提示，在抑制不可靠区域的同时实现细粒度的图像块到提示的对应关系；（2）双向提示设计，捕获互补的面向干净线索与噪声感知线索，使模型聚焦于稳定语义；（3）基于对齐的选择性优化策略，利用最优传输仅修正误标注样本，同时保留可靠数据。在合成与真实噪声基准测试上的实验表明，NA-MVP持续优于现有先进基线方法，验证了其在噪声监督下实现鲁棒小样本学习的有效性。

摘要 (Abstract)

Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.

关键词: vision-language models, few-shot learning, noisy labels, prompt alignment, multi-view prompts, optimal transport, robust learning, cross-modal alignment

206. ❌ SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation

作者: Muyi Sun, Yifan Gao, Ziang Jia, Xingqun Qi, Qianli Zhang, Qian Liu, Tianzheng Deng 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11616v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学影像（CBCT牙齿分割）中的半监督学习框架，属于AI在生物医学领域的应用，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但未涉及大模型、深度学习技术原理创新或其他关键词（如LLMs、MoE、Scaling Laws等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多源CBCT牙齿分割中标注数据获取困难和数据域差异的问题，提出了一个通用的半监督框架SemiTooth，通过多教师-多学生架构和更严格的加权置信约束，在自建数据集上实现了最先进的性能。

摘要翻译

随着人工智能技术的快速发展，面向临床诊疗的智能牙科展现出广阔前景。作为核心临床牙科任务，锥形束计算机断层扫描（CBCT）的牙齿结构分割近年来取得显著进展。然而，全标注数据获取困难以及不同机构多源数据采集的差异性，导致CBCT切片存在利用率低、体素级不一致和领域特异性差异等挑战。因此，如何合理高效地利用多源未标注数据成为关键问题。本文提出SemiTooth——一种面向多源牙齿分割的通用半监督框架。具体而言，我们首先构建了面向临床牙科CBCT的多源半监督牙齿数据集MS3Toothset，其中包含三种不同标注级别的多源数据。随后，我们设计了多教师-多学生框架SemiTooth，以促进多源数据的半监督学习。该框架通过不同的学生网络分别学习不同来源的未标注数据，并由对应的教师网络进行监督。此外，我们为多教师机制引入更严格的加权置信度约束，以提升多源数据的分割精度。在MS3Toothset上进行的大量实验验证了SemiTooth框架的可行性与优越性，该框架在半监督多源牙齿分割场景中取得了最先进的性能表现。

摘要 (Abstract)

With the rapid advancement of artificial intelligence, intelligent dentistry for clinical diagnosis and treatment has become increasingly promising. As the primary clinical dentistry task, tooth structure segmentation for Cone-Beam Computed Tomography (CBCT) has made significant progress in recent years. However, challenges arise from the obtainment difficulty of full-annotated data, and the acquisition variability of multi-source data across different institutions, which have caused low-quality utilization, voxel-level inconsistency, and domain-specific disparity in CBCT slices. Thus, the rational and efficient utilization of multi-source and unlabeled data represents a pivotal problem. In this paper, we propose SemiTooth, a generalizable semi-supervised framework for multi-source tooth segmentation. Specifically, we first compile MS3Toothset, Multi-Source Semi-Supervised Tooth DataSet for clinical dental CBCT, which contains data from three sources with different-level annotations. Then, we design a multi-teacher and multi-student framework, i.e., SemiTooth, which promotes semi-supervised learning for multi-source data. SemiTooth employs distinct student networks that learn from unlabeled data with different sources, supervised by its respective teachers. Furthermore, a Stricter Weighted-Confidence Constraint is introduced for multiple teachers to improve the multi-source accuracy.Extensive experiments are conducted on MS3Toothset to verify the feasibility and superiority of the SemiTooth framework, which achieves SOTA performance on the semi-supervised and multi-source tooth segmentation scenario.

关键词: tooth segmentation, semi-supervised learning, multi-source data, CBCT, medical imaging, dental AI, domain adaptation, multi-teacher framework

207. ❌ DyWeight: Dynamic Gradient Weighting for Few-Step Diffusion Sampling

作者: Tong Zhao, Mingkun Lei, Liangyu Yuan, Yanming Yang, Chenxi Song, Yang Wang, Beier Zhu, Chi Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11607v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型（Diffusion Models）的高效采样方法，提出了一种名为DyWeight的动态梯度加权多步求解器。所有评分关键词均与大语言模型（LLM）及其相关技术（如训练、对齐、推理优化、代理系统等）或特定科学AI应用（如生物信息学）直接相关。本论文的研究主题（扩散模型采样加速）与这些LLM或特定科学AI关键词无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文针对扩散模型采样过程缓慢的问题，提出了一种动态梯度加权多步求解器DyWeight，通过自适应聚合历史梯度并校准有效步长，在显著减少函数评估次数的同时实现了更优的视觉保真度和稳定性。

摘要翻译

扩散模型（Diffusion Models, DMs）已在多种模态上实现了最先进的生成性能，但其采样过程因需要数百次函数评估而仍然极其缓慢。多步常微分方程（ODE）求解器的最新进展通过重用历史梯度极大地提高了效率，但现有方法依赖于手工设计的系数，无法适应扩散采样的非平稳动态。为应对这一局限，我们提出了动态梯度加权（DyWeight），这是一种轻量级、基于学习的多步求解器，引入了一种简化的隐式耦合范式。通过放宽经典数值约束，DyWeight学习无约束的时变参数，自适应地聚合历史梯度，同时内在地缩放有效步长。这种隐式时间校准在大积分步长下，精确地将求解器的数值轨迹与模型内部的去噪动态对齐，避免了复杂的解耦参数化和优化过程。在CIFAR-10、FFHQ、AFHQv2、ImageNet64、LSUN-Bedroom、Stable Diffusion和FLUX.1-dev上的大量实验表明，DyWeight以显著更少的函数评估次数实现了卓越的视觉保真度和稳定性，在高效扩散求解器中确立了新的最优性能。代码发布于https://github.com/Westlake-AGI-Lab/DyWeight。

摘要 (Abstract)

Diffusion Models (DMs) have achieved state-of-the-art generative performance across multiple modalities, yet their sampling process remains prohibitively slow due to the need for hundreds of function evaluations. Recent progress in multi-step ODE solvers has greatly improved efficiency by reusing historical gradients, but existing methods rely on handcrafted coefficients that fail to adapt to the non-stationary dynamics of diffusion sampling. To address this limitation, we propose Dynamic Gradient Weighting (DyWeight), a lightweight, learning-based multi-step solver that introduces a streamlined implicit coupling paradigm. By relaxing classical numerical constraints, DyWeight learns unconstrained time-varying parameters that adaptively aggregate historical gradients while intrinsically scaling the effective step size. This implicit time calibration accurately aligns the solver’s numerical trajectory with the model’s internal denoising dynamics under large integration steps, avoiding complex decoupled parameterizations and optimizations. Extensive experiments on CIFAR-10, FFHQ, AFHQv2, ImageNet64, LSUN-Bedroom, Stable Diffusion and FLUX.1-dev demonstrate that DyWeight achieves superior visual fidelity and stability with significantly fewer function evaluations, establishing a new state-of-the-art among efficient diffusion solvers. Code is available at https://github.com/Westlake-AGI-Lab/DyWeight

关键词: Diffusion Models, Sampling Acceleration, Multi-step Solver, Dynamic Gradient Weighting, ODE Solvers, Few-Step Sampling, Implicit Coupling, Denoising Dynamics

208. ❌ Articulat3D: Reconstructing Articulated Digital Twins From Monocular Videos with Geometric and Motion Constraints

作者: Lijun Guo, Haoyu Zhao, Xingyue Zhao, Rong Fu, Linghao Zhuang, Siteng Huang, Zhongyu Li, Hua Zou 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11606v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Articulat3D: Reconstructing Articulated Digital Twins From Monocular Videos with Geometric and Motion Constraints》专注于计算机视觉和3D重建领域，特别是从单目视频中重建铰接式物体的数字孪生体。其核心贡献在于提出了一种结合几何和运动约束的新框架，包括运动先验驱动的初始化和几何与运动约束细化。论文内容涉及3D点轨迹、运动基、运动分解、运动学基元、关节轴、枢轴点等概念，属于计算机视觉、3D重建和运动分析领域。所有评分关键词均与大模型、深度学习技术原理、AI for Science（如生物信息学、化学信息学）等主题相关，而本论文的研究内容与这些关键词无直接关联，因此所有关键词的相关度评分均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Articulat3D的新框架，解决了从单目视频中重建铰接式物体高保真数字孪生体的挑战，通过联合实施显式的3D几何和运动约束，在合成基准和真实世界视频上实现了最先进的性能。

摘要翻译

从视觉数据构建高保真度的铰接物体数字孪生体仍是一个核心挑战。现有方法依赖于物体在离散静态状态下的多视角采集，这严重限制了其在实际场景中的可扩展性。本文提出Articulat3D，一种新颖的框架，通过联合施加显式的三维几何与运动约束，能够从随意采集的单目视频中构建此类数字孪生体。我们首先提出运动先验驱动初始化方法，该方法利用三维点轨迹来挖掘铰接运动的低维结构。通过用一组紧凑的运动基对场景动态进行建模，我们实现了将场景软分解为多个刚性运动组。基于此初始化，我们进一步提出几何与运动约束优化方法，该方法通过可学习的运动学基元——由关节轴、枢轴点以及逐帧运动标量参数化——来强制满足物理合理的铰接关系，从而产生几何精确且时序一致的重建结果。大量实验表明，Articulat3D在合成基准测试和真实世界随意采集的单目视频上均达到了最先进的性能，显著推进了在非受控真实世界条件下创建数字孪生体的可行性。我们的项目页面位于https://maxwell-zhao.github.io/Articulat3D。

摘要 (Abstract)

Building high-fidelity digital twins of articulated objects from visual data remains a central challenge. Existing approaches depend on multi-view captures of the object in discrete, static states, which severely constrains their real-world scalability. In this paper, we introduce Articulat3D, a novel framework that constructs such digital twins from casually captured monocular videos by jointly enforcing explicit 3D geometric and motion constraints. We first propose Motion Prior-Driven Initialization, which leverages 3D point tracks to exploit the low-dimensional structure of articulated motion. By modeling scene dynamics with a compact set of motion bases, we facilitate soft decomposition of the scene into multiple rigidly-moving groups. Building on this initialization, we introduce Geometric and Motion Constraints Refinement, which enforces physically plausible articulation through learnable kinematic primitives parameterized by a joint axis, a pivot point, and per-frame motion scalars, yielding reconstructions that are both geometrically accurate and temporally coherent. Extensive experiments demonstrate that Articulat3D achieves state-of-the-art performance on synthetic benchmarks and real-world casually captured monocular videos, significantly advancing the feasibility of digital twin creation under uncontrolled real-world conditions. Our project page is at https://maxwell-zhao.github.io/Articulat3D.

关键词: Articulated Digital Twins, Monocular Videos, 3D Geometric Constraints, Motion Constraints, Motion Prior-Driven Initialization, Kinematic Primitives, 3D Reconstruction, Articulated Objects

209. ❌ LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference

作者: Junkun Jiang, Ho Yin Au, Jingyu Xiang, Jie Chen 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11605v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLMs进行语言到动作生成的创新应用，因此与’Large Language Models’高度相关（10分）。论文提出基于代理的设计，LLM解释运动模式并重新组合符号，与’LLM Agents’高度相关（10分）。论文强调可解释性，开发了LabanLite表示法，与’Mechanistic Interpretability’高度相关（10分）。论文涉及符号推理和组合，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。LLM使用符号模板生成可执行计划，与’Tool Use’有一定关联（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了现有文本到动作生成方法缺乏可解释性和时间准确性的问题，通过引入LabanLite符号表示法和基于LLM的代理框架LaMoGen，实现了可解释、可控的语言驱动动作合成，并在多个数据集上超越了现有方法。

摘要翻译

人体运动具有高度表现力且与语言天然对齐，然而当前主流方法严重依赖文本-运动联合嵌入，难以合成时序精确、细节丰富的运动，且往往缺乏可解释性。为解决这些局限，我们提出了LabanLite——一种通过改编和扩展拉班舞谱系统而开发的运动表征方法。与黑箱式的文本-运动嵌入不同，LabanLite将每个原子化的身体部位动作（例如单次左脚迈步）编码为离散的拉班符号与文本模板的组合。这种抽象将复杂运动分解为可解释的符号序列和身体部位指令，从而在高层语言与低层运动轨迹之间建立了符号化连接。基于LabanLite，我们进一步提出LaMoGen框架，即“文本→LabanLite→运动生成”的双阶段架构，使大语言模型能够通过符号推理来组合运动序列。大语言模型通过解析运动模式、将其关联到文本描述，并将符号重组为可执行计划，最终生成兼具可解释性与语言关联性的运动。为支持系统化评估，我们构建了一个基于拉班舞谱的基准测试集，包含结构化描述-运动对，并提出了三项指标，从符号、时序与协调性三个维度综合衡量文本-运动对齐度。实验表明，LaMoGen在可解释性与可控性方面确立了新的基准，在我们构建的基准集及两个公开数据集上均优于现有方法。这些结果凸显了符号推理与智能体式设计在语言驱动运动合成中的优势。

摘要 (Abstract)

Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint text-motion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text-motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose motion sequences through symbolic reasoning. The LLM interprets motion patterns, relates them to textual descriptions, and recombines symbols into executable plans, producing motions that are both interpretable and linguistically grounded. To support rigorous evaluation, we introduce a Labanotation-based benchmark with structured description-motion pairs and three metrics that jointly measure text-motion alignment across symbolic, temporal, and harmony dimensions. Experiments demonstrate that LaMoGen establishes a new baseline for both interpretability and controllability, outperforming prior methods on our benchmark and two public datasets. These results highlight the advantages of symbolic reasoning and agent-based design for language-driven motion synthesis.

关键词: Language to Motion Generation, LLM-Guided Symbolic Inference, LabanLite, Interpretable Motion Synthesis, Text-to-LabanLite-to-Motion, Symbolic Reasoning, LLM Agents, Motion Representation

210. ❌ WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

作者: Hui Zhang, Juntao Liu, Zongkai Liu, Liqiang Niu, Fandong Meng, Zuxuan Wu, Yu-Gang Jiang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11593v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于文本中心图像编辑任务，提出了WeEdit框架，包括数据集、基准和训练策略。核心贡献在于：1）构建了330K训练对的数据集和基准；2）采用两阶段训练策略：第一阶段使用字形引导的监督微调（SFT），第二阶段使用多目标强化学习。因此，仅与"Post-training OR Supervised Fine-tuning OR SFT"高度相关（10分），因为论文明确使用了监督微调（SFT）作为其训练策略的一部分。其他关键词均与论文内容无关，论文未涉及大模型技术原理、科学AI应用或其他指定技术。

!!! tip deepseek-chat TL;DR

该论文针对文本中心图像编辑任务中现有模型难以精确编辑文本的问题，提出了WeEdit框架，包括一个大规模数据集、两个基准测试以及一个两阶段训练策略，实验表明其性能显著优于现有开源模型。

摘要翻译

基于指令的图像编辑旨在根据用户提供的指令修改现有图像中的特定内容，同时保留非目标区域。除了传统的以对象和风格为中心的操作外，以文本为中心的图像编辑侧重于修改、翻译或重新排列嵌入图像中的文本元素。然而，现有的领先模型往往难以精确执行复杂的文本编辑，经常产生模糊或幻觉字符。我们将这些失败主要归因于缺乏专门针对以文本为中心的编辑定制的训练范式，以及缺乏闭环训练和评估系统所需的大规模数据集和标准化基准。为了解决这些局限性，我们提出了WeEdit，这是一个系统性的解决方案，包含一个可扩展的数据构建流程、两个基准测试以及一个量身定制的两阶段训练策略。具体而言，我们提出了一种新颖的基于HTML的自动编辑流程，该流程生成了33万个训练对，涵盖多样化的编辑操作和15种语言，并辅以标准化的双语和多语言基准进行全面评估。在算法方面，我们采用字形引导的监督微调来注入明确的空间和内容先验，随后通过一个多目标强化学习阶段，使生成结果与指令遵循、文本清晰度和背景保持保持一致。大量实验表明，WeEdit在多种编辑操作上均明显优于之前的开源模型。

摘要 (Abstract)

Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.

关键词: text-centric image editing, instruction-based editing, glyph-guided fine-tuning, multi-objective reinforcement learning, HTML-based data construction, benchmark evaluation, supervised fine-tuning, WeEdit framework

211. ❌ R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

作者: Zhongyu Xia, Yousen Tang, Yongtao Wang, Zhifeng Wang, Weijun Qin 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11566v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文R4Det专注于自动驾驶中的4D雷达-相机融合3D目标检测，提出了Panoramic Depth Fusion、Deformable Gated Temporal Fusion和Instance-Guided Dynamic Refinement三个模块来解决现有方法的深度估计不准确、姿态依赖和稀疏点云问题。所有评分关键词均与大模型、深度学习技术原理或AI科学应用相关，而本文研究的是传感器融合和计算机视觉中的具体工程问题，未涉及大模型技术、训练方法、推理优化、对齐、代理系统或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对4D雷达与相机融合的3D目标检测中深度估计不准确、姿态依赖和稀疏点云问题，提出了R4Det方法，通过全景深度融合、可变形门控时序融合和实例引导动态细化模块，在TJ4DRadSet和VoD数据集上实现了最先进的检测性能。

摘要翻译

四维雷达-相机融合感知方案在自动驾驶领域的重要性日益凸显。然而，现有融合四维雷达与相机数据的三维目标检测方法面临若干挑战。首先，其绝对深度估计模块的鲁棒性与准确性不足，导致三维定位不精确。其次，当自车位姿信息缺失或不准确时，其时序融合模块性能会显著下降甚至失效。第三，对于某些小型物体，稀疏的雷达点云可能完全无法从其表面反射，此类情况下检测必须完全依赖视觉单模态先验信息。为应对这些局限，我们提出R4Det框架：通过全景深度融合模块提升深度估计质量，实现绝对深度与相对深度的相互增强；针对时序融合，设计了不依赖自车位姿的可变形门控时序融合模块；此外构建了实例引导动态优化模块，从二维实例引导中提取语义原型。实验表明，R4Det在TJ4DRadSet与VoD数据集上实现了最先进的三维目标检测性能。

摘要 (Abstract)

4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle’s pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle’s pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.

关键词: 4D radar-camera fusion, 3D object detection, autonomous driving, depth estimation, temporal fusion, sparse point clouds, Panoramic Depth Fusion, Deformable Gated Temporal Fusion

212. ❌ SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

作者: Yuyuan Yang, Junkun Hong, Hongrong Wang, Honghao Cai, Xunpeng Ren, Ge Wang, Mingcong Lei, Shenhao Yan, Jiahao Yang, Chengsi Yao, Xi Li, Yiming Zhao, Yatong Han, Jinke Ren 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11563v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	15.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出SVLL框架用于具身任务规划，核心创新在于Bias-DPO对齐方法，直接改进DPO算法，因此与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（15分）。论文使用视觉语言模型进行具身规划，与’Large Language Models OR LLMs OR Foundation Models’、‘Instruction Tuning OR Alignment OR Value Alignment’、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’相关（各10分）。Bias-DPO旨在减少幻觉和物理约束违反，与’Hallucination Mitigation OR Factuality OR Truthfulness’相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对具身任务规划中视觉语言模型面临的时间绑定过早和优化不稳定问题，提出了SVLL三阶段框架和Bias-DPO对齐方法，在AI2-THOR基准和真实机器人部署中超越了现有开源和闭源模型，提高了任务成功率并减少了物理约束违反。

摘要翻译

具身任务规划要求视觉语言模型生成在视觉上具有基础性且在时间上因果连贯的动作序列。然而，现有的训练范式面临一个关键权衡：联合端到端训练常导致过早的时间绑定，而标准的强化学习方法则受限于优化不稳定性。为弥合这一差距，我们提出了分阶段视觉语言学习（SVLL），这是一个用于实现鲁棒、物理基础具身规划的统一的三个阶段框架。在前两个阶段，SVLL将空间基础与时间推理解耦，在引入序列动作历史之前建立鲁棒的视觉依赖性。在最后阶段，我们指出了标准直接偏好优化（DPO）的一个关键局限——其纯粹的相对性：仅优化获胜与失败轨迹之间的偏好差距，而忽略对最优路径的绝对似然约束，这常常导致不安全或产生幻觉的行为。为解决此问题，我们进一步引入了Bias-DPO，一种新颖的对齐目标，它通过显式最大化对真实动作的似然性，同时惩罚过度自信的幻觉，从而注入一种倾向于专家轨迹的归纳偏置。通过将策略锚定在专家流形上并缓解因果错位，由Bias-DPO驱动的SVLL确保了严格遵循环境可供性，并有效抑制了物理上不可能的捷径。最后，在交互式AI2-THOR基准测试和真实世界机器人部署上进行的大量实验表明，SVLL在任务成功率上均优于最先进的开源模型（如Qwen2.5-VL-7B）和闭源模型（如GPT-4o、Gemini-2.0-flash），同时显著减少了物理约束违反。

摘要 (Abstract)

Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature – optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.

关键词: Embodied Task Planning, Vision-Language Models, Direct Preference Optimization, Bias-DPO, Physical Grounding, Hallucination Mitigation, AI2-THOR, Robotic Deployment

213. ❌ TornadoNet: Real-Time Building Damage Detection with Ordinal Supervision

作者: Robinson Umeike, Cuong Pham, Ryan Hausen, Thang Dao, Shane Crawford, Tanya Brown-Giammanco, Gerard Lemson, John van de Lindt, Blythe Johnston, Arik Mitschang, Trung Do 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11557v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算机视觉领域的建筑损伤检测，使用CNN和Transformer架构进行目标检测和分类，与大多数大模型技术关键词无关。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在灾害科学中的应用，但并非核心内容，因此给5分。

!!! tip deepseek-chat TL;DR

该研究提出了TornadoNet基准，通过比较YOLO和RT-DETR模型在龙卷风后建筑损伤检测中的性能，发现结合序数监督能提升Transformer模型的损伤严重程度评估准确性。

摘要翻译

我们推出TornadoNet，这是一个用于自动化街景建筑损伤评估的综合基准测试，旨在评估现代实时目标检测架构与有序感知监督策略在真实灾后条件下的性能表现。TornadoNet提供了首个受控基准，展示了架构设计与损失函数如何共同影响基于街景图像的多级损伤检测，为灾害响应提供了方法论洞见与可部署工具。利用2021年美国中西部龙卷风爆发事件中的3,333张高分辨率地理标记图像和8,890个标注建筑实例，我们系统比较了基于CNN的YOLO系列检测器与基于Transformer的模型（RT-DETR）在多级损伤检测中的表现。所有模型均基于IN-CORE损伤状态定义的五级分类框架，在标准化协议下训练，并通过专家交叉标注验证。基线实验揭示了不同架构的互补优势：基于CNN的YOLO模型实现了最高的检测精度与吞吐量，其较大变体在A100 GPU上达到46.05% mAP@0.5，帧率为66-276 FPS；基于Transformer的RT-DETR模型则表现出更强的有序一致性，获得88.13%的有序Top-1准确率和0.65的MAOE值，表明其在基线mAP较低的情况下仍能实现更可靠的损伤严重度分级。为使监督机制与损伤严重度的有序特性对齐，我们引入了软有序分类目标并评估了显式有序距离惩罚方法。采用校准有序监督训练的RT-DETR模型达到44.70% mAP@0.5，提升4.8个百分点，同时在有序指标上取得进步（有序Top-1准确率91.15%，MAOE=0.56）。这些发现证实，当有序感知监督与检测器架构相匹配时，能有效提升损伤严重度估计性能。模型与数据：https://github.com/crumeike/TornadoNet

摘要 (Abstract)

We present TornadoNet, a comprehensive benchmark for automated street-level building damage assessment evaluating how modern real-time object detection architectures and ordinal-aware supervision strategies perform under realistic post-disaster conditions. TornadoNet provides the first controlled benchmark demonstrating how architectural design and loss formulation jointly influence multi-level damage detection from street-view imagery, delivering methodological insights and deployable tools for disaster response. Using 3,333 high-resolution geotagged images and 8,890 annotated building instances from the 2021 Midwest tornado outbreak, we systematically compare CNN-based detectors from the YOLO family against transformer-based models (RT-DETR) for multi-level damage detection. Models are trained under standardized protocols using a five-level damage classification framework based on IN-CORE damage states, validated through expert cross-annotation. Baseline experiments reveal complementary architectural strengths. CNN-based YOLO models achieve highest detection accuracy and throughput, with larger variants reaching 46.05% mAP@0.5 at 66-276 FPS on A100 GPUs. Transformer-based RT-DETR models exhibit stronger ordinal consistency, achieving 88.13% Ordinal Top-1 Accuracy and MAOE of 0.65, indicating more reliable severity grading despite lower baseline mAP. To align supervision with the ordered nature of damage severity, we introduce soft ordinal classification targets and evaluate explicit ordinal-distance penalties. RT-DETR trained with calibrated ordinal supervision achieves 44.70% mAP@0.5, a 4.8 percentage-point improvement, with gains in ordinal metrics (91.15% Ordinal Top-1 Accuracy, MAOE = 0.56). These findings establish that ordinal-aware supervision improves damage severity estimation when aligned with detector architecture. Model & Data: https://github.com/crumeike/TornadoNet

关键词: building damage detection, ordinal supervision, object detection, YOLO, RT-DETR, disaster response, computer vision, benchmark dataset

214. ❌ Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

作者: Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li, Nicolo Fusi, Yilun Du, Sham M. Kakade, Carles Domingo-Enrich 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12248v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语言模型的微调方法，与’Large Language Models’和’Post-training/SFT’高度相关（10分），因为论文专注于语言模型的微调技术。与’RLHF/DPO’有一定关联（5分），因为论文提出了一种替代RLHF的序列级优化方法，但未直接使用偏好学习。其他关键词如MoE、量化、推理加速、科学AI等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于能量模型的微调方法（EBFT），通过特征匹配优化语言模型的序列级行为，在问答、编码和翻译任务中优于监督微调（SFT）并达到更低的验证交叉熵。

摘要翻译

交叉熵训练为语言模型提供了密集且可扩展的监督，但它优化的是教师强制下的下一词元预测，而非模型展开下的序列级行为。我们引入了一种用于语言模型微调的特征匹配目标，该目标针对完成分布的序列级统计量，提供密集的语义反馈，且无需任务特定的验证器或偏好模型。为了高效优化此目标，我们提出了基于能量的微调方法。该方法使用跨步块并行采样从嵌套前缀中并发生成多个展开序列，对这些展开序列进行批量特征提取，并利用所得嵌入执行同策略的策略梯度更新。我们从理论视角阐述了EBFT与KL正则化特征匹配及基于能量的建模之间的联系。实证研究表明，在问答式编程、非结构化编程和翻译任务中，EBFT在保持低于对比方法的验证交叉熵的同时，其下游准确率与RLVR相当，并优于监督微调。

摘要 (Abstract)

Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.

关键词: energy-based fine-tuning, feature matching, language model fine-tuning, sequence-level optimization, policy gradient, cross-entropy training, model rollouts, semantic feedback

215. ❌ Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception

作者: Xinyu Nan, Ning Wang, Yuyao Zhai, Mei Yang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11556v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图像美学增强，提出了一种基于扩散模型的双监督框架（DIAE），并引入了多模态美学感知（MAP）和弱监督训练方法。虽然涉及生成模型和AI应用，但所有关键词均明确针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、Agent等），或特定科学领域（如生物信息学）。论文未提及任何语言模型、LLM技术原理、LLM应用或科学AI应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该研究解决了图像美学增强中编辑指令模糊和缺乏完美配对数据的问题，通过提出DIAE模型（结合多模态美学感知和双分支弱监督框架）显著提升了图像美学评分和内容一致性。

摘要翻译

图像美学增强旨在感知图像中的美学缺陷并执行相应的编辑操作，这一任务极具挑战性，要求模型具备创造力和美学感知能力。尽管近期图像编辑模型的进展显著提升了其可控性和灵活性，但它们在增强图像美学方面仍面临困难。主要挑战来自两方面：首先，在遵循编辑指令的同时融入美学感知是困难的；其次，缺乏内容一致但美学品质迥异的“完美配对”图像。本文提出双监督图像美学增强模型，这是一种具备多模态美学感知的扩散生成模型。首先，DIAE 引入了多模态美学感知模块，通过以下方式将模糊的美学指令转化为明确指导：（i）在多个美学属性上采用详细、标准化的美学指令，以及（ii）利用源自文本-图像对的多模态控制信号，这些信号在同一美学属性内保持一致性。其次，为缓解“完美配对”图像的稀缺问题，我们收集了一个名为 IIAEData 的“非完美配对”数据集，其中包含语义相同但美学质量各异的图像。为了更好地在训练中利用 IIAEData 的弱匹配特性，我们还引入了一个双分支监督框架，用于弱监督的图像美学增强。实验结果表明，DIAE 优于基线模型，并在图像美学评分和图像内容一致性评分上取得了更优的结果。

摘要 (Abstract)

Image aesthetic enhancement aims to perceive aesthetic deficiencies in images and perform corresponding editing operations, which is highly challenging and requires the model to possess creativity and aesthetic perception capabilities. Although recent advancements in image editing models have significantly enhanced their controllability and flexibility, they struggle with enhancing image aesthetic. The primary challenges are twofold: first, following editing instructions with aesthetic perception is difficult, and second, there is a scarcity of “perfectly-paired” images that have consistent content but distinct aesthetic qualities. In this paper, we propose Dual-supervised Image Aesthetic Enhancement (DIAE), a diffusion-based generative model with multimodal aesthetic perception. First, DIAE incorporates Multimodal Aesthetic Perception (MAP) to convert the ambiguous aesthetic instruction into explicit guidance by (i) employing detailed, standardized aesthetic instructions across multiple aesthetic attributes, and (ii) utilizing multimodal control signals derived from text-image pairs that maintain consistency within the same aesthetic attribute. Second, to mitigate the lack of “perfectly-paired” images, we collect “imperfectly-paired” dataset called IIAEData, consisting of images with varying aesthetic qualities while sharing identical semantics. To better leverage the weak matching characteristics of IIAEData during training, a dual-branch supervision framework is also introduced for weakly supervised image aesthetic enhancement. Experimental results demonstrate that DIAE outperforms the baselines and obtains superior image aesthetic scores and image content consistency scores.

关键词: Image Aesthetic Enhancement, Diffusion Models, Multimodal Aesthetic Perception, Weakly Supervised Learning, Dual-branch Supervision, Aesthetic Instruction, Image Editing, Generative Models

216. ❌ Temporal Straightening for Latent Planning

作者: Ying Wang, Oumayma Bounou, Gaoyue Zhou, Randall Balestriero, Tim G. J. Rudner, Yann LeCun, Mengye Ren 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12231v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于世界模型（World Models）中的表示学习，特别是通过时间拉直（temporal straightening）来改进潜在规划。这与关键词’World Models AND General World Models’高度相关，因为论文明确研究世界模型中的表示学习。然而，论文不涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、量化、推理加速等）或特定科学领域应用（如生物信息学）。它主要关注视觉编码器、潜在轨迹和规划，而非LLM相关技术。因此，除’World Models’外，所有其他关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过时间拉直（temporal straightening）来改进世界模型中的表示学习，以增强潜在规划的稳定性和成功率。

摘要翻译

学习良好的表征对于基于世界模型的潜在规划至关重要。尽管预训练的视觉编码器能够生成强语义的视觉特征，但这些特征并非为规划任务定制，且包含与规划无关甚至有害的信息。受人类视觉处理中感知平直化假说的启发，我们引入时间平直化方法来改进潜在规划的表征学习。通过采用鼓励局部平直化潜在轨迹的曲率正则化器，我们联合学习编码器与预测器。研究表明，通过这种方式降低曲率，可使潜在空间中的欧氏距离更好地替代测地距离，并改善规划目标的适定性。实验证明，时间平直化使基于梯度的规划更加稳定，并在一系列目标达成任务中显著提高了成功率。

摘要 (Abstract)

Learning good representations is essential for latent planning with world models. While pretrained visual encoders produce strong semantic visual features, they are not tailored to planning and contain information irrelevant – or even detrimental – to planning. Inspired by the perceptual straightening hypothesis in human visual processing, we introduce temporal straightening to improve representation learning for latent planning. Using a curvature regularizer that encourages locally straightened latent trajectories, we jointly learn an encoder and a predictor. We show that reducing curvature this way makes the Euclidean distance in latent space a better proxy for the geodesic distance and improves the conditioning of the planning objective. We demonstrate empirically that temporal straightening makes gradient-based planning more stable and yields significantly higher success rates across a suite of goal-reaching tasks.

关键词: temporal straightening, latent planning, world models, representation learning, curvature regularizer, visual encoders, goal-reaching tasks, geodesic distance

217. ❌ STAMP: Selective Task-Aware Mechanism for Text Privacy

作者: Fengwei Tian, Payel Bhattacharjee, Heidi Hanson, Geoffrey D. Rubin, Joseph Y. Lo, Ravi Tandon 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12237v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文STAMP专注于文本隐私保护技术，提出了一种任务感知的文本隐私化框架，通过选择性分配隐私预算和极坐标机制扰动嵌入方向来平衡隐私与效用。虽然该研究涉及自然语言处理中的嵌入表示和下游任务，但核心内容与所有评分关键词（主要关注大模型技术原理、训练方法、推理优化、对齐、代理系统等）均无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

STAMP提出了一种任务感知的文本隐私化框架，通过选择性分配隐私预算和极坐标机制扰动嵌入方向，在SQuAD、Yelp和AG News数据集上实现了更好的隐私-效用权衡。

摘要翻译

本文提出STAMP（面向文本隐私的选择性任务感知机制），一种实现更优隐私-效用权衡的任务感知文本脱敏新框架。STAMP通过联合考量两个维度：（i）各词元对下游任务的重要性（通过任务或查询特定的表征进行度量），（ii）其隐私敏感性（如姓名、日期、标识符），从而在词元级别选择性分配隐私预算。这种词元级划分机制实现了对输入文本不同部分所添加噪声强度的细粒度分组控制，在隐私保护与任务相关性之间取得平衡。针对单个词元嵌入的脱敏处理，我们提出极坐标扰动机制，该机制仅扰动单位球面上嵌入向量的方向而保持其模长不变。解码过程通过余弦最近邻搜索实现，使扰动几何结构与解码几何结构保持一致。与各向同性噪声机制不同，极坐标扰动机制能保持嵌入空间中的语义邻域关系，从而更好地保留下游任务效用。在SQuAD、Yelp和AG News数据集上的实验评估表明，结合归一化极坐标机制的STAMP框架，在不同词元隐私预算设置下均能持续实现更优越的隐私-效用权衡。

摘要 (Abstract)

We present STAMP (Selective Task-Aware Mechanism for Text Privacy), a new framework for task-aware text privatization that achieves an improved privacy-utility trade-off. STAMP selectively allocates privacy budgets across tokens by jointly considering (i) each token’s importance to the downstream task (as measured via a task- or query-specific representation), and (ii) its privacy sensitivity (e.g., names, dates, identifiers). This token-level partitioning enables fine-grained, group-wise control over the level of noise applied to different parts of the input, balancing privacy protection with task relevance. To privatize individual token embeddings, we introduce the polar mechanism, which perturbs only the direction of embeddings on the unit sphere while preserving their magnitude. Decoding is performed via cosine nearest-neighbor search, aligning the perturbation geometry with the decoding geometry. Unlike isotropic noise mechanisms, the polar mechanism maintains semantic neighborhoods in the embedding space and better preserves downstream utility. Experimental evaluations on SQuAD, Yelp, and AG News datasets demonstrate that STAMP, when combined with the normalized polar mechanism, consistently achieves superior privacy-utility trade-offs across varying per-token privacy budgets.

关键词: text privacy, privacy-utility trade-off, task-aware privatization, polar mechanism, token-level partitioning, embedding perturbation, cosine nearest-neighbor search, selective privacy budget allocation

218. ❌ Interpreting Contrastive Embeddings in Specific Domains with Fuzzy Rules

作者: Javier Fumanal-Idocin, Mohammadreza Jamalifard, Javier Andreu-Perez 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12227v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文使用CLIP模型（一种对比学习预训练模型）和模糊规则系统进行文本特征映射和解释，主要涉及预训练模型在特定领域的应用和可解释性研究。与大多数关键词（如LLM、MoE、RLHF等）无关，因为论文聚焦于CLIP而非大语言模型。与’Pre-training’相关（5分），因为CLIP是预训练模型；与’Mechanistic Interpretability’相关（5分），因为研究模型特征解释；与’AI for Science’相关（5分），因为应用于临床报告领域。其他关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究如何利用模糊规则系统解释CLIP预训练模型在临床报告和电影评论等特定领域的文本嵌入特征，并分析了特征重要性和规则关联。

摘要翻译

自由文本仍然是现实环境中数据记录的常见方式之一，例如法律程序和医疗记录。正因如此，自然语言处理领域已投入大量努力，旨在将这些文本转换为结构化格式，以便标准机器学习方法能够加以利用。将文本嵌入向量表示的最流行方法之一是对比语言-图像预训练模型（CLIP），该模型同时使用图像和文本进行训练。尽管CLIP计算出的表示在零样本和少样本学习问题上取得了显著成功，但在应用于特定领域时仍存在局限。本研究采用基于模糊规则的分类系统，结合若干标准文本处理技术，将我们关注的部分特征映射到CLIP模型创建的空间中。随后，我们分析了所获得的规则与关联性，并探讨了各特征的重要性。我们将此方法应用于两个不同的数据领域——临床报告和电影评论，分别比较单独处理与联合处理时获得的结果。最后，我们讨论了该方法的局限性及其可能的改进方向。

摘要 (Abstract)

Free-style text is still one of the common ways in which data is registered in real environments, like legal procedures and medical records. Because of that, there have been significant efforts in the area of natural language processing to convert these texts into a structured format, which standard machine learning methods can then exploit. One of the most popular methods to embed text into a vectorial representation is the Contrastive Language-Image Pre-training model (CLIP), which was trained using both image and text. Although the representations computed by CLIP have been very successful in zero-show and few-shot learning problems, they still have problems when applied to a particular domain. In this work, we use a fuzzy rule-based classification system along with some standard text procedure techniques to map some of our features of interest to the space created by a CLIP model. Then, we discuss the rules and associations obtained and the importance of each feature considered. We apply this approach in two different data domains, clinical reports and film reviews, and compare the results obtained individually and when considering both. Finally, we discuss the limitations of this approach and how it could be further improved.

关键词: CLIP, fuzzy rule-based classification, text embedding, domain adaptation, interpretability, clinical reports, film reviews, feature importance

219. ❌ Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models

作者: Jae-Won Chung, Jeff J. Ma, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, Mosharaf Chowdhury 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12118v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于分布式服务系统Cornserve的设计与实现，用于高效服务Any-to-Any多模态模型。虽然涉及多模态模型（可能包括大模型），但核心贡献是系统架构、任务抽象、分布式运行时和性能优化，而非大模型或深度学习技术原理的创新、训练方法、对齐、推理加速、科学应用等。所有关键词均与大模型技术原理、训练/对齐方法、推理优化、科学AI应用等相关，与本文的系统工程焦点无直接关联，因此全部评分为0。

!!! tip deepseek-chat TL;DR

本文提出了Cornserve，一个用于Any-to-Any多模态模型的分布式服务系统，通过灵活的任务抽象和高效的记录-重放执行模型，实现了组件解耦和独立扩展，从而将吞吐量提升最高3.81倍并降低尾部延迟5.79倍。

摘要翻译

任意到任意模型是一类新兴的多模态模型，其能够接受多模态数据（如文本、图像、视频、音频）的任意组合作为输入，并生成相应组合作为输出。为这类模型提供服务具有挑战性：不同请求的输入与输出模态各异，在模型计算图中会经过不同的路径，且模型的每个组件具有不同的扩展特性。

我们提出了Cornserve，一个面向通用任意到任意模型的分布式服务系统。Cornserve提供了一种灵活的任务抽象机制，用于表达任意到任意模型的计算图，实现了组件解耦与独立扩展。其分布式运行时通过一种高效的记录-重放执行模型来调度数据平面上的计算，该模型跟踪数据依赖关系，并将张量数据直接在生产者与消费者组件之间转发。Cornserve基于Kubernetes构建，新增约23,000行Python代码，支持多种任意到任意模型，能够实现高达3.81倍的吞吐量提升与5.79倍的尾部延迟降低。Cornserve已开源，演示视频可在YouTube上观看。

摘要 (Abstract)

Any-to-Any models are an emerging class of multimodal models that accept combinations of multimodal data (e.g., text, image, video, audio) as input and generate them as output. Serving these models are challenging; different requests with different input and output modalities traverse different paths through the model computation graph, and each component of the model have different scaling characteristics. We present Cornserve, a distributed serving system for generic Any-to-Any models. Cornserve provides a flexible task abstraction for expressing Any-to-Any model computation graphs, enabling component disaggregation and independent scaling. The distributed runtime dispatches compute to the data plane via an efficient record-and-replay execution model that keeps track of data dependencies, and forwards tensor data between components directly from the producer to the consumer. Built on Kubernetes with approximately 23K new lines of Python, Cornserve supports diverse Any-to-Any models and delivers up to 3.81$\times$ higher throughput and 5.79$\times$ lower tail latency. Cornserve is open-source, and the demo video is available on YouTube.

关键词: Any-to-Any models, multimodal models, distributed serving system, task abstraction, component disaggregation, record-and-replay execution, throughput, tail latency

220. ❌ Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics

作者: Ming-Hong Chen, Kuan-Chen Pan, You-De Huang, Xi Liu, Ping-Chun Hsieh 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12087v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于跨领域强化学习（CDRL），提出了一种名为QAvatar的方法，通过跨领域贝尔曼一致性和混合批评器来解决跨领域知识迁移问题。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是强化学习中的跨领域迁移问题，与这些关键词没有直接关联。论文未涉及大模型、深度学习技术原理或AI在生物/化学信息学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为QAvatar的跨领域强化学习方法，通过跨领域贝尔曼一致性和混合批评器来解决跨领域知识迁移中的状态/动作空间差异和可迁移性识别问题，并在多个RL基准任务中验证了其有效性。

摘要翻译

跨域强化学习（Cross-Domain Reinforcement Learning, CDRL）旨在利用从源域收集的数据样本来促进相似目标域的学习，从而提高强化学习的数据效率。尽管具有潜力，但强化学习中的跨域迁移已知存在两个根本且相互交织的挑战：（i）源域与目标域可能具有不同的状态空间或动作空间，这使得直接迁移不可行，从而需要更复杂的域间映射；（ii）强化学习中源域模型的可迁移性难以先验确定，因此CDRL在迁移过程中容易产生负面效应。本文提出通过跨域贝尔曼一致性（cross-domain Bellman consistency）与混合评论器（hybrid critic）的视角来共同应对这两项挑战。具体而言，我们首先引入跨域贝尔曼一致性的概念，作为衡量源域模型可迁移性的一种方式。随后，我们提出$Q$Avatar方法，该方法通过一种自适应的无超参数权重函数，将源域与目标域的Q函数结合起来。通过这一设计，我们刻画了$Q$Avatar的收敛行为，并证明其在有效利用源域Q函数向目标域进行知识迁移的意义上实现了可靠的迁移。实验表明，$Q$Avatar在多种强化学习基准任务（包括运动控制与机器人手臂操作）中均表现出良好的可迁移性。我们的代码发布于https://rl-bandits-lab.github.io/Cross-Domain-RL/。

摘要 (Abstract)

Cross-domain reinforcement learning (CDRL) is meant to improve the data efficiency of RL by leveraging the data samples collected from a source domain to facilitate the learning in a similar target domain. Despite its potential, cross-domain transfer in RL is known to have two fundamental and intertwined challenges: (i) The source and target domains can have distinct state space or action space, and this makes direct transfer infeasible and thereby requires more sophisticated inter-domain mappings; (ii) The transferability of a source-domain model in RL is not easily identifiable a priori, and hence CDRL can be prone to negative effect during transfer. In this paper, we propose to jointly tackle these two challenges through the lens of \textit{cross-domain Bellman consistency} and \textit{hybrid critic}. Specifically, we first introduce the notion of cross-domain Bellman consistency as a way to measure transferability of a source-domain model. Then, we propose $Q$Avatar, which combines the Q functions from both the source and target domains with an adaptive hyperparameter-free weight function. Through this design, we characterize the convergence behavior of $Q$Avatar and show that $Q$Avatar achieves reliable transfer in the sense that it effectively leverages a source-domain Q function for knowledge transfer to the target domain. Through experiments, we demonstrate that $Q$Avatar achieves favorable transferability across various RL benchmark tasks, including locomotion and robot arm manipulation. Our code is available at https://rl-bandits-lab.github.io/Cross-Domain-RL/.

关键词: Cross-domain reinforcement learning, Bellman consistency, Hybrid critic, QAvatar, Knowledge transfer, State space, Action space, Transferability

221. ❌ Wasserstein Gradient Flows for Batch Bayesian Optimal Experimental Design

作者: Louis Sharrock 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12102v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于贝叶斯最优实验设计（BOED）的优化方法，特别是针对批量设置下的期望信息增益（EIG）最大化问题。它提出了基于概率提升、Wasserstein梯度流和粒子算法的计算框架。所有关键词均与大模型、深度学习技术原理或其在科学领域的应用直接相关，但论文内容完全不涉及大模型、深度学习或任何AI技术（如LLM、MoE、训练方法、推理优化、代理系统等）。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为BOED是科学计算中的一种方法，可视为AI在科学领域的潜在应用，但论文未明确提及AI，故给5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对批量贝叶斯最优实验设计中期望信息增益优化困难的问题，提出了一种通过概率提升和Wasserstein梯度流的新方法，并开发了可扩展的粒子算法，在数值实验中展示了其处理多模态优化景观和高效用批次设计的能力。

摘要翻译

贝叶斯最优实验设计（Bayesian optimal experimental design，BOED）提供了一个强大的决策理论框架，用于选择实验以最大化待收集数据的期望效用。然而在实际应用中，其适用性常因优化所选效用函数的困难而受限。例如，期望信息增益（expected information gain，EIG）往往是高维度且高度非凸的优化问题。这一挑战在批量实验设计场景中尤为突出，因为需要同时设计多个实验。本文提出一种基于批量EIG的BOED新方法，通过将原始优化问题概率性地提升至概率测度空间来实现。具体而言，我们提出在设计测度空间上优化期望效用的熵正则化目标。在温和条件下，我们证明该目标存在唯一最小化子，且能以吉布斯分布的形式显式刻画。所得设计法则可直接作为随机化批量设计策略使用，或作为计算松弛形式从中提取确定性批量方案。为在大批量规模下获得可扩展的近似解，我们进一步考虑全批量分布的两个可处理限制形式：平均场族与独立同分布乘积族。针对独立同分布目标函数（及其平均场扩展形式），我们推导了相应的瓦瑟斯坦梯度流，刻画其长期行为，并通过时空离散化得到基于粒子的算法。同时，我们提出了双重随机变体算法，将交互粒子更新与EIG梯度的蒙特卡洛估计相结合。最后，通过多个数值实验展示了所提方法的性能，验证了其在多模态优化场景中的探索能力以及在复杂示例中获得高效用批量方案的有效性。

摘要 (Abstract)

Bayesian optimal experimental design (BOED) provides a powerful, decision-theoretic framework for selecting experiments so as to maximise the expected utility of the data to be collected. In practice, however, its applicability can be limited by the difficulty of optimising the chosen utility. The expected information gain (EIG), for example, is often high-dimensional and strongly non-convex. This challenge is particularly acute in the batch setting, where multiple experiments are to be designed simultaneously. In this paper, we introduce a new approach to batch EIG-based BOED via a probabilistic lifting of the original optimisation problem to the space of probability measures. In particular, we propose to optimise an entropic regularisation of the expected utility over the space of design measures. Under mild conditions, we show that this objective admits a unique minimiser, which can be explicitly characterised in the form of a Gibbs distribution. The resulting design law can be used directly as a randomised batch-design policy, or as a computational relaxation from which a deterministic batch is extracted. To obtain scalable approximations when the batch size is large, we then consider two tractable restrictions of the full batch distribution: a mean-field family, and an i.i.d. product family. For the i.i.d. objective, and formally for its mean-field extension, we derive the corresponding Wasserstein gradient flow, characterise its long-time behaviour, and obtain particle-based algorithms via space-time discretisations. We also introduce doubly stochastic variants that combine interacting particle updates with Monte Carlo estimators of the EIG gradient. Finally, we illustrate the performance of the proposed methods in several numerical experiments, demonstrating their ability to explore multimodal optimisation landscapes and obtain high-utility batches in challenging examples.

关键词: Bayesian optimal experimental design, expected information gain, batch design, Wasserstein gradient flow, probabilistic lifting, entropic regularization, particle-based algorithms, multimodal optimization

222. ❌ Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference

作者: Valentyn Melnychuk, Vahid Balazadeh, Stefan Feuerriegel, Rahul G. Krishnan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12037v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究基础模型（PFNs）在因果推断中的应用，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为PFNs属于基础模型范畴。论文明确将因果推断任务框架化为上下文学习问题，与’In-context Learning OR Many-shot Learning’高度相关（10分）。论文涉及因果推断在科学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但未特指生物信息学或化学信息学。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于先验数据拟合网络（PFNs）的基础模型在因果推断中估计平均处理效应（ATE）时的频率一致性不足问题，并提出了一种基于一步后验校正（OSPC）的校准方法，通过定制鞅后验来恢复功能滋扰后验，从而恢复频率一致性并在实验中实现与频率不确定性渐近匹配的ATE不确定性估计。

摘要翻译

基于先验数据拟合网络（PFNs）的基础模型通过将因果推断任务构建为上下文学习问题，已展现出强大的实证性能。然而，目前尚不清楚基于PFN的因果估计器是否能提供与经典频率派估计器相一致的不确定性量化。本研究通过分析基于PFN的平均处理效应（ATE）估计器的频率派一致性来填补这一空白。（1）我们证明，现有PFNs在被解释为贝叶斯ATE估计器时，可能表现出先验诱导的混杂偏误：先验不会渐近地被数据覆盖，从而阻碍了频率派一致性的实现。（2）作为改进方案，我们提出采用基于一步后验校正（OSPC）的校准程序。研究表明，OSPC有助于恢复频率派一致性，并能为校准后的PFNs推导出半参数伯恩斯坦-冯·米塞斯定理（即随着数据量增长，基于校准PFN的估计器与经典半参数有效估计器在分布上收敛）。（3）最后，我们通过在PFNs之上构建鞅后验来实现OSPC。这种方法使我们能够从PFNs中恢复OSPC所需的功能性冗余参数后验。在多项（半）合成实验中，采用我们提出的鞅后验OSPC校准的PFNs所产生的ATE不确定性（i）渐近匹配频率派不确定性，且（ii）在有限样本中相较于其他贝叶斯ATE估计器具有更好的校准性。

摘要 (Abstract)

Foundation models based on prior-data fitted networks (PFNs) have shown strong empirical performance in causal inference by framing the task as an in-context learning problem.However, it is unclear whether PFN-based causal estimators provide uncertainty quantification that is consistent with classical frequentist estimators. In this work, we address this gap by analyzing the frequentist consistency of PFN-based estimators for the average treatment effect (ATE). (1) We show that existing PFNs, when interpreted as Bayesian ATE estimators, can exhibit prior-induced confounding bias: the prior is not asymptotically overwritten by data, which, in turn, prevents frequentist consistency. (2) As a remedy, we suggest employing a calibration procedure based on a one-step posterior correction (OSPC). We show that the OSPC helps to restore frequentist consistency and can yield a semi-parametric Bernstein-von Mises theorem for calibrated PFNs (i.e., both the calibrated PFN-based estimators and the classical semi-parametric efficient estimators converge in distribution with growing data size). (3) Finally, we implement OSPC through tailoring martingale posteriors on top of the PFNs. In this way, we are able to recover functional nuisance posteriors from PFNs, required by the OSPC. In multiple (semi-)synthetic experiments, PFNs calibrated with our martingale posterior OSPC produce ATE uncertainty that (i) asymptotically matches frequentist uncertainty and (ii) is well calibrated in finite samples in comparison to other Bayesian ATE estimators.

关键词: Foundation models, Prior-data fitted networks, Causal inference, Frequentist consistency, Average treatment effect, In-context learning, One-step posterior correction, Martingale posteriors

223. ❌ Efficient Generative Modeling with Unitary Matrix Product States Using Riemannian Optimization

作者: Haotong Duan, Zhongming Chen, Ngai Wong 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12026v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究矩阵乘积状态（MPS）在生成建模中的应用，并提出黎曼优化方法提高训练效率。论文核心是张量网络和生成模型，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）无直接关联。关键词主要涉及大语言模型、微调、对齐、推理、代理等具体技术，而本文属于基础机器学习/张量网络领域，未涉及大模型或深度学习在科学领域的应用创新。

!!! tip deepseek-chat TL;DR

该论文研究了基于酉矩阵乘积状态的生成建模方法，通过黎曼优化解决了传统梯度训练效率低的问题，在Bars-and-Stripes和EMNIST数据集上实现了高效、稳定的学习性能。

摘要翻译

张量网络最初为描述复杂量子多体系统而发展，近年来已成为一种能够以强物理可解释性捕捉高维概率分布的强大框架。本文系统研究了用于生成建模的矩阵乘积态（Matrix Product States, MPS），并证明幺正矩阵乘积态——一种既简洁又富有表达力的张量网络架构——通过减少参数更新的模糊性并提升效率，为无监督学习提供了明显优势。为克服基于标准梯度的MPS训练效率低下的问题，我们提出了一种黎曼优化方法，将概率建模转化为具有流形约束的优化问题，并进一步推导出一种高效的空间解耦算法。在Bars-and-Stripes和EMNIST数据集上的实验表明，该方法能快速适应数据结构、实现稳定更新并展现出色性能，同时保持了MPS的效率和表达能力。

摘要 (Abstract)

Tensor networks, which are originally developed for characterizing complex quantum many-body systems, have recently emerged as a powerful framework for capturing high-dimensional probability distributions with strong physical interpretability. This paper systematically studies matrix product states (MPS) for generative modeling and shows that unitary MPS, which is a tensor-network architecture that is both simple and expressive, offers clear benefits for unsupervised learning by reducing ambiguity in parameter updates and improving efficiency. To overcome the inefficiency of standard gradient-based MPS training, we develop a Riemannian optimization approach that casts probabilistic modeling as an optimization problem with manifold constraints, and further derive an efficient space-decoupling algorithm. Experiments on Bars-and-Stripes and EMNIST datasets demonstrate fast adaptation to data structure, stable updates, and strong performance while maintaining the efficiency and expressive power of MPS.

关键词: Matrix Product States, Generative Modeling, Unitary MPS, Riemannian Optimization, Tensor Networks, Probabilistic Modeling, Unsupervised Learning, Manifold Constraints

224. ❌ Deep Learning-Based Metamodeling of Nonlinear Stochastic Dynamic Systems under Parametric and Predictive Uncertainty

作者: Haimiti Atila, Seymour M. J. Spence 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12012v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习（MLP、MPNN、AE、LSTM）进行非线性随机动态系统的元建模，以解决结构工程中的计算挑战和不确定性量化问题。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词特指大语言模型（LLM）及相关技术，而本文研究的是传统的深度学习在工程领域的应用。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将深度学习应用于结构工程（可视为科学/工程领域），但并非核心创新点，只是应用现有方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了三种深度学习元建模框架（MLP-LSTM、MPNN-LSTM、AE-LSTM），用于在参数和预测不确定性下对非线性随机动态结构系统进行高效预测，并在两个案例研究中验证了其低预测误差和不确定性量化能力。

摘要翻译

对自然灾害作用下的高维非线性动态结构系统进行建模存在巨大的计算挑战，尤其是在同时考虑外部荷载与结构参数不确定性的情况下。现有研究已成功纳入自然灾害相关外部荷载的不确定性，但极少有工作能在考虑神经网络预测不确定性的同时，兼顾结构系统中的荷载与参数不确定性。为填补这些空白，本文构建了三种元建模框架，每个框架均通过多层感知机（MLP）、消息传递神经网络（MPNN）或自编码器（AE）与长短期记忆（LSTM）网络耦合实现特征提取，并采用蒙特卡洛丢弃法和负对数似然损失函数进行训练。所得到的架构（MLP-LSTM、MPNN-LSTM和AE-LSTM）通过两个案例研究进行验证：一个多自由度Bouc-Wen系统和一个37层纤维离散化非线性钢弯矩抵抗框架，两者均承受随机地震激励及结构参数不确定性的影响。三种方法均实现了较低的预测误差：对于较低维度的Bouc-Wen系统，MLP-LSTM获得了最精确的结果；而对于更复杂的钢框架模型，MPNN-LSTM与AE-LSTM表现出更优的性能。此外，预测方差与实际误差之间的一致性关联证实了这些框架适用于主动学习策略，并可用于评估结构响应预测中的模型置信度。

摘要 (Abstract)

Modeling high-dimensional, nonlinear dynamic structural systems under natural hazards presents formidable computational challenges, especially when simultaneously accounting for uncertainties in external loads and structural parameters. Studies have successfully incorporated uncertainties related to external loads from natural hazards, but few have simultaneously addressed loading and parameter uncertainties within structural systems while accounting for prediction uncertainty of neural networks. To address these gaps, three metamodeling frameworks were formulated, each coupling a feature-extraction module implemented through a multi-layer perceptron (MLP), a message-passing neural network (MPNN), or an autoencoder (AE) with a long short-term memory (LSTM) network using Monte Carlo dropout and a negative log-likelihood loss. The resulting architectures (MLP-LSTM, MPNN-LSTM, and AE-LSTM) were validated on two case studies: a multi-degree-of-freedom Bouc-Wen system and a 37-story fiber-discretized nonlinear steel moment-resisting frame, both subjected to stochastic seismic excitation and structural parameter uncertainty. All three approaches achieved low prediction errors: the MLP-LSTM yielded the most accurate results for the lower-dimensional Bouc-Wen system, whereas the MPNN-LSTM and AE-LSTM provided superior performance on the more complex steel-frame model. Moreover, a consistent correlation between predictive variance and actual error confirms the suitability of these frameworks for active-learning strategies and for assessing model confidence in structural response predictions.

关键词: deep learning, metamodeling, nonlinear stochastic dynamic systems, parametric uncertainty, predictive uncertainty, Monte Carlo dropout, LSTM, structural engineering

225. ❌ Decentralized Orchestration Architecture for Fluid Computing: A Secure Distributed AI Use Case

作者: Diego Cajaraville-Aboy, Ana Fernández-Vilas, Rebeca P. Díaz-Redondo, Manuel Fernández-Veiga, Pablo Picallo-López 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12001v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文专注于分布式AI、流体计算、多域编排架构和去中心化联邦学习（DFL）的安全增强，特别是针对拜占庭威胁。所有给定的关键词都直接与大语言模型（LLM）及其相关技术（如训练、对齐、推理、应用等）相关。论文没有讨论LLMs、MoE、SLMs、缩放定律、预训练/后训练、对齐技术（RLHF/DPO）、高效微调（PEFT/LoRA）、RAG、上下文扩展、注意力优化、推理方法（CoT/System 2/MCTS）、自我改进、智能体、工具使用、多智能体系统、模型压缩、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI（生物/化学信息学）。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于流体计算环境的多域编排架构，以支持去中心化协调和意图驱动的部署，并通过一个增强拜占庭安全的去中心化联邦学习用例进行了验证。

摘要翻译

分布式人工智能与物联网应用日益频繁地在跨越终端设备、边缘/雾基础设施及云平台的异构资源上执行，这些资源常隶属于不同的管理域。流体计算作为一种新兴的范式，通过将此类资源视为统一的计算织物，依据应用需求驱动实现与服务无关的最优部署，从而增强跨计算连续体的大规模资源管理能力。然而，现有解决方案在很大程度上仍采用集中式架构，且通常未明确考虑多域管理问题。本文提出一种面向流体计算环境的、与具体技术无关的多域编排架构。该编排平面支持各域之间进行去中心化协调，在保持本地自治的同时，共同实现租户基于意图的部署请求，确保端到端的服务安置与执行。为此，该架构将域侧控制服务提升为一等能力，以支持运行时应用级功能增强。作为一个代表性用例，我们研究了拜占庭威胁下的多域去中心化联邦学习部署场景。通过引入FU-HST——一种支持软件定义网络的多域异常检测机制，我们利用域侧能力来增强拜占庭安全性，该机制与拜占庭鲁棒性聚合方法形成互补。我们通过在单域及多域设置下的仿真验证了该方案，并对异常检测效果、去中心化联邦学习性能以及计算/通信开销进行了评估。

摘要 (Abstract)

Distributed AI and IoT applications increasingly execute across heterogeneous resources spanning end devices, edge/fog infrastructure, and cloud platforms, often under different administrative domains. Fluid Computing has emerged as a promising paradigm for enhancing massive resource management across the computing continuum by treating such resources as a unified fabric, enabling optimal service-agnostic deployments driven by application requirements. However, existing solutions remain largely centralized and often do not explicitly address multi-domain considerations. This paper proposes an agnostic multi-domain orchestration architecture for fluid computing environments. The orchestration plane enables decentralized coordination among domains that maintain local autonomy while jointly realizing intent-based deployment requests from tenants, ensuring end-to-end placement and execution. To this end, the architecture elevates domain-side control services as first-class capabilities to support application-level enhancement at runtime. As a representative use case, we consider a multi-domain Decentralized Federated Learning (DFL) deployment under Byzantine threats. We leverage domain-side capabilities to enhance Byzantine security by introducing FU-HST, an SDN-enabled multi-domain anomaly detection mechanism that complements Byzantine-robust aggregation. We validate the approach via simulation in single- and multi-domain settings, evaluating anomaly detection, DFL performance, and computation/communication overhead.

关键词: Fluid Computing, Multi-domain Orchestration, Decentralized Coordination, Distributed AI, Decentralized Federated Learning, Byzantine Security, Anomaly Detection, SDN-enabled

226. ❌ On-Average Stability of Multipass Preconditioned SGD and Effective Dimension

作者: Simon Vary, Tyler Farghly, Ilja Kuzborskij, Patrick Rebeschini 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11989v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多轮次预条件随机梯度下降（PSGD）的泛化能力，聚焦于优化理论、统计学习理论和算法稳定性分析。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文属于经典机器学习优化理论范畴，未涉及任何大模型、深度学习或AI for Science的具体技术、方法或应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了多轮次预条件随机梯度下降（PSGD）的泛化能力，通过开发新的平均稳定性分析框架，建立了依赖于有效维度的超额风险界，并证明了不当的预条件选择会导致优化和泛化中的次优性能。

摘要翻译

本研究探讨了多轮次预处理随机梯度下降（PSGD）算法中，总体风险曲率、噪声几何特性与预处理策略对其泛化能力产生的权衡关系。许多实际优化启发式方法以不同方式隐式地处理这一权衡——例如，部分方法旨在白化梯度噪声，而另一些则试图使更新方向与期望损失曲率对齐。当总体风险曲率的几何特性与梯度噪声的几何特性不匹配时，旨在改善某一方面的激进选择可能沿另一方向放大不稳定性，从而导致次优的统计性能。本文采用平均算法稳定性框架，将PSGD的泛化能力与取决于这些曲率来源的有效维度联系起来。现有针对SGD平均稳定性的分析技术仅限于单轮次训练，作为首个贡献，我们为多轮次SGD开发了新的平均稳定性分析方法，能够处理数据复用引发的相关性。这使我们得以推导出依赖于有效维度的超额风险界。特别地，我们证明不当选择的预处理器可能在优化和泛化两方面均导致次优的有效维度依赖关系。最后，我们通过构造匹配的实例相关下界，对所得上界进行了补充。

摘要 (Abstract)

We study trade-offs between the population risk curvature, geometry of the noise, and preconditioning on the generalisation ability of the multipass Preconditioned Stochastic Gradient Descent (PSGD). Many practical optimisation heuristics implicitly navigate this trade-off in different ways – for instance, some aim to whiten gradient noise, while others aim to align updates with expected loss curvature. When the geometry of the population risk curvature and the geometry of the gradient noise do not match, an aggressive choice that improves one aspect can amplify instability along the other, leading to suboptimal statistical behavior. In this paper we employ on-average algorithmic stability to connect generalisation of PSGD to the effective dimension that depends on these sources of curvature. While existing techniques for on-average stability of SGD are limited to a single pass, as first contribution we develop a new on-average stability analysis for multipass SGD that handles the correlations induced by data reuse. This allows us to derive excess risk bounds that depend on the effective dimension. In particular, we show that an improperly chosen preconditioner can yield suboptimal effective dimension dependence in both optimisation and generalisation. Finally, we complement our upper bounds with matching, instance-dependent lower bounds.

关键词: Preconditioned Stochastic Gradient Descent, on-average stability, generalization, effective dimension, multipass optimization, excess risk bounds, population risk curvature, gradient noise geometry

227. ❌ Topological DeepONets and a generalization of the Chen-Chen operator approximation theorem

作者: Vugar Ismailov 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11972v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究拓扑DeepONets，属于算子近似理论在神经网络架构中的数学扩展，与深度学习的基础数学理论相关。所有关键词均聚焦于大语言模型（LLMs）及其具体技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及语言模型、自然语言处理或任何LLM相关技术。唯一略有相关的是“AI for Science”，因为该研究属于数学领域的AI应用（算子近似理论），但并非生物信息学或化学信息学等具体科学领域，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种拓扑DeepONet架构，将经典的Chen-Chen算子近似定理从连续函数空间推广到局部凸空间，证明了连续算子在此框架下可由分支-主干神经网络结构一致逼近。

摘要翻译

深度算子网络（DeepONets）提供了一种分支-主干神经架构，用于逼近作用于函数空间之间的非线性算子。在经典算子逼近框架中，输入是定义在紧集$K_1$（通常是巴拿赫空间的紧子集）上的函数$u\in C(K_1)$，算子将$u$映射到定义在紧欧几里得区域$K_2\subset\mathbb{R}^d$上的输出函数$G(u)\in C(K_2)$。本文发展了一种拓扑扩展，其中算子输入位于任意豪斯多夫局部凸空间$X$中。我们利用对偶空间$X^*$中的连续线性泛函在$X$上构造拓扑前馈神经网络，并引入拓扑DeepONets，其分支组件通过此类线性测量作用于$X$，而主干组件作用于欧几里得输出区域。我们的主要定理表明，连续算子$G:V\to C(K;\mathbb{R}^m)$（其中$V\subset X$和$K\subset\mathbb{R}^d$为紧集）可由此类拓扑DeepONets一致逼近。这将经典的陈-陈算子逼近定理从连续函数空间推广到局部凸空间，并给出了超越巴拿赫空间设置的分支-主干逼近定理。

摘要 (Abstract)

Deep Operator Networks (DeepONets) provide a branch-trunk neural architecture for approximating nonlinear operators acting between function spaces. In the classical operator approximation framework, the input is a function $u\in C(K_1)$ defined on a compact set $K_1$ (typically a compact subset of a Banach space), and the operator maps $u$ to an output function $G(u)\in C(K_2)$ defined on a compact Euclidean domain $K_2\subset\mathbb{R}^d$. In this paper, we develop a topological extension in which the operator input lies in an arbitrary Hausdorff locally convex space $X$. We construct topological feedforward neural networks on $X$ using continuous linear functionals from the dual space $X^*$ and introduce topological DeepONets whose branch component acts on $X$ through such linear measurements, while the trunk component acts on the Euclidean output domain. Our main theorem shows that continuous operators $G:V\to C(K;\mathbb{R}^m)$, where $V\subset X$ and $K\subset\mathbb{R}^d$ are compact, can be uniformly approximated by such topological DeepONets. This extends the classical Chen-Chen operator approximation theorem from spaces of continuous functions to locally convex spaces and yields a branch-trunk approximation theorem beyond the Banach-space setting.

关键词: Topological DeepONets, Operator Approximation, Chen-Chen Theorem, Locally Convex Spaces, Neural Networks, Continuous Operators, Branch-Trunk Architecture, Function Spaces

228. ❌ Statistical and structural identifiability in representation learning

作者: Walter Nelson, Marco Fumero, Theofanis Karaletsos, Francesco Locatello 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11970v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究表示学习中的可识别性问题，提出了统计和结构近可识别性的新定义，并应用于自编码器和监督学习模型。论文虽然提到了GPT作为示例，但并非研究重点。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在细胞显微镜数据分析中应用了基础模型规模的MAE来解耦生物变异和技术批次效应，这属于生物信息学应用，但并非核心研究内容，因此给予5分。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了表示学习中统计和结构近可识别性的新定义，证明了非线性解码器模型的统计近可识别性，并通过ICA后处理实现解耦，在合成基准和细胞显微镜数据分析中验证了方法的有效性。

摘要翻译

表征学习模型在其内部表征中展现出令人惊讶的稳定性。以往研究大多将此稳定性视为单一属性，我们则将其形式化为两个不同的概念：统计可识别性（跨运行次数的表征一致性）与结构可识别性（表征与某种未观测到的基础真值的对齐）。认识到对于现代表征学习模型而言，完美的逐点可识别性通常不切实际，我们提出了新的、与模型无关的统计与结构近可识别性定义，允许一定的误差容忍度 $ε$。基于这些定义，我们证明了具有非线性解码器的模型表征在统计上具有 $ε$-近可识别性，从而将现有的可识别性理论（例如，生成式预训练变换器（GPTs）中仅针对最后一层表征的理论）推广至更广泛的模型类别（包括（掩码）自编码器（MAEs）和监督学习器）其中间表征的近可识别性。尽管这些较弱的假设带来较弱的可识别性，但我们证明独立成分分析（ICA）能够为这类模型解决大部分剩余的线性模糊性，并通过实验验证和度量了我们的近可识别性主张。在数据生成过程满足额外假设的情况下，统计可识别性可扩展至结构可识别性，从而为解纠缠提供了一种简单实用的方法：对潜在表征进行ICA后处理。在合成基准测试中，该方法使用普通自编码器实现了最先进的解纠缠性能。在一个用于细胞显微镜的基础模型级MAE上，该方法成功将生物变异与技术批次效应解纠缠，显著提升了下游任务的泛化能力。

摘要 (Abstract)

Representation learning models exhibit a surprising stability in their internal representations. Whereas most prior work treats this stability as a single property, we formalize it as two distinct concepts: statistical identifiability (consistency of representations across runs) and structural identifiability (alignment of representations with some unobserved ground truth). Recognizing that perfect pointwise identifiability is generally unrealistic for modern representation learning models, we propose new model-agnostic definitions of statistical and structural near-identifiability of representations up to some error tolerance $ε$. Leveraging these definitions, we prove a statistical $ε$-near-identifiability result for the representations of models with nonlinear decoders, generalizing existing identifiability theory beyond last-layer representations in e.g. generative pre-trained transformers (GPTs) to near-identifiability of the intermediate representations of a broad class of models including (masked) autoencoders (MAEs) and supervised learners. Although these weaker assumptions confer weaker identifiability, we show that independent components analysis (ICA) can resolve much of the remaining linear ambiguity for this class of models, and validate and measure our near-identifiability claims empirically. With additional assumptions on the data-generating process, statistical identifiability extends to structural identifiability, yielding a simple and practical recipe for disentanglement: ICA post-processing of latent representations. On synthetic benchmarks, this approach achieves state-of-the-art disentanglement using a vanilla autoencoder. With a foundation model-scale MAE for cell microscopy, it disentangles biological variation from technical batch effects, substantially improving downstream generalization.

关键词: representation learning, identifiability, statistical identifiability, structural identifiability, autoencoders, independent components analysis, disentanglement, foundation model-scale MAE

229. ❌ Uncovering Locally Low-dimensional Structure in Networks by Locally Optimal Spectral Embedding

作者: Hannah Sansford, Nick Whiteley, Patrick Rubin-Delanchy 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11965v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究网络嵌入方法（Local Adjacency Spectral Embedding），属于图论和网络科学领域，与所有评分关键词（均涉及大模型、深度学习技术原理或AI科学应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐技术或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文针对标准邻接谱嵌入（ASE）在稀疏网络中的局限性，提出了局部邻接谱嵌入（LASE）方法，通过局部加权谱分解揭示网络的局部低维结构，并证明了该方法在局部重建和可视化方面优于全局和子图基线。

摘要翻译

标准邻接谱嵌入（ASE）依赖于全局低秩假设，该假设常与真实世界网络的稀疏性及传递性结构不相容，导致局部几何特征被“模糊化”。为解决此问题，我们提出局部邻接谱嵌入（LASE），该方法通过加权谱分解揭示局部低维结构。在具有核特征映射的潜在位置模型下，我们将潜在位置的像视为无限维特征空间中的局部低维集合。我们建立了有限样本界，量化了局部化带来的统计成本与通过针对嵌入的局部低维区域所减少的截断误差之间的权衡。此外，我们证明充分的局部化会诱导快速的谱衰减和显著谱间隙的出现，从理论上为低维局部嵌入提供了依据。在合成网络和真实网络上的实验表明，相较于全局及子图基线方法，LASE在局部重建和可视化方面均有提升；同时我们提出了UMAP-LASE方法，用于将重叠的局部嵌入整合为高保真度的全局可视化。

摘要 (Abstract)

Standard Adjacency Spectral Embedding (ASE) relies on a global low-rank assumption often incompatible with the sparse, transitive structure of real-world networks, causing local geometric features to be ‘smeared’. To address this, we introduce Local Adjacency Spectral Embedding (LASE), which uncovers locally low-dimensional structure via weighted spectral decomposition. Under a latent position model with a kernel feature map, we treat the image of latent positions as a locally low-dimensional set in infinite-dimensional feature space. We establish finite-sample bounds quantifying the trade-off between the statistical cost of localisation and the reduced truncation error achieved by targeting a locally low-dimensional region of the embedding. Furthermore, we prove that sufficient localisation induces rapid spectral decay and the emergence of a distinct spectral gap, theoretically justifying low-dimensional local embeddings. Experiments on synthetic and real networks show that LASE improves local reconstruction and visualisation over global and subgraph baselines, and we introduce UMAP-LASE for assembling overlapping local embeddings into high-fidelity global visualisations.

关键词: Local Adjacency Spectral Embedding, spectral embedding, network analysis, low-dimensional structure, latent position model, visualization, UMAP-LASE, graph embedding

230. ❌ Causal Matrix Completion under Multiple Treatments via Mixed Synthetic Nearest Neighbors

作者: Minrui Luo, Zhiheng Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11942v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究因果矩阵补全方法（Mixed Synthetic Nearest Neighbors），属于因果推断和缺失数据处理领域，与所有评分关键词（均涉及大模型、深度学习技术原理及应用）完全无关。论文未提及任何大模型、深度学习、AI for Science等相关内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对多治疗场景下因果矩阵补全的数据稀缺问题，提出了一种混合合成最近邻方法，通过整合跨治疗水平的信息来提高估计效果，并在理论和实证上验证了其有效性。

摘要翻译

合成最近邻（Synthetic Nearest Neighbors，SNN）通过利用完全观测的锚点子矩阵所揭示的局部低秩结构，为缺失非随机（Missing-Not-At-Random，MNAR）条件下的因果矩阵补全提供了一个原理性解决方案。然而，其有效性关键依赖于每个处理水平内具有充足的数据，这一条件在处理多重或复杂时常常无法满足。本文提出混合合成最近邻（Mixed Synthetic Nearest Neighbors，MSNN），这是一种新的逐项因果识别估计器，能够整合跨处理水平的信息。我们证明，MSNN在保留SNN的有限样本误差界与渐近正态性保证的同时，扩大了可用于估计的有效样本量。在合成数据集与真实数据集上的实证结果验证了所提方法的有效性，尤其是在数据稀缺的处理水平下。

摘要 (Abstract)

Synthetic Nearest Neighbors (SNN) provides a principled solution to causal matrix completion under missing-not-at-random (MNAR) by exploiting local low-rank structure through fully observed anchor submatrices. However, its effectiveness critically relies on sufficient data availability within each treatment level, a condition that often fails in settings with multiple or complex treatments. In this work, we propose Mixed Synthetic Nearest Neighbors (MSNN), a new entry-wise causal identification estimator that integrates information across treatment levels. We show that MSNN retains the finite-sample error bounds and asymptotic normality guarantees of SNN, while enlarging the effective sample size available for estimation. Empirical results on synthetic and real-world datasets illustrate the efficacy of the proposed approach, especially under data-scarce treatment levels.

关键词: causal matrix completion, missing-not-at-random, synthetic nearest neighbors, multiple treatments, finite-sample error bounds, asymptotic normality, data-scarce treatment levels, entry-wise causal identification

231. ❌ Causal Representation Learning with Optimal Compression under Complex Treatments

作者: Wanting Liang, Haoang Chi, Zhiheng Zhang 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11907v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于因果表示学习、多治疗效应估计和计算优化，属于因果推断和机器学习领域，与大多数大模型技术关键词（如LLM、MoE、RLHF、RAG等）无直接关联。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及科学应用（因果推断在医学/社会科学中的潜在应用），但并非核心内容，因此给予5分（有一定关联）。其他关键词均未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了多治疗场景下个体治疗效应估计中的超参数选择困境和维度灾难问题，提出了一种理论估计器和可扩展的生成架构，显著提高了大规模干预场景下的估计精度和效率。

摘要翻译

在多干预场景下估计个体处理效应面临两大核心挑战：平衡权重的超参数选择困境与计算可扩展性的维度灾难问题。本文推导出新颖的多干预泛化边界，并提出最优平衡权重$α$的理论估计器，从而消除耗经验的启发式调参过程。我们研究了三种平衡策略：配对平衡法、一对多平衡法以及干预聚合平衡法。实验表明，在低维场景下一对多平衡法具有最优精度，而本文提出的干预聚合平衡法在干预空间扩展时既能保持精度，又能实现O(1)级别的计算复杂度。进一步地，我们将该框架扩展为生成式架构——多干预因果嵌入生成模型，该模型保持了干预流形的Wasserstein测地线结构。在半合成数据集与图像数据集上的实验证明，我们的方法在估计精度与计算效率上显著优于传统模型，尤其在大规模干预场景中表现突出。

摘要 (Abstract)

Estimating Individual Treatment Effects (ITE) in multi-treatment scenarios faces two critical challenges: the Hyperparameter Selection Dilemma for balancing weights and the Curse of Dimensionality in computational scalability. This paper derives a novel multi-treatment generalization bound and proposes a theoretical estimator for the optimal balancing weight $α$, eliminating expensive heuristic tuning. We investigate three balancing strategies: Pairwise, One-vs-All (OVA), and Treatment Aggregation. While OVA achieves superior precision in low-dimensional settings, our proposed Treatment Aggregation ensures both accuracy and O(1) scalability as the treatment space expands. Furthermore, we extend our framework to a generative architecture, Multi-Treatment CausalEGM, which preserves the Wasserstein geodesic structure of the treatment manifold. Experiments on semi-synthetic and image datasets demonstrate that our approach significantly outperforms traditional models in estimation accuracy and efficiency, particularly in large-scale intervention scenarios.

关键词: Causal Representation Learning, Individual Treatment Effects, Multi-treatment Scenarios, Balancing Weights, Computational Scalability, Treatment Aggregation, Generative Architecture, Wasserstein Geodesic

232. ❌ Exhaustive Circuit Mapping of a Single-Cell Foundation Model Reveals Massive Redundancy, Heavy-Tailed Hub Architecture, and Layer-Dependent Differentiation Control

作者: Ihor Kendiukhov 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11940v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文研究基于Transformer的单细胞基础模型Geneformer的机制可解释性，通过详尽电路追踪、高阶组合消融和因果轨迹引导实验，揭示了层依赖的细胞状态控制机制。论文高度相关于：1）‘Large Language Models OR LLMs OR Foundation Models’（10分），因为明确研究基于Transformer的基础模型；2）‘Mechanistic Interpretability OR Explainable AI’（10分），核心研究模型机制可解释性；3）‘AI for Science OR Bioinformatics OR Cheminformatics’（10分），应用于生物信息学单细胞分析。其他关键词如MoE、SFT、RAG、量化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过详尽电路追踪、高阶组合消融和因果轨迹引导实验，揭示了基于Transformer的单细胞基础模型Geneformer中存在的巨大冗余、重尾枢纽架构以及层依赖的细胞分化控制机制。

摘要翻译

生物基础模型的机制可解释性研究通常依赖于选择性特征采样、成对相互作用测试以及观测性轨迹分析。这些方法均可能引入系统性偏差。本文通过三项实验，在基于Transformer的单细胞基础模型Geneformer中，采用穷举式回路追踪、高阶组合消融及因果轨迹导向方法，以应对上述局限。首先，对第5层全部4065个活跃稀疏自编码器特征进行穷举追踪，产生了1393850条显著下游边，较选择性采样扩展了27倍。这揭示了一种重尾枢纽分布：其中1.8%的特征占据不成比例的网络连接，而前20大枢纽中40%缺乏生物学注释。这些结果表明先前选择性分析存在系统性注释偏差。其次，对8组特征三元组进行三阶组合消融实验显示，冗余度随相互作用阶数单调增加：三阶冗余比为0.59，而成对冗余比为0.74，且未观察到协同效应。这证实模型架构在所有测试阶数上均呈现次可加性。第三，轨迹引导的特征导向实验建立了层位与分化方向性之间的因果关联。第17层的深层特征始终推动细胞状态向成熟方向转化，正向调控比例达1.0；而第0层和第11层的早中期特征主要推动细胞状态偏离成熟，正向调控比例介于0.00至0.58之间。这些研究共同将细胞状态层级调控的证据从相关性推向了因果性层面。

摘要 (Abstract)

Mechanistic interpretability of biological foundation models has relied on selective feature sampling, pairwise interaction testing, and observational trajectory analysis. Each of these can introduce systematic bias. Here we present three experiments that address these limitations through exhaustive circuit tracing, higher order combinatorial ablation, and causal trajectory steering in Geneformer, a transformer based single cell foundation model. First, exhaustive tracing of all 4065 active sparse autoencoder features at layer 5 yields 1393850 significant downstream edges, a 27 fold expansion over selective sampling. This reveals a heavy tailed hub distribution in which 1.8 percent of features account for disproportionate connectivity and 40 percent of the top 20 hubs lack biological annotation. These results indicate systematic annotation bias in prior selective analyses. Second, three way combinatorial ablation across 8 feature triplets shows that redundancy deepens monotonically with interaction order, with a three way ratio of 0.59 versus a pairwise ratio of 0.74, and with zero synergy. This confirms that the model architecture is subadditive at all tested orders. Third, trajectory guided feature steering establishes a causal link between layer position and differentiation directionality. Late layer features at L17 consistently push cell states toward maturity, with fraction positive equal to 1.0. Early and mid layer features at L0 and L11 mostly push away from maturity, with fraction positive ranging from 0.00 to 0.58. Together these results move from correlation toward causal evidence for layer dependent control of cell state.

关键词: mechanistic interpretability, foundation model, transformer, single-cell analysis, circuit tracing, causal trajectory, layer-dependent control, biological annotation bias

233. ❌ On the Role of Reversible Instance Normalization

作者: Gaspard Berthelier, Tahar Nabil, Etienne Le Naour, Richard Niamke, Samir Perlaza, Giovanni Neglia 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11869v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于时间序列预测中数据归一化（特别是Reversible Instance Normalization）的技术分析，研究其组件冗余性和改进方法。所有评分关键词均涉及大模型、深度学习技术原理或特定AI应用领域（如生物信息学），而本文研究的是传统深度学习中的归一化技术，与这些大模型相关主题无直接关联。

!!! tip deepseek-chat TL;DR

本文研究了时间序列预测中Reversible Instance Normalization（RevIN）的作用，通过消融实验发现其多个组件是冗余甚至有害的，并提出了改进其鲁棒性和泛化能力的新视角。

摘要翻译

数据归一化是深度学习模型的关键组成部分，但其在时间序列预测中的作用仍未得到充分理解。本文指出了时间序列预测中归一化面临的三个核心挑战：时间输入分布偏移、空间输入分布偏移以及条件输出分布偏移。在此背景下，我们重新审视了广泛使用的可逆实例归一化（Reversible Instance Normalization, RevIN），通过消融实验表明其若干组件是冗余甚至有害的。基于这些观察，我们提出了提升RevIN鲁棒性与泛化能力的新视角。

摘要 (Abstract)

Data normalization is a crucial component of deep learning models, yet its role in time series forecasting remains insufficiently understood. In this paper, we identify three central challenges for normalization in time series forecasting: temporal input distribution shift, spatial input distribution shift, and conditional output distribution shift. In this context, we revisit the widely used Reversible Instance Normalization (RevIN), by showing through ablation studies that several of its components are redundant or even detrimental. Based on these observations, we draw new perspectives to improve RevIN’s robustness and generalization.

关键词: time series forecasting, data normalization, Reversible Instance Normalization, RevIN, ablation studies, robustness, generalization, distribution shift

234. ❌ FlexRec: Adapting LLM-based Recommenders for Flexible Needs via Reinforcement Learning

作者: Yijun Pan, Weikang Qiu, Qiyao Ma, Mingxuan Ju, Tong Zhao, Neil Shah, Rex Ying 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11901v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在推荐系统中的应用，通过强化学习进行后训练对齐。高度相关的关键词包括：LLMs（论文直接研究LLM-based recommenders）、Post-training/SFT（论文提出post-training RL framework）、RLHF/DPO（使用强化学习进行后训练对齐）、Instruction Tuning/Alignment（研究LLM根据need instruction进行推荐对齐）。其他关键词如MoE、SLMs、RAG、CoT等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究如何通过强化学习后训练框架FlexRec，使基于大语言模型的推荐系统能够根据用户具体需求指令灵活调整推荐行为，在多种推荐场景中显著提升了推荐质量指标。

摘要翻译

现代推荐系统必须适应多样化推荐场景中动态且需求特定的目标，然而大多数传统推荐器仅针对单一静态目标进行优化，难以按需重构行为。基于强化学习的后训练技术近期取得进展，解锁了大语言模型强大的指令遵循与推理能力，这为将其与复杂推荐目标对齐提供了原则性路径。受此启发，我们研究闭集自回归排序任务，即大语言模型基于用户上下文和显式需求指令，在固定候选集上生成排列序列。然而，将强化学习应用于此场景面临两大障碍：（1）序列级奖励产生的粗粒度信用分配无法提供细粒度训练信号；（2）交互反馈稀疏且含噪声，二者共同导致更新低效且不稳定。我们提出FlexRec，一种后训练强化学习框架，通过以下方式解决上述问题：（1）基于剩余候选池内反事实交换的因果性项目级奖励；（2）评论家引导的不确定性感知缩放机制，显式建模奖励不确定性并对低置信度奖励降权，以在稀疏监督下稳定学习。在多样化推荐场景与目标下，FlexRec取得显著提升：在需求特定排序任务中，NDCG@5最高提升59%，Recall@5最高提升109.4%；在泛化设置下进一步实现最高**24.1%**的Recall@5提升，显著优于传统推荐器及基于大语言模型的基线方法。

摘要 (Abstract)

Modern recommender systems must adapt to dynamic, need-specific objectives for diverse recommendation scenarios, yet most traditional recommenders are optimized for a single static target and struggle to reconfigure behavior on demand. Recent advances in reinforcement-learning-based post-training have unlocked strong instruction-following and reasoning capabilities in LLMs, suggesting a principled route for aligning them to complex recommendation goals. Motivated by this, we study closed-set autoregressive ranking, where an LLM generates a permutation over a fixed candidate set conditioned on user context and an explicit need instruction. However, applying RL to this setting faces two key obstacles: (i) sequence-level rewards yield coarse credit assignment that fails to provide fine-grained training signals, and (ii) interaction feedback is sparse and noisy, which together lead to inefficient and unstable updates. We propose FlexRec, a post-training RL framework that addresses both issues with (1) a causally grounded item-level reward based on counterfactual swaps within the remaining candidate pool, and (2) critic-guided, uncertainty-aware scaling that explicitly models reward uncertainty and down-weights low-confidence rewards to stabilize learning under sparse supervision. Across diverse recommendation scenarios and objectives, FlexRec achieves substantial gains: it improves NDCG@5 by up to \textbf{59%} and Recall@5 by up to \textbf{109.4%} in need-specific ranking, and further achieves up to \textbf{24.1%} Recall@5 improvement under generalization settings, outperforming strong traditional recommenders and LLM-based baselines.

关键词: LLM-based recommenders, reinforcement learning, post-training, instruction-following, recommendation systems, autoregressive ranking, need-specific objectives, FlexRec

235. ❌ Multi-Station WiFi CSI Sensing Framework Robust to Station-wise Feature Missingness and Limited Labeled Data

作者: Keita Kayano, Takayuki Nishio, Daiki Yoda, Yuta Hirai, Tomoko Adachi 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11858v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究WiFi信道状态信息（CSI）感知框架，专注于解决多站部署中的特征缺失和标签数据有限问题，使用自监督学习和数据增强技术。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文属于无线传感网络领域，未涉及任何大模型技术、深度学习创新或AI科学应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对多站WiFi信道状态信息感知的框架，通过结合缺失不变性预训练和站级掩蔽增强，有效解决了实际部署中的站级特征缺失和标签数据有限问题。

摘要翻译

本文提出一种面向多站点部署的WiFi信道状态信息（CSI）感知框架，旨在解决实际CSI感知中的两个核心挑战：站点级特征缺失与标记数据有限。特征缺失问题通常通过重采样非均匀间隔的CSI测量值或重建缺失样本来处理，而标记数据稀缺问题则通过数据增强或自监督表征学习来缓解。然而，现有技术往往孤立地处理这些问题，未能同时应对长期、结构化的站点不可用性与标记数据稀缺的联合挑战。为弥补这一空白，我们显式地将站点不可用性纳入表征学习与下游模型训练中。具体而言，我们改造了跨模态自监督学习（CroSSL）——一个原本为时序传感数据设计的表征学习框架，将其适配于多站点CSI感知任务，从而从未标记数据中学习对站点级特征缺失具有内在不变性的表征。此外，我们在下游模型训练中引入了站点级掩码增强（Station-wise Masking Augmentation, SMA），使模型在有限标记数据下接触真实的站点不可用性模式。实验表明，仅依靠缺失不变性预训练或仅使用站点级增强均不足以保证鲁棒性；二者的结合对于在站点级特征缺失与标记数据稀缺并存条件下实现稳健性能至关重要。所提框架为实际部署中的多站点WiFi CSI感知提供了实用且鲁棒的基础。

摘要 (Abstract)

We propose a WiFi Channel State Information (CSI) sensing framework for multi-station deployments that addresses two fundamental challenges in practical CSI sensing: station-wise feature missingness and limited labeled data. Feature missingness is commonly handled by resampling unevenly spaced CSI measurements or by reconstructing missing samples, while label scarcity is mitigated by data augmentation or self-supervised representation learning. However, these techniques are typically developed in isolation and do not jointly address long-term, structured station unavailability together with label scarcity. To bridge this gap, we explicitly incorporate station unavailability into both representation learning and downstream model training. Specifically, we adapt cross-modal self-supervised learning (CroSSL), a representation learning framework originally designed for time-series sensory data, to multi-station CSI sensing in order to learn representations that are inherently invariant to station-wise feature missingness from unlabeled data. Furthermore, we introduce Station-wise Masking Augmentation (SMA) during downstream model training, which exposes the model to realistic station unavailability patterns under limited labeled data. Our experiments show that neither missingness-invariant pre-training nor station-wise augmentation alone is sufficient; their combination is essential to achieve robust performance under both station-wise feature missingness and label scarcity. The proposed framework provides a practical and robust foundation for multi-station WiFi CSI sensing in real-world deployments.

关键词: WiFi CSI sensing, multi-station deployments, feature missingness, limited labeled data, self-supervised learning, station-wise masking augmentation, robust performance

236. ❌ Inverse Neural Operator for ODE Parameter Optimization

作者: Zhi-Song Liu, Wenqing Peng, Helmi Toropainen, Ammar Kheder, Andreas Rupp, Holger Froning, Xiaojie Lin, Michael Boy 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11854v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出了一种用于常微分方程参数优化的逆神经算子框架，属于科学计算和AI for Science领域。论文内容与深度学习在科学领域的应用高度相关，特别是通过神经网络方法解决大气化学和基因调控网络中的参数反演问题。然而，论文未涉及任何大语言模型、模型架构、训练对齐、推理优化、智能体系统等具体技术，因此除’AI for Science’关键词外，其他所有关键词均无相关性。

!!! tip deepseek-chat TL;DR

该论文提出了逆神经算子框架，用于从稀疏观测数据中高效准确地恢复常微分方程中的隐藏参数，在真实大气化学和合成基因调控网络基准测试中实现了比梯度方法更高的精度和487倍的加速。

摘要翻译

我们提出逆向神经算子（INO），这是一个从稀疏、部分观测数据中恢复隐藏常微分方程参数的两阶段框架。在第一阶段，采用交叉注意力机制的条件傅里叶神经算子（C-FNO）通过学习一个可微分的代理模型，从任意稀疏输入中重建完整的常微分方程轨迹，并通过谱正则化抑制高频伪影。在第二阶段，摊销漂移模型（ADM）在参数空间中学习一个核加权的速度场，将随机参数初始化值向真实参数方向传输，而无需通过代理模型进行反向传播，从而避免了在刚性系统中困扰基于梯度的逆向求解方法的雅可比矩阵不稳定性问题。在真实世界的刚性大气化学基准测试（POLLU，25个参数）和合成基因调控网络（GRN，40个参数）上的实验表明，INO在参数恢复精度上优于基于梯度的和摊销的基线方法，同时仅需0.23秒的推理时间，相比迭代梯度下降法实现了487倍的加速。

摘要 (Abstract)

We propose the Inverse Neural Operator (INO), a two-stage framework for recovering hidden ODE parameters from sparse, partial observations. In Stage 1, a Conditional Fourier Neural Operator (C-FNO) with cross-attention learns a differentiable surrogate that reconstructs full ODE trajectories from arbitrary sparse inputs, suppressing high-frequency artifacts via spectral regularization. In Stage 2, an Amortized Drifting Model (ADM) learns a kernel-weighted velocity field in parameter space, transporting random parameter initializations toward the ground truth without backpropagating through the surrogate, avoiding the Jacobian instabilities that afflict gradient-based inversion in stiff regimes. Experiments on a real-world stiff atmospheric chemistry benchmark (POLLU, 25 parameters) and a synthetic Gene Regulatory Network (GRN, 40 parameters) show that INO outperforms gradient-based and amortized baselines in parameter recovery accuracy while requiring only 0.23s inference time, a 487x speedup over iterative gradient descent.

关键词: Inverse Neural Operator, ODE parameter optimization, Conditional Fourier Neural Operator, Amortized Drifting Model, atmospheric chemistry, gene regulatory network, parameter recovery, inference acceleration

237. ❌ Hypercomplex Widely Linear Processing: Fundamentals for Quaternion Machine Learning

作者: Sayed Pouria Talebi, Clive Cheong Took 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11835v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Hypercomplex Widely Linear Processing: Fundamentals for Quaternion Machine Learning》专注于四元数机器学习的基础理论，包括四元数代数、统计建模和估计算法。所有评分关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及LLM或深度学习，仅讨论四元数这一特定数学结构在机器学习中的基础应用，与所有关键词无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文为四元数机器学习奠定理论基础，通过建立四元数统计模型、广泛线性模型和估计算法，以支持三维旋转等应用领域的研究。

摘要翻译

为将复数代数在工程与科学领域的成功拓展至其他超复数域（如四元数、双复数、双四元数和八元数），已有大量尝试。其中，或许没有哪一种能媲美四元数的成就。四元数最具实用价值的特性在于其能够对三维旋转进行建模，这一特性已在航空航天和计算机图形学等多个工业领域得到广泛应用。近年来，随着机器学习的兴起，我们见证了四元数研究的复兴。为使读者能够为这一新兴研究领域做出贡献，本章将奠定以下基础：- 用于建模四元数值随机过程的增广统计学；- 利用此类先进统计量的广线性模型；- 用于算法推导的四元数微积分与代数；- 基于实际考量的均方估计。为便于理解，本章提供了若干示例，以促进对这一多维领域的学习、理解，并期望推动其实际应用。

摘要 (Abstract)

Numerous attempts have been made to replicate the success of complex-valued algebra in engineering and science to other hypercomplex domains such as quaternions, tessarines, biquaternions, and octonions. Perhaps, none have matched the success of quaternions. The most useful feature of quaternions lies in their ability to model three-dimensional rotations which, in turn, have found various industrial applications such as in aeronautics and computergraphics. Recently, we have witnessed a renaissance of quaternions due to the rise of machine learning. To equip the reader to contribute to this emerging research area, this chapter lays down the foundation for: - augmented statistics for modelling quaternion-valued random processes, - widely linear models to exploit such advanced statistics, - quaternion calculus and algebra for algorithmic derivations, - mean square estimation for practical considerations. For ease of exposure, several examples are offered to facilitate the learning, understanding, and(hopefully) the adoption of this multidimensional domain.

关键词: quaternion machine learning, hypercomplex algebra, widely linear models, augmented statistics, mean square estimation, three-dimensional rotations, quaternion calculus, multidimensional domain

238. ❌ Exponential-Family Membership Inference: From LiRA and RMIA to BaVarIA

作者: Rickard Brännvall 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11799v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器学习模型的成员推理攻击（MIA）方法研究，具体探讨了LiRA、RMIA和BASE攻击方法的统一框架，并提出了新的BaVarIA攻击方法。论文的核心内容是机器学习隐私审计和攻击方法，而非大模型、深度学习技术原理或其在科学领域的应用。所有关键词均涉及大模型技术、训练方法、推理优化、应用领域等方向，与该论文的隐私攻击研究主题完全无关。

!!! tip deepseek-chat TL;DR

该论文通过建立指数族对数似然比框架统一了LiRA、RMIA和BASE三种成员推理攻击方法，并提出了基于贝叶斯方差推断的BaVarIA攻击方法，在低影子模型预算下实现了稳定且改进的性能。

摘要翻译

成员推理攻击正逐渐成为审计机器学习模型隐私性的标准工具。主流攻击方法——LiRA（Carlini等人，2022）和RMIA（Zarifzadeh等人，2024）——似乎采用了不同的评分策略，而近期提出的BASE方法（Lassila等人，2025）被证明与RMIA等价，这使得实践者难以在它们之间做出选择。本文证明，这三种方法均属于同一指数族对数似然比框架的实例，其区别仅在于分布假设以及每个数据点所估计的参数数量。这种统一性揭示了一个层次结构（BASE1-4），将RMIA和LiRA连接为模型复杂度递增谱系的两个端点。在此框架内，我们指出方差估计是小规模影子模型预算下的关键瓶颈，并提出了BaVarIA——一种贝叶斯方差推理攻击，该方法采用共轭正态逆伽马先验替代基于阈值的参数切换。BaVarIA可生成学生t分布预测（BaVarIA-t）或具有稳定方差的高斯分布预测（BaVarIA-n），在无需额外超参数调优的情况下即可提供稳定的性能。在12个数据集和7种影子模型预算的测试中，BaVarIA均达到或超越了LiRA和RMIA的表现，其中在实践意义重大的低影子模型数量及离线场景下取得了最显著的性能提升。

摘要 (Abstract)

Membership inference attacks (MIAs) are becoming standard tools for auditing the privacy of machine learning models. The leading attacks – LiRA (Carlini et al., 2022) and RMIA (Zarifzadeh et al., 2024) – appear to use distinct scoring strategies, while the recently proposed BASE (Lassila et al., 2025) was shown to be equivalent to RMIA, making it difficult for practitioners to choose among them. We show that all three are instances of a single exponential-family log-likelihood ratio framework, differing only in their distributional assumptions and the number of parameters estimated per data point. This unification reveals a hierarchy (BASE1-4) that connects RMIA and LiRA as endpoints of a spectrum of increasing model complexity. Within this framework, we identify variance estimation as the key bottleneck at small shadow-model budgets and propose BaVarIA, a Bayesian variance inference attack that replaces threshold-based parameter switching with conjugate normal-inverse-gamma priors. BaVarIA yields a Student-t predictive (BaVarIA-t) or a Gaussian with stabilized variance (BaVarIA-n), providing stable performance without additional hyperparameter tuning. Across 12 datasets and 7 shadow-model budgets, BaVarIA matches or improves upon LiRA and RMIA, with the largest gains in the practically important low-shadow-model and offline regimes.

关键词: Membership Inference Attacks, LiRA, RMIA, Exponential-family Framework, BaVarIA, Bayesian Variance Inference, Privacy Auditing, Shadow Models

239. ❌ Disentangled Representation Learning through Unsupervised Symmetry Group Discovery

作者: Dang-Nhu Barthélémy, Annabi Louis, Argentieri Sylvain 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11790v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究无监督对称群发现的解耦表示学习，属于表示学习、强化学习、对称性理论领域，与所有评分关键词（均聚焦大模型、深度学习技术原理及应用）完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种无监督对称群发现方法，使智能体通过环境交互自主发现动作空间的群结构，用于学习线性对称基解耦表示，并在三个环境中验证了其优于现有方法。

摘要翻译

基于对称性的解耦表示学习利用环境变换的群结构来揭示变化的潜在因子。先前基于对称性的解耦方法需要关于对称群结构的强先验知识，或对子群性质施加限制性假设。在本工作中，我们通过提出一种方法移除了这些约束，该方法使具身智能体通过与环境的无监督交互自主发现其动作空间的群结构。我们在最小假设下证明了真实对称群分解的可识别性，并推导出两种算法：一种用于从交互数据中发现群分解，另一种用于在不假设特定子群性质的情况下学习基于线性对称性的解耦表示。我们的方法在三个展现不同群分解的环境中得到验证，其性能优于现有的基于线性对称性的解耦方法。

摘要 (Abstract)

Symmetry-based disentangled representation learning leverages the group structure of environment transformations to uncover the latent factors of variation. Prior approaches to symmetry-based disentanglement have required strong prior knowledge of the symmetry group’s structure, or restrictive assumptions about the subgroup properties. In this work, we remove these constraints by proposing a method whereby an embodied agent autonomously discovers the group structure of its action space through unsupervised interaction with the environment. We prove the identifiability of the true symmetry group decomposition under minimal assumptions, and derive two algorithms: one for discovering the group decomposition from interaction data, and another for learning Linear Symmetry-Based Disentangled (LSBD) representations without assuming specific subgroup properties. Our method is validated on three environments exhibiting different group decompositions, where it outperforms existing LSBD approaches.

关键词: disentangled representation learning, unsupervised symmetry group discovery, embodied agent, group structure, linear symmetry-based disentangled representations, environment transformations, latent factors, identifiability

240. ❌ A Further Efficient Algorithm with Best-of-Both-Worlds Guarantees for $m$-Set Semi-Bandit Problem

作者: Botao Chen, Jongyeong Lee, Chansoo Kim, Junya Honda 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11764v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是在线学习中的m-set半赌博问题，专注于Follow-the-Perturbed-Leader (FTPL)算法在对抗性和随机性环境下的最优性和计算复杂度分析。论文内容完全属于经典机器学习、在线学习和优化理论领域，涉及算法理论分析、遗憾界证明和计算复杂度改进。所有评分关键词均与大模型、深度学习、AI应用或相关技术原理相关，而本文不涉及任何大模型、深度学习、AI for Science或相关技术（如微调、对齐、推理加速等），因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了Follow-the-Perturbed-Leader (FTPL)算法在m-set半赌博问题中的最优性，证明了其在对抗性和随机性环境下均能达到最佳遗憾界，并改进了几何重采样方法以降低计算复杂度。

摘要翻译

本文研究了跟随扰动领导者（FTPL）策略在$m$集合半赌博问题中的最优性与计算复杂度。FTPL作为一种在对抗性组合半赌博问题中具有良好遗憾界的高效算法候选方案，已得到广泛研究。然而，与已在多种在线学习任务中被证明最优性的跟随正则化领导者（FTRL）不同，FTPL的最优性始终未明。本文中，我们将几何重采样（GR）的FTPL分析扩展至$m$集合半赌博问题（组合半赌博问题的一个特例），证明采用特定参数的Fréchet分布和Pareto分布的FTPL在对抗性环境中可实现$O(\sqrt{mdT})$的最优遗憾界。同时，我们证明采用特定参数的Fréchet分布和Pareto分布的FTPL在随机环境中可实现对数遗憾，这意味着FTPL在$m$集合半赌博问题上实现了“两全其美”的最优性。此外，我们将条件几何重采样技术扩展至$m$集合半赌博问题，以在FTPL中实现高效损失估计，将计算复杂度从原始几何重采样的$O(d^2)$降低至$O(md(\log(d/m)+1))$，且不牺牲遗憾界性能。

摘要 (Abstract)

This paper studies the optimality and complexity of Follow-the-Perturbed-Leader (FTPL) policy in $m$-set semi-bandit problems. FTPL has been studied extensively as a promising candidate of an efficient algorithm with favorable regret for adversarial combinatorial semi-bandits. Nevertheless, the optimality of FTPL has still been unknown unlike Follow-the-Regularized-Leader (FTRL) whose optimality has been proved for various tasks of online learning. In this paper, we extend the analysis of FTPL with geometric resampling (GR) to $m$-set semi-bandits, which is a special case of combinatorial semi-bandits, showing that FTPL with Fréchet and Pareto distributions with certain parameters achieves the best possible regret of $O(\sqrt{mdT})$ in adversarial setting. We also show that FTPL with Fréchet and Pareto distributions with a certain parameter achieves a logarithmic regret for stochastic setting, meaning the Best-of-Both-Worlds optimality of FTPL for $m$-set semi-bandit problems. Furthermore, we extend the conditional geometric resampling to $m$-set semi-bandits for efficient loss estimation in FTPL, reducing the computational complexity from $O(d^2)$ of the original geometric resampling to $O(md(\log(d/m)+1))$ without sacrificing the regret performance.

关键词: Follow-the-Perturbed-Leader, FTPL, m-set semi-bandit, regret analysis, geometric resampling, computational complexity, adversarial setting, stochastic setting

作者: Xiaofu Jin, Yunpeng Bai, Antti Oulasvirta 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11759v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是人类用户在信息架构中的导航行为建模，特别是基于信息气味（Information Scent）的序列决策模型。论文内容属于人机交互（HCI）、认知建模和用户行为分析领域，完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用。所有评分关键词都聚焦于大模型技术及其相关方法（如训练、推理、对齐、压缩等），而本文研究的是人类认知过程和行为模拟，与AI模型技术无任何关联。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过将导航建模为记忆约束下的序列决策问题，扩展了信息气味概念，解释了用户在信息架构中过早选择链接、走错路和通过回溯恢复等试错导航行为。

摘要翻译

用户常常难以在信息架构中定位目标条目，尤其是当链接表述模糊或深嵌于层级结构时。信息线索理论常被用于解释用户为何选择错误链接，但这一概念默认用户在决策前会浏览所有可用链接。实际上，用户往往过快地选择链接，忽略相关线索，并在出错时依赖回溯操作。本研究通过将导航行为框架化为记忆约束下的序列决策问题，拓展了信息线索的理论内涵。具体而言，我们假设用户不会完整扫描页面内容，而是基于时间预算进行策略性浏览，仅查看“足够找到目标”的信息。在选择下一个待查看条目时，用户会同时考虑局部（当前页面）与全局（网站）信息线索；但二者均受记忆能力制约。为尽量避免时间浪费，用户有时会在未检视页面全部内容的情况下选择错误链接。与实证数据的对比表明，我们的模型复现了关键导航行为：过早选择、误入歧途以及通过回溯实现的路径恢复。研究结论表明：当纳入导航问题的序列性与有限性特征时，试错行为可通过信息线索理论得到合理解释。

摘要 (Abstract)

Users often struggle to locate an item within an information architecture, particularly when links are ambiguous or deeply nested in hierarchies. Information scent has been used to explain why users select incorrect links, but this concept assumes that users see all available links before deciding. In practice, users frequently select a link too quickly, overlook relevant cues, and then rely on backtracking when errors occur. We extend the concept of information scent by framing navigation as a sequential decision-making problem under memory constraints. Specifically, we assume that users do not scan entire pages but instead inspect strategically, looking “just enough” to find the target given their time budget. To choose which item to inspect next, they consider both local (this page) and global (site) scent; however, both are constrained by memory. Trying to avoid wasting time, they occasionally choose the wrong links without inspecting everything on a page. Comparisons with empirical data show that our model replicates key navigation behaviors: premature selections, wrong turns, and recovery from backtracking. We conclude that trial-and-error behavior is well explained by information scent when accounting for the sequential and bounded characteristics of the navigation problem.

关键词: Information Scent, Sequential Decision-making, Navigation, Memory Constraints, Trial-and-error Behavior, User Behavior Modeling, Backtracking, Human-Computer Interaction

242. ❌ Mitigating the Multiplicity Burden: The Role of Calibration in Reducing Predictive Multiplicity of Classifiers

作者: Mustafa Cavus 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11750v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究机器学习分类器的校准与预测多样性问题，聚焦于信用风险评估中的模型可靠性和公平性。所有关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用创新，而本文讨论的是传统机器学习分类器（未指定为深度学习模型）的校准技术，未涉及大模型、深度学习、AI for Science等主题，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文研究了分类器校准如何减少预测多样性，发现后处理校准方法（特别是Platt Scaling和Isotonic Regression）能有效降低信用风险评估中的算法任意性，减轻少数类样本的预测多样性负担。

摘要翻译

随着机器学习模型日益部署于高风险决策环境，确保概率可靠性与预测稳定性变得至关重要。本文研究了分类校准与预测多重性之间的相互作用——后者指拉什莫尔集合中多个近似最优模型对同一申请人产生矛盾信贷决策的现象。基于九个异构信贷风险基准数据集，我们探究了预测多重性是否集中于低预测置信度区域，以及事后校准如何缓解算法任意性。实证分析表明，少数类样本承担着不成比例的多重性负担，这通过预测多重性与预测置信度的显著差异得到证实。此外，实验比较显示，应用事后校准方法（特别是普拉特缩放、等渗回归与温度缩放）能够降低拉什莫尔集合的整体模糊性。在测试技术中，普拉特缩放与等渗回归对预测多重性的削减效果最为稳健。这些发现表明，校准可作为强制共识层发挥作用，并通过缓解预测多重性来支持程序公平性。

摘要 (Abstract)

As machine learning models are increasingly deployed in high-stakes environments, ensuring both probabilistic reliability and prediction stability has become critical. This paper examines the interplay between classification calibration and predictive multiplicity - the phenomenon in which multiple near-optimal models within the Rashomon set yield conflicting credit outcomes for the same applicant. Using nine diverse credit risk benchmark datasets, we investigate whether predictive multiplicity concentrates in regions of low predictive confidence and how post-hoc calibration can mitigate algorithmic arbitrariness. Our empirical analysis reveals that minority class observations bear a disproportionate multiplicity burden, as confirmed by significant disparities in predictive multiplicity and prediction confidence. Furthermore, our empirical comparisons indicate that applying post-hoc calibration methods - specifically Platt Scaling, Isotonic Regression, and Temperature Scaling - is associated with lower obscurity across the Rashomon set. Among the tested techniques, Platt Scaling and Isotonic Regression provide the most robust reduction in predictive multiplicity. These findings suggest that calibration can function as a consensus-enforcing layer and may support procedural fairness by mitigating predictive multiplicity.

关键词: predictive multiplicity, calibration, classification, Rashomon set, credit risk, algorithmic arbitrariness, post-hoc calibration, procedural fairness

243. ❌ Decomposing Observational Multiplicity in Decision Trees: Leaf and Structural Regret

作者: Mustafa Cavus 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11701v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究决策树中的观测多重性（observational multiplicity），属于传统机器学习模型的可解释性和安全性研究，与所有大模型/深度学习技术关键词无直接关联。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文涉及模型可解释性和安全性框架，但并非针对大模型或深度学习。

!!! tip deepseek-chat TL;DR

该论文研究了决策树分类器中观测多重性的来源，提出了叶遗憾和结构遗憾两个互补概念来量化预测变异性，并证明结构遗憾是主要驱动因素，同时展示了这些度量在选择性预测中作为弃权机制的应用价值。

摘要翻译

许多机器学习任务中存在多个性能几乎相当的模型，这一现象被称为预测多重性。其根本来源之一是观测多重性，它产生于标签收集的随机性：观测到的训练标签仅代表了底层真实概率的单一实现。尽管逻辑回归的观测多重性理论框架已经建立，但其对决策树这类非光滑、基于划分的模型的影响仍未得到充分探索。本文针对决策树分类器提出了两种互补的观测多重性概念：叶节点遗憾（leaf regret）和结构遗憾（structural regret）。叶节点遗憾量化了固定叶节点内由于有限样本噪声导致的预测内在变异性，而结构遗憾则捕捉了由学习到的树结构本身的不稳定性所引发的变异性。我们给出了将观测多重性分解为这两个组成部分的形式化方法，并建立了统计保证。我们在多个信用风险评分数据集上的实验评估证实，理论分解与经验观测的方差近乎完美吻合。值得注意的是，我们发现结构遗憾是观测多重性的主要驱动因素，在某些数据集中其变异性可达叶节点遗憾的15倍以上。此外，我们证明在选择性预测中将这些遗憾度量作为弃权机制，可以有效识别任意区域并提升模型安全性，在最稳定的子群体上将召回率从92%提升至100%。这些结果为量化观测多重性建立了一个严谨的框架，与算法安全性和可解释性的最新进展相契合。

摘要 (Abstract)

Many machine learning tasks admit multiple models that perform almost equally well, a phenomenon known as predictive multiplicity. A fundamental source of this multiplicity is observational multiplicity, which arises from the stochastic nature of label collection: observed training labels represent only a single realization of the underlying ground-truth probabilities. While theoretical frameworks for observational multiplicity have been established for logistic regression, their implications for non-smooth, partition-based models like decision trees remain underexplored. In this paper, we introduce two complementary notions of observational multiplicity for decision tree classifiers: leaf regret and structural regret. Leaf regret quantifies the intrinsic variability of predictions within a fixed leaf due to finite-sample noise, while structural regret captures variability induced by the instability of the learned tree structure itself. We provide a formal decomposition of observational multiplicity into these two components and establish statistical guarantees. Our experimental evaluation across diverse credit risk scoring datasets confirms the near-perfect alignment between our theoretical decomposition and the empirically observed variance. Notably, we find that structural regret is the primary driver of observational multiplicity, accounting for over 15 times the variability of leaf regret in some datasets. Furthermore, we demonstrate that utilizing these regret measures as an abstention mechanism in selective prediction can effectively identify arbitrary regions and improve model safety, elevating recall from 92% to 100% on the most stable sub-populations. These results establish a rigorous framework for quantifying observational multiplicity, aligning with recent advances in algorithmic safety and interpretability.

关键词: observational multiplicity, decision trees, leaf regret, structural regret, predictive multiplicity, selective prediction, model safety, interpretability

244. ❌ Context-dependent manifold learning: A neuromodulated constrained autoencoder approach

作者: Jérôme Adriaens, Guillaume Drion, Pierre Sacré 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11673v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是受约束自编码器（cAE）的改进，通过引入神经调节机制实现上下文相关的流形学习，属于机器学习中的自编码器架构创新。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于传统自编码器在动态系统中的应用，未涉及LLM、MoE、量化、推理加速、对齐、RAG等任何关键词领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种神经调节约束自编码器（NcAE），解决了标准约束自编码器无法适应不同物理参数或环境条件的问题，通过增益和偏置调谐实现上下文相关的流形学习，实验表明NcAE能准确捕捉不同机制下的流形几何变化并保持严格的投影特性。

摘要翻译

约束自编码器（cAE）通过在潜在空间上施加几何结构，为可解释的降维提供了一条成功路径。然而，标准的cAE无法适应变化的物理参数或环境条件，且容易将这些上下文变化与主要输入混为一谈。为解决这一问题，我们将神经调控机制整合到cAE框架中，以实现上下文依赖的流形学习。本文提出了神经调控约束自编码器（Neuromodulated Constrained Autoencoder, NcAE），它能够基于静态上下文信息，通过增益和偏置调节来自适应地参数化几何约束。在动态系统上的实验结果表明，NcAE能够准确捕捉不同状态下流形几何的变化，同时保持严格的投影特性。这些结果证明，神经调控机制有效地将全局上下文参数与局部流形表示解耦。该架构为在受（非平稳）环境约束的系统中开发更灵活、具有物理信息意识的表示奠定了基础。

摘要 (Abstract)

Constrained autoencoders (cAE) provide a successful path towards interpretable dimensionality reduction by enforcing geometric structure on latent spaces. However, standard cAEs cannot adapt to varying physical parameters or environmental conditions without conflating these contextual shifts with the primary input. To address this, we integrated a neuromodulatory mechanism into the cAE framework to allow for context-dependent manifold learning. This paper introduces the Neuromodulated Constrained Autoencoder (NcAE), which adaptively parameterizes geometric constraints via gain and bias tuning conditioned on static contextual information. Experimental results on dynamical systems show that the NcAE accurately captures how manifold geometry varies across different regimes while maintaining rigorous projection properties. These results demonstrate that neuromodulation effectively decouples global contextual parameters from local manifold representations. This architecture provides a foundation for developing more flexible, physics-informed representations in systems subject to (non-stationary) environmental constraints.

关键词: Constrained autoencoder, Neuromodulation, Manifold learning, Context-dependent, Dimensionality reduction, Geometric constraints, Dynamic systems, Physics-informed representations

245. ❌ Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

作者: Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, Roberto Martin-Martin 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11653v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language-Action (VLA)模型的持续强化学习，核心是使用LoRA进行参数高效微调，属于大模型在机器人/具身智能领域的应用创新。高度相关的关键词：PEFT/LoRA（核心方法，10分）、Large Language Models/Foundation Models（VLA模型基于大模型，8分）、Pre-training/Continual Pre-training（涉及预训练模型和持续学习，8分）、Post-training/SFT（涉及微调，8分）、LLM Agents/Autonomous Agents（研究具身智能体，8分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，对于大型预训练的视觉-语言-动作模型，简单的顺序微调配合LoRA方法在持续强化学习中表现出色，能够实现高可塑性、低遗忘和强泛化能力，挑战了传统持续学习需要复杂策略的认知。

摘要翻译

面向视觉-语言-动作（Vision-Language-Action, VLA）模型的持续强化学习（Continual Reinforcement Learning, CRL）是开发能够在开放、动态环境中自我改进的具身智能体的一个前景广阔的方向。然而，持续学习领域的传统观点认为，简单的顺序微调（Sequential Fine-Tuning, Seq. FT）会导致灾难性遗忘，因此需要复杂的CRL策略。在本研究中，我们回归基础，对三个大型预训练VLA模型在五个具有挑战性的终身强化学习基准上进行了系统的CRL研究。我们发现，与既定认知相反，结合低秩自适应（LoRA）的简单顺序微调表现出惊人的强大性能：它具有很高的可塑性，几乎不产生遗忘，并保持了强大的零样本泛化能力，其表现常常优于更复杂的CRL方法。通过详细分析，我们揭示了这种鲁棒性源于大型预训练模型、参数高效自适应以及同策略强化学习三者之间的协同作用。这些组件共同重塑了稳定性与可塑性之间的权衡，使得持续自适应既稳定又可扩展。我们的研究结果确立了顺序微调作为VLA模型持续强化学习的一种有效方法，并为大模型时代的终身学习提供了新的见解。代码发布于 github.com/UT-Austin-RobIn/continual-vla-rl。

摘要 (Abstract)

Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. However, conventional wisdom from continual learning suggests that naive Sequential Fine-Tuning (Seq. FT) leads to catastrophic forgetting, necessitating complex CRL strategies. In this work, we take a step back and conduct a systematic study of CRL for large pretrained VLAs across three models and five challenging lifelong RL benchmarks. We find that, contrary to established belief, simple Seq. FT with low-rank adaptation (LoRA) is remarkably strong: it achieves high plasticity, exhibits little to no forgetting, and retains strong zero-shot generalization, frequently outperforming more sophisticated CRL methods. Through detailed analysis, we show that this robustness arises from a synergy between the large pretrained model, parameter-efficient adaptation, and on-policy RL. Together, these components reshape the stability-plasticity trade-off, making continual adaptation both stable and scalable. Our results position Sequential Fine-Tuning as a powerful method for continual RL with VLAs and provide new insights into lifelong learning in the large model era. Code is available at github.com/UT-Austin-RobIn/continual-vla-rl.

关键词: Continual Reinforcement Learning, Vision-Language-Action Models, Sequential Fine-Tuning, Low-rank Adaptation (LoRA), Parameter-efficient Adaptation, Embodied Agents, Lifelong Learning, Stability-Plasticity Trade-off

246. ❌ Personalized Federated Learning via Gaussian Generative Modeling

作者: Peng Hu, Jianwei Ma 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11620v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究个性化联邦学习（pFedGM），专注于数据异构性下的模型个性化方法，采用高斯生成建模和贝叶斯推断。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI科学应用直接相关，而本文属于传统联邦学习领域，未涉及LLMs、MoE、量化、推理加速、对齐、RAG等大模型相关技术，也未应用于生物信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于高斯生成建模的个性化联邦学习方法pFedGM，通过建模客户端异构性和双尺度融合框架，在数据异构场景下实现了优于或媲美现有方法的性能。

摘要翻译

联邦学习作为一种新兴范式，能够在保护隐私的前提下，基于本质上分布式的客户端数据协同训练模型。在此背景下，个性化联邦学习通过为每个客户端配备专用模型，以应对数据异质性的挑战。一种主流策略将模型解耦为共享特征提取器与个性化分类器头部，其中后者主动引导表征学习。然而，既往研究多聚焦于分类器头部引导的个性化，忽视了表征分布中潜在的个性化特性。基于此洞察，我们提出pFedGM，一种基于高斯生成建模的方法。该方法首先训练一个高斯生成器，通过加权重采样对客户端异质性进行建模。随后，通过采用双重目标实现全局协作与个性化之间的平衡：一个共享目标旨在最大化跨客户端的类间距离，一个本地目标则致力于最小化客户端内部的类内距离。为实现此目标，我们将传统高斯分类器解耦为用于全局优化的导航器，以及用于捕捉分布统计量的统计提取器。受卡尔曼增益启发，该算法随后在全局与本地层面采用双尺度融合框架，为每个客户端配备个性化分类器头部。在此框架中，我们将全局表征分布建模为先验分布，将客户端特定数据建模为似然分布，从而通过贝叶斯推断实现类别概率估计。评估涵盖了一系列广泛场景：类别数量异质性、环境干扰，以及多种基准数据集与配置。与现有先进方法相比，pFedGM展现出优越或具有竞争力的性能。

摘要 (Abstract)

Federated learning has emerged as a paradigm to train models collaboratively on inherently distributed client data while safeguarding privacy. In this context, personalized federated learning tackles the challenge of data heterogeneity by equipping each client with a dedicated model. A prevalent strategy decouples the model into a shared feature extractor and a personalized classifier head, where the latter actively guides the representation learning. However, previous works have focused on classifier head-guided personalization, neglecting the potential personalized characteristics in the representation distribution. Building on this insight, we propose pFedGM, a method based on Gaussian generative modeling. The approach begins by training a Gaussian generator that models client heterogeneity via weighted re-sampling. A balance between global collaboration and personalization is then struck by employing a dual objective: a shared objective that maximizes inter-class distance across clients, and a local objective that minimizes intra-class distance within them. To achieve this, we decouple the conventional Gaussian classifier into a navigator for global optimization, and a statistic extractor for capturing distributional statistics. Inspired by the Kalman gain, the algorithm then employs a dual-scale fusion framework at global and local levels to equip each client with a personalized classifier head. In this framework, we model the global representation distribution as a prior and the client-specific data as the likelihood, enabling Bayesian inference for class probability estimation. The evaluation covers a comprehensive range of scenarios: heterogeneity in class counts, environmental corruption, and multiple benchmark datasets and configurations. pFedGM achieves superior or competitive performance compared to state-of-the-art methods.

关键词: Personalized Federated Learning, Gaussian Generative Modeling, Data Heterogeneity, Bayesian Inference, Client-specific Models, Dual-scale Fusion, Representation Distribution, Privacy-preserving Learning

247. ❌ AutoScout: Structured Optimization for Automating ML System Configuration

作者: Jimmy Shong, Yuhan Ding, Yihan Jiang, Liheng Jing, Haonan Chen, Gaokai Zhang, Aditya Akella, Fan Lai 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11603v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器学习系统配置的自动化优化（AutoScout），涉及模型并行策略、通信优化和运行时参数等系统级配置。所有关键词均与大模型技术原理、训练方法、推理优化、对齐、应用等具体内容相关，而本文是通用的系统配置优化框架，不针对特定模型技术或应用领域，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

论文提出了一个名为AutoScout的通用机器学习系统配置优化器，通过混合离散/连续优化和分层依赖建模，自动寻找高性能配置，在多样化模型和硬件上实现了2.7-3.0倍的训练加速。

摘要翻译

机器学习（ML）系统呈现出快速扩展的配置空间，涵盖模型并行策略、通信优化及底层运行时参数。端到端系统效率对这些选择高度敏感，但由于异构特征类型（例如稀疏与稠密参数）、条件依赖（例如仅在特定上游决策下有效的执行参数）以及高昂的搜索（性能分析）成本，识别高性能配置极具挑战。现有方法要么仅优化有限的配置维度子集，要么依赖难以随配置空间持续增长而泛化的临时启发式策略。本文提出AutoScout，一种面向机器学习训练、微调与推理的通用系统配置器。它将系统配置建模为具有层次依赖关系的混合离散/连续优化问题，并引入一种混合优化框架，可联合优化稀疏结构决策与稠密执行参数。为降低性能分析成本，AutoScout自适应地优先处理高影响力配置特征，并集成多精度模拟器。在多样化模型、硬件平台及部署目标下，AutoScout始终能识别出高性能配置，相比专家调优设置实现了2.7-3.0倍的训练加速。

摘要 (Abstract)

Machine learning (ML) systems expose a rapidly expanding configuration space spanning model-parallelism strategies, communication optimizations, and low-level runtime parameters. End-to-end system efficiency is highly sensitive to these choices, yet identifying high-performance configurations is challenging due to heterogeneous feature types (e.g., sparse and dense parameters), conditional dependencies (e.g., valid execution parameters only under specific upstream decisions), and the high search (profiling) cost. Existing approaches either optimize a narrow subset of configuration dimensions or rely on ad-hoc heuristics that fail to generalize as configuration spaces continue to grow. We present AutoScout, a general-purpose systems configurator for ML training, fine-tuning, and inference. It formulates the system configuration as a mixed-discrete/continuous optimization problem with hierarchical dependencies and introduces a hybrid optimization framework that jointly refines sparse structural decisions and dense execution parameters. To reduce profiling cost, AutoScout adaptively prioritizes high-impact configuration features and ensembles simulators with varying fidelity. Across diverse models, hardware platforms, and deployment objectives, AutoScout consistently identifies high-performance configurations, achieving 2.7-3.0$\times$ training speedup over expert-tuned settings.

关键词: AutoScout, ML system configuration, mixed-discrete/continuous optimization, hierarchical dependencies, training speedup, model-parallelism, communication optimizations, runtime parameters

248. ❌ Hybrid Energy-Aware Reward Shaping: A Unified Lightweight Physics-Guided Methodology for Policy Optimization

作者: Qijun Liao, Jue Yang, Yiting Kang, Xinxin Zhao, Yong Zhang, Mingan Zhao 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11600v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于深度强化学习（DRL）中的策略优化方法，特别是通过混合能量感知奖励塑形（H-EARS）来改进连续控制任务。虽然涉及深度学习（DRL属于深度学习子领域），但论文内容完全不涉及大语言模型（LLMs）、大模型技术原理、或AI在科学领域的应用。所有关键词均与大语言模型、大模型技术、或特定科学AI应用相关，而本文研究的是强化学习的奖励塑形方法，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种混合能量感知奖励塑形（H-EARS）方法，通过统一基于势能的奖励塑形和能量感知动作正则化，在不需要完整系统模型的情况下，提高了深度强化学习在连续控制任务中的收敛性、稳定性和能源效率。

摘要翻译

深度强化学习在连续控制任务中表现出色，但通常需要大量探索，而基于物理的模型则要求完整的系统方程且面临立方级复杂度问题。本研究提出混合能量感知奖励塑形方法（Hybrid Energy-Aware Reward Shaping, H-EARS），将基于势函数的奖励塑形与能量感知动作正则化相统一。H-EARS通过功能分解在约束动作幅值的同时平衡任务相关势能与能量势能，仅捕获主导能量分量而无需完整动力学模型，实现了线性复杂度O(n)。我们建立了以下理论基础：(1) 任务与能量优化分离的函数独立性；(2) 基于能量的收敛加速机制；(3) 函数逼近下的收敛保证；(4) 近似势函数误差边界。研究还分析了李雅普诺夫稳定性关联作为启发式指导。在多类基准测试中的实验表明，该方法在收敛性、稳定性和能量效率方面均有提升。车辆仿真验证了该方法在极端条件下安全关键领域的适用性。结果证实，集成轻量化物理先验知识可在无需完整系统模型的情况下增强无模型强化学习性能，推动实验室研究向工业应用转化。

摘要 (Abstract)

Deep reinforcement learning excels in continuous control but often requires extensive exploration, while physics-based models demand complete equations and suffer cubic complexity. This study proposes Hybrid Energy-Aware Reward Shaping (H-EARS), unifying potential-based reward shaping with energy-aware action regularization. H-EARS constrains action magnitude while balancing task-specific and energy-based potentials via functional decomposition, achieving linear complexity O(n) by capturing dominant energy components without full dynamics. We establish a theoretical foundation including: (1) functional independence for separate task/energy optimization; (2) energy-based convergence acceleration; (3) convergence guarantees under function approximation; and (4) approximate potential error bounds. Lyapunov stability connections are analyzed as heuristic guides. Experiments across baselines show improved convergence, stability, and energy efficiency. Vehicle simulations validate applicability in safety-critical domains under extreme conditions. Results confirm that integrating lightweight physics priors enhances model-free RL without complete system models, enabling transfer from lab research to industrial applications.

关键词: Deep Reinforcement Learning, Reward Shaping, Energy-Aware, Policy Optimization, Physics-Guided, Continuous Control, Convergence Acceleration, Linear Complexity

249. ❌ Survival Meets Classification: A Novel Framework for Early Risk Prediction Models of Chronic Diseases

作者: Shaheer Ahmad Khan, Muhammad Usamah Shahid, Muddassar Farooq 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11598v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于使用电子病历数据进行慢性疾病风险预测，提出了一种结合生存分析和分类技术的新方法。论文内容与大多数关键词（涉及大模型技术原理、训练方法、推理优化等）完全无关，因此评分为0。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文提到使用新方法生成解释并经过临床验证；与’AI for Science OR Bioinformatics OR Cheminformatics’有较高关联（8分），因为论文属于AI在生物医学领域的应用，但未明确使用大模型或深度学习技术。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合生存分析和分类技术的新框架，用于基于电子病历数据预测五种慢性疾病的早期风险，实验表明其性能优于或相当于LightGBM和XGBoost等现有模型。

摘要翻译

慢性疾病是需要终身医疗干预的长期健康问题。利用大规模电子病历数据，我们针对五种常见慢性疾病——糖尿病、高血压、慢性肾脏病、慢性阻塞性肺疾病和慢性缺血性心脏病——开发了早期疾病风险预测模型。本研究提出了一种整合生存分析与分类技术的新型疾病风险建模方法。传统的慢性疾病风险预测模型主要独立侧重于生存分析或分类方法。本文论证了生存分析方法可通过重构，使其能够高效且有效地执行分类任务，从而成为开发疾病风险监测模型的综合性工具。基于真实世界大规模电子病历数据的实验结果表明，生存模型在准确率、F1分数和AUROC方面的性能，与当前先进的LightGBM和XGBoost等模型相当或更优。最后，所提出的生存模型采用创新方法生成预测解释，该解释已通过由三位临床专家医师组成的评审小组的临床验证。

摘要 (Abstract)

Chronic diseases are long-lasting conditions that require lifelong medical attention. Using big EMR data, we have developed early disease risk prediction models for five common chronic diseases: diabetes, hypertension, CKD, COPD, and chronic ischemic heart disease. In this study, we present a novel approach for disease risk models by integrating survival analysis with classification techniques. Traditional models for predicting the risk of chronic diseases predominantly focus on either survival analysis or classification independently. In this paper, we show survival analysis methods can be re-engineered to enable them to do classification efficiently and effectively, thereby making them a comprehensive tool for developing disease risk surveillance models. The results of our experiments on real-world big EMR data show that the performance of survival models in terms of accuracy, F1 score, and AUROC is comparable to or better than that of prior state-of-the-art models like LightGBM and XGBoost. Lastly, the proposed survival models use a novel methodology to generate explanations, which have been clinically validated by a panel of three expert physicians.

关键词: chronic disease risk prediction, survival analysis, classification, electronic medical records, clinical validation, explainable models, diabetes, hypertension

250. ❌ CAETC: Causal Autoencoding and Treatment Conditioning for Counterfactual Estimation over Time

作者: Nghia D. Nguyen, Pablo Robles-Granda, Lav R. Varshney 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11565v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于因果推断和反事实估计，使用对抗表示学习和自编码架构，属于AI在科学（特别是医学）领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。但论文未涉及大模型、深度学习技术原理创新或任何其他关键词中的具体技术（如LLMs、MoE、Scaling Laws等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CAETC的新方法，通过因果自编码和治疗条件化来解决时间序列观测数据中的反事实估计问题，并在合成和真实数据上展示了优于现有方法的性能。

摘要翻译

反事实时间估计在个性化医疗等众多应用中具有重要意义。然而，观测数据中存在的时间依赖性混杂偏倚，仍是实现准确高效估计的重大挑战。本文针对该问题提出了一种新方法——因果自编码与处理条件化（CAETC，Causal AutoEncoding and Treatment Conditioning）。该方法基于对抗性表征学习，利用自编码架构来学习一种部分可逆且处理不变的表示，并将结果预测任务转化为对该表示施加处理特定的条件化操作。我们的设计独立于底层序列模型，可应用于长短期记忆网络（LSTMs）或时间卷积网络（TCNs）等现有架构。我们在合成数据、半合成数据及真实世界数据上进行了大量实验，结果表明CAETC在反事实估计方面相比现有方法取得了显著提升。

摘要 (Abstract)

Counterfactual estimation over time is important in various applications, such as personalized medicine. However, time-dependent confounding bias in observational data still poses a significant challenge in achieving accurate and efficient estimation. We introduce causal autoencoding and treatment conditioning (CAETC), a novel method for this problem. Built on adversarial representation learning, our method leverages an autoencoding architecture to learn a partially invertible and treatment-invariant representation, where the outcome prediction task is cast as applying a treatment-specific conditioning on the representation. Our design is independent of the underlying sequence model and can be applied to existing architectures such as long short-term memories (LSTMs) or temporal convolution networks (TCNs). We conduct extensive experiments on synthetic, semi-synthetic, and real-world data to demonstrate that CAETC yields significant improvement in counterfactual estimation over existing methods.

关键词: counterfactual estimation, causal inference, time-dependent confounding, adversarial representation learning, autoencoding, treatment conditioning, personalized medicine, observational data

251. ❌ Multi-Task Anti-Causal Learning for Reconstructing Urban Events from Residents’ Reports

作者: Liangkai Zhou, Susu Xu, Shuqi Zhong, Shan Lin 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于多任务反因果学习框架（MTAC）及其在城市事件重建中的应用，涉及因果发现、结构化方程模型和最大后验推断。所有关键词均与大模型、深度学习技术原理或特定AI应用（如生物信息学）相关，但论文未提及任何大模型、深度学习技术或生物/化学信息学应用，仅涉及传统机器学习因果方法。因此，除’AI for Science’因涉及科学应用（城市科学）得5分外，其余关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个多任务反因果学习框架（MTAC），用于从居民报告中重建城市事件，并在真实数据上实现了比基线方法高达34.61%的MAE降低。

摘要翻译

许多现实世界的机器学习任务本质上是反因果的：它们需要从观测结果中推断潜在原因。在实践中，我们常面临多个相关任务，其中部分前向因果机制在任务间保持不变，而其他组成部分则具有任务特异性。我们提出了多任务反因果学习框架，该框架通过显式利用此类跨任务不变性，从结果和混杂因素中估计原因。MTAC首先执行因果发现以学习共享因果图，随后实例化一个结构化的多任务结构方程模型，该模型将结果生成过程分解为：（i）通过具有任务特定输出头的共享主干网络实现的任务不变机制，以及（ii）任务特定机制。基于学习得到的前向模型，MTAC执行最大后验概率推断，通过在学习到的因果结构下联合优化潜在机制变量与原因强度来重建原因。我们在从居民报告中重建城市事件的应用中评估MTAC，涵盖三个任务：违规停车、废弃房产和环境卫生问题。基于从曼哈顿和纽瓦克市收集的真实数据，MTAC相较于强基线模型持续提升了重建精度，实现了高达34.61%的平均绝对误差降低，并验证了跨任务学习可迁移因果机制的有效性。

摘要 (Abstract)

Many real-world machine learning tasks are anti-causal: they require inferring latent causes from observed effects. In practice, we often face multiple related tasks where part of the forward causal mechanism is invariant across tasks, while other components are task-specific. We propose Multi-Task Anti-Causal learning (MTAC), a framework for estimating causes from outcomes and confounders by explicitly exploiting such cross-task invariances. MTAC first performs causal discovery to learn a shared causal graph and then instantiates a structured multi-task structural equation model (SEM) that factorizes the outcome-generation process into (i) a task-invariant mechanism and (ii) task-specific mechanisms via a shared backbone with task-specific heads. Building on the learned forward model, MTAC performs maximum A posteriori (MAP)based inference to reconstruct causes by jointly optimizing latent mechanism variables and cause magnitudes under the learned causal structure. We evaluate MTAC on the application of urban event reconstruction from resident reports, spanning three tasks:parking violations, abandoned properties, and unsanitary conditions. On real-world data collected from Manhattan and the city of Newark, MTAC consistently improves reconstruction accuracy over strong baselines, achieving up to 34.61% MAE reduction and demonstrating the benefit of learning transferable causal mechanisms across tasks.

关键词: Multi-Task Anti-Causal Learning, Causal Discovery, Structural Equation Model, Urban Event Reconstruction, Resident Reports, Maximum A Posteriori Inference, Task-invariant Mechanism

252. ❌ Simultaneous estimation of multiple discrete unimodal distributions under stochastic order constraints

作者: Yasuhiro Yoshida, Noriyoshi Sukegawa, Jiro Iwanaga 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11532v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是在随机序约束下估计多个离散单峰分布的问题，属于统计学和优化方法领域。论文内容完全不涉及大模型、深度学习、AI技术或科学AI应用，所有关键词均与大模型技术原理、训练方法、推理优化、AI应用等无关。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种在随机序约束下估计多个离散单峰分布的方法，通过混合整数凸二次优化实现，实验表明在小样本情况下能平均减少2.2%的Jensen-Shannon散度。

摘要翻译

本研究针对现实平台中的搜索行为分析问题，探讨了多个离散单峰分布的估计方法。为引入分布间优先关系的先验知识，我们施加了随机序约束，并将估计任务构建为一个混合整数凸二次优化问题。在合成数据集与真实数据集上的实验表明：当样本量较小时，所提方法能将詹森-香农散度平均降低2.2%（最高可达6.3%）；而在数据充足时，其性能与现有方法相当。

摘要 (Abstract)

We study the problem of estimating multiple discrete unimodal distributions, motivated by search behavior analysis on a real-world platform. To incorporate prior knowledge of precedence relations among distributions, we impose stochastic order constraints and formulate the estimation task as a mixed-integer convex quadratic optimization problem. Experiments on both synthetic and real datasets show that the proposed method reduces the Jensen-Shannon divergence by 2.2% on average (up to 6.3%) when the sample size is small, while performing comparably to existing methods when sufficient data are available.

关键词: discrete unimodal distributions, stochastic order constraints, mixed-integer convex quadratic optimization, Jensen-Shannon divergence, search behavior analysis, estimation, sample size

253. ❌ CFD-HAR: User-controllable Privacy through Conditional Feature Disentanglement

作者: Alex Gn, Fan Li, S Kuniyilh, Ada Axan 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11526v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于人类活动识别（HAR）中的隐私保护技术，使用特征解缠和表示学习方法，但未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science应用；所有关键词均与大模型、深度学习技术或科学AI应用相关，而本文研究传统机器学习在物联网边缘设备上的应用，与给定关键词完全无关。

!!! tip deepseek-chat TL;DR

本文提出了一种基于条件特征解缠的用户可控隐私保护方法，用于物联网边缘设备上的人类活动识别系统，在保护用户隐私的同时保持识别性能，并与基于自编码器的少样本学习方法进行了比较分析。

摘要翻译

现代可穿戴与移动设备普遍配备惯性测量单元（IMU）。基于此类设备运行的人类活动识别（HAR）应用采用以机器学习为基础的数据驱动技术，利用传感器数据进行识别。然而，依赖传感器数据的HAR实际部署面临两大关键挑战：一是依据用户隐私偏好保护传感器数据中嵌入的敏感信息，二是在标注样本有限的情况下保持高识别性能。本文提出一种基于特征解耦表示学习的细粒度动态隐私过滤技术，以实现用户可控的隐私保护。我们还将该技术与基于自编码器的表示学习方法在少样本HAR中的效能进行对比，从架构设计、学习目标、隐私保障性、数据效率及边缘物联网（IoT）部署适用性等方面展开分析。研究表明：基于条件特征解耦（CFD）的HAR通过在潜在空间中分离活动特征与敏感属性，提供了显式可调的隐私保护机制；而基于自编码器的少样本HAR虽具有更优的标签效率与轻量级适应能力，却缺乏内在的隐私保障机制。我们进一步探讨了两种方法在持续学习物联网环境中的安全影响，揭示了它们在表征泄露与嵌入层攻击脆弱性方面的差异。分析表明，单一范式均无法完全满足下一代物联网HAR系统的新兴需求。最后，本文展望了未来研究方向，指出应构建能协同优化隐私保护、少样本适应能力与系统鲁棒性的统一框架，以实现可信的物联网智能。

摘要 (Abstract)

Modern wearable and mobile devices are equipped with inertial measurement units (IMUs). Human Activity Recognition (HAR) applications running on such devices use machine-learning-based, data-driven techniques that leverage such sensor data. However, sensor-data-driven HAR deployments face two critical challenges: protecting sensitive user information embedded in sensor data in accordance with users’ privacy preferences and maintaining high recognition performance with limited labeled samples. This paper proposes a technique for user-controllable privacy through feature disentanglement-based representation learning at the granular level for dynamic privacy filtering. We also compare the efficacy of our technique against few-shot HAR using autoencoder-based representation learning. We analyze their architectural designs, learning objectives, privacy guarantees, data efficiency, and suitability for edge Internet of Things (IoT) deployment. Our study shows that CFD-based HAR provides explicit, tunable privacy protection controls by separating activity and sensitive attributes in the latent space, whereas autoencoder-based few-shot HAR offers superior label efficiency and lightweight adaptability but lacks inherent privacy safeguards. We further examine the security implications of both approaches in continual IoT settings, highlighting differences in susceptibility to representation leakage and embedding-level attacks. The analysis reveals that neither paradigm alone fully satisfies the emerging requirements of next-generation IoT HAR systems. We conclude by outlining research directions toward unified frameworks that jointly optimize privacy preservation, few-shot adaptability, and robustness for trustworthy IoT intelligence.

关键词: Human Activity Recognition, privacy protection, feature disentanglement, representation learning, few-shot learning, edge IoT, sensor data, autoencoder

254. ❌ Binding Free Energies without Alchemy

作者: Michael Brocidiacono, Brandon Novy, Rishabh Dey, Konstantin I. Popov, Alexander Tropsha 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12253v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学和生物信息学领域，提出了一种新的蛋白质-配体结合自由能计算方法（DBFE），属于科学计算和分子模拟范畴。论文内容与绝大多数大模型和深度学习技术关键词完全无关，因为这些关键词主要涉及语言模型架构、训练方法、推理优化、对齐技术、代理系统等。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物信息学/计算化学应用，但论文本身并未明确使用AI或机器学习方法（它描述的是基于物理的模拟方法），因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需炼金中间态的Direct Binding Free Energy（DBFE）方法，用于更高效地预测蛋白质-配体结合亲和力，并在基准测试中表现优于或相当于现有方法。

摘要翻译

绝对结合自由能（Absolute Binding Free Energy, ABFE）方法是预测蛋白质-配体结合亲和力最精确的计算技术之一，但其应用受限于需要对大量经炼金术修饰的中间态进行模拟。我们提出了直接结合自由能（Direct Binding Free Energy, DBFE）方法，这是一种隐式溶剂下的末端态ABFE方法，无需炼金术中间态。在主-客体基准测试中，DBFE的表现优于OBC2双去耦方法；在蛋白质-配体基准测试中，其性能与OBC2 MM/GBSA相当。由于受体和配体的模拟可以预先计算并分摊至多个化合物，与双去耦方法所需的多个λ窗口相比，DBFE仅需对每个配体进行一次复合物模拟，这使其成为虚拟筛选工作流程中极具潜力的候选方法。我们已在https://github.com/molecularmodelinglab/dbfe公开此方法的代码。

摘要 (Abstract)

Absolute Binding Free Energy (ABFE) methods are among the most accurate computational techniques for predicting protein-ligand binding affinities, but their utility is limited by the need for many simulations of alchemically modified intermediate states. We propose Direct Binding Free Energy (DBFE), an end-state ABFE method in implicit solvent that requires no alchemical intermediates. DBFE outperforms OBC2 double decoupling on a host-guest benchmark and performs comparably to OBC2 MM/GBSA on a protein-ligand benchmark. Since receptor and ligand simulations can be precomputed and amortized across compounds, DBFE requires only one complex simulation per ligand compared to the many lambda windows needed for double decoupling, making it a promising candidate for virtual screening workflows. We publicly release the code for this method at https://github.com/molecularmodelinglab/dbfe.

关键词: Binding Free Energy, Protein-ligand binding, Computational chemistry, Implicit solvent, Virtual screening, Molecular simulation, DBFE, Alchemical intermediates

255. ❌ Permutation invariant multi-scale full quantum neural network wavefunction

作者: Pengzhen Cai, Yubing Qian, Li Deng, Weizhong Fu, Lei Yang, Zhiyu Sun, Xin-Zheng Li, En-Ge Wang, Liangwen Chen, Weiluo Ren, Ji Chen 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12233v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子物理领域的神经网络应用，提出了一种用于模拟多体量子系统波函数的神经网络框架。所有关键词均与大语言模型（LLM）或深度学习技术原理相关，但论文内容完全不涉及LLM、深度学习技术原理或AI在生物医药等科学领域的应用。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（具体是量子物理）中的应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文主题（量子神经网络、波函数模拟）无直接关联，均得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种具有置换不变性的多尺度全量子神经网络波函数框架，用于直接模拟包含电子、原子核和μ子的复杂多体量子系统的完整波函数，超越了玻恩-奥本海默近似，并在分子系统上验证了其计算可行性。

摘要翻译

求解相互作用粒子复杂的量子行为是解开凝聚态物质奥秘的关键，但捕捉它们跨尺度的复杂关联性仍是一个巨大的挑战。我们提出一种神经网络框架，通过直接模拟系统（包括电子、原子核与μ子）的完整量子波函数来突破这一障碍，从而捕获超越玻恩-奥本海默近似的全量子效应。该神经网络以严格处理置换不变性的方式近似描述不同相互作用粒子的联合波函数，使得无需显式考虑激发态即可同时处理核量子效应及电子-原子核-μ子耦合。在分子体系上的验证表明，该方法为复杂多体系统中的全量子现象建模提供了一条计算上可行的路径，从而在基本粒子特性与涌现的材料行为之间建立了直接联系。

摘要 (Abstract)

Solving the intricate quantum behavior of interacting particles is key to unlocking the mysteries of condensed matter, but capturing their complex correlations across different scales remains a monumental challenge. We introduce a neural network framework that overcomes this barrier by modeling the full quantum wavefunction of a system, including electrons, nuclei and muons, directly capturing the full quantum effects beyond the Born-Oppenheimer approximation. The neural network approximates joint wavefunction of different interacting particles with a rigorous handling of permutation invariance, enabling simultaneous treatment of nuclear quantum effects and electron-nucleus-muon couplings without explicit excited states. Validated on molecular systems, this approach offers a computationally feasible way to model full quantum phenomena in complex many-body systems, establishing a direct connection between fundamental particle properties and emergent material behavior.

关键词: quantum neural network, wavefunction, permutation invariance, multi-scale, many-body systems, Born-Oppenheimer approximation, nuclear quantum effects, molecular systems

256. ❌ Raman relaxation in Yb(III) molecular qubits: non-trivial correlations between spin-phonon coupling and molecular structure

作者: Giacomo Sansone, Lorenzo A. Mariano, Stefano Carretta, Paolo Santini, Alessandro Lunghi 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12160v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究Yb(III)分子量子比特的自旋-声子弛豫，属于计算化学和量子信息科学领域，与绝大多数大模型/深度学习技术关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究涉及计算化学模拟和分子设计，属于科学计算应用，但论文本身并未使用AI或机器学习方法，而是采用第一性原理计算，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究通过第一性原理计算揭示了Yb(III)分子量子比特中自旋-声子弛豫的复杂机制，发现低温弛豫由少数离域低能声子驱动的拉曼过程主导，且分子结构（超出第一配位层）对自旋-声子耦合的调控难以用简单化学相关性解释，从而提出需要基于预测性第一性原理框架来指导未来化学设计策略。

摘要翻译

镱(III)的配位配合物在4f化合物中展现出最长的自旋相干时间之一，这使其成为分子量子技术领域极具前景的平台。尽管即使在低温下，自旋-声子弛豫仍然是相干时间的限制因素，但通过化学设计对其进行调控，有望推动这些自旋量子比特原型突破当前极限。为了深入探究如何从化学角度调控自旋-声子弛豫，本文对三种化学差异极小但自旋弛豫时间存在定量差异的镱(III)分子进行了完整的从头算自旋-声子动力学研究。结果表明，低温弛豫由一小群高度离域的低能声子触发的拉曼过程主导。对这些贡献的分析表明，超越第一配位壳层的分子结构修饰对自旋-声子耦合的调控本质上高度复杂，难以用简单的化学术语进行合理化解释。这些发现呼吁在概念上实现转变：应摒弃试图使用简单的磁结构相关性来解释分子结构修饰对自旋-声子弛豫的影响，而将预测性第一性原理框架作为未来化学设计策略的潜在驱动力。

摘要 (Abstract)

The coordination complexes of Yb(III) exhibit some of the longest spin coherence times among 4f compounds, making them a promising platform for molecular quantum technologies. While spin-phonon relaxation remains a limiting factor for coherence times even at low temperature, its control through chemical design has the potential to push these spin qubits prototypes beyond current limits. With the aim of providing insights on how to chemically control spin-phonon relaxation, we here present a full ab initio study of spin-phonon dynamics for three Yb(III) molecules exhibiting minimal chemical differences, yet quantitatively different spin relaxation times. Results show that low-temperature relaxation is governed by Raman processes triggered by a small group of largely delocalized low-energy phonons. The analysis of these contributions highlights that the modulation of spin-phonon coupling by molecular structure modifications beyond the first coordination shell are highly non-trivial in nature and hard to rationalize in simple chemical terms. These findings call for a conceptual step change from the attempt to use simple magneto-structural correlations to interpret the effect of molecular structural modifications on spin-phonon relaxation, and present predictive first-principles frameworks as a potential driving force of future chemical design strategies

关键词: Yb(III) molecular qubits, spin-phonon relaxation, Raman processes, ab initio study, molecular quantum technologies, spin coherence times, first-principles frameworks, chemical design

257. ❌ Note on a rigorous derivation of self-consistent double-hybrid functional theory via generalized Kohn-Sham theory and cumulant approximation

作者: Lan Nguyen Tran 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12131v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是关于计算化学中密度泛函理论（DFT）的理论推导，具体研究双杂化密度泛函的自洽性问题。论文内容完全属于理论化学和计算物理领域，与所有大模型、深度学习、AI技术原理等关键词无直接关联。唯一可能的相关点是"AI for Science"，因为该研究属于科学计算领域，但论文并未使用任何AI或机器学习方法，而是纯理论推导，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文通过广义Kohn-Sham理论和累积近似，提出了一个严格自洽的双杂化密度泛函理论框架，解决了传统双杂化泛函中MP2相关能非自洽处理的理论不一致性问题。

摘要翻译

在本短讯中，我们提出了单体双杂化密度泛函（OBDHF）理论的严格理论推导，这是一种新颖的自洽双杂化密度泛函框架，它将广义Kohn-Sham（GKS）形式体系与单体Møller-Plesset二阶微扰（OBMP2）理论相统一。传统的双杂化密度泛函存在一个根本性的理论不一致问题，这源于对微扰MP2相关能的非自洽处理：进入相关能表达式的轨道并未针对完整的双杂化能量泛函进行变分优化。为解决这一缺陷，我们构建了一个模型能量泛函，它是半局域密度泛函近似交换相关（XC）能、比例为$α_x$的精确Hartree-Fock（HF）交换能以及比例为$α_c$的OBMP2相关能的线性组合。凭借OBMP2的单体算符结构，微扰相关能贡献被直接且自洽地嵌入到GKS有效哈密顿量中，而无需借助优化有效势（OEP）构造或微扰轨道弛豫修正。通过对总OBDHF能量进行关于轨道的泛函微分，我们以严格且清晰的方式推导出了OBDHF有效哈密顿量及相应的自洽场方程。这一表述在GKS框架内为完全自洽的双杂化密度泛函理论计算提供了一条理论基础坚实且实际可行的路径，从而解决了传统双杂化泛函固有的自洽性问题。

摘要 (Abstract)

In this short note, we present a rigorous theoretical derivation of the one-body double-hybrid density functional (OBDHF) theory, a novel self-consistent double-hybrid density functional framework that unifies the generalized Kohn-Sham (GKS) formalism with one-body Møller-Plesset second-order perturbation (OBMP2) theory. Conventional double-hybrid density functionals suffer from a fundamental theoretical inconsistency arising from the non-self-consistent treatment of the perturbative MP2 correlation, in which the orbitals entering the correlation energy expression are not variationally optimized with respect to the full double-hybrid energy functional. To address this deficiency, we construct a model energy functional as a linear combination of semilocal density functional approximation XC, a fraction $α_x$ of exact Hartree-Fock (HF) exchange, and a fraction $α_c$ of OBMP2 correlation. By virtue of the one-body operator structure of OBMP2, the perturbative correlation contribution is embedded directly and self-consistently into the GKS effective Hamiltonian, without recourse to the optimized effective potential (OEP) construction or perturbative orbital relaxation corrections. Through functional differentiation of the total OBDHF energy with respect to the orbitals, we derive the OBDHF effective Hamiltonian and the associated self-consistent field equations in a rigorous and transparent manner. This formulation provides a theoretically well-founded and practically tractable pathway to fully self-consistent double-hybrid density functional theory calculations within the GKS framework, resolving the self-consistency problem inherent in conventional double-hybrid functionals.

关键词: double-hybrid density functional, self-consistent, generalized Kohn-Sham theory, Møller-Plesset perturbation theory, theoretical derivation, orbital optimization, computational chemistry, density functional theory

258. ❌ Thermodynamic Descriptors from Molecular Dynamics as Machine Learning Features for Extrapolable Property Prediction

作者: Nuria H. Espejo, Pablo Llombart, Andrés González de Castilla, Jorge Ramirez, Jorge R. Espinosa, Adiran Garaizar 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.12017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是使用分子动力学模拟计算的热力学描述符作为机器学习特征，用于预测分子性质（特别是沸点），以提高模型在化学空间中的外推能力。论文的核心是物理增强的机器学习框架，使用CatBoost回归模型，属于AI在科学（特别是化学信息学）领域的应用。因此，仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为论文涉及AI在化学领域的应用，但未明确提及生物信息学或化学信息学术语，且研究重点不是大模型或深度学习技术原理的创新。其他所有关键词均与大语言模型、深度学习技术、模型训练/对齐/推理优化、代理系统等无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种物理增强的机器学习框架，使用分子动力学模拟计算的热力学描述符替代结构描述符，成功预测了包括训练集中未出现的无机化合物和盐类在内的多种分子的沸点，显著提高了模型在化学空间中的外推能力。

摘要翻译

依赖分子结构的机器学习模型在预测具有充分代表性有机化合物的性质方面表现优异，但其对训练域外化学类型的泛化能力有限，这仍是化学发现领域的一个关键瓶颈。这一挑战在工业发现中尤为突出，因为探索未知化学空间以生成新知识产权是其主要目标。常压沸点可作为测试机器学习算法外推能力的关键基准。现有方法的一个主要局限在于：基团贡献法因其设计原理，无法对含有未参数化片段的分子进行预测。本文证明，通过用分子动力学模拟直接计算的热力学性质替代结构描述符，可以克服这一局限。我们提出一种物理增强框架，其中CatBoost回归模型直接从原子级液相模拟中提取的系综平均内聚能、汽化热和密度进行学习。基准比较表明，虽然我们的物理增强模型与传统的基于结构的模型在标准有机化合物上表现相当，但只有前者在外推至结构差异显著的化学空间时能保持误差增长受控。我们的模型成功预测了训练集中完全未出现的化学类别——包括无机化合物、盐类以及含Si、B、Te等元素的分子——而这些类别是基于结构的模型从根本上无法处理的。通过编码支配相行为的分子间作用力，本框架为超越现有方法结构边界的性质预测建立了一种可推广的策略。

摘要 (Abstract)

Machine learning (ML) models which rely on molecular structure excel at predicting properties for well-represented organic compounds, however their limited ability to extrapolate to chemotypes outside their training domain, remains a critical bottleneck in chemical discovery. This challenge is particularly acute in industrial discovery, where navigating uncharted chemical space to generate new intellectual property is a primary objective. Normal boiling points serve as a key benchmark for testing the extrapolative power of ML algorithms. A major limitation is that group-contribution methods are by design unable to generate predictions for molecules containing unparameterized fragments. Here, we demonstrate that this limitation can be overcome by replacing structural descriptors with thermodynamic properties computed directly from molecular dynamics simulations. We introduce a physics-augmented framework where a CatBoost regression model learns directly from ensemble-averaged cohesive energies, heats of vaporization, and densities extracted from atomistic liquid-phase simulations. Benchmark comparisons reveal that while both our physics-augmented model and conventional structure-based models perform comparably well on standard organic compounds, only the former maintains controlled error growth when extrapolating to structurally dissimilar chemical space. Our model successfully predicts boiling points for chemical classes entirely absent from training – including inorganic compounds, salts, and molecules with elements like Si, B, and Te – where structure-based models are fundamentally inapplicable. By encoding the intermolecular forces governing phase behavior, our framework establishes a generalizable strategy for property prediction beyond the structural boundaries of the existing methods.

关键词: thermodynamic descriptors, molecular dynamics simulations, machine learning, property prediction, extrapolation, boiling points, physics-augmented framework, CatBoost regression

259. ❌ Accurate prediction of K-edge excitation energies using state-specific self-consistent perturbation theory

作者: Lan Nguyen Tran 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11893v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子化学计算领域，提出了一种基于OBMP2的ΔSCF协议用于预测K-edge激发能。论文内容与绝大多数关键词（涉及大模型、深度学习、训练方法、推理优化、智能体等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学领域，是AI for Science的一个具体应用分支，但论文本身并未提及AI或机器学习方法，而是纯粹的量子化学计算方法创新，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于自洽微扰理论（OBMP2）的ΔSCF新方法，用于精确预测分子的K-edge激发能，并在闭壳层和开壳层分子测试集上证明其性能优于ΔDFT和EOM-CCSD等现有标准方法。

摘要翻译

本文介绍了新近发展的单体莫勒-普莱塞特微扰理论（One-Body Møller-Plesset Perturbation Theory, OBMP2）在K边激发态预测中的应用。OBMP2是一种自洽微扰理论，通过正则变换结合累积量近似，可推导出有效单体哈密顿量。该算符在标准福克算符基础上，增加了包含双激发MP2振幅的单体关联势，从而允许在关联效应存在下优化分子轨道与轨道能量。这种自洽框架缓解了开壳层体系和键拉伸体系中标准非迭代MP2方法常见的收敛性与精度问题。本研究评估了基于OBMP2的方法在计算K边激发时的性能。通过对闭壳层与开壳层分子的基准测试集进行分析，我们证明该方法优于现有标准技术（包括$Δ$DFT、EOM-CCSD和USTEOM-CCSD）。我们的研究结果表明，基于OBMP2的$Δ$自洽场计算方案为处理K边激发态提供了一种稳健而精确的新型计算方法。

摘要 (Abstract)

We present the application of the recently developed one-body Møller–Plesset perturbation theory (OBMP2) to the prediction of K-edge excited states. OBMP2 is a self-consistent perturbation theory in which a canonical transformation followed by a cumulant approximation yields an effective one-body Hamiltonian. This resulting operator augments the standard Fock operator with a one-body correlation potential containing double-excitation MP2 amplitudes, allowing molecular orbitals and orbital energies to be optimized in the presence of correlation. This self-consistent framework mitigates convergence and accuracy issues often encountered in standard non-iterative MP2 for open-shell systems and bond-stretching regimes. In this work, we evaluate the performance of an OBMP2-based approach for the calculation of K-edge excitations. Utilizing benchmark test sets of both closed-shell and open-shell molecules, we demonstrate that our method outperforms established standard techniques, including $Δ$DFT, EOM-CCSD, and USTEOM-CCSD. Our findings establish the OBMP2-based $Δ$SCF protocol as a robust and accurate new computational method for the treatment of K-edge excited states.

关键词: K-edge excitation energies, OBMP2, self-consistent perturbation theory, ΔSCF protocol, molecular orbitals, correlation potential, computational chemistry, excited states

260. ❌ ChemFit: A concurrent framework for model parametrization

作者: Moritz Sallermann, Amrita Goswami, Hannes Jónsson, Elvar Ö. Jónsson, Jorge R. Espinosa 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11769v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文ChemFit是一个用于计算化学和物理中参数优化的Python框架，专注于分子动力学模拟和密度泛函理论计算中的参数拟合。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词都特指大型语言模型或深度学习模型的技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学领域，是AI在科学（具体是化学）中的应用，但论文本身并未明确使用AI或机器学习方法进行参数优化（它使用梯度自由和黑盒优化算法，这些是传统优化方法，不一定是基于AI的），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了ChemFit，一个用于计算化学和物理中基于模拟的目标函数定义、组合和并发评估的Python框架，以支持可扩展、可重复且与优化器无关的参数拟合，并通过液态氩的Lennard-Jones参数确定和水的可极化力场参数化等案例验证了其有效性。

摘要翻译

计算化学与物理学中的参数优化常涉及目标函数具有计算成本高昂、含噪声、不可微分或由独立模拟产生的异质贡献组成等特点。无梯度和黑箱优化算法是解决此类优化问题的有力工具，尤其适用于此类场景。然而，模拟引擎与参数优化库的对接往往较为繁琐，特别是在模拟成本较高且需要并行运行的情况下。本文介绍ChemFit——一个灵活的Python框架，用于定义、组合及大规模并行评估基于模拟的目标函数，该框架专为与此类算法协同工作而设计。它提供了针对异质目标项、基于文件和内存的物理量评估的抽象机制，并能对目标函数各组分及参数猜测的并行计算进行显式控制。我们通过以下应用展示了ChemFit的多功能性：（i）利用分子动力学模拟，根据宽温压范围内的实验密度数据确定液态氩的Lennard-Jones参数；（ii）基于密度泛函理论计算获得的小型冰团簇结构，对H2O的极化力场进行参数化。这些案例表明，ChemFit能够实现可扩展、可重复且与优化器无关的参数拟合。

摘要 (Abstract)

Parameter optimization in computational chemistry and physics often involves objective functions that are expensive to evaluate, noisy, non-differentiable, or composed of heterogeneous contributions originating from separate simulations. Gradient-free and black box optimization algorithms are powerful tools which are particularly well-suited to solving such optimization problems. However, interfacing simulation engines and parameter optimization libraries can be cumbersome, especially if simulations are expensive and need to be run concurrently. Here, we introduce ChemFit, a flexible Python framework for the definition, composition, and massively concurrent evaluation of simulation-based objective functions, which is designed to operate in conjunction with these algorithms. This framework provides abstractions for heterogeneous objective terms, file-based and in-memory quantity evaluation, and explicit control over concurrency across both objective components and parameter guesses. We demonstrate the versatility of ChemFit for different applications such as: (i) determination of Lennard-Jones parameters for liquid Argon from experimental density data over a range in temperature and pressure, using molecular-dynamics simulations, and (ii) the parameterization of a polarizable force-field for H2O against the structure of small ice clusters obtained from density functional theory calculations. These examples illustrate how ChemFit enables scalable, reproducible, and optimizer-agnostic parameter fitting.

关键词: parameter optimization, computational chemistry, molecular dynamics, density functional theory, concurrent evaluation, force-field parameterization, Python framework, black box optimization

261. ❌ Why ice is so slippery

作者: Sigbjørn Løland Bore, B. N. J. Persson, Henrik Andersen Sveinsson 期刊/来源: arxiv 发布日期: 2026-03-12 arXiv链接: http://arxiv.org/abs/2603.11539v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究冰的摩擦学特性，通过纳米尺度模拟和摩擦加热模型解释冰面滑溜的物理机制。所有评分关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、应用领域等），而本文是纯粹的物理/材料科学研究，未使用任何人工智能、机器学习或大模型技术，也未涉及生物信息学、化学信息学等AI for Science应用。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究通过纳米尺度模拟和摩擦加热模型揭示了冰面滑溜的物理机制，发现摩擦加热导致接触温度升高至接近熔点，而非仅靠初始润滑膜形成，从而与实验数据吻合。

摘要翻译

冰面为何光滑这一问题长期困扰着科学界。为解决此问题，我们首先基于第一性原理模拟了纳米尺度下冰与玻璃（非晶二氧化硅）的摩擦行为，并利用摩擦热模型将结果扩展至宏观尺度。研究发现，仅凭纳米尺度模拟无法准确捕捉冰摩擦的速度依赖性，会导致摩擦系数被高估。通过恰当考虑摩擦生热效应，我们发现即使在中低速运动（速度高于0.1 m/s，位移1毫米）条件下，接触面温度也会急剧上升至接近冰的熔点，从而使得模拟结果与宽速度范围内的实验摩擦数据高度吻合。尽管冰面润滑膜的最初形成可能无需热作用，但正如鲍登和休斯于1939年所提出的（未涉及融化机制），冰面的最终光滑性本质上取决于摩擦生热效应。

摘要 (Abstract)

The origin of ice’s slipperiness has long puzzled scientists. To resolve this question, we simulate ice- glass (amorphous silica) friction at the nanoscale from first principles and upscale to the macroscale using a frictional heating model. We find that nanoscale simulations alone cannot capture the correct velocity dependence of ice friction, resulting in an overestimated coefficient of friction. By properly accounting for frictional heating, we find a strong increase in contact temperature toward the melting point, even under modest motion of 1 millimeter with velocities above 0.1 m/s, yielding excellent agreement with experimental friction data across a wide range of velocities. While the initial formation of a lubricating film on ice may occur without heating, the ultimate slipperiness of ice hinges on frictional heating, as proposed by Bowden and Hughes in 1939, but without incorporating melting.

关键词: ice friction, frictional heating, nanoscale simulation, melting point, lubricating film, velocity dependence, first principles, contact temperature

Token 消耗统计

总计: 803,045 tokens（输入 508,615 / 输出 294,430）

模型	输入	输出	合计
deepseek-chat	465,581	260,522	726,103
glm-4.7	43,034	33,908	76,942

📊 ArXiv 研究报告 (2026-03-13)#

📌 配置信息#

关键词列表（共 27 个，总权重 27.0）#

评分设置#

📈 论文统计#

⭐ 及格论文详细分析#

1. Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural#

分词使多模态大语言模型能够理解、生成和编辑建筑平面图#

2. Tiny Aya: Bridging Scale and Multilingual Depth#

Tiny Aya：连接规模与多语言深度#

3. Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights#

神经灌木丛：预训练权重周围密集分布着多样化的任务专家#

4. When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows#

当OpenClaw遇见医院：面向动态临床工作流的智能体操作系统#

5. One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries#

一个主管，多种模态：自主查询的自适应工具编排#

6. CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable#

CrossEarth-SAR：以SAR为中心的十亿级规模地理空间基础模型，用于域可泛化语义分割#

7. On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents#

大语言模型智能体主动推理强化学习中的信息自锁现象研究#

8. AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimiza#

AdaFuse：通过令牌级预门控和融合内核优化加速动态适配器推理#

9. Scaling Laws for Educational AI Agents#

教育AI智能体的缩放定律#

10. Long-Context Encoder Models for Polish Language Understanding#

面向波兰语理解的长上下文编码器模型#

11. Automating Skill Acquisition through Large-Scale Mining of Open-Source Agentic Repositories: A Frame#

通过大规模挖掘开源智能体仓库实现技能自动化获取：一种多智能体程序性知识提取框架#

12. From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration#

从控制到远见：仿真作为人机协作的新范式#

📋 所有论文列表#

1. ✅ Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans#

2. ✅ Tiny Aya: Bridging Scale and Multilingual Depth#

3. ✅ Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights#

4. ✅ When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows#

5. ✅ One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries#

6. ✅ CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation#

7. ✅ On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents#

8. ✅ AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization#

9. ✅ Scaling Laws for Educational AI Agents#

10. ✅ Long-Context Encoder Models for Polish Language Understanding#

11. ✅ Automating Skill Acquisition through Large-Scale Mining of Open-Source Agentic Repositories: A Framework for Multi-Agent Procedural Knowledge Extraction#

12. ✅ From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration#

13. ❌ Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions#

14. ❌ PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents#

15. ❌ Language Generation with Replay: A Learning-Theoretic View of Model Collapse#

16. ❌ Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems#

17. ❌ Streaming Translation and Transcription Through Speech-to-Text Causal Alignment#

18. ❌ The Latent Color Subspace: Emergent Order in High-Dimensional Chaos#

19. ❌ STAIRS-Former: Spatio-Temporal Attention with Interleaved Recursive Structure Transformer for Offline Multi-task Multi-agent Reinforcement Learning#

20. ❌ SemBench: A Universal Semantic Framework for LLM Evaluation#

21. ❌ AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling#

22. ❌ EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering#

23. ❌ Coarse-Guided Visual Generation via Weighted h-Transform Sampling#

24. ❌ Compactifying the Electronic Wavefunction II: Quantum Estimators for Spin-Coupled Generalized Valence Bond Wavefunctions#

25. ❌ Accurate prediction of inverted singlet-triplet excited states using self-consistent spin-opposite perturbation theory#

26. ❌ SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning#

27. ❌ Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training#

28. ❌ Separable neural architectures as a primitive for unified predictive and generative intelligence#

29. ❌ Incremental Neural Network Verification via Learned Conflicts#

30. ❌ Security Considerations for Artificial Intelligence Agents#

31. ❌ Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration#

32. ❌ Portfolio of Solving Strategies in CEGAR-based Object Packing and Scheduling for Sequential 3D Printing#

33. ❌ RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images#

34. ❌ WORKSWORLD: A Domain for Integrated Numeric Planning and Scheduling of Distributed Pipelined Workflows#

35. ❌ Compiling Temporal Numeric Planning into Discrete PDDL+: Extended Version#

36. ❌ Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections#

37. ❌ Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials#

38. ❌ BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning#

39. ❌ A Quantitative Characterization of Forgetting in Post-Training#

40. ❌ GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows#

41. ❌ IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL#

42. ❌ FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance#

43. ❌ Automatic Generation of High-Performance RL Environments#

44. ❌ TopoBench: Benchmarking LLMs on Hard Topological Reasoning#

45. ❌ Increasing intelligence in AI agents can worsen collective outcomes#

46. ❌ CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance#

47. ❌ SommBench: Assessing Sommelier Expertise of Language Models#

48. ❌ Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives#

49. ❌ A Robust and Efficient Multi-Agent Reinforcement Learning Framework for Traffic Signal Control#

📊 ArXiv 研究报告 (2026-03-13)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural

分词使多模态大语言模型能够理解、生成和编辑建筑平面图

2. Tiny Aya: Bridging Scale and Multilingual Depth

Tiny Aya：连接规模与多语言深度

3. Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

神经灌木丛：预训练权重周围密集分布着多样化的任务专家

4. When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows

当OpenClaw遇见医院：面向动态临床工作流的智能体操作系统

5. One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

一个主管，多种模态：自主查询的自适应工具编排

6. CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable

CrossEarth-SAR：以SAR为中心的十亿级规模地理空间基础模型，用于域可泛化语义分割

7. On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

大语言模型智能体主动推理强化学习中的信息自锁现象研究

8. AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimiza

AdaFuse：通过令牌级预门控和融合内核优化加速动态适配器推理

9. Scaling Laws for Educational AI Agents

教育AI智能体的缩放定律

10. Long-Context Encoder Models for Polish Language Understanding

面向波兰语理解的长上下文编码器模型

11. Automating Skill Acquisition through Large-Scale Mining of Open-Source Agentic Repositories: A Frame

通过大规模挖掘开源智能体仓库实现技能自动化获取：一种多智能体程序性知识提取框架

12. From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration

从控制到远见：仿真作为人机协作的新范式

📋 所有论文列表

1. ✅ Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

2. ✅ Tiny Aya: Bridging Scale and Multilingual Depth

3. ✅ Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

4. ✅ When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows

5. ✅ One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

6. ✅ CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation

7. ✅ On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

8. ✅ AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

9. ✅ Scaling Laws for Educational AI Agents

10. ✅ Long-Context Encoder Models for Polish Language Understanding

11. ✅ Automating Skill Acquisition through Large-Scale Mining of Open-Source Agentic Repositories: A Framework for Multi-Agent Procedural Knowledge Extraction

12. ✅ From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration

13. ❌ Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions

14. ❌ PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents

15. ❌ Language Generation with Replay: A Learning-Theoretic View of Model Collapse

16. ❌ Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

17. ❌ Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

18. ❌ The Latent Color Subspace: Emergent Order in High-Dimensional Chaos

19. ❌ STAIRS-Former: Spatio-Temporal Attention with Interleaved Recursive Structure Transformer for Offline Multi-task Multi-agent Reinforcement Learning

20. ❌ SemBench: A Universal Semantic Framework for LLM Evaluation

21. ❌ AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

22. ❌ EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

23. ❌ Coarse-Guided Visual Generation via Weighted h-Transform Sampling

24. ❌ Compactifying the Electronic Wavefunction II: Quantum Estimators for Spin-Coupled Generalized Valence Bond Wavefunctions

25. ❌ Accurate prediction of inverted singlet-triplet excited states using self-consistent spin-opposite perturbation theory

26. ❌ SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

27. ❌ Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

28. ❌ Separable neural architectures as a primitive for unified predictive and generative intelligence

29. ❌ Incremental Neural Network Verification via Learned Conflicts

30. ❌ Security Considerations for Artificial Intelligence Agents

31. ❌ Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration

32. ❌ Portfolio of Solving Strategies in CEGAR-based Object Packing and Scheduling for Sequential 3D Printing

33. ❌ RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images

34. ❌ WORKSWORLD: A Domain for Integrated Numeric Planning and Scheduling of Distributed Pipelined Workflows

35. ❌ Compiling Temporal Numeric Planning into Discrete PDDL+: Extended Version

36. ❌ Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

37. ❌ Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials

38. ❌ BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning

39. ❌ A Quantitative Characterization of Forgetting in Post-Training

40. ❌ GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

41. ❌ IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

42. ❌ FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

43. ❌ Automatic Generation of High-Performance RL Environments

44. ❌ TopoBench: Benchmarking LLMs on Hard Topological Reasoning

45. ❌ Increasing intelligence in AI agents can worsen collective outcomes

46. ❌ CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance

47. ❌ SommBench: Assessing Sommelier Expertise of Language Models

48. ❌ Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives

49. ❌ A Robust and Efficient Multi-Agent Reinforcement Learning Framework for Traffic Signal Control