📊 ArXiv 研究报告 (2026-03-25)

生成时间: 2026-03-25 09:49:33 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 326 篇
及格论文: 10 篇 (3.1%)
深度分析: 10 篇

⭐ 及格论文详细分析

1. Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment

作者: Xinyu Zhang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21558v1

评分: 75.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究递归自我改进（recursive self-improvement）中的递归漂移问题，提出NSRSA框架，通过符号验证子系统在推理步骤层面筛选训练数据质量。该研究高度相关于大模型（LLMs）、对齐（Alignment）、DPO、思维链（CoT）、系统2思维（System 2 Thinking）和自我改进（Self-Improvement），因为这些是论文的核心技术和方法。与数据质量（Scaling Laws AND Data Quality）、幻觉缓解（Hallucination Mitigation）和可解释AI（Explainable AI）有一定关联，因为论文涉及数据筛选和推理验证。其他关键词如MoE、SLMs、预训练、RAG、量化等未在论文中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该论文解决了大模型递归自我训练中的递归漂移问题，通过提出Neuro-Symbolic Recursive Self-Alignment（NSRSA）框架，在推理步骤层面进行符号验证来筛选训练数据，从而稳定迭代自我训练并提高模型推理可靠性。

摘要翻译

递归自我改进——即模型基于自身输出进行迭代训练——虽能带来持续的能力增长，却面临一个根本性障碍：递归漂移。当模型在多轮迭代中基于自生成数据进行训练时，中间推理过程中的误差会不断累积，导致模式崩溃与性能下降。我们提出神经符号递归自对齐（Neuro-Symbolic Recursive Self-Alignment, NSRSA），该方法通过嵌入一个符号验证子系统，在推理步骤层面控制训练数据质量，从而稳定迭代式自训练。与仅基于结果的过滤方法（会纳入推理过程存在缺陷的“侥幸猜中”答案）不同，NSRSA通过sympy验证每个算术运算步骤，检查推理步骤间的逻辑流一致性，并强制执行领域约束。我们在GSM8K数据集上使用Qwen3-4B-Thinking模型，在五种条件下（无验证、结果验证、多数投票、完整NSRSA符号验证、以及NSRSA结合DPO）进行了5轮自训练迭代评估。我们的过滤分析表明，NSRSA拒绝了约34%能通过结果验证的正确答案解法，从而将推理有误的“侥幸猜中”样本从训练集中剔除。我们进一步证明，基于NSRSA验证结果构建的DPO偏好对，能够教会模型区分严谨与有缺陷的推理（奖励模型准确率从46%提升至63%）。NSRSA提供了一个可扩展的框架，证明了在可实现自动验证的领域内，外部符号验证如何使递归自我改进变得可衡量且可靠。

摘要 (Abstract)

Recursive self-improvement–where a model iteratively trains on its own outputs–promises sustained capability growth but faces a fundamental obstacle: recursive drift. As models train on self-generated data across multiple iterations, errors in intermediate reasoning compound, leading to mode collapse and performance degradation. We propose Neuro-Symbolic Recursive Self-Alignment (NSRSA), which stabilizes iterative self-training by embedding a symbolic verification subsystem that gates training data quality at the reasoning step level. Unlike outcome-only filtering (which admits “lucky guesses” with flawed reasoning), NSRSA verifies each arithmetic operation via sympy, checks logical flow consistency across reasoning steps, and enforces domain constraints. We evaluate NSRSA on GSM8K using Qwen3-4B-Thinking across 5 self-training iterations under five conditions: no verification, outcome verification, majority voting, full NSRSA symbolic verification, and NSRSA with DPO. Our filtering analysis shows that NSRSA rejects approximately 34% of correct-answer solutions that pass outcome verification, eliminating “lucky guesses” with flawed reasoning from the training set. We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%). NSRSA provides an extensible framework that demonstrates how external symbolic verification can make recursive self-improvement measurable and reliable within domains where automated verification is available.

关键词: recursive self-improvement, self-training, symbolic verification, reasoning steps, data quality filtering, DPO, alignment, large language models

深度分析:

基于符号递归自对齐与验证推理的迭代自训练稳定化方法

摘要:

针对递归自我改进（RSI）中因中间推理错误累积导致的“递归漂移”问题，本文提出了神经符号递归自对齐（NSRSA）框架。该方法通过在自训练循环中嵌入符号验证子系统，对生成的推理链进行细粒度检查，包括答案正确性、算术运算验证、逻辑流一致性及约束满足，从而剔除答案正确但推理有缺陷的“幸运猜测”。在GSM8K数据集上使用Qwen3-4B-Thinking模型的实验表明，NSRSA比仅基于结果的验证更具选择性，能有效防止错误模式在迭代中传播。此外，结合DPO利用验证结果构建偏好对，能进一步提升模型区分合理推理与错误推理的能力，实现了更稳定、可靠的递归自我提升。

创新点:

提出了神经符号递归自对齐（NSRSA）框架，通过引入外部符号验证子系统来稳定迭代自训练过程，解决了递归漂移问题。
设计了细粒度的多级验证机制，不仅验证最终答案，还利用符号计算工具（如sympy）验证每一步的算术运算和逻辑流一致性。
利用NSRSA的验证结果构建DPO（直接偏好优化）训练数据，教导模型偏好合理的推理过程而非仅仅是正确的答案。
提供了一套完整的、可复现的迭代自训练流水线，展示了如何利用自动化符号验证来提升递归自我改进的数据质量。

方法

!!! info

论文采用迭代自训练循环，每轮包括生成、验证、训练和评估四个步骤。核心在于验证策略：1. 生成阶段使用模型对每个问题采样多个解；2. 验证阶段应用NSRSA过滤器，通过解析算术表达式并用sympy计算验证算术正确性，通过跟踪变量赋值验证逻辑流一致性，以及检查领域约束；3. 仅通过所有检查的解进入训练集用于微调模型。此外，还对比了无验证、仅结果验证、多数投票等基线，并探索了结合DPO利用验证结果构建偏好对的方法。

关键结果:

NSRSA比仅基于结果的验证更具选择性（接受率约52% vs 78%），成功剔除了约34%答案正确但推理有缺陷的“幸运猜测”。
基于NSRSA验证构建的DPO偏好对能显著提升模型区分合理推理与错误推理的能力，奖励准确性从46%提升至63%。
在GSM8K数据集上的5轮迭代实验表明，NSRSA能有效稳定自训练过程，防止因错误推理累积导致的性能退化。

技术栈: 模型：Qwen3-4B-Thinking, 算法：迭代自训练、直接偏好优化（DPO）、LoRA（Low-Rank Adaptation）, 工具：vLLM（推理加速）、HuggingFace Trainer、sympy（符号数学库）, 数据集：GSM8K, MATH-500, 技术：符号计算、模式匹配、逻辑一致性检查

优点

细粒度验证：超越了传统的结果监督，实现了过程级的符号验证，能有效识别并过滤掉推理有缺陷的数据。
解决递归漂移：针对递归自我改进中的核心痛点（错误累积和模式崩溃）提出了有效的解决方案。
自动化流程：利用符号计算工具实现了自动化的步骤级验证，无需人工标注步骤标签。
提升推理质量：不仅关注答案正确性，更强调推理过程的合理性，有助于提升模型的内在推理能力。

局限

解析覆盖限制：算术验证依赖于文本解析，对于无法解析的表达式（如自然语言描述的运算）可能无法检测错误。
逻辑流简化：逻辑流验证基于简单的字符串匹配而非复杂的指代消解，可能遗漏重命名变量或代词指代导致的错误。
领域特定性：目前的验证规则主要针对数学问题，扩展到其他领域可能需要重新设计验证器。
计算开销：对每个生成的解进行多步符号验证会增加计算成本。

与研究方向的相关性:

论文高度相关。它属于大模型技术原理的创新，专注于解决大模型递归自我训练中的稳定性问题。通过结合符号AI（符号验证）与神经网络，提出了一种创新的神经符号框架，这符合深度学习技术原理的创新要求。虽然主要应用于数学推理，但其方法论对于科学计算等需要精确推理的领域具有重要的参考价值和应用潜力。

2. Mind over Space: Can Multimodal Large Language Models Mentally Navigate?

作者: Qihui Zhu, Shouwei Ruan, Xiao Yang, Hao Jiang, Yao Huang, Shiji Zhao, Hanwei Fan, Hang Su, Xingxing Wei 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21577v1

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）在具身智能体中的空间推理和规划能力，核心贡献包括：1）引入Video2Mental基准评估MLLMs的"心理导航"能力；2）提出NavMind模型，通过显式认知地图作为中间表示来增强结构化规划。论文高度相关于大模型技术（LLMs/MLLMs）、推理方法（CoT/System 2 Thinking）、智能体（LLM Agents）和世界模型（World Models），并明确使用监督微调（SFT）进行训练。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在具身智能体中缺乏长期空间推理能力的问题，提出了Video2Mental基准和NavMind模型，通过显式认知地图和渐进式监督微调显著提升了心理导航性能。

摘要翻译

尽管多模态大语言模型（MLLMs）在具身智能体中被广泛采用，但其能力仍主要局限于基于即时观察的反应式规划，在跨越广阔时空尺度的空间推理方面始终表现不佳。认知科学揭示，生物智能（BI）的优势在于“心理导航”：即从经验中策略性地构建空间表征，并在行动前进行路径的心理模拟。为弥合人工智能（AI）与生物智能之间的差距，我们提出了Video2Mental，一个用于评估MLLMs心理导航能力的开创性基准。该任务要求从长时第一人称视角视频中构建分层认知地图，并逐步生成基于地标的路径规划，其规划准确性通过基于模拟器的物理交互进行验证。我们的基准测试结果表明，心理导航能力并未从标准预训练中自然涌现。前沿的MLLMs在零样本结构化空间表征方面存在显著困难，且其规划准确性随规划时域的延长而急剧下降。为克服这一局限，我们提出了\textbf{NavMind}，一种推理模型，它通过将显式、细粒度的认知地图作为可学习的中间表征，将心理导航过程内化。通过采用难度分层的渐进式监督微调范式，NavMind有效弥合了原始感知与结构化规划之间的鸿沟。实验证明，NavMind实现了卓越的心理导航能力，显著超越了前沿的商业及空间专用MLLMs。

摘要 (Abstract)

Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales. Cognitive science reveals that Biological Intelligence (BI) thrives on “mental navigation”: the strategic construction of spatial representations from experience and the subsequent mental simulation of paths prior to action. To bridge the gap between AI and BI, we introduce Video2Mental, a pioneering benchmark for evaluating the mental navigation capabilities of MLLMs. The task requires constructing hierarchical cognitive maps from long egocentric videos and generating landmark-based path plans step by step, with planning accuracy verified through simulator-based physical interaction. Our benchmarking results reveal that mental navigation capability does not naturally emerge from standard pre-training. Frontier MLLMs struggle profoundly with zero-shot structured spatial representation, and their planning accuracy decays precipitously over extended horizons. To overcome this, we propose \textbf{NavMind}, a reasoning model that internalizes mental navigation using explicit, fine-grained cognitive maps as learnable intermediate representations. Through a difficulty-stratified progressive supervised fine-tuning paradigm, NavMind effectively bridges the gap between raw perception and structured planning. Experiments demonstrate that NavMind achieves superior mental navigation capabilities, significantly outperforming frontier commercial and spatial MLLMs.

关键词: Multimodal Large Language Models, Mental Navigation, Spatial Reasoning, Cognitive Maps, Supervised Fine-tuning, Embodied Agents, World Models, Step-by-step Planning

深度分析:

心智超越空间：多模态大语言模型能否进行心理导航？

摘要:

针对多模态大语言模型（MLLMs）在具身智能中缺乏大规模时空空间推理能力的问题，该论文受生物智能“心理导航”启发，提出了Video2Mental基准。该基准要求模型从长视频中构建分层认知地图并生成路径规划，通过模拟器验证。研究发现现有MLLMs在此任务上表现不佳。为此，作者提出了NavMind模型，通过引入显式、细粒度的认知地图作为可学习的中间表征，并采用难度分层的渐进式监督微调进行训练。实验表明，NavMind显著优于现有模型，有效提升了心理导航能力。

创新点:

提出了Video2Mental基准，这是首个专门评估MLLMs从长视频中构建认知地图并进行心理导航能力的基准。
提出了NavMind推理模型，创新性地使用显式、细粒度的认知地图作为可学习的中间表征，连接原始感知与结构化规划。
设计了难度分层的渐进式监督微调范式，有效解决了模型在长时序空间推理中的性能退化问题。

方法

!!! info

论文首先构建了Video2Mental基准，包含长第一人称视频数据，任务要求构建分层认知地图并生成基于地标的路径规划，通过模拟器进行物理交互验证。随后，作者提出了NavMind模型，该模型将认知地图显式地整合到推理过程中。在训练阶段，采用了难度分层的渐进式监督微调策略，逐步提升模型处理复杂空间结构的能力。

关键结果:

现有的前沿MLLMs在零样本结构化空间表征方面表现挣扎，无法自然涌现心理导航能力。
随着规划时间跨度的增加，现有模型的规划准确率呈急剧下降趋势。
NavMind在心理导航任务上取得了优异表现，显著超越了现有的前沿商业模型和空间MLLMs。

技术栈: 多模态大语言模型, 认知地图, 监督微调, 模拟器验证, 分层空间表征

优点

跨学科创新：将认知科学中的心理导航概念引入AI领域，为具身智能提供了新的视角。
评估严谨：通过模拟器进行物理交互验证，而非仅依赖文本生成，确保了评估的真实性。
针对性强：有效解决了MLLMs在长时序、大尺度空间推理中的核心瓶颈。
方法可解释：引入显式的认知地图作为中间表征，增强了模型推理过程的可解释性。

局限

依赖模拟器环境，可能无法完全覆盖现实世界中极其复杂和动态的物理环境。
需要特定的分层认知地图数据进行监督微调，数据获取和标注成本较高。
模型架构引入了额外的中间表征，可能会增加计算开销和推理延迟。

与研究方向的相关性:

该论文高度相关。它属于深度学习和大模型技术原理的创新领域，专注于多模态大语言模型（MLLMs）在具身智能和空间推理中的应用。论文提出的NavMind模型和Video2Mental基准展示了在大模型技术原理上的显著创新，解决了AI在空间认知方面的关键挑战，符合用户对新技术原理创新和高创新性研究的要求。

3. Probing How Scalable Table Data Enhances General Long-Context Reasoning

作者: Huaibing Xie, Guoliang Zhao, Yang Liu, Shihan Dou, Siming Huang, Yanling Xiao, Shaolei Wang, Yiting Liu, Cheng Zhang, Shaofan Liu, Pluto Zhou 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21719v1

评分: 64.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的长上下文推理能力，通过结构化表格数据增强该能力，属于大模型技术原理创新。高度相关的关键词包括：LLMs（核心研究对象）、Post-training（通过RL增强推理）、Long Context LLMs（核心研究问题）、RLHF（使用RL方法）、CoT Reasoning和System 2 Thinking（涉及深度推理）。中等相关的关键词：Scaling Laws AND Data Quality（涉及数据质量和扩展实验）、Mechanistic Interpretability（分析底层机制）。其他关键词未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究了如何利用结构化表格数据增强大语言模型的长上下文推理能力，通过数学分析和实验验证，提出了一种可扩展的数据合成管道，显著提升了LLMs在多个长上下文基准测试上的性能。

摘要翻译

随着现实世界任务日益复杂，长上下文推理已成为大语言模型（LLM）的核心能力。然而，目前少有研究探讨何种数据类型对长上下文推理有效及其原因。我们发现具有周期性结构的结构化表格数据展现出长上下文推理的强大潜力。基于这一观察，我们利用互信息对表格依赖结构进行数学分析，揭示了表格数据中周期性不衰减的依赖关系。进一步，我们系统分析了结构化表格数据的能力，开展了相关的扩展实验，并验证了其增强长上下文推理的内在机制，从而获得了若干有意义的洞见。基于这些洞见，我们提出了一种简单且可扩展的流程（TableLong），用于合成高质量、多样化且可验证的结构化表格数据，以通过强化学习（RL）提升长上下文推理能力。大量实验结果表明，表格数据显著提升了LLM在多个长上下文基准测试上的推理能力（平均提升8.24%），甚至改善了其在领域外基准测试上的性能（平均提升8.06%）。我们希望这些洞见能为提升LLM长上下文推理能力的有效后训练数据提供实用指导。

摘要 (Abstract)

As real-world tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24% on average), and even improves performance on out-of-domain benchmarks (+8.06% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.

关键词: Large Language Models, Long-context reasoning, Structured table data, Post-training, Reinforcement Learning, Scaling experiments, Mutual information analysis, Benchmark evaluation

深度分析:

探究可扩展表格数据如何增强通用长上下文推理

摘要:

随着现实任务复杂化，长上下文推理成为大模型的核心能力。本研究发现具有周期性结构的表格数据对此具有巨大潜力。论文首先利用互信息理论数学分析了表格依赖结构，揭示了表格数据具有周期性非消失依赖特性，这与自然语言的衰减特性形成对比。基于此，作者提出了TableLong流水线，通过合成高质量、多样化且可验证的结构化表格数据（SQL任务），并利用强化学习来提升模型的长上下文推理能力。实验结果表明，表格数据显著提升了模型在多个长上下文基准测试中的表现（平均提升8.24%），并在数学和代码等域外基准上展现了良好的泛化能力。

创新点:

首次利用互信息（MI）对表格依赖结构进行数学分析，揭示了表格数据区别于自然语言的周期性非消失依赖特性。
提出了TableLong流水线，一种简单且可扩展的方法，用于合成高质量、多样化且可验证的结构化表格数据，专门用于长上下文推理。
验证了表格数据增强长上下文推理的底层机制，并证明其能显著提升模型在长上下文及域外（OOD）基准（如Math、Code）上的性能。

方法

!!! info

论文采用理论分析与实证实验相结合的方法。首先，通过信息论框架（互信息）对比分析自然语言与表格数据的依赖衰减模式。其次，构建TableLong数据合成流水线，包括环境初始化（混合表格源）、样本构建（生成SQL查询与自然语言问题并执行验证）以及基于一致性的验证与过滤机制。最后，利用合成数据通过强化学习（RL）对大模型进行训练，并在多个长上下文及域外基准上进行评估。

关键结果:

表格数据表现出周期性非消失依赖结构，其有效依赖距离在理论上为无穷大，优于自然语言。
提出的TableLong流水线能有效生成高质量训练数据，显著提升LLMs的长上下文推理能力。
在多个长上下文基准测试中平均性能提升8.24%，在域外基准（Math、Code）上平均提升8.06%。

技术栈: 互信息, KL散度, 幂律衰减, 强化学习, SQL生成与执行, 一致性过滤, SQLite

优点

理论扎实：提供了数学层面的互信息分析，深入解释了表格数据为何有效，而非仅依赖经验观察。
可扩展性强：TableLong流水线设计简单，能够自动化生成大规模、多样化的训练数据。
数据质量高：利用SQL执行结果进行验证，并采用一致性过滤机制剔除噪声和简单样本，确保了训练数据的难度和质量。
泛化能力好：不仅提升了长上下文任务性能，还迁移到了数学和代码等不同领域。

局限

数据形式单一：主要关注结构化表格数据，对于非结构化长文本推理的普适性可能有限。
线性化假设：分析基于行优先的表格线性化，可能忽略了其他表格表示方式的潜在优势。
计算开销：SQL执行和一致性过滤步骤需要额外的计算资源。
实验细节缺失：提供的文本截断了实验部分的具体设置细节。

与研究方向的相关性:

该论文高度相关。它属于“大模型和深度学习技术原理的创新”，深入探讨了长上下文推理这一核心技术瓶颈，并从数据结构（表格数据）的角度提出了创新的数学解释和解决方案。同时，它也涉及“大模型在不同领域的研究应用”，因为表格数据广泛存在于科学、金融、医疗等领域，该方法具有广泛的跨领域应用潜力。论文的创新性强，技术深度高，符合高分标准。

4. Improving Coherence and Persistence in Agentic AI for System Optimization

作者: Pantea Karimi, Kimia Noorbakhsh, Mohammad Alizadeh, Hari Balakrishnan 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21321v1

评分: 62.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	8.0/10	8.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	8.0/10	8.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM在系统优化中的agentic应用，直接涉及LLMs、LLM Agents和Multi-agent Systems（核心内容，10分）。它解决长上下文限制（Context Window Extension，8分）和KV缓存优化（KV Cache Compression，8分），并采用多步推理（Chain of Thought和System 2 Thinking，各8分）。其他关键词如MoE、SLMs、训练方法、RAG、量化等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文解决了LLM在复杂系统优化中因进化邻域偏差和连贯性上限导致的失败问题，提出了Engram多智能体架构，通过解耦长视野探索和持久知识存储，在多个领域（如多播、LLM推理路由、KV缓存优化）实现了优越性能。

摘要翻译

设计高性能系统启发式方法是一个创造性、迭代性的过程，需要专家提出假设并执行多步骤的概念转换。尽管大型语言模型（LLMs）在自动化这一循环中展现出潜力，但由于两种关键失效模式——进化邻域偏差和连贯性上限——它们在处理复杂系统问题时仍面临困难。进化方法依赖标量基准分数，常常陷入局部最优解，当需要协调的多步骤变更时便会失效。相反，现有的智能体框架在长周期中会遭遇上下文退化，或无法在独立运行中积累知识。
我们提出了Engram，一种智能体研究者架构，通过将长周期探索与单一上下文窗口的限制解耦，以应对这些局限。Engram将探索组织为一系列智能体的序列，这些智能体迭代地设计、测试和分析机制。在每次运行结束时，一个智能体会将代码快照、日志和结果存储到持久化档案库中，并将高层建模洞见提炼为一份简洁、持久的研究摘要。随后的智能体则从一个全新的上下文窗口开始，通过阅读研究摘要来基于先前的发现进行构建。
我们发现，Engram在多个领域均表现出卓越性能，包括多云组播、LLM推理请求路由，以及通过自然语言查询优化数据库中的KV缓存复用。

摘要 (Abstract)

Designing high-performance system heuristics is a creative, iterative process requiring experts to form hypotheses and execute multi-step conceptual shifts. While Large Language Models (LLMs) show promise in automating this loop, they struggle with complex system problems due to two critical failure modes: evolutionary neighborhood bias and the coherence ceiling. Evolutionary methods often remain trapped in local optima by relying on scalar benchmark scores, failing when coordinated multi-step changes are required. Conversely, existing agentic frameworks suffer from context degradation over long horizons or fail to accumulate knowledge across independent runs. We present Engram, an agentic researcher architecture that addresses these limitations by decoupling long-horizon exploration from the constraints of a single context window. Engram organizes exploration into a sequence of agents that iteratively design, test, and analyze mechanisms. At the conclusion of each run, an agent stores code snapshots, logs, and results in a persistent Archive and distills high-level modeling insights into a compact, persistent Research Digest. Subsequent agents then begin with a fresh context window, reading the Research Digest to build on prior discoveries. We find that Engram exhibits superior performance across diverse domains including multi-cloud multicast, LLM inference request routing, and optimizing KV cache reuse in databases with natural language queries.

关键词: Agentic AI, Large Language Models, System Optimization, Multi-agent Systems, Context Window, KV Cache, Coherence, Persistence

深度分析:

改进智能体AI在系统优化中的连贯性和持久性

摘要:

针对大语言模型（LLM）在自动化系统优化中面临的“演化邻域偏差”和“连贯性上限”问题，本文提出了Engram架构。该架构通过将长期探索与单一上下文窗口解耦，组织了一系列智能体进行迭代设计、测试和分析。关键创新在于引入了结构化的交接机制，每个智能体在运行结束时将高层见解提炼为持久的“Research Digest”，并将详细数据存入“Archive”。后续智能体以新的上下文窗口开始，通过阅读摘要继承先前发现。实验表明，Engram在多云组播、LLM推理请求路由等任务中超越了人类专家及现有自动化方法（如Glia和OpenEvolve），成功实现了连贯性、灵活性和持久性的统一。

创新点:

提出了结构化交接机制，通过持久化的Research Digest和Archive，实现了跨智能体的知识积累与传递，解决了长上下文下的性能退化问题。
设计了基于序列智能体的研究架构，将长期探索分解为多个具有独立上下文窗口的智能体，结合了演化方法的持久性与基于工具的智能体的灵活性。
在多云组播、LLM推理路由等复杂系统优化任务中，发现了超越人类专家和现有自动化基线的新型启发式算法。
定义并解决了LLM智能体在系统优化中的两个关键失败模式：演化邻域偏差和连贯性上限。

方法

!!! info

论文采用基于智能体的研究方法。Engram架构将系统优化过程组织为一系列单智能体探索，每个智能体遵循科学方法（假设-实现-实验-分析）循环。智能体在工作空间中操作，拥有文件读写、运行脚本和调用模拟器等工具。关键步骤是每个智能体结束时将实验结果和见解写入Research Digest，后续智能体通过检索Digest来构建新的探索计划。评估在ADRS基准和LLM请求路由器等九个系统问题上进行。

关键结果:

在多云组播任务中，Engram发现的启发式算法实现了$622的最佳成本，优于人类SOTA的$626及所有演化基线。
在LLM推理请求路由任务中，Engram将平均响应时间降低至23.9秒，优于专家设计的启发式算法（25.7秒）及Glia方法。
在评估的九个系统问题中，Engram在八个设置上优于人类SOTA，并在所有类别中超过或匹配OpenEvolve。
成功证明了通过结构化知识传递，智能体可以在数百次试验中保持连贯的研究探索。

技术栈: 大语言模型（LLMs）作为核心推理智能体, 模拟器或实验测试台用于代码评估, 文件系统用于持久化存储Archive和Research Digest, Python编程环境, ADRS基准测试套件, LLM请求路由器模拟环境

优点

有效解决了LLM智能体在长期任务中的上下文腐烂和知识遗忘问题，显著提升了长程推理能力。
架构设计兼顾了连贯性、灵活性和持久性，克服了现有方法（如纯代码演化或单一智能体迭代）的局限性。
在多个真实的系统优化领域展示了卓越的性能，具有广泛的适用性。
引入的Research Digest概念为智能体间的知识传承提供了可扩展的范式。

局限

系统的性能依赖于底层LLM生成高质量假设和代码的能力，若模型能力不足可能影响效果。
结构化交接的质量取决于智能体准确总结和提炼见解的能力，若摘要不准确可能导致误导。
运行多个智能体和大量模拟实验可能带来较高的计算成本。
目前主要针对系统优化问题，在其他领域的通用性尚需进一步验证。

与研究方向的相关性:

该论文高度相关。它属于“大模型和深度学习技术原理的创新”范畴，特别是针对智能体架构、长上下文管理和知识传递机制进行了创新。同时，它展示了大模型在“系统优化”这一具体领域的应用，符合“大模型在不同领域的研究应用”。论文提出的Engram架构解决了智能体AI的核心痛点，具有很高的技术创新性和学术价值。

5. EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises

作者: Ankush Agarwal, Harsh Vishwakarma, Suraj Nagaje, Chaitanya Devaguptapu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21630v1

评分: 56.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	15.0/10	15.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是开发企业级AI代理部署平台EnterpriseLab，重点关注小型语言模型（SLMs）在企业环境中的应用，通过集成工具、自动化数据生成和训练管道来提升代理能力。与"Small Language Models"高度相关（10分），因为论文明确讨论SLMs作为隐私保护替代方案；与"LLM Agents"高度相关（15分），因为整个平台围绕代理开发部署；与"Tool Use"高度相关（10分），因为平台集成140+企业工具；与"Large Language Models"相关（8分），因为与GPT-4o对比；与"Post-training"相关（8分），涉及训练管道；与"Pre-training"有一定关联（5分），涉及领域适应。其他关键词如MoE、Scaling Laws、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了EnterpriseLab平台，解决了企业部署AI代理时开发流程碎片化的问题，通过统一工具集成、数据生成和训练管道，使8B参数的小型语言模型在复杂企业工作流中达到GPT-4o性能水平，同时降低8-10倍推理成本。

摘要翻译

在企业环境中部署人工智能代理，需要在能力与数据主权及成本限制之间取得平衡。虽然小型语言模型为前沿模型提供了保护隐私的替代方案，但其专业化进程受到开发流程碎片化的阻碍——这些流程将工具集成、数据生成和训练相互割裂。我们推出EnterpriseLab，一个全栈平台，将这些阶段统一到一个闭环框架中。EnterpriseLab提供：(1) 一个模块化环境，通过模型上下文协议（Model Context Protocol）暴露企业应用程序，实现专有工具与开源工具的无缝集成；(2) 自动化轨迹合成，能够根据环境模式以编程方式生成训练数据；(3) 集成的训练管道与持续评估机制。我们通过EnterpriseArena对该平台进行了验证，这是一个在IT、人力资源、销售和工程领域包含15个应用程序和140多个工具的具体实例。我们的结果表明，在EnterpriseLab中训练的80亿参数模型，在复杂企业工作流上的性能与GPT-4o相当，同时将推理成本降低了8-10倍，并且在包括EnterpriseBench（+10%）和CRMArena（+10%）在内的多种企业基准测试中均保持稳健性。EnterpriseLab为企业提供了一条可行的路径，使其能够部署能力强、保护隐私的智能代理，同时不牺牲运营能力。

摘要 (Abstract)

Deploying AI agents in enterprise environments requires balancing capability with data sovereignty and cost constraints. While small language models offer privacy-preserving alternatives to frontier models, their specialization is hindered by fragmented development pipelines that separate tool integration, data generation, and training. We introduce EnterpriseLab, a full-stack platform that unifies these stages into a closed-loop framework. EnterpriseLab provides (1) a modular environment exposing enterprise applications via Model Context Protocol, enabling seamless integration of proprietary and open-source tools; (2) automated trajectory synthesis that programmatically generates training data from environment schemas; and (3) integrated training pipelines with continuous evaluation. We validate the platform through EnterpriseArena, an instantiation with 15 applications and 140+ tools across IT, HR, sales, and engineering domains. Our results demonstrate that 8B-parameter models trained within EnterpriseLab match GPT-4o’s performance on complex enterprise workflows while reducing inference costs by 8-10x, and remain robust across diverse enterprise benchmarks, including EnterpriseBench (+10%) and CRMArena (+10%). EnterpriseLab provides enterprises a practical path to deploying capable, privacy-preserving agents without compromising operational capability.

关键词: AI agents, enterprise deployment, small language models, tool integration, automated trajectory synthesis, training pipelines, privacy-preserving, inference cost reduction

深度分析:

EnterpriseLab：用于在企业中开发和部署智能体的全栈平台

摘要:

在企业环境中部署AI智能体面临数据主权、高推理成本以及开发流程碎片化的挑战。虽然小型语言模型（SLM）提供了隐私保护的替代方案，但缺乏将内部工具转化为训练数据的集成基础设施。论文提出了EnterpriseLab，这是一个全栈平台，通过闭环框架统一了基于模型上下文协议（MCP）的模块化工具环境、自动化轨迹合成以及集成训练管道。实验表明，在该平台上训练的8B参数模型在复杂企业工作流中匹配了GPT-4o的性能，推理成本降低了8-10倍，并在多个基准测试中表现优异，为企业提供了低成本、隐私保护的智能体部署路径。

创新点:

提出了EnterpriseLab全栈平台，首次将工具集成、数据合成、模型训练和评估统一到一个闭环框架中，解决了企业智能体开发流程碎片化的问题。
设计了基于模型上下文协议（MCP）的模块化工具环境，实现了企业应用的无缝即插即用集成，支持有状态执行和异构输出归一化。
开发了约束感知的工具图遍历算法，能够自动从环境模式中生成可执行的高质量训练轨迹，消除了对手动标注的依赖。
验证了小型语言模型（8B）在特定企业任务上通过该平台训练后可匹敌GPT-4o，同时大幅降低推理成本并保持数据主权。

方法

!!! info

论文构建了包含15个容器化应用和140+工具的EnterpriseArena仿真环境。在技术路线上，实现了基于MCP的动态工具注册表和有状态执行容器；提出了四阶段任务合成流程：工具图构建、约束感知轨迹采样（DFS遍历）、分层任务生成和验证；采用监督微调（SFT）、偏好优化（DPO）和在线强化学习（RL）相结合的集成训练管道，并利用环境反馈进行持续优化。

关键结果:

Qwen3-8B模型在EnterpriseArena上匹配了GPT-4o的性能。
推理成本相比使用GPT-4o降低了8-10倍。
在EnterpriseBench和CRMArena两个外部基准上，模型性能均优于GPT-4o 10%。
训练效率高，SFT仅需2小时，在线RL需24-30小时即可产出生产级模型。

技术栈: 模型：Qwen3-8B, GPT-4o (对比基线), 协议：Model Context Protocol (MCP), 容器化：Docker, 算法：深度优先搜索 (DFS), 监督微调 (SFT), 直接偏好优化 (DPO), 在线强化学习 (Online RL), 环境：EnterpriseArena (15 apps, 140+ tools)

优点

全栈集成：打破了工具集成、数据生成和训练之间的壁垒，形成闭环反馈。
自动化程度高：通过工具图遍历自动生成训练数据，极大降低了企业定制智能体的门槛和成本。
性能与成本兼顾：证明了小模型在特定垂直领域可以超越大模型，兼顾了隐私、成本和性能。
通用性强：支持任意工具集成，不仅限于开源工具，也适配企业私有系统。

局限

环境依赖：虽然支持任意工具，但构建高保真的仿真环境（如EnterpriseArena）仍需大量工作，且仿真环境与真实生产环境可能存在差异。
模型规模限制：目前主要验证了8B参数模型，对于极其复杂的跨领域长链路推理，小模型可能仍存在天花板。
合成数据质量：尽管有约束感知，自动生成的轨迹可能仍缺乏人类专家在处理模糊或异常情况时的灵活性。

与研究方向的相关性:

该论文属于大模型（LLM）在企业领域的垂直应用研究，重点在于智能体的开发与部署平台。论文涉及大模型技术原理的创新，特别是自动化轨迹合成和闭环训练框架，属于大模型基础设施层面的创新。虽然不直接涉及科学发现（如生物、物理），但其方法论（工具使用、数据合成）对科学AI中的自动化实验和工具调用具有借鉴意义。总体而言，该论文高度契合“大模型在不同领域的研究应用”及“大模型技术原理创新”的评价标准，具有很高的创新性和实用价值。

6. Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERT

作者: Mariela M. Nina, Caio Veloso Costa, Lilian Berton, Didier A. Vega-Oliveros 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21418v1

评分: 53.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	15.0/10	15.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	10.0/10	10.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究参数高效微调（PEFT）和量化技术在葡萄牙语问答任务中的应用，与关键词"PEFT"和"Quantization"高度相关（核心内容，给10-15分）。论文涉及大语言模型（LLMs）在低资源语言的应用，与"Large Language Models"相关（给8分）。论文比较生成式LLMs与小型编码器模型（如BERTimbau），与"Small Language Models"有一定关联（给5分）。论文进行微调实验，与"Post-training"相关（核心内容，给10分）。论文提及领域适应（葡萄牙语），与"Pre-training"有一定关联（给5分）。其他关键词如MoE、Scaling Laws、Instruction Tuning等与论文内容无关，给0分。

!!! tip deepseek-chat TL;DR

本研究系统评估了参数高效微调（PEFT）和量化技术在葡萄牙语问答任务中的应用，发现LoRA等方法能显著降低计算成本（训练时间减少73.5%）并保持高性能，同时证明小型编码器模型比生成式大语言模型更高效。

摘要翻译

尽管大语言模型已变革了自然语言处理领域，但其计算成本为巴西葡萄牙语等低资源语言的可及性设置了障碍。本研究对应用于BERTimbau模型的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）与量化技术进行了系统性评估，任务为基于巴西葡萄牙语版SQuAD v1（SQuAD-BR）的问答任务。我们评估了40种配置组合，涵盖四种PEFT方法（LoRA、DoRA、QLoRA、QDoRA）及两种模型规模（Base版：1.1亿参数，Large版：3.35亿参数）。研究结果揭示了三个关键发现：（1）在BERTimbau-Large模型上，LoRA方法能达到基线性能的95.8%，同时将训练时间减少73.5%（F1分数为81.32对比84.86）；（2）较高的学习率（2e-4）能显著提升PEFT性能，相较于标准学习率可获得最高达+19.71分的F1分数提升；（3）更大规模的模型展现出两倍的量化鲁棒性（F1分数损失为4.83对比9.56分）。这些结果表明，基于编码器的模型可通过高效微调应用于巴西葡萄牙语抽取式问答任务，其计算成本远低于大型生成式大语言模型，从而推动了符合“绿色人工智能”原则的可持续方法。对Tucano和Sabiá模型在同一抽取式问答基准上的探索性评估显示：虽然生成式模型通过LoRA微调可获得有竞争力的F1分数，但其所需GPU内存最高达BERTimbau-Base的4.2倍，训练时间多出3倍，这进一步印证了基于编码器的小型架构在此类任务中的效率优势。

摘要 (Abstract)

Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8% of baseline performance on BERTimbau-Large while reducing training time by 73.5% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textit{Green AI} principles. An exploratory evaluation of Tucano and Sabiá on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2$\times$ more GPU memory and 3$\times$ more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.

关键词: Parameter-Efficient Fine-Tuning, LoRA, Quantization, Portuguese Question Answering, BERTimbau, Computational Efficiency, Green AI, Extractive QA

深度分析:

葡萄牙语问答的高效微调方法：BERTimbau 上的 PEFT 比较研究与生成式 LLM 的探索性评估

摘要:

本文针对巴西葡萄牙语问答任务，系统评估了参数高效微调（PEFT）和量化技术在 BERTimbau 模型上的应用。研究在 SQuAD-BR 数据集上对比了 LoRA、DoRA、QLoRA 和 QDoRA 四种方法在 Base 和 Large 两种模型规模下的表现，共测试了 40 种配置。结果显示，LoRA 在 BERTimbau-Large 上达到了基线性能的 95.8%，同时训练时间减少了 73.5%；较高的学习率能显著提升 PEFT 性能；大模型对 4-bit 量化的鲁棒性更强。此外，探索性评估表明，生成式 LLM（如 Tucano 和 Sabiá）虽然性能尚可，但显存和训练时间消耗远超 BERTimbau。研究证明了编码器模型在提取式 QA 任务上的效率优势，符合绿色 AI 原则。

创新点:

针对巴西葡萄牙语低资源环境，系统评估了 PEFT（LoRA, DoRA）与量化（QLoRA, QDoRA）的组合效果，填补了该语言在高效微调研究上的空白。
提出并验证了关于 PEFT 的三个核心假设（低资源效率、规模鲁棒性、优化敏感性），特别是发现较高的学习率（2e-4）对 PEFT 性能提升至关重要。
对比了编码器模型（BERTimbau）与生成式 LLM（Tucano, Sabiá）在提取式 QA 任务上的计算效率，量化了生成式模型在资源消耗上的劣势（高达 4.2 倍显存占用）。

方法

!!! info

研究使用 SQuAD-BR（巴西葡萄牙语版 SQuAD v1）作为基准数据集。选取 BERTimbau 的 Base（110M）和 Large（335M）两个版本作为主要实验对象。设计了 40 种配置，结合四种 PEFT 方法（LoRA, DoRA, QLoRA, QDoRA）和两种模型规模进行对比实验。以全参数微调作为基线，评估指标包括 F1 分数、Exact Match (EM)、训练时间和 GPU 内存占用。此外，对 Tucano 和 Sabiá 等生成式 LLM 进行了探索性评估，对比其与编码器模型的效率。

关键结果:

LoRA 在 BERTimbau-Large 上实现了 95.8% 的基线性能（F1=81.32 vs 84.86），训练时间减少了 73.5%。
使用较高的学习率（2e-4）可显著改善 PEFT 性能，F1 分数相比标准率提升高达 +19.71 点。
较大的模型（Large）对 4-bit 量化的鲁棒性是较小模型（Base）的两倍（F1 损失 4.83 vs 9.56）。
生成式 LLM（Tucano, Sabiá）在提取式 QA 任务中需要比 BERTimbau-Base 多 4.2 倍的 GPU 内存和 3 倍的训练时间。

技术栈: LoRA (Low-Rank Adaptation), DoRA (Weight-Decomposed Low-Rank Adaptation), QLoRA (Quantized LoRA), QDoRA, 4-bit NormalFloat (NF4) 量化, Double Quantization, Paged Optimizers, BERTimbau (Base/Large), Tucano, Sabiá, SQuAD-BR, F1-score, Exact Match (EM)

优点

针对性强：专注于低资源语言（巴西葡萄牙语），解决了特定领域的计算资源瓶颈问题。
实验设计系统：涵盖了多种 PEFT 方法、量化策略和模型规模，提供了全面的对比分析。
实用价值高：提供了具体的超参数建议（如学习率 2e-4），并量化了效率提升，对资源受限环境有重要指导意义。
绿色 AI 导向：强调了编码器模型在特定任务上的效率优势，倡导可持续的研究方法。

局限

任务范围有限：主要聚焦于提取式问答（SQuAD-BR），未涉及生成式问答或其他 NLP 任务，结论的普适性有待验证。
模型选择局限：虽然探索了生成式 LLM，但主要深度分析仍集中在 BERT 架构，未涵盖最新的更高效的架构（如 Mamba 等）。
生成式 LLM 评估较浅：对 Tucano 和 Sabiá 的评估仅为探索性，可能未充分挖掘其通过提示工程或更复杂微调方法的潜力。

与研究方向的相关性:

该论文高度相关。它直接涉及大模型（LLM）和深度学习技术原理的创新，特别是参数高效微调（PEFT）和量化技术。虽然应用场景是葡萄牙语问答（属于 NLP 领域），但其核心贡献在于优化大模型在资源受限环境下的训练效率，这符合“大模型和深度学习技术原理的创新”这一关键词。论文通过系统评估 LoRA、DoRA、QLoRA 等前沿技术，展示了如何降低计算成本，具有很强的技术创新性。

7. User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Intera

作者: Yuren Hao, Shuhaib Mehri, ChengXiang Zhai, Dilek Hakkani-Tür 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20939v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为对话代理（LLM Agents）的个性化问题，提出VARS框架使用检索增强（Retrieval-Augmented Generation）技术来建模用户偏好，因此与"Large Language Models"、“LLM Agents"和"Retrieval-Augmented Generation"高度相关（10分）。框架涉及从用户反馈中学习，与"Self-Correction"有一定关联（5分），且双向量设计支持可解释性，与"Explainable AI"有一定关联（5分）。其他关键词如MoE、SFT、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对对话LLM代理缺乏持久用户模型的问题，提出了VARS框架，通过检索增强的交互和弱奖励学习来建模用户偏好，在保持冻结主干模型的情况下提高了交互效率并匹配了强基准的任务成功率。

摘要翻译

大型语言模型正日益被用作个人助手，然而大多数模型缺乏持久的用户模型，迫使用户在不同会话中反复重述偏好。我们提出向量自适应检索评分（VARS），这是一个与具体流程无关、主干网络冻结的框架，通过在共享偏好空间中使用长期与短期向量来表示每位用户，并利用这些向量对结构化偏好记忆的检索评分进行偏置调整。这些向量通过用户反馈产生的弱标量奖励进行在线更新，从而实现无需针对每位用户进行微调的个性化适配。我们在\textsc{MultiSessionCollab}（一个包含丰富用户偏好档案的在线多会话协作基准测试）上对数学和编程任务进行了评估。在主干网络冻结的条件下，用户感知检索的主要优势在于提升交互效率，而非大幅提高原始任务准确率：我们完整的VARS智能体实现了最强的综合性能，在任务成功率上与强大的反思基线模型持平，同时降低了超时率和用户操作负担。学习到的长期向量与跨用户偏好重叠度保持一致，而短期向量则捕捉了会话特定的适应性，这支持了双向量设计的可解释性。代码、模型与数据可在https://github.com/YurenHao0426/VARS获取。

摘要 (Abstract)

Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector-Adapted Retrieval Scoring (VARS), a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users’ feedback, enabling personalization without per-user fine-tuning. We evaluate on \textsc{MultiSessionCollab}, an online multi-session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user-aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long-term vectors also align with cross-user preference overlap, while short-term vectors capture session-specific adaptation, supporting the interpretability of the dual-vector design. Code, model, and data are available at https://github.com/YurenHao0426/VARS.

关键词: Large Language Models, LLM Agents, Retrieval-Augmented Generation, User Preference Modeling, Personalization, Weak Rewards, Frozen Backbone, Multi-session Collaboration

深度分析:

基于检索增强交互的弱奖励反馈：对话式大模型智能体的用户偏好建模

摘要:

针对现有对话式大模型缺乏持久用户模型导致用户需重复陈述偏好的问题，论文提出了VARS（Vector-Adapted Retrieval Scoring）框架。该框架通过学习双向量用户状态（长期和短期向量）在共享偏好空间中表示用户，并利用这些向量偏置结构化偏好记忆的检索评分。VARS采用冻结的主干模型，仅通过用户反馈的弱标量奖励在线更新用户向量，实现了无需针对每个用户微调的持续个性化。在MULTISESSIONCOLLAB基准测试上的评估表明，VARS在保持任务成功率与强基线相当的同时，显著提高了交互效率，降低了超时率和用户努力。分析显示长期向量反映了跨会话的稳定偏好，短期向量则捕获了特定会话的上下文。

创新点:

提出了VARS框架，通过学习紧凑的双向量用户状态来偏置检索评分，实现了无需微调的持续个性化。
设计了长期与短期双向量机制，有效分离了跨会话的稳定偏好与会话内的瞬时上下文。
引入了基于弱标量奖励的在线更新机制，利用REINFORCE算法从用户反馈中优化用户向量。
构建了结构化偏好记忆系统，区分全局偏好和条件偏好，优化了检索效率和准确性。

方法

!!! info

1. 偏好提取：使用微调的轻量级模型（Mext）将对话转换为结构化的JSON偏好元组。2. 偏好记忆：将偏好存储为包含元数据和嵌入向量的记忆卡片，并通过PCA降维到共享物品空间。3. 用户状态建模：构建长期向量（用户所有物品向量的均值）和短期向量，组合成有效用户向量。4. 个性化检索：在检索和重排序阶段，利用有效用户向量对候选记忆卡片进行偏置打分。5. 在线更新：通过关键词匹配将后续查询转化为标量奖励，使用REINFORCE算法更新用户向量。

关键结果:

VARS在MULTISESSIONCOLLAB基准测试中取得了最强的整体性能。
在任务成功率方面与强Reflection基线持平，但显著降低了超时率和用户努力，提升了交互效率。
长期向量与跨用户偏好重叠度对齐，证明了其捕获稳定偏好的能力。
短期向量能够有效适应特定会话的上下文变化。

技术栈: 模型：Qwen3-0.6B（偏好提取）、冻结的Chat LLM、Embedding模型、Reranker。, 算法：PCA（主成分分析）、REINFORCE（强化学习）、Dense Retrieval（密集检索）、Cross-encoder Reranking（交叉编码器重排序）。, 数据结构：JSON（结构化偏好提取）、Memory Cards（记忆卡片）。

优点

高效性：无需针对每个用户微调大模型，仅更新低维向量，计算成本低，易于部署。
可解释性：双向量设计清晰分离了长期和短期偏好，便于理解模型的决策依据。
实用性：显著降低了用户在长期协作中的重复输入和纠正成本，提升了交互体验。
鲁棒性：冻结主干模型避免了灾难性遗忘，并保持了基础模型的能力。

局限

评估依赖模拟：主要基于LLM用户模拟器进行评估，可能无法完全反映真实人类用户的复杂反馈行为。
奖励信号简单：使用关键词匹配生成弱奖励信号，可能存在噪声或无法捕捉细微的用户满意度。
提取精度：偏好提取模型虽然召回率高但精度较低，依赖下游过滤，可能引入无关信息。

与研究方向的相关性:

该论文高度相关。它直接涉及“大模型技术原理的创新”，特别是在个性化、长期记忆和强化学习微调方面。虽然应用场景是数学和代码协作，但其核心贡献在于改进大模型智能体如何通过记忆和反馈适应用户，这是大模型系统架构的关键创新。符合用户对新技术原理创新的关注。

8. SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Im

作者: Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Zhou Zhao 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22228v1

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究文本到图像生成中的空间一致性奖励建模，核心贡献是SpatialReward模型和SpatRelBench基准。与关键词的相关性分析如下：1）“Chain of Thought” (10分)：摘要明确提到vision-language model应用chain-of-thought reasoning来评估复杂空间关系，这是核心方法。2）“RLHF” (8分)：论文使用强化学习训练文本到图像模型，SpatialReward作为奖励模型，与RLHF技术高度相关。3）“Large Language Models” (5分)：虽然论文主要涉及视觉语言模型和文本到图像模型，但属于大模型在特定领域的应用。4）“System 2 Thinking” (5分)：chain-of-thought reasoning体现了深度推理过程。5）“Hallucination Mitigation” (5分)：解决图像生成中的空间不准确问题可视为缓解幻觉的一种形式。6）“Explainable AI” (5分)：verifiable reward modeling和可解释的评估过程相关。其他关键词如MoE、量化、RAG等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对文本到图像生成中细粒度空间关系评估不足的问题，提出了可验证的空间奖励模型SpatialReward，通过结合提示分解、专家检测和链式推理，显著提高了生成图像的空间一致性和整体质量。

摘要翻译

近期，基于强化学习（RL）的文本到图像（T2I）生成技术得益于评估语义对齐与视觉质量的奖励模型而取得进展。然而，现有奖励模型大多对细粒度空间关系关注有限，常生成整体看似合理但物体定位存在偏差的图像。本研究提出 SpatialReward，一种可验证的奖励模型，专门用于评估生成图像中的空间布局。SpatialReward采用多阶段流程：提示分解器（Prompt Decomposer）从自由形式提示中提取实体、属性及空间元数据；专家检测器提供物体位置与属性的精确视觉定位；视觉语言模型则基于定位观察结果进行思维链推理，以评估基于规则方法难以处理的复杂空间关系。为更全面评估生成图像中的空间关系，我们引入 SpatRelBench 基准，涵盖物体属性、朝向、物体间关系及渲染文本布局等维度。在Stable Diffusion和FLUX上的实验表明，将SpatialReward融入RL训练能持续提升空间一致性与整体生成质量，其结果与人类判断更为吻合。这些发现表明，可验证的奖励模型在推动文本到图像生成模型实现更精准、可控的优化方面具有显著潜力。

摘要 (Abstract)

Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.

关键词: text-to-image generation, spatial consistency, reward modeling, reinforcement learning, chain-of-thought reasoning, vision-language model, spatial relationships, verifiable evaluation

深度分析:

SpatialReward：用于文本到图像生成中细粒度空间一致性的可验证空间奖励建模

摘要:

针对当前文本到图像（T2I）生成模型在强化学习优化过程中缺乏细粒度空间关系评估的问题，本文提出了SpatialReward，一种可验证的空间奖励模型。该模型采用三阶段流水线：首先通过提示分解器将自由形式的提示解析为结构化约束；其次利用专家检测器提供准确的物体位置和属性视觉定位；最后结合视觉-语言模型（VLM）的链式思维推理来评估复杂的空间关系。此外，作者还引入了SpatRelBench基准测试，涵盖物体属性、方向、物体间关系及渲染文本位置。在Stable Diffusion和FLUX上的实验表明，将SpatialReward集成到RL训练中能显著提升空间一致性和整体生成质量，结果更符合人类判断。

创新点:

提出了SpatialReward，一个结合提示分解、专家检测和链式思维推理的可验证空间奖励模型，专门用于评估T2I生成的空间布局。
设计了Prompt Decomposer，能够将自由形式的提示转换为结构化约束，解决了传统结构化方法对固定模板的依赖。
引入了协作验证机制，利用专家检测器提供可验证的事实信息，结合VLM的CoT推理，在保证准确性的同时提高了对复杂空间关系的处理灵活性。
发布了SpatRelBench基准，扩展了空间评估维度，包括物体方向、3D空间定位及渲染文本位置等细粒度指标。

方法

!!! info

论文采用多阶段技术路线：1. 提示分解阶段，使用微调后的Qwen2.5-VL-7B模型将自然语言提示解析为包含实体、属性和空间关系的结构化约束集；2. 可验证奖励阶段，利用开放集检测器和OCR工具提取图像中的物体边界框、属性及文本信息，生成可验证的观测数据；3. 推理阶段，将验证信息输入视觉-语言模型（如Qwen-VL），通过思维链推理判断复杂的空间关系是否满足约束，并聚合生成最终奖励分数。最后，将SpatialReward集成到Flow-GRPO强化学习框架中，对Stable Diffusion和FLUX等基础模型进行优化。

关键结果:

在Stable Diffusion和FLUX模型上，使用SpatialReward进行RL优化显著提升了生成图像的空间一致性。
与现有的基线奖励模型相比，该方法在处理复杂空间关系和物体定位方面表现更优，结果更接近人类评估。
SpatRelBench基准测试填补了现有评估工具在细粒度空间关系（如方向、3D位置、文本放置）方面的空白。
实验证明了可验证奖励模型在提高T2I生成可控性和准确性方面的巨大潜力。

技术栈: Stable Diffusion, FLUX, Flow-GRPO (Group Relative Policy Optimization), Qwen2.5-VL-7B, Qwen-VL, GPT-4o, Chain-of-Thought (CoT) Reasoning, Open-set Object Detection, OCR (Optical Character Recognition)

优点

可验证性强：通过引入专家检测器，减少了纯VLM推理可能产生的幻觉，提供了基于事实的评估依据。
灵活性高：能够处理任意自由形式的提示，克服了传统结构化方法仅适用于固定模板的局限性。
针对性强：专注于解决T2I生成中细粒度空间一致性的痛点，填补了现有奖励模型在空间布局评估上的不足。
评估全面：新提出的SpatRelBench涵盖了更丰富的空间关系维度，有助于更全面地评估模型性能。

局限

计算复杂度高：三阶段流水线（分解、检测、推理）相比简单的CLIP分数计算需要更多的计算资源和时间。
依赖检测器性能：系统的准确性在很大程度上依赖于专家检测器的表现，如果检测器漏检或误检物体，将直接影响奖励分数。
推理链长度限制：对于极其复杂的场景（包含大量物体和嵌套关系），VLM的推理链可能会变得过长，导致推理错误或效率下降。

与研究方向的相关性:

该论文高度相关于深度学习技术原理的创新。它不仅涉及大模型（VLM）在图像生成反馈机制中的应用，还创新性地结合了强化学习、可验证AI（Verifiable AI）和思维链推理技术。论文针对文本到图像生成中的核心技术难题——空间一致性，提出了新颖的解决方案，体现了在深度学习算法原理和多模态大模型应用层面的显著创新。

9. Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with C

作者: Shixu Liu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21673v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心研究大语言模型（LLMs）在气象科学领域的应用，提出了一种基于多智能体系统（Multi-agent Systems）和LLM Agents的免训练框架WeatherTGD，用于从天气时间序列数据生成可解释的自然语言描述。因此，与"Large Language Models”、“LLM Agents"和"Multi-agent Systems"高度相关（10分），与"AI for Science"有一定关联（5分），因为气象学属于科学应用领域。其他关键词如MoE、SLMs、训练方法、推理加速、对齐技术等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为WeatherTGD的免训练多智能体框架，利用文本梯度下降和共识感知梯度融合机制，通过三个专用LLM智能体协作从天气时间序列数据生成高质量、可解释的自然语言描述，在真实气象数据集上显著优于现有多智能体基线。

摘要翻译

将天气时间序列数据转化为可解释的自然语言描述，仍然是气象科学与自然语言处理交叉领域的一项重大挑战。尽管大语言模型（LLMs）在时间序列预测与分析方面已展现出卓越能力，但现有方法要么生成缺乏人类可理解解释的数值预测，要么产生缺乏领域专业深度的通用描述。我们提出了WeatherTGD，一个无需训练的多智能体框架，它通过文本梯度下降（Text Gradient Descent, TGD）的视角重新诠释了协作式描述优化过程。我们的系统部署了三个专业化的LLM智能体，包括统计分析师、物理解释器和气象学专家，它们从天气时间序列观测中生成领域特定的文本梯度。这些梯度通过一种新颖的共识感知梯度融合机制进行聚合，该机制在提取共同信号的同时保留了独特的领域视角。融合后的梯度随后指导一个类似于梯度下降的迭代优化过程，其中每个LLM生成的反馈信号都会更新描述，使其逼近最优解。在真实世界气象数据集上的实验表明，WeatherTGD在基于LLM的评估和人类专家评估中均取得了显著提升，大幅超越了现有的多智能体基线方法，同时通过并行智能体执行保持了计算效率。

摘要 (Abstract)

Generating interpretable natural language captions from weather time series data remains a significant challenge at the intersection of meteorological science and natural language processing. While recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities in time series forecasting and analysis, existing approaches either produce numerical predictions without human-accessible explanations or generate generic descriptions lacking domain-specific depth. We introduce WeatherTGD, a training-free multi-agent framework that reinterprets collaborative caption refinement through the lens of Text Gradient Descent (TGD). Our system deploys three specialized LLM agents including a Statistical Analyst, a Physics Interpreter, and a Meteorology Expert that generate domain-specific textual gradients from weather time series observations. These gradients are aggregated through a novel Consensus-Aware Gradient Fusion mechanism that extracts common signals while preserving unique domain perspectives. The fused gradients then guide an iterative refinement process analogous to gradient descent, where each LLM-generated feedback signal updates the caption toward an optimal solution. Experiments on real-world meteorological datasets demonstrate that WeatherTGD achieves significant improvements in both LLM-based evaluation and human expert evaluation, substantially outperforming existing multi-agent baselines while maintaining computational efficiency through parallel agent execution.

关键词: Large Language Models, Multi-agent Systems, LLM Agents, Weather Captioning, Text Gradient Descent, Training-Free Approach, Consensus-Aware Gradient Fusion, Meteorological Science

深度分析:

基于文本梯度下降优化多智能体天气描述生成：一种具有共识感知梯度融合的免训练方法

摘要:

论文提出了WeatherTGD，一个免训练的多智能体框架，旨在解决从天气时间序列数据生成可解释自然语言描述的难题。该方法将文本梯度下降（TGD）引入气象领域，利用统计分析师、物理解释师和气象专家三个专用LLM智能体，从不同视角生成领域特定的文本梯度。通过共识感知梯度融合机制聚合这些梯度，并引导迭代优化过程。实验表明，WeatherTGD在LLM评估和人类专家评估中均显著优于现有基线，有效提升了天气描述的质量和可读性。

创新点:

首次将文本梯度下降（TGD）应用于天气时间序列描述生成，将多智能体协作重构为迭代优化过程。
设计了包含统计分析师、物理解释师和气象专家的三专家智能体层，生成领域特定的文本梯度。
提出了共识感知梯度融合机制，在提取共识信息的同时保留独特的领域视角。
实现了基于语义相似度阈值的显式收敛停止准则，确保计算效率。

方法

!!! info

论文采用多智能体系统结合文本梯度下降的方法。首先，三个专用LLM智能体（统计、物理、气象）并行分析天气数据并生成文本形式的“梯度”（反馈）。然后，通过共识感知梯度融合模块聚合这些反馈，提取共同信号并整合独特见解。最后，在迭代优化循环中，利用融合后的梯度不断修正生成的描述，直到达到语义相似度阈值或最大迭代次数。

关键结果: WeatherTGD在真实气象数据集上表现优异，LLM评分达到8.50/10，人类专家评分达到8.34/10，显著优于现有的多智能体基线。该方法在保持计算效率（通过并行执行）的同时，有效解决了单一模型难以平衡统计准确性、物理机制和气象意义的问题。

技术栈: 大语言模型, 文本梯度下降, 多智能体系统, 语义相似度计算, 提示工程, 迭代优化算法

优点

免训练：不需要模型微调，降低了计算成本和门槛。
领域融合：巧妙结合了统计学、物理学和气象学三个视角，生成的描述更具深度和准确性。
创新性：将数值优化的梯度下降概念迁移到离散的自然语言生成中，视角新颖。
高效性：智能体并行执行，且具有明确的收敛条件，避免了无限循环。

局限

依赖LLM能力：框架的性能严重依赖底层LLM（如ChatGPT, Claude, Qwen）的推理能力和知识储备。
计算开销：虽然是免训练，但每次生成都需要多次调用多个LLM进行推理，推理成本较高。
收敛性保证：虽然提出了停止准则，但在极端情况下可能无法收敛或陷入局部最优。
泛化能力：目前主要针对气象领域，迁移到其他复杂科学领域可能需要重新设计智能体角色和融合机制。

与研究方向的相关性:

论文高度相关。它属于“大模型和深度学习在科学领域的应用”（气象科学），同时也涉及“大模型和深度学习技术原理的创新”（文本梯度下降TGD、多智能体协作机制）。它展示了如何利用LLM解决科学数据的可解释性问题，并创新性地使用了TGD这一新技术原理，符合用户对创新性和新技术应用的关注。

10. Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Artic

作者: Sai Koneru, Jian Wu, Sarah Rajtmajer 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21193v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 该论文研究如何从科学文献全文中提取假设和统计证据，属于大模型在科学领域的应用研究。论文明确使用了大型语言模型（LLMs）作为提取器，并系统评估了检索增强生成（RAG）方法在上下文选择中的作用，因此与"Large Language Models"和"Retrieval-Augmented Generation"高度相关（10分）。研究内容属于科学文献分析，与"AI for Science"高度相关（10分）。论文未涉及其他关键词所描述的具体技术原理（如MoE、量化、对齐等）或应用场景（如智能体、工具调用等），因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何从科学文献全文中有效提取假设及其统计证据，发现通过优化检索质量进行针对性上下文选择能显著提升假设提取效果，但统计证据提取仍面临挑战，表明现有提取器在处理混合数值-文本语句时存在局限。

摘要翻译

从全文科学文献中提取假设及其支持的统计证据，是实证研究结果综合的核心环节，但由于文档长度以及科学论证分散于论文不同章节，这一任务仍面临困难。本研究探讨了一种序列式全文提取场景：将文章摘要中陈述的主要发现，与（i）论文正文中对应的假设陈述及（ii）支持或反驳该假设的统计证据进行关联。这一框架构建了一个具有挑战性的文档内检索场景，其中许多候选段落虽与发现主题相关，但修辞功能各异，从而为检索与提取过程制造了困难负例。通过采用两阶段“检索-提取”框架，我们开展了一项关于检索设计选择的对照研究，在四种大型语言模型提取器上，分别调整上下文数量、上下文质量（标准检索增强生成、重排序，以及结合重排序的微调检索器），并设置理想段落对照以区分检索失败与提取局限。研究发现，针对性的上下文选择相较于全文提示持续改进了假设提取，其增益主要集中在优化检索质量与上下文纯净度的配置中。相比之下，统计证据提取的难度仍然显著更高。即使在理想段落条件下，性能仍处于中等水平，这表明提取器在处理混合数值-文本陈述时存在持续的能力局限，而非仅由检索失败导致。

摘要 (Abstract)

Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article’s abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.

关键词: hypothesis extraction, statistical evidence extraction, full-text scientific articles, retrieval augmented generation, large language models, context selection, within-document retrieval, scientific information extraction

深度分析:

全文科学文章中假设和统计证据提取的上下文选择

摘要:

本文研究了从全文科学文章中提取假设及其支持性统计证据的挑战。针对文档长度和科学论证分布导致的提取困难，作者提出了一种两阶段的检索-提取框架。该框架首先将摘要中的发现与正文中的核心假设进行链接，进而提取支持该假设的统计证据。研究通过控制实验，对比了不同的上下文选择策略（包括上下文数量、检索质量以及Oracle设置）与全文提示的效果。结果表明，针对性的上下文选择能显著改善假设提取，特别是在优化检索质量和上下文纯净度时。然而，统计证据提取仍然困难，即使在提供理想段落的情况下，性能提升有限，这表明提取器在处理混合数字-文本陈述方面存在固有局限，而非仅仅是检索失败。

创新点:

提出了一种两阶段的顺序提取框架，将摘要发现链接到正文假设，再链接到统计证据，解决了科学论证分布在不同章节的问题。
进行了系统的消融研究，通过改变上下文数量、检索质量（包括重排序和微调检索器）以及引入Oracle设置，分离了检索失败与提取器限制。
发现了任务依赖的瓶颈：假设提取主要受限于检索质量和上下文纯净度，而统计证据提取则受限于提取器处理混合数字-文本陈述的能力。
在文档内检索场景中，验证了针对性的上下文选择优于直接对全文进行提示，尤其是在处理主题相关但修辞角色不同的困难负样本时。

方法

!!! info

论文采用两阶段的检索-提取管道。第一阶段，使用摘要中的发现作为查询检索候选段落，利用大语言模型提取核心假设；第二阶段，将摘要发现与提取的假设组合作为查询，检索统计证据的候选段落。研究在检索增强生成（RAG）范式下，对比了不同检索配置（Top-k段落数量、标准密集检索、重排序、微调检索器）的效果，并使用Oracle段落设置来界定性能上限，区分检索误差和提取误差。

关键结果:

针对性的上下文选择在假设提取任务上始终优于全文提示，增加检索上下文数量（从k=5增加到20）通常是有益的。
重排序和微调检索器能提高两个阶段的性能，但统计证据提取的性能仍远低于Oracle水平。
假设提取的性能瓶颈主要在于检索质量和上下文的干扰，而统计证据提取即使在提供完美段落的情况下，F1分数也仅在0.47到0.55之间，表明提取器本身是主要瓶颈。
大语言模型在处理包含数值、测试统计量和p值的混合文本-数字陈述时存在显著困难。

技术栈: Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), Dense Retrieval (密集检索), Reranking Models (重排序模型), Fine-tuning (微调技术)

优点

设计了严谨的Oracle实验设置，有效地分离了检索阶段和提取阶段的错误来源。
深入分析了文档内检索中的“困难负样本”问题，即主题相关但修辞角色不符的段落对提取的干扰。
明确区分了假设提取（语义对齐）和证据提取（数值信息提取）的不同特性，为后续研究提供了清晰的思路。
提供了关于如何优化科学文献信息检索系统的实用见解，强调了上下文质量的重要性。

局限

统计证据提取的性能仍然较低，即使检索完美，现有的LLM提取器在处理混合数字-文本数据时仍存在显著局限。
研究仅关注基于统计证据的经验性发现，排除了理论论证等其他类型的学术推理。
预处理阶段排除了表格、图片和公式，而这些部分往往包含关键的统计数据，限制了证据提取的完整性。

与研究方向的相关性:

该论文高度相关。它直接应用大语言模型（LLMs）解决科学领域的具体问题（假设和证据提取），属于大模型在科学领域的应用。同时，论文深入探讨了检索增强生成（RAG）技术、上下文窗口管理以及检索配置对模型性能的影响，涉及大模型技术原理的创新与应用。研究针对科学文献长文本处理的难点，具有较好的创新性和实用价值。

📋 所有论文列表

1. ✅ Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment

作者: Xinyu Zhang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21558v1

评分: 75.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文解决了大模型递归自我训练中的递归漂移问题，通过提出Neuro-Symbolic Recursive Self-Alignment（NSRSA）框架，在推理步骤层面进行符号验证来筛选训练数据，从而稳定迭代自我训练并提高模型推理可靠性。

摘要翻译

递归自我改进——即模型基于自身输出进行迭代训练——虽能带来持续的能力增长，却面临一个根本性障碍：递归漂移。当模型在多轮迭代中基于自生成数据进行训练时，中间推理过程中的误差会不断累积，导致模式崩溃与性能下降。我们提出神经符号递归自对齐（Neuro-Symbolic Recursive Self-Alignment, NSRSA），该方法通过嵌入一个符号验证子系统，在推理步骤层面控制训练数据质量，从而稳定迭代式自训练。与仅基于结果的过滤方法（会纳入推理过程存在缺陷的“侥幸猜中”答案）不同，NSRSA通过sympy验证每个算术运算步骤，检查推理步骤间的逻辑流一致性，并强制执行领域约束。我们在GSM8K数据集上使用Qwen3-4B-Thinking模型，在五种条件下（无验证、结果验证、多数投票、完整NSRSA符号验证、以及NSRSA结合DPO）进行了5轮自训练迭代评估。我们的过滤分析表明，NSRSA拒绝了约34%能通过结果验证的正确答案解法，从而将推理有误的“侥幸猜中”样本从训练集中剔除。我们进一步证明，基于NSRSA验证结果构建的DPO偏好对，能够教会模型区分严谨与有缺陷的推理（奖励模型准确率从46%提升至63%）。NSRSA提供了一个可扩展的框架，证明了在可实现自动验证的领域内，外部符号验证如何使递归自我改进变得可衡量且可靠。

摘要 (Abstract)

Recursive self-improvement–where a model iteratively trains on its own outputs–promises sustained capability growth but faces a fundamental obstacle: recursive drift. As models train on self-generated data across multiple iterations, errors in intermediate reasoning compound, leading to mode collapse and performance degradation. We propose Neuro-Symbolic Recursive Self-Alignment (NSRSA), which stabilizes iterative self-training by embedding a symbolic verification subsystem that gates training data quality at the reasoning step level. Unlike outcome-only filtering (which admits “lucky guesses” with flawed reasoning), NSRSA verifies each arithmetic operation via sympy, checks logical flow consistency across reasoning steps, and enforces domain constraints. We evaluate NSRSA on GSM8K using Qwen3-4B-Thinking across 5 self-training iterations under five conditions: no verification, outcome verification, majority voting, full NSRSA symbolic verification, and NSRSA with DPO. Our filtering analysis shows that NSRSA rejects approximately 34% of correct-answer solutions that pass outcome verification, eliminating “lucky guesses” with flawed reasoning from the training set. We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%). NSRSA provides an extensible framework that demonstrates how external symbolic verification can make recursive self-improvement measurable and reliable within domains where automated verification is available.

关键词: recursive self-improvement, self-training, symbolic verification, reasoning steps, data quality filtering, DPO, alignment, large language models

2. ✅ Mind over Space: Can Multimodal Large Language Models Mentally Navigate?

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在具身智能体中缺乏长期空间推理能力的问题，提出了Video2Mental基准和NavMind模型，通过显式认知地图和渐进式监督微调显著提升了心理导航性能。

摘要翻译

尽管多模态大语言模型（MLLMs）在具身智能体中被广泛采用，但其能力仍主要局限于基于即时观察的反应式规划，在跨越广阔时空尺度的空间推理方面始终表现不佳。认知科学揭示，生物智能（BI）的优势在于“心理导航”：即从经验中策略性地构建空间表征，并在行动前进行路径的心理模拟。为弥合人工智能（AI）与生物智能之间的差距，我们提出了Video2Mental，一个用于评估MLLMs心理导航能力的开创性基准。该任务要求从长时第一人称视角视频中构建分层认知地图，并逐步生成基于地标的路径规划，其规划准确性通过基于模拟器的物理交互进行验证。我们的基准测试结果表明，心理导航能力并未从标准预训练中自然涌现。前沿的MLLMs在零样本结构化空间表征方面存在显著困难，且其规划准确性随规划时域的延长而急剧下降。为克服这一局限，我们提出了\textbf{NavMind}，一种推理模型，它通过将显式、细粒度的认知地图作为可学习的中间表征，将心理导航过程内化。通过采用难度分层的渐进式监督微调范式，NavMind有效弥合了原始感知与结构化规划之间的鸿沟。实验证明，NavMind实现了卓越的心理导航能力，显著超越了前沿的商业及空间专用MLLMs。

摘要 (Abstract)

Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales. Cognitive science reveals that Biological Intelligence (BI) thrives on “mental navigation”: the strategic construction of spatial representations from experience and the subsequent mental simulation of paths prior to action. To bridge the gap between AI and BI, we introduce Video2Mental, a pioneering benchmark for evaluating the mental navigation capabilities of MLLMs. The task requires constructing hierarchical cognitive maps from long egocentric videos and generating landmark-based path plans step by step, with planning accuracy verified through simulator-based physical interaction. Our benchmarking results reveal that mental navigation capability does not naturally emerge from standard pre-training. Frontier MLLMs struggle profoundly with zero-shot structured spatial representation, and their planning accuracy decays precipitously over extended horizons. To overcome this, we propose \textbf{NavMind}, a reasoning model that internalizes mental navigation using explicit, fine-grained cognitive maps as learnable intermediate representations. Through a difficulty-stratified progressive supervised fine-tuning paradigm, NavMind effectively bridges the gap between raw perception and structured planning. Experiments demonstrate that NavMind achieves superior mental navigation capabilities, significantly outperforming frontier commercial and spatial MLLMs.

关键词: Multimodal Large Language Models, Mental Navigation, Spatial Reasoning, Cognitive Maps, Supervised Fine-tuning, Embodied Agents, World Models, Step-by-step Planning

3. ✅ Probing How Scalable Table Data Enhances General Long-Context Reasoning

评分: 64.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了如何利用结构化表格数据增强大语言模型的长上下文推理能力，通过数学分析和实验验证，提出了一种可扩展的数据合成管道，显著提升了LLMs在多个长上下文基准测试上的性能。

摘要翻译

随着现实世界任务日益复杂，长上下文推理已成为大语言模型（LLM）的核心能力。然而，目前少有研究探讨何种数据类型对长上下文推理有效及其原因。我们发现具有周期性结构的结构化表格数据展现出长上下文推理的强大潜力。基于这一观察，我们利用互信息对表格依赖结构进行数学分析，揭示了表格数据中周期性不衰减的依赖关系。进一步，我们系统分析了结构化表格数据的能力，开展了相关的扩展实验，并验证了其增强长上下文推理的内在机制，从而获得了若干有意义的洞见。基于这些洞见，我们提出了一种简单且可扩展的流程（TableLong），用于合成高质量、多样化且可验证的结构化表格数据，以通过强化学习（RL）提升长上下文推理能力。大量实验结果表明，表格数据显著提升了LLM在多个长上下文基准测试上的推理能力（平均提升8.24%），甚至改善了其在领域外基准测试上的性能（平均提升8.06%）。我们希望这些洞见能为提升LLM长上下文推理能力的有效后训练数据提供实用指导。

摘要 (Abstract)

As real-world tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24% on average), and even improves performance on out-of-domain benchmarks (+8.06% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.

关键词: Large Language Models, Long-context reasoning, Structured table data, Post-training, Reinforcement Learning, Scaling experiments, Mutual information analysis, Benchmark evaluation

4. ✅ Improving Coherence and Persistence in Agentic AI for System Optimization

作者: Pantea Karimi, Kimia Noorbakhsh, Mohammad Alizadeh, Hari Balakrishnan 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21321v1

评分: 62.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	8.0/10	8.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	8.0/10	8.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文解决了LLM在复杂系统优化中因进化邻域偏差和连贯性上限导致的失败问题，提出了Engram多智能体架构，通过解耦长视野探索和持久知识存储，在多个领域（如多播、LLM推理路由、KV缓存优化）实现了优越性能。

摘要翻译

设计高性能系统启发式方法是一个创造性、迭代性的过程，需要专家提出假设并执行多步骤的概念转换。尽管大型语言模型（LLMs）在自动化这一循环中展现出潜力，但由于两种关键失效模式——进化邻域偏差和连贯性上限——它们在处理复杂系统问题时仍面临困难。进化方法依赖标量基准分数，常常陷入局部最优解，当需要协调的多步骤变更时便会失效。相反，现有的智能体框架在长周期中会遭遇上下文退化，或无法在独立运行中积累知识。
我们提出了Engram，一种智能体研究者架构，通过将长周期探索与单一上下文窗口的限制解耦，以应对这些局限。Engram将探索组织为一系列智能体的序列，这些智能体迭代地设计、测试和分析机制。在每次运行结束时，一个智能体会将代码快照、日志和结果存储到持久化档案库中，并将高层建模洞见提炼为一份简洁、持久的研究摘要。随后的智能体则从一个全新的上下文窗口开始，通过阅读研究摘要来基于先前的发现进行构建。
我们发现，Engram在多个领域均表现出卓越性能，包括多云组播、LLM推理请求路由，以及通过自然语言查询优化数据库中的KV缓存复用。

摘要 (Abstract)

Designing high-performance system heuristics is a creative, iterative process requiring experts to form hypotheses and execute multi-step conceptual shifts. While Large Language Models (LLMs) show promise in automating this loop, they struggle with complex system problems due to two critical failure modes: evolutionary neighborhood bias and the coherence ceiling. Evolutionary methods often remain trapped in local optima by relying on scalar benchmark scores, failing when coordinated multi-step changes are required. Conversely, existing agentic frameworks suffer from context degradation over long horizons or fail to accumulate knowledge across independent runs. We present Engram, an agentic researcher architecture that addresses these limitations by decoupling long-horizon exploration from the constraints of a single context window. Engram organizes exploration into a sequence of agents that iteratively design, test, and analyze mechanisms. At the conclusion of each run, an agent stores code snapshots, logs, and results in a persistent Archive and distills high-level modeling insights into a compact, persistent Research Digest. Subsequent agents then begin with a fresh context window, reading the Research Digest to build on prior discoveries. We find that Engram exhibits superior performance across diverse domains including multi-cloud multicast, LLM inference request routing, and optimizing KV cache reuse in databases with natural language queries.

关键词: Agentic AI, Large Language Models, System Optimization, Multi-agent Systems, Context Window, KV Cache, Coherence, Persistence

5. ✅ EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises

作者: Ankush Agarwal, Harsh Vishwakarma, Suraj Nagaje, Chaitanya Devaguptapu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21630v1

评分: 56.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	15.0/10	15.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了EnterpriseLab平台，解决了企业部署AI代理时开发流程碎片化的问题，通过统一工具集成、数据生成和训练管道，使8B参数的小型语言模型在复杂企业工作流中达到GPT-4o性能水平，同时降低8-10倍推理成本。

摘要翻译

在企业环境中部署人工智能代理，需要在能力与数据主权及成本限制之间取得平衡。虽然小型语言模型为前沿模型提供了保护隐私的替代方案，但其专业化进程受到开发流程碎片化的阻碍——这些流程将工具集成、数据生成和训练相互割裂。我们推出EnterpriseLab，一个全栈平台，将这些阶段统一到一个闭环框架中。EnterpriseLab提供：(1) 一个模块化环境，通过模型上下文协议（Model Context Protocol）暴露企业应用程序，实现专有工具与开源工具的无缝集成；(2) 自动化轨迹合成，能够根据环境模式以编程方式生成训练数据；(3) 集成的训练管道与持续评估机制。我们通过EnterpriseArena对该平台进行了验证，这是一个在IT、人力资源、销售和工程领域包含15个应用程序和140多个工具的具体实例。我们的结果表明，在EnterpriseLab中训练的80亿参数模型，在复杂企业工作流上的性能与GPT-4o相当，同时将推理成本降低了8-10倍，并且在包括EnterpriseBench（+10%）和CRMArena（+10%）在内的多种企业基准测试中均保持稳健性。EnterpriseLab为企业提供了一条可行的路径，使其能够部署能力强、保护隐私的智能代理，同时不牺牲运营能力。

摘要 (Abstract)

Deploying AI agents in enterprise environments requires balancing capability with data sovereignty and cost constraints. While small language models offer privacy-preserving alternatives to frontier models, their specialization is hindered by fragmented development pipelines that separate tool integration, data generation, and training. We introduce EnterpriseLab, a full-stack platform that unifies these stages into a closed-loop framework. EnterpriseLab provides (1) a modular environment exposing enterprise applications via Model Context Protocol, enabling seamless integration of proprietary and open-source tools; (2) automated trajectory synthesis that programmatically generates training data from environment schemas; and (3) integrated training pipelines with continuous evaluation. We validate the platform through EnterpriseArena, an instantiation with 15 applications and 140+ tools across IT, HR, sales, and engineering domains. Our results demonstrate that 8B-parameter models trained within EnterpriseLab match GPT-4o’s performance on complex enterprise workflows while reducing inference costs by 8-10x, and remain robust across diverse enterprise benchmarks, including EnterpriseBench (+10%) and CRMArena (+10%). EnterpriseLab provides enterprises a practical path to deploying capable, privacy-preserving agents without compromising operational capability.

关键词: AI agents, enterprise deployment, small language models, tool integration, automated trajectory synthesis, training pipelines, privacy-preserving, inference cost reduction

6. ✅ Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs

作者: Mariela M. Nina, Caio Veloso Costa, Lilian Berton, Didier A. Vega-Oliveros 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21418v1

评分: 53.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	15.0/10	15.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	10.0/10	10.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

本研究系统评估了参数高效微调（PEFT）和量化技术在葡萄牙语问答任务中的应用，发现LoRA等方法能显著降低计算成本（训练时间减少73.5%）并保持高性能，同时证明小型编码器模型比生成式大语言模型更高效。

摘要翻译

尽管大语言模型已变革了自然语言处理领域，但其计算成本为巴西葡萄牙语等低资源语言的可及性设置了障碍。本研究对应用于BERTimbau模型的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）与量化技术进行了系统性评估，任务为基于巴西葡萄牙语版SQuAD v1（SQuAD-BR）的问答任务。我们评估了40种配置组合，涵盖四种PEFT方法（LoRA、DoRA、QLoRA、QDoRA）及两种模型规模（Base版：1.1亿参数，Large版：3.35亿参数）。研究结果揭示了三个关键发现：（1）在BERTimbau-Large模型上，LoRA方法能达到基线性能的95.8%，同时将训练时间减少73.5%（F1分数为81.32对比84.86）；（2）较高的学习率（2e-4）能显著提升PEFT性能，相较于标准学习率可获得最高达+19.71分的F1分数提升；（3）更大规模的模型展现出两倍的量化鲁棒性（F1分数损失为4.83对比9.56分）。这些结果表明，基于编码器的模型可通过高效微调应用于巴西葡萄牙语抽取式问答任务，其计算成本远低于大型生成式大语言模型，从而推动了符合“绿色人工智能”原则的可持续方法。对Tucano和Sabiá模型在同一抽取式问答基准上的探索性评估显示：虽然生成式模型通过LoRA微调可获得有竞争力的F1分数，但其所需GPU内存最高达BERTimbau-Base的4.2倍，训练时间多出3倍，这进一步印证了基于编码器的小型架构在此类任务中的效率优势。

摘要 (Abstract)

Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8% of baseline performance on BERTimbau-Large while reducing training time by 73.5% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textit{Green AI} principles. An exploratory evaluation of Tucano and Sabiá on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2$\times$ more GPU memory and 3$\times$ more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.

关键词: Parameter-Efficient Fine-Tuning, LoRA, Quantization, Portuguese Question Answering, BERTimbau, Computational Efficiency, Green AI, Extractive QA

7. ✅ User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction

作者: Yuren Hao, Shuhaib Mehri, ChengXiang Zhai, Dilek Hakkani-Tür 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20939v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为对话代理（LLM Agents）的个性化问题，提出VARS框架使用检索增强（Retrieval-Augmented Generation）技术来建模用户偏好，因此与"Large Language Models”、“LLM Agents"和"Retrieval-Augmented Generation"高度相关（10分）。框架涉及从用户反馈中学习，与"Self-Correction"有一定关联（5分），且双向量设计支持可解释性，与"Explainable AI"有一定关联（5分）。其他关键词如MoE、SFT、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对对话LLM代理缺乏持久用户模型的问题，提出了VARS框架，通过检索增强的交互和弱奖励学习来建模用户偏好，在保持冻结主干模型的情况下提高了交互效率并匹配了强基准的任务成功率。

摘要翻译

大型语言模型正日益被用作个人助手，然而大多数模型缺乏持久的用户模型，迫使用户在不同会话中反复重述偏好。我们提出向量自适应检索评分（VARS），这是一个与具体流程无关、主干网络冻结的框架，通过在共享偏好空间中使用长期与短期向量来表示每位用户，并利用这些向量对结构化偏好记忆的检索评分进行偏置调整。这些向量通过用户反馈产生的弱标量奖励进行在线更新，从而实现无需针对每位用户进行微调的个性化适配。我们在\textsc{MultiSessionCollab}（一个包含丰富用户偏好档案的在线多会话协作基准测试）上对数学和编程任务进行了评估。在主干网络冻结的条件下，用户感知检索的主要优势在于提升交互效率，而非大幅提高原始任务准确率：我们完整的VARS智能体实现了最强的综合性能，在任务成功率上与强大的反思基线模型持平，同时降低了超时率和用户操作负担。学习到的长期向量与跨用户偏好重叠度保持一致，而短期向量则捕捉了会话特定的适应性，这支持了双向量设计的可解释性。代码、模型与数据可在https://github.com/YurenHao0426/VARS获取。

摘要 (Abstract)

Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector-Adapted Retrieval Scoring (VARS), a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users’ feedback, enabling personalization without per-user fine-tuning. We evaluate on \textsc{MultiSessionCollab}, an online multi-session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user-aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long-term vectors also align with cross-user preference overlap, while short-term vectors capture session-specific adaptation, supporting the interpretability of the dual-vector design. Code, model, and data are available at https://github.com/YurenHao0426/VARS.

关键词: Large Language Models, LLM Agents, Retrieval-Augmented Generation, User Preference Modeling, Personalization, Weak Rewards, Frozen Backbone, Multi-session Collaboration

8. ✅ SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对文本到图像生成中细粒度空间关系评估不足的问题，提出了可验证的空间奖励模型SpatialReward，通过结合提示分解、专家检测和链式推理，显著提高了生成图像的空间一致性和整体质量。

摘要翻译

近期，基于强化学习（RL）的文本到图像（T2I）生成技术得益于评估语义对齐与视觉质量的奖励模型而取得进展。然而，现有奖励模型大多对细粒度空间关系关注有限，常生成整体看似合理但物体定位存在偏差的图像。本研究提出 SpatialReward，一种可验证的奖励模型，专门用于评估生成图像中的空间布局。SpatialReward采用多阶段流程：提示分解器（Prompt Decomposer）从自由形式提示中提取实体、属性及空间元数据；专家检测器提供物体位置与属性的精确视觉定位；视觉语言模型则基于定位观察结果进行思维链推理，以评估基于规则方法难以处理的复杂空间关系。为更全面评估生成图像中的空间关系，我们引入 SpatRelBench 基准，涵盖物体属性、朝向、物体间关系及渲染文本布局等维度。在Stable Diffusion和FLUX上的实验表明，将SpatialReward融入RL训练能持续提升空间一致性与整体生成质量，其结果与人类判断更为吻合。这些发现表明，可验证的奖励模型在推动文本到图像生成模型实现更精准、可控的优化方面具有显著潜力。

摘要 (Abstract)

Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.

关键词: text-to-image generation, spatial consistency, reward modeling, reinforcement learning, chain-of-thought reasoning, vision-language model, spatial relationships, verifiable evaluation

9. ✅ Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with Consensus-Aware Gradient Fusion

作者: Shixu Liu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21673v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文提出了一种名为WeatherTGD的免训练多智能体框架，利用文本梯度下降和共识感知梯度融合机制，通过三个专用LLM智能体协作从天气时间序列数据生成高质量、可解释的自然语言描述，在真实气象数据集上显著优于现有多智能体基线。

摘要翻译

将天气时间序列数据转化为可解释的自然语言描述，仍然是气象科学与自然语言处理交叉领域的一项重大挑战。尽管大语言模型（LLMs）在时间序列预测与分析方面已展现出卓越能力，但现有方法要么生成缺乏人类可理解解释的数值预测，要么产生缺乏领域专业深度的通用描述。我们提出了WeatherTGD，一个无需训练的多智能体框架，它通过文本梯度下降（Text Gradient Descent, TGD）的视角重新诠释了协作式描述优化过程。我们的系统部署了三个专业化的LLM智能体，包括统计分析师、物理解释器和气象学专家，它们从天气时间序列观测中生成领域特定的文本梯度。这些梯度通过一种新颖的共识感知梯度融合机制进行聚合，该机制在提取共同信号的同时保留了独特的领域视角。融合后的梯度随后指导一个类似于梯度下降的迭代优化过程，其中每个LLM生成的反馈信号都会更新描述，使其逼近最优解。在真实世界气象数据集上的实验表明，WeatherTGD在基于LLM的评估和人类专家评估中均取得了显著提升，大幅超越了现有的多智能体基线方法，同时通过并行智能体执行保持了计算效率。

摘要 (Abstract)

Generating interpretable natural language captions from weather time series data remains a significant challenge at the intersection of meteorological science and natural language processing. While recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities in time series forecasting and analysis, existing approaches either produce numerical predictions without human-accessible explanations or generate generic descriptions lacking domain-specific depth. We introduce WeatherTGD, a training-free multi-agent framework that reinterprets collaborative caption refinement through the lens of Text Gradient Descent (TGD). Our system deploys three specialized LLM agents including a Statistical Analyst, a Physics Interpreter, and a Meteorology Expert that generate domain-specific textual gradients from weather time series observations. These gradients are aggregated through a novel Consensus-Aware Gradient Fusion mechanism that extracts common signals while preserving unique domain perspectives. The fused gradients then guide an iterative refinement process analogous to gradient descent, where each LLM-generated feedback signal updates the caption toward an optimal solution. Experiments on real-world meteorological datasets demonstrate that WeatherTGD achieves significant improvements in both LLM-based evaluation and human expert evaluation, substantially outperforming existing multi-agent baselines while maintaining computational efficiency through parallel agent execution.

关键词: Large Language Models, Multi-agent Systems, LLM Agents, Weather Captioning, Text Gradient Descent, Training-Free Approach, Consensus-Aware Gradient Fusion, Meteorological Science

10. ✅ Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles

作者: Sai Koneru, Jian Wu, Sarah Rajtmajer 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21193v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文研究如何从科学文献全文中有效提取假设及其统计证据，发现通过优化检索质量进行针对性上下文选择能显著提升假设提取效果，但统计证据提取仍面临挑战，表明现有提取器在处理混合数值-文本语句时存在局限。

摘要翻译

从全文科学文献中提取假设及其支持的统计证据，是实证研究结果综合的核心环节，但由于文档长度以及科学论证分散于论文不同章节，这一任务仍面临困难。本研究探讨了一种序列式全文提取场景：将文章摘要中陈述的主要发现，与（i）论文正文中对应的假设陈述及（ii）支持或反驳该假设的统计证据进行关联。这一框架构建了一个具有挑战性的文档内检索场景，其中许多候选段落虽与发现主题相关，但修辞功能各异，从而为检索与提取过程制造了困难负例。通过采用两阶段“检索-提取”框架，我们开展了一项关于检索设计选择的对照研究，在四种大型语言模型提取器上，分别调整上下文数量、上下文质量（标准检索增强生成、重排序，以及结合重排序的微调检索器），并设置理想段落对照以区分检索失败与提取局限。研究发现，针对性的上下文选择相较于全文提示持续改进了假设提取，其增益主要集中在优化检索质量与上下文纯净度的配置中。相比之下，统计证据提取的难度仍然显著更高。即使在理想段落条件下，性能仍处于中等水平，这表明提取器在处理混合数值-文本陈述时存在持续的能力局限，而非仅由检索失败导致。

摘要 (Abstract)

Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article’s abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.

11. ❌ Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

作者: Rui Yang Tan, Yujia Hu, Roy Ka-Wei Lee 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21697v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）的安全对齐漏洞，核心围绕LLMs和Alignment。论文明确研究MLLMs（属于LLMs的扩展）的安全对齐失败模式，因此与"Large Language Models"和"Alignment"高度相关（10分）。论文涉及安全评估和事实性（如不可靠的安全评估器），与"Hallucination Mitigation"有一定关联（5分）。其他关键词如MoE、Scaling Laws、Pre-training、RLHF、RAG、Agents等，论文未涉及技术细节或应用，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，基于漫画模板的视觉叙事攻击能有效绕过多模态大语言模型的安全对齐机制，在多个先进模型上成功率超过90%，并揭示了现有防御方法在处理良性内容时的高拒绝率问题。

摘要翻译

多模态大语言模型（MLLMs）在纯文本大语言模型的基础上扩展了视觉推理能力，但也引入了基于视觉指令的新安全失效模式。我们研究了漫画模板越狱攻击，该方法将有害目标嵌入简单的三格视觉叙事中，并诱导模型进行角色扮演以“完成漫画”。基于JailbreakBench和JailbreakV，我们提出了ComicJailbreak——一个基于漫画的越狱基准测试集，包含1,167个攻击实例，涵盖10种危害类别和5种任务设置。在对15个前沿多模态大语言模型（6个商业模型和9个开源模型）的测试中，基于漫画的攻击成功率与强规则型越狱方法相当，并显著超越纯文本和随机图像基线，在多个商业模型上的集成成功率超过90%。进一步实验表明，现有防御方法虽能有效拦截有害漫画，但在处理良性提示时也会引发高拒绝率。最后，通过自动评估和针对性人工评估，我们发现当前安全评估器对敏感但无害内容的判断存在不可靠性。本研究揭示了针对叙事驱动型多模态越狱攻击构建鲁棒安全对齐机制的必要性。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) extend text-only LLMs with visual reasoning, but also introduce new safety failure modes under visually grounded instructions. We study comic-template jailbreaks that embed harmful goals inside simple three-panel visual narratives and prompt the model to role-play and “complete the comic.” Building on JailbreakBench and JailbreakV, we introduce ComicJailbreak, a comic-based jailbreak benchmark with 1,167 attack instances spanning 10 harm categories and 5 task setups. Across 15 state-of-the-art MLLMs (six commercial and nine open-source), comic-based attacks achieve success rates comparable to strong rule-based jailbreaks and substantially outperform plain-text and random-image baselines, with ensemble success rates exceeding 90% on several commercial models. Then, with the existing defense methodologies, we show that these methods are effective against the harmful comics, they will induce a high refusal rate when prompted with benign prompts. Finally, using automatic judging and targeted human evaluation, we show that current safety evaluators can be unreliable on sensitive but non-harmful content. Our findings highlight the need for safety alignment robust to narrative-driven multimodal jailbreaks.

关键词: Multimodal Large Language Models, Safety Alignment, Jailbreak, Visual Narratives, Comic-based Attack, Benchmark, Safety Evaluation, MLLMs

12. ❌ SynSym: A Synthetic Data Generation Framework for Psychiatric Symptom Identification

作者: Migyeong Kang, Jihyun Kim, Hyolim Jeon, Sunwoo Hwang, Jihyun An, Yonghoon Kim, Haewoon Kwak, Jisun An, Jinyoung Han 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21529v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心是利用大语言模型（LLMs）生成合成数据来解决精神病症状识别中的数据稀缺问题，属于大模型在科学（具体是心理健康/生物信息学）领域的应用。因此，与"Large Language Models"高度相关（10分），与"AI for Science"高度相关（10分）。论文提到模型可以从真实数据中进一步微调，与"Post-training"有一定关联（5分）。其他关键词如MoE、Scaling Laws、RLHF、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为SynSym的合成数据生成框架，利用大语言模型为精神病症状识别任务生成高质量、多样化的训练数据，实验表明仅使用合成数据训练的模型性能与使用真实数据训练的模型相当，且能从真实数据的微调中进一步受益。

摘要翻译

社交媒体精神症状识别旨在从用户生成的内容中推断细粒度的心理健康症状，从而实现对用户心理状态的细致理解。然而，由于专家标注需要大量资源且缺乏标准化的标注指南，大规模症状级别数据集的构建仍面临挑战，这进而限制了模型从用户生成文本中识别多样化症状表达的可泛化性。为解决这些问题，我们提出了SynSym——一个用于构建可泛化症状识别数据集的合成数据生成框架。SynSym利用大语言模型，通过以下方式构建高质量训练样本：(1) 将每个症状扩展为子概念以增强生成表达的多样性，(2) 生成反映不同语言风格下精神症状的合成表达，以及(3) 依据临床共现模式构建真实的多症状复合表达。我们在涵盖不同抑郁症状表达风格的三个基准数据集上验证了SynSym。实验结果表明，仅使用SynSym生成的合成数据训练的模型与使用真实数据训练的模型表现相当，并且能通过真实数据的进一步微调获得额外提升。这些发现凸显了合成数据作为精神症状建模中真实标注替代资源的潜力，而SynSym则为生成具有临床相关性且真实的症状表达提供了一个实用框架。

摘要 (Abstract)

Psychiatric symptom identification on social media aims to infer fine-grained mental health symptoms from user-generated posts, allowing a detailed understanding of users’ mental states. However, the construction of large-scale symptom-level datasets remains challenging due to the resource-intensive nature of expert labeling and the lack of standardized annotation guidelines, which in turn limits the generalizability of models to identify diverse symptom expressions from user-generated text. To address these issues, we propose SynSym, a synthetic data generation framework for constructing generalizable datasets for symptom identification. Leveraging large language models (LLMs), SynSym constructs high-quality training samples by (1) expanding each symptom into sub-concepts to enhance the diversity of generated expressions, (2) producing synthetic expressions that reflect psychiatric symptoms in diverse linguistic styles, and (3) composing realistic multi-symptom expressions, informed by clinical co-occurrence patterns. We validate SynSym on three benchmark datasets covering different styles of depressive symptom expression. Experimental results demonstrate that models trained solely on the synthetic data generated by SynSym perform comparably to those trained on real data, and benefit further from additional fine-tuning with real data. These findings underscore the potential of synthetic data as an alternative resource to real-world annotations in psychiatric symptom modeling, and SynSym serves as a practical framework for generating clinically relevant and realistic symptom expressions.

关键词: synthetic data generation, psychiatric symptom identification, large language models, mental health, social media analysis, data augmentation, clinical co-occurrence patterns, fine-tuning

13. ❌ Graph Fusion Across Languages using Large Language Models

作者: Kaung Myat Kyaw, Khush Agarwal, Jonathan Chan 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21248v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是利用LLMs的in-context reasoning能力解决跨语言知识图谱融合问题，因此与"Large Language Models"和"In-context Learning"高度相关（10分）。论文提到LLMs的推理能力，与"Chain of Thought"有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或相关，故给0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种利用大语言模型的上下文推理和多语言语义先验来解决跨语言知识图谱融合挑战的框架，并在DBP15K数据集上验证了其作为通用语义桥梁的有效性。

摘要翻译

融合跨语言边界的多个知识图谱（Knowledge Graphs, KGs）一直是一个持续存在的挑战，这主要源于语义异构性和图谱环境的复杂性。我们提出了一种跨语言图谱融合框架，该框架利用大型语言模型（Large Language Models, LLMs）的上下文推理能力和多语言语义先验知识。该框架通过将三元组直接映射为自然语言序列（例如，[头实体] [关系] [尾实体]）来实现结构线性化，使得LLM能够在不断演化的融合图谱（$G_{c}^{(t-1)}$）与新的候选图谱（$G_{t}$）之间进行关系映射和实体对齐。在DBP15K数据集上的评估表明，这项探索性研究证实了LLMs能够作为解决跨语言差异的通用语义桥梁。结果显示，该框架成功实现了多个异构图谱的序列化聚合，为多源、多语言环境下的持续知识合成提供了一个可扩展、模块化的解决方案。

摘要 (Abstract)

Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ($G_{c}^{(t-1)}$) and a new candidate graph ($G_{t}$). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.

关键词: cross-lingual graph fusion, knowledge graphs, Large Language Models, in-context reasoning, multilingual semantic priors, structural linearization, DBP15K dataset, knowledge synthesis

14. ❌ AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference

作者: Risa Shinoda, Kaede Shiohara, Nakamasa Inoue, Hiroaki Santo, Fumio Okura 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22053v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文AnimalCLAP专注于利用深度学习进行动物声音识别和生态特征推断，属于AI在科学（特别是生物信息学）领域的应用。核心贡献包括构建大规模动物声音数据集和开发基于CLAP（对比语言-音频预训练）的模型，通过整合分类学信息提升未见物种的识别能力。因此，与"Pre-training"高度相关（10分），因为模型基于音频和文本进行预训练；与"AI for Science"高度相关（10分），属于生物信息学应用。与"Large Language Models"有一定关联（5分），因为CLAP框架涉及语言模型组件，但论文未深入探讨LLM技术本身。其他关键词（如MoE、SFT、RAG等）与论文内容无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了AnimalCLAP，一个结合分类学信息的语言-音频预训练框架，用于从动物声音中识别物种并推断生态特征，在未见物种识别上优于现有CLAP模型。

摘要翻译

动物发声为野生动物评估提供了关键洞察，尤其在森林等复杂环境中，有助于物种识别与生态监测。深度学习的最新进展使得通过发声实现物种自动分类成为可能。然而，对训练阶段未出现物种的分类仍具挑战性。为应对这一局限，我们提出了AnimalCLAP——一个融合层级生物学信息的、具备分类学感知能力的语言-音频框架，包含新数据集与模型。具体而言，我们的发声数据集包含4,225小时录音，涵盖6,823个物种，并标注了22种生态特征。AnimalCLAP模型基于该数据集训练，利用分类学结构对齐音频与文本表征，从而提升对未见物种的识别能力。我们证明，所提出的模型能直接从物种发声有效推断其生态与生物学属性，相比CLAP实现了更优性能。我们的数据集、代码与模型将通过https://dahlian00.github.io/AnimalCLAP_Page/ 公开提供。

摘要 (Abstract)

Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.

关键词: Animal vocalizations, Species recognition, Taxonomy-aware, Language-audio pretraining, Ecological trait inference, Zero-shot learning, Bioacoustics, CLAP model

15. ❌ mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT

作者: Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21606v1

评分: 23.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	15.0/10	15.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究多任务监督微调（SFT）中的数据混合过拟合问题，提出mSFT算法。与关键词"Post-training” OR “Supervised Fine-tuning” OR “SFT"高度相关（15分），因为论文标题和摘要明确聚焦SFT。与"Large Language Models” OR “LLMs” OR “Foundation Models"有一定关联（8分），因为论文涉及语言模型训练，但未明确指定为大模型。其他关键词如MoE、Scaling Laws、RLHF等均未在论文中提及，故评0分。

!!! tip deepseek-chat TL;DR

论文针对多任务监督微调中数据混合导致异质过拟合的问题，提出mSFT算法，通过迭代识别和排除过拟合子数据集来优化训练，实验证明其在多个基准和模型上优于基线方法。

摘要翻译

当前语言模型训练通常采用多任务监督微调（SFT），并在所有子数据集上使用均匀的计算预算。这种方法本质上并非最优：异质的学习动态会导致学习速度较快的任务过早过拟合，而学习较慢的任务则仍处于欠拟合状态。为解决这一问题，我们提出了mSFT——一种针对多任务数据混合的迭代式、过拟合感知搜索算法。mSFT在动态混合数据上训练模型，识别并排除最早发生过拟合的子数据集，并在继续训练前回退至该特定任务的最优检查点。大量评估表明，mSFT在10个基准测试和6个基础模型上均持续优于4种基线方法。进一步分析证实，mSFT在不同数据集规模、任务粒度下均保持稳健的性能提升，且对其唯一新增超参数（计算预算）不敏感。值得注意的是，在低计算预算下，mSFT能在降低训练FLOPs的同时提升模型性能。最终，mSFT为多任务SFT建立了一种实用的过拟合感知算法，能够最大化模型在多样化数据混合中的潜力。

摘要 (Abstract)

Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.

关键词: Supervised Fine-Tuning, multi-task SFT, dataset mixtures, overfitting, heterogeneous learning, iterative algorithm, training optimization, language model training

16. ❌ Conspiracy Frame: a Semiotically-Driven Approach for Conspiracy Theories Detection

作者: Heidi Campana Piva, Shaina Ashraf, Maziar Kianimoghadam Jouneghani, Arianna Longo, Rossana Damiano, Lucie Flek, Marco Antonio Stranisci 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21368v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究LLMs在阴谋论检测中的应用，核心是评估LLMs识别阴谋论的能力，并探索框架语义（frames）如何支持此任务。摘要明确提到"the ability of LLMs to recognize this phenomenon"和"injection of frames in an in-context approach”，因此与"Large Language Models"和"In-context Learning"高度相关（10分）。其他关键词如MoE、Scaling Laws、RLHF等涉及模型架构、训练方法或特定技术，论文未提及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）在检测阴谋论方面的能力，通过引入基于框架语义和符号学的Conspiracy Frame数据集，发现虽然注入框架的上下文学习方法未显著提升性能，但揭示了潜在的语义模式，为更语义感知的阴谋论检测铺平了道路。

摘要翻译

阴谋论是反权威的叙事，易引发社会冲突并影响人们对政治信息的认知。为深入理解这一问题，我们提出“阴谋论框架”：一种基于框架语义学与符号学构建的、针对阴谋论叙事的细粒度语义表征体系。基于此框架，我们构建了“阴谋论框架数据集”——一个在片段层级进行标注的Telegram消息语料库。该框架与数据集有助于推动对阴谋论更具普适性的理解与识别研究。我们考察了大语言模型在领域内及跨领域场景中识别此类现象的能力，并探究框架在此任务中的潜在支持作用。实验结果表明：尽管在上下文学习中注入框架未能显著提升模型性能，但其仍具潜力；通过将标注片段与FrameNet进行映射，我们发现了抽象的语义模式（如“亲属关系”“摄入物质”），这或可为构建更具语义与符号学意识的阴谋论叙事检测方法开辟新路径。

摘要 (Abstract)

Conspiracy theories are anti-authoritarian narratives that lead to social conflict, impacting how people perceive political information. To help in understanding this issue, we introduce the Conspiracy Frame: a fine-grained semantic representation of conspiratorial narratives derived from frame-semantics and semiotics, which spawned the Conspiracy Frames (Con.Fra.) dataset: a corpus of Telegram messages annotated at span-level. The Conspiracy Frame and Con.Fra. dataset contribute to the implementation of a more generalizable understanding and recognition of conspiracy theories. We observe the ability of LLMs to recognize this phenomenon in-domain and out-of-domain, investigating the role that frames may have in supporting this task. Results show that, while the injection of frames in an in-context approach does not lead to clear increase of performance, it has potential; the mapping of annotated spans with FrameNet shows abstract semantic patterns (e.g., Kinship', Ingest_substance’) that potentially pave the way for a more semantically- and semiotically-aware detection of conspiratorial narratives.

关键词: Conspiracy theories detection, Large Language Models, Frame semantics, In-context learning, Semiotic analysis, Telegram messages, Span-level annotation, Generalizable understanding

17. ❌ TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

作者: Saketh Vinjamuri, Marielle Fis Loperena, Marie C. Spezia, Ramez Kouzy 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21335v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心是开发一个基于LLM（Google’s Gemini）的自动化管道TimeTox，用于从临床试验方案中提取时间毒性指标，这直接对应关键词"Large Language Models" OR “LLMs” OR “Foundation Models”（高度相关，10分）和"AI for Science" OR “Bioinformatics” OR “Cheminformatics”（生物医学AI应用，高度相关，10分）。论文未涉及其他技术原理创新（如MoE、量化、推理优化等），也未使用其他训练方法（如SFT、RLHF、PEFT等），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于大语言模型的自动化管道TimeTox，用于从临床试验方案中提取时间毒性指标，在真实世界数据上实现了95.3%的临床可接受准确率和82.0%的完美稳定性。

摘要翻译

时间毒性（Time toxicity）——即临床试验参与过程中累积的医疗接触天数——是一项重要但需从试验方案文件中人工提取的指标。我们开发了TimeTox，一种基于大语言模型的自动化流程，用于从评估计划表中提取时间毒性数据。TimeTox采用谷歌Gemini模型，分三个阶段运行：从完整方案PDF中提取摘要、量化各治疗组在六个累积时间点的时间毒性，以及通过基于治疗组位置匹配的多轮共识机制进行整合。我们在20份合成评估计划（共240组对比）上进行了验证，并在644份真实世界肿瘤学试验方案中评估了其可重复性。我们比较了两种架构：单阶段直接提取与两阶段分步提取。在合成数据上，两阶段流程实现了100%临床可接受的准确度（误差±3天，平均绝对误差0.81天），而单阶段流程仅为41.5%（平均绝对误差9.0天）。然而在真实世界方案中，单阶段流程展现出更优的可重复性：在644份方案的三轮运行中达到95.3%临床可接受准确度（四分位距≤3天），其中82.0%实现完全稳定（四分位距=0）。生产流程最终成功提取了涵盖多疾病领域的1,288个治疗组的时间毒性数据。对于生产环境的大语言模型部署而言，真实世界数据上的提取稳定性比合成基准测试的准确度更具决定性意义。

摘要 (Abstract)

Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google’s Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.

关键词: Time toxicity, LLM-based pipeline, clinical trial protocols, automated extraction, oncology, Schedule of Assessments, reproducibility, Gemini models

18. ❌ Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models

作者: Abdul-Salem Beibitkhan 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21036v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文的核心是评估大型语言模型（LLMs）在低资源语言（如哈萨克语和蒙古语）上的性能表现，并测试跨语言迁移策略的有效性。因此，它与关键词"Large Language Models" OR “LLMs” OR “Foundation Models"高度相关（评分为10分），因为论文直接以LLMs为研究对象，并进行了基准测试。然而，论文并未深入探讨其他关键词所涉及的具体技术（如MoE、SFT、RAG、量化等）、方法（如CoT、RLHF、PEFT）或应用领域（如AI for Science）。它主要关注性能评估和跨语言迁移策略，而非技术创新或特定领域的应用。

!!! tip deepseek-chat TL;DR

该研究评估了大型语言模型在低资源语言上的性能，发现模型在英语与低资源语言之间存在显著的性能差距，且跨语言迁移策略的有效性取决于模型架构。

摘要翻译

本研究通过评估八种大型语言模型在英语、哈萨克语和蒙古语五种实验条件下的表现，探究其对低资源语言的处理能力。我们使用涵盖事实性、推理性、技术性及文化相关类别的50个人工设计问题，对2000个回答的准确性、流畅性和完整性进行评估。研究发现，英语与低资源语言条件之间存在13.8至16.7个百分点的稳定性能差距：模型虽能保持表层流畅性，但生成内容的准确性显著降低。跨语言迁移策略——即提示模型先用英语推理再翻译回目标语言——为双语架构模型带来选择性提升（+2.2至+4.3个百分点），但对英语主导模型无增益效果。我们的结果表明，当前大型语言模型系统性地未能充分服务低资源语言群体，且有效的缓解策略需根据模型架构定制，而非普适通用。

摘要 (Abstract)

We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally grounded categories, we evaluate 2,000 responses on accuracy, fluency, and completeness. We find a consistent performance gap of 13.8-16.7 percentage points between English and low-resource language conditions, with models maintaining surface-level fluency while producing significantly less accurate content. Cross-lingual transfer-prompting models to reason in English before translating back-yields selective gains for bilingual architectures (+2.2pp to +4.3pp) but provides no benefit to English-dominant models. Our results demonstrate that current LLMs systematically underserve low-resource language communities, and that effective mitigation strategies are architecture-dependent rather than universal.

关键词: Large Language Models, Low-resource Languages, Cross-lingual Transfer, Performance Evaluation, Benchmarking, Kazakh, Mongolian, Accuracy Gap

19. ❌ Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

作者: Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta, Rubaya Tabassum, Ariful Ekraj Hridoy, Mehraj Mahmood, Mahbub E Sobhani, Md. Tarek Hasan, Swakkhar Shatabda 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21165v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究多语言视觉语言模型（VLMs）在孟加拉文化理解上的评估，属于大模型（VLMs是视觉语言模型，可视为大模型的一种）在特定文化领域的应用研究，因此与"Large Language Models” OR “LLMs” OR “Foundation Models"有一定关联（5分）。其他关键词主要涉及大模型的技术原理（如MoE、Scaling Laws、训练方法、推理优化等）或特定应用领域（如AI for Science），论文未直接涉及这些具体技术或领域，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对孟加拉文化在多模态评估中代表性不足的问题，提出了BanglaVerse基准来评估多语言视觉语言模型在孟加拉文化理解上的表现，发现仅评估标准孟加拉语会高估模型能力，而方言变化会导致性能下降，主要瓶颈是文化知识的缺失而非视觉基础问题。

摘要翻译

孟加拉文化通过地域、方言、历史、饮食、政治、媒体及日常视觉生活得到丰富展现，但在多模态评估领域仍代表性不足。为填补这一空白，我们推出BanglaVerse——一个基于文化的基准测试集，用于评估多语言视觉-语言模型（VLMs）在跨越历史关联语言及区域方言的孟加拉文化上的表现。该基准包含九个领域共1,152张人工筛选图像，支持视觉问答与图像描述任务，并扩展至四种语言及五种孟加拉方言，共生成约3.23万项数据样本。实验表明，仅评估标准孟加拉语会高估模型的真实能力：在方言变体下模型性能显著下降（尤其在图像描述生成任务中），而印地语、乌尔都语等历史关联语言虽能保留部分文化语义，但在结构化推理任务上表现仍较弱。跨领域分析发现，主要瓶颈在于模型缺乏文化知识而非单纯的视觉基础能力，尤其在知识密集型类别中表现明显。这些发现使BanglaVerse成为衡量语言变异下文化根基性多模态理解能力的更现实测试平台。

摘要 (Abstract)

Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.

关键词: multilingual vision-language models, Bengali culture understanding, BanglaVerse benchmark, dialectal variation, culturally grounded evaluation, visual question answering, caption generation, cultural knowledge bottleneck

20. ❌ A Latent Representation Learning Framework for Hyperspectral Image Emulation in Remote Sensing

作者: Chedly Ben Azizi, Claire Guilloteau, Gilles Roussel, Matthieu Puigt 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21911v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文专注于遥感领域的高光谱图像仿真，提出了一种基于潜在表示学习的框架，使用变分自编码器（VAE）进行预训练和插值。研究内容属于AI在科学领域的应用（遥感），与"AI for Science"有一定关联（5分），但论文未涉及大语言模型（LLMs）、深度学习技术原理创新或任何其他评分关键词中的具体技术（如MoE、Scaling Laws、RLHF等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于潜在表示学习的高光谱图像仿真框架，通过变分自编码器预训练和参数到潜在空间的插值，在植被和遥感数据上实现了比传统回归方法更高的重建精度、光谱保真度和鲁棒性，并保持了下游生物物理参数检索的性能。

摘要翻译

合成高光谱图像生成对于大规模仿真、算法开发和任务设计至关重要，然而传统的辐射传输模型计算成本高昂，且通常仅限于光谱级输出。本研究提出一种基于潜在表征的高光谱仿真框架，该框架学习高光谱数据的潜在生成表征。所提出的方法同时支持光谱级与空间-光谱级仿真，既可通过直接一步式训练实现，也可采用耦合变分自编码器预训练与参数-潜在空间插值的两步式策略进行训练。在PROSAIL模拟植被数据和哨兵-3 OLCI影像上的实验表明，该方法在重建精度、光谱保真度以及对真实世界空间变异性的鲁棒性方面均优于经典的基于回归的仿真器。我们进一步证明，仿真生成的高光谱图像在下游生物物理参数反演任务中能保持性能表现，这凸显了仿真数据在遥感应用中的实际价值。

摘要 (Abstract)

Synthetic hyperspectral image (HSI) generation is essential for large-scale simulation, algorithm development, and mission design, yet traditional radiative transfer models remain computationally expensive and often limited to spectrum-level outputs. In this work, we propose a latent representation-based framework for hyperspectral emulation that learns a latent generative representation of hyperspectral data. The proposed approach supports both spectrum-level and spatial-spectral emulation and can be trained either in a direct one-step formulation or in a two-step strategy that couples variational autoencoder (VAE) pretraining with parameter-to-latent interpolation. Experiments on PROSAIL-simulated vegetation data and Sentinel-3 OLCI imagery demonstrate that the method outperforms classical regression-based emulators in reconstruction accuracy, spectral fidelity, and robustness to real-world spatial variability. We further show that emulated HSIs preserve performance in downstream biophysical parameter retrieval, highlighting the practical relevance of emulated data for remote sensing applications.

关键词: hyperspectral image emulation, latent representation learning, variational autoencoder, remote sensing, spectral-spatial emulation, biophysical parameter retrieval, Sentinel-3 OLCI, PROSAIL

21. ❌ UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

作者: Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22282v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UniMotion提出了一种统一框架，用于在单一架构中同时理解和生成人体运动、自然语言和RGB图像。它明确使用了共享的LLM骨干网络（高度相关），并提出了自监督预训练策略LRA（与预训练相关）。然而，论文专注于多模态统一建模，特别是运动-文本-视觉的交叉模态对齐和生成，未涉及其他关键词如MoE、SLMs、对齐、推理、代理、压缩等具体技术。

!!! tip deepseek-chat TL;DR

UniMotion提出了首个统一框架，通过连续模态处理和交叉模态对齐技术，在单一架构中实现了人体运动、自然语言和RGB图像的同步理解与生成，并在七项跨模态任务中取得了最先进的性能。

摘要翻译

本文提出UniMotion框架，据我们所知，这是首个能够在单一架构内同时理解与生成人体运动、自然语言和RGB图像的统一模型。现有统一模型仅能处理受限的模态子集（如运动-文本或静态姿态-图像），且主要依赖离散化标记方法，这会引入量化误差并破坏时间连续性。UniMotion通过一个核心原则克服了这些局限：将运动视为与RGB平等的连续模态。我们设计了新型跨模态对齐运动变分自编码器（CMA-VAE）与对称双路嵌入器，在共享大语言模型主干中为运动和RGB构建了并行的连续处理通路。为了在推理时无需图像输入的情况下将视觉语义先验注入运动表征，我们提出双后验KL对齐（DPA）方法，将视觉融合编码器更丰富的后验分布蒸馏至纯运动编码器。针对冷启动问题——仅凭文本监督过于稀疏而无法校准新引入的运动通路——我们进一步提出潜在重建对齐（LRA），这是一种自监督预训练策略，利用稠密的运动潜在表征作为明确条件，协同校准嵌入器、主干网络和流预测头，为所有下游任务建立稳定的运动感知基础。UniMotion在涵盖三种模态间任意到任意理解、生成与编辑的七项任务中均达到最先进性能，尤其在跨模态组合任务上展现出显著优势。

摘要 (Abstract)

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder’s richer posterior into the motion-only encoder. To address the cold-start problem – where text supervision alone is too sparse to calibrate the newly introduced motion pathway – we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.

关键词: Unified Framework, Motion-Text-Vision, Cross-Modal Alignment, Continuous Modality, LLM Backbone, Self-supervised Pre-training, Multi-modal Generation, State-of-the-art Performance

22. ❌ End-to-End Training for Unified Tokenization and Latent Denoising

作者: Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman, Antonio Torralba, Phillip Isola, William T. Freeman 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22283v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种名为UNITE的自动编码器架构，用于统一标记化和潜在扩散，主要关注图像和分子模态的生成模型。论文的核心贡献在于单阶段联合训练方法，通过共享参数实现标记化和生成的联合优化。所有关键词均与大型语言模型（LLMs）相关，但论文未涉及LLMs，而是专注于潜在扩散模型（LDMs）和自动编码器。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文提到了分子模态的应用，但这不是核心焦点，因此给予5分（有一定关联）。其他关键词与LLMs技术、对齐、推理、代理等无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为UNITE的自动编码器架构，通过单阶段联合训练统一图像标记化和潜在扩散，在图像和分子模态上实现了接近最先进的性能，无需对抗性损失或预训练编码器。

摘要翻译

潜在扩散模型通过在学习的潜在空间中操作，实现了高保真度的合成。然而，训练最先进的潜在扩散模型需要复杂的分阶段过程：必须首先训练一个分词器，然后才能在冻结的潜在空间中训练扩散模型。我们提出了UNITE——一种用于统一分词和潜在扩散的自编码器架构。UNITE包含一个通过权重共享同时作为图像分词器和潜在生成器的生成式编码器。我们的核心见解是，分词和生成可以被视为不同条件设定下的同一潜在推断问题：分词是从完全观测到的图像推断潜在表示，而生成则是从噪声结合文本或类别条件中推断潜在表示。受此启发，我们引入了一种单阶段训练程序，通过同一生成式编码器的两次前向传播来联合优化这两个任务。共享参数使得梯度能够共同塑造潜在空间，从而促进一种“共同的潜在语言”。在图像和分子模态上，UNITE无需对抗性损失或预训练编码器（例如DINO），即可达到接近最先进的性能，在ImageNet 256 x 256数据集上，其Base和Large模型的FID分别达到2.12和1.73。我们进一步从表征对齐和压缩的角度分析了生成式编码器。这些结果表明，从头开始对分词和生成进行单阶段联合训练是可行的。

摘要 (Abstract)

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a “common latent language”. Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.

关键词: latent diffusion models, autoencoder, unified tokenization, generative encoder, single-stage training, image synthesis, molecule modalities, joint optimization

23. ❌ WorldCache: Content-Aware Caching for Accelerated Video World Models

作者: Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan, Fahad Shahbaz Khan 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22286v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频世界模型（Video World Models）的推理加速技术，特别是针对Diffusion Transformers（DiTs）的缓存优化方法。与绝大多数关键词（涉及大语言模型、训练方法、对齐、代理等）完全无关。仅与两个关键词相关：1）“World Models AND General World Models"高度相关（10分），因为论文直接研究视频世界模型；2）“Speculative Decoding OR Inference Acceleration"有一定关联（5分），因为论文核心目标是加速推理，但方法是通过缓存而非推测解码。

!!! tip deepseek-chat TL;DR

该论文针对Diffusion Transformers视频世界模型推理速度慢的问题，提出了一种感知约束的动态缓存框架WorldCache，实现了2.3倍加速同时保持99.4%的基线质量。

摘要翻译

扩散变换器（Diffusion Transformers, DiTs）驱动着高保真度的视频世界模型，但由于其顺序去噪过程及昂贵的时空注意力机制，计算成本依然高昂。免训练特征缓存技术通过在去噪步骤间复用中间激活值来加速推理；然而，现有方法主要依赖于零阶保持假设，即在全局漂移较小时将缓存特征作为静态快照复用。这常常导致动态场景中出现重影伪影、模糊和运动不一致问题。我们提出 \textbf{WorldCache}，一种感知约束的动态缓存框架，从“何时复用”和“如何复用”两方面进行改进。WorldCache 引入了运动自适应阈值、显著性加权的漂移估计、通过混合与形变实现的最优近似，以及在扩散步骤间采用相位感知的阈值调度。我们这一整体性方法实现了自适应、运动一致的特征复用，且无需重新训练。在 PAI-Bench 上评估的 Cosmos-Predict2.5-2B 模型中，WorldCache 实现了 \textbf{2.3$\times$} 的推理加速，同时保持了基线模型 \textbf{99.4%} 的质量，显著优于先前的免训练缓存方法。我们的代码可在 \href{https://umair1221.github.io/World-Cache/}{World-Cache} 上获取。

摘要 (Abstract)

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.

关键词: Video World Models, Diffusion Transformers, Inference Acceleration, Feature Caching, Training-free Optimization, Motion-adaptive Caching, Perception-Constrained Framework, Dynamic Scene Consistency

24. ❌ 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

作者: Haoyu Zhen, Xiaolong Li, Yilin Zhao, Han Zhang, Sifei Liu, Kaichun Mo, Chuang Gan, Subhashree Radhakrishnan 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22279v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在空间布局编辑任务中的结构化推理能力，与’Large Language Models’和’Chain of Thought’高度相关（10分），因为直接使用LLMs并改进其推理过程。与’SFT’和’Instruction Tuning’有一定关联（5分），因涉及指令跟随的微调。与’System 2 Thinking’和’Explainable AI’有弱关联（5分），因强调深度推理和可解释性。其他关键词如MoE、SLMs、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在空间理解和布局一致性上的不足，提出了一种结构化推理框架，通过场景图推理进行文本条件空间布局编辑，相比基线方法在空间精度上实现了显著提升。

摘要翻译

大语言模型（LLMs）与视觉语言模型（VLMs）已展现出令人瞩目的推理能力，但在执行细粒度视觉编辑任务时，它们仍难以准确把握空间理解与布局一致性。本文提出一种结构化推理框架，该框架通过场景图推理实现基于文本条件的空间布局编辑。给定输入场景图及自然语言指令，模型对场景图进行推理，生成满足文本条件且保持空间连贯性的更新后场景图。通过利用结构化关系表征显式引导推理过程，我们的方法提升了对空间关系的可解释性与控制力。我们在一个涵盖排序、空间对齐及房间编辑任务的新型文本引导布局编辑基准上评估了本方法。相较于思维链微调（CoT-SFT）与原始GRPO基线，我们的训练范式在交并比（IoU）上平均提升15%，中心距离误差降低25%。与当前最先进的零样本大语言模型相比，我们最优模型的平均交并比（mIoU）最高可提升20%，显示出显著增强的空间精确性。

摘要 (Abstract)

Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.

关键词: Large Language Models, Structured Reasoning, Spatial Layout Editing, Scene-graph Reasoning, Chain of Thought, Text-conditioned Editing, Spatial Coherence, Interpretability

25. ❌ ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

作者: Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22281v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉-语言模型（VLM）引导的潜在世界模型框架，核心是结合密集帧动态建模与长时域语义指导。仅与关键词’World Models AND General World Models’高度相关（10分），因为论文明确研究潜在世界模型（latent world models）并改进其预测能力。其他关键词主要涉及纯语言模型技术、训练方法、推理优化、代理系统等，论文未涉及这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

论文提出了一种VLM引导的JEPA风格潜在世界建模框架，通过结合密集帧动态建模和长时域语义指导，解决了现有方法在长时域语义捕捉和下游任务效用上的不足，并在手部操作轨迹预测任务中超越了基线方法。

摘要翻译

近期潜在世界模型（如V-JEPA2）的研究进展表明，其通过视频观测预测未来世界状态的能力颇具前景。然而，基于短时观测窗口的密集预测限制了时间上下文，易使预测器偏向局部、低层次的推断，难以捕捉长时程语义，从而降低了下游任务的实用性。相比之下，视觉—语言模型通过对均匀采样帧进行推理，提供了强大的语义基础和通用知识，但由于计算驱动的稀疏采样、将细粒度交互状态压缩为面向文本表征的语言输出瓶颈，以及适配小型动作条件数据集时的数据机制不匹配，它们并不适合作为独立的密集预测器。我们提出了一种VLM引导的JEPA风格潜在世界建模框架，通过双时间路径将密集帧动态建模与长时程语义引导相结合：一个用于细粒度运动与交互线索的密集JEPA分支，以及一个具有更大时间步长、提供知识丰富引导的均匀采样VLM“思考者”分支。为有效传递VLM的渐进式推理信号，我们引入了分层金字塔表征提取模块，该模块将多层VLM表征聚合为与潜在预测兼容的引导特征。在手工操作轨迹预测任务上的实验表明，我们的方法在性能上超越了仅使用VLM的强基线模型和JEPA预测器基线，并产生了更稳健的长时程推演行为。

摘要 (Abstract)

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision–language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM’s progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

关键词: latent world models, vision-language models, JEPA, temporal context, long-horizon semantics, trajectory prediction, hierarchical pyramid representation, dual-temporal pathway

26. ❌ TiCo: Time-Controllable Training for Spoken Dialogue Models

作者: Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22267v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出TiCo方法，属于大模型在语音对话领域的应用创新。核心相关关键词：1) ‘Post-training OR Supervised Fine-tuning OR SFT’（10分）- TiCo被明确描述为’post-training method’，是核心方法；2) ‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（5分）- 论文提到使用’reinforcement learning’进行优化；3) ‘Instruction Tuning OR Alignment OR Value Alignment’（5分）- 涉及使模型遵循时间约束指令；4) ‘Large Language Models OR LLMs OR Foundation Models’（5分）- 论文研究spoken dialogue models，属于大模型应用范畴；5) ‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（5分）- 应用于voice assistants和interactive agents。其他关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

论文提出TiCo方法，解决了现有语音对话模型缺乏时间意识、无法遵循持续时间指令的问题，通过引入口语时间标记和强化学习，显著提高了模型对持续时间约束的遵循能力。

摘要翻译

我们提出TiCo，一种简单的训练后方法，旨在使口语对话模型能够遵循时间约束指令，并生成时长可控的响应。这种能力对于现实世界中的口语系统（如语音助手和交互式代理）具有重要价值，因为控制响应时长可以提升交互质量。然而，尽管现有模型在生成自然口语响应方面表现出色，它们缺乏时间意识，难以遵循与时长相关的指令（例如“请生成一段约15秒的响应”）。通过对开源和商业口语对话模型的实证评估，我们发现它们经常无法满足此类时间控制要求。TiCo通过使模型在生成过程中借助口语时间标记（例如<10.6秒>）来估计已用说话时间，从而解决了这一局限。这些标记帮助模型保持时间意识，并调整剩余内容以满足目标时长。TiCo方法简单高效：仅需少量数据且无需额外的问答对，而是依赖自生成和强化学习。实验结果表明，TiCo在保持响应质量的同时，显著提升了模型对时长约束的遵循能力。

摘要 (Abstract)

We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., “Please generate a response lasting about 15 seconds”). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.

关键词: spoken dialogue models, time-controllable training, post-training method, duration constraints, spoken time markers, reinforcement learning, voice assistants, interactive agents

27. ❌ One Model, Two Markets: Bid-Aware Generative Recommendation

作者: Yanchen Jiang, Zhe Feng, Christopher P. Mah, Aranyak Mehta, Di Wang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22231v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是生成式推荐系统（GEM-Rec），专注于商业检索中的广告投放和竞价机制，通过控制令牌和竞价感知解码来优化语义相关性和平台收入。所有关键词均涉及大模型、深度学习技术原理或科学应用（如生物信息学），而本文的核心是推荐系统架构和商业优化，未涉及任何大模型技术、深度学习创新或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了GEM-Rec框架，解决了生成式推荐系统中如何整合商业相关性和货币化目标的问题，通过控制令牌和竞价感知解码机制，实现了在不重新训练模型的情况下动态优化语义相关性和平台收入。

摘要翻译

基于语义ID的生成式推荐系统（如TIGER [Rajput等人，2023]）已成为序列推荐领域广泛采用的竞争范式。然而，现有架构仅针对语义检索设计，并未解决诸如通过广告收入实现货币化以及商业检索中竞价整合等问题。我们提出了GEM-Rec，一个将商业相关性与货币化目标直接整合到生成序列中的统一框架。我们引入了控制令牌，以将“是否展示广告”与“展示哪个项目”的决策解耦。这使得模型能够直接从交互日志中学习有效的广告投放模式，这些日志本质上反映了过去成功的广告投放。作为补充，我们设计了一种竞价感知解码机制，该机制处理实时定价，将竞价直接注入推理过程，以引导生成朝向高价值项目。我们证明了该方法能保证分配单调性，即确保更高的竞价会弱增加广告被展示的可能性，而无需重新训练模型。实验表明，GEM-Rec使平台能够动态优化语义相关性与平台收入。

摘要 (Abstract)

Generative Recommender Systems using semantic ids, such as TIGER (Rajput et al., 2023), have emerged as a widely adopted competitive paradigm in sequential recommendation. However, existing architectures are designed solely for semantic retrieval and do not address concerns such as monetization via ad revenue and incorporation of bids for commercial retrieval. We propose GEM-Rec, a unified framework that integrates commercial relevance and monetization objectives directly into the generative sequence. We introduce control tokens to decouple the decision of whether to show an ad from which item to show. This allows the model to learn valid placement patterns directly from interaction logs, which inherently reflect past successful ad placements. Complementing this, we devise a Bid-Aware Decoding mechanism that handles real-time pricing, injecting bids directly into the inference process to steer the generation toward high-value items. We prove that this approach guarantees allocation monotonicity, ensuring that higher bids weakly increase an ad’s likelihood of being shown without requiring model retraining. Experiments demonstrate that GEM-Rec allows platforms to dynamically optimize for semantic relevance and platform revenue.

关键词: Generative Recommender Systems, Semantic IDs, Commercial Retrieval, Ad Revenue, Bid-Aware Decoding, Allocation Monotonicity, Platform Revenue Optimization, Control Tokens

28. ❌ Confidence-Based Decoding is Provably Efficient for Diffusion Language Models

作者: Changxiao Cai, Gen Li 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22248v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散语言模型（DLMs）的解码策略理论分析，属于语言模型技术范畴，但所有关键词均针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG等），而DLMs是不同于LLMs的模型架构（扩散模型 vs. 自回归模型），论文未涉及LLMs或任何关键词中的具体技术，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文首次为扩散语言模型中的置信度解码策略建立了理论分析框架，证明了基于熵和的解码方法能实现高效的采样加速，且无需先验知识或超参数调优。

摘要翻译

扩散语言模型已成为自回归模型在语言建模领域一种颇具前景的替代方案，它允许灵活的生成顺序和多个标记的并行生成。然而，这种灵活性引入了一个自回归模型所不具备的挑战：解码策略——即决定每次迭代中生成标记的顺序和数量——会关键性地影响采样效率。在实践中探索的解码策略中，基于置信度的方法（其根据预测置信度自适应地选择要解掩哪些标记以及解掩多少标记）已展现出强大的实证性能。尽管取得了这些成功，我们对基于置信度解码的理论理解仍然有限。
在本工作中，我们为扩散语言模型中的基于置信度解码建立了首个理论分析框架。我们聚焦于一种基于熵和的策略，该策略在每次迭代中持续解掩标记，直至累积熵超过某个阈值。我们证明，该策略能以 $\widetilde O(H(X_0)/\varepsilon)$ 的期望迭代次数实现KL散度意义上的 $\varepsilon$ 精确采样，其中 $H(X_0)$ 表示目标数据分布的熵。值得注意的是，当数据分布相对于序列长度具有较低熵时，此策略能带来显著的采样加速，同时能自动适应数据的内在复杂性，而无需先验知识或超参数调整。总体而言，我们的研究结果为基于置信度的解码提供了理论基础，并可能为设计更高效的扩散语言模型解码策略提供参考。

摘要 (Abstract)

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models for language modeling, allowing flexible generation order and parallel generation of multiple tokens. However, this flexibility introduces a challenge absent in AR models: the \emph{decoding strategy} – which determines the order and number of tokens generated at each iteration – critically affects sampling efficiency. Among decoding strategies explored in practice, confidence-based methods, which adaptively select which and how many tokens to unmask based on prediction confidence, have shown strong empirical performance. Despite this success, our theoretical understanding of confidence-based decoding remains limited. In this work, we develop the first theoretical analysis framework for confidence-based decoding in DLMs. We focus on an entropy sum-based strategy that continues unmasking tokens within each iteration until the cumulative entropy exceeds a threshold, and show that it achieves $\varepsilon$-accurate sampling in KL divergence with an expected number of iterations $\widetilde O(H(X_0)/\varepsilon)$, where $H(X_0)$ denotes the entropy of the target data distribution. Notably, this strategy yields substantial sampling acceleration when the data distribution has low entropy relative to the sequence length, while automatically adapting to the intrinsic complexity of data without requiring prior knowledge or hyperparameter tuning. Overall, our results provide a theoretical foundation for confidence-based decoding and may inform the design of more efficient decoding strategies for DLMs.

关键词: Diffusion Language Models, Decoding Strategy, Confidence-Based Decoding, Sampling Efficiency, Theoretical Analysis, Entropy Sum, KL Divergence, Parallel Generation

29. ❌ Dyadic: A Scalable Platform for Human-Human and Human-AI Conversation Research

作者: David M. Markowitz 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22227v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文介绍了一个名为Dyadic的对话研究平台，支持人-人和人-AI对话研究，但论文本身不涉及任何大模型或深度学习的技术原理、创新或应用。它只是一个研究工具平台，不包含模型训练、优化、推理、对齐、压缩、解释性、科学应用等任何技术内容。所有关键词均与大模型技术或科学应用相关，而本文仅涉及平台工具介绍，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文介绍了一个名为Dyadic的模块化网络平台，用于研究人-人和人-AI对话，解决了现有对话研究工具不够模块化和适应性不足的问题，提供了多模态、AI建议、实时监控和调查部署等功能，无需编码即可操作。

摘要翻译

对话在社会生活中无处不在，但对此互动过程的实证研究长期受限于工具模块化不足且难以适应研究者需求。为突破对话研究中的诸多限制，本教程概述并介绍一款新型工具——Dyadic（https://www.chatdyadic.com/），这是一个基于网络的平台，可用于通过文本或语音聊天研究人-人及人-人工智能（AI）对话。Dyadic区别于其他平台的特点在于：它支持多模态研究、AI建议（例如在人-人研究中，AI可为参与者生成回复建议）、实时监控（例如研究者可对交流者间的聊天进行实时评估）以及调查部署（例如可将李克特量表、情感温度计和开放式文本框发送给参与者，用于对互动进行现场评估）等关键功能。使用Dyadic无需编程基础，且平台支持与现有调查工具的集成。

摘要 (Abstract)

Conversation is ubiquitous in social life, but the empirical study of this interactive process has been thwarted by tools that are insufficiently modular and unadaptive to researcher needs. To relieve many constraints in conversation research, the current tutorial presents an overview and introduction to a new tool, Dyadic (https://www.chatdyadic.com/), a web-based platform for studying human-human and human-AI conversations using text-based or voice-based chats. Dyadic is distinct from other platforms by offering studies with multiple modalities, AI suggestions (e.g., in human-human studies, AI can suggest responses to a participant), live monitoring (e.g., researchers can evaluate, in real time, chats between communicators), and survey deployment (e.g., Likert-type scales, feeling thermometers, and open-ended text boxes can be sent to humans for in situ evaluations of the interaction), among other consequential features. No coding is required to operate Dyadic directly, and integrations with existing survey platforms are offered.

关键词: conversation research, human-human conversation, human-AI conversation, Dyadic platform, web-based platform, real-time monitoring, survey deployment, no coding required

30. ❌ Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

作者: Tom Biskupski, Stephan Kleber 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22214v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为评估者（LLM-as-judge）的可靠性和与人类判断的一致性，直接涉及LLM技术应用，因此与’Large Language Models’高度相关（10分）。研究测试了37个不同规模的对话LLM，包括GPT-4o和开源模型，并提到有5个模型针对任务进行了微调，这与’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（5分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、Scaling Laws、Instruction Tuning、RAG、推理加速、幻觉缓解、AI for Science等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了使用大型语言模型（LLM）作为自动评估者来评估其他LLM输出的可靠性和与人类判断的一致性，实验结果表明，在合适的提示下，LLM评估者（特别是GPT-4o和一些大型开源模型）与人类评估具有高度相关性。

摘要翻译

大语言模型（LLM）作为评判者，通过分析受害机器学习（ML）模型（特别是大语言模型）的输出来评估其质量。LLM作为评判者由一个模型和一个专门设计的评判提示组合而成，该提示包含了分析标准。相较于人工评审，这种分析自动化通过更快速、更一致的判断，实现了对受害模型自由文本输出的复杂评估的大规模扩展。因此，对大语言模型的质量和安全评估能够覆盖受害模型广泛的用例范围。作为一种相对较新的技术，LLM作为评判者在其可靠性以及与人类判断的一致性方面尚缺乏深入研究。
我们的工作评估了LLM作为受害LLM自动化质量评估者的适用性。我们测试了37个不同规模的对话式LLM，结合5种不同的评判提示、二级评判者的概念，以及5个为此任务微调的模型作为评估者的效能。作为评估目标，我们为八类不同的评判任务策划了数据集，并基于人工评估提供了相应的真实标签。我们的实证结果表明，当结合合适的提示时，LLM作为评判者与人类评估结果具有高度相关性，尤其是GPT-4o、多个参数规模≥320亿的开源模型，以及如Qwen2.5 140亿等少数较小模型。

摘要 (Abstract)

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models’ free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models’ use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.

关键词: Large Language Models, LLM as judge, automated evaluation, reliability assessment, human judgment agreement, prompt engineering, model fine-tuning, empirical evaluation

31. ❌ SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

作者: Kexian Tang, Jiani Wang, Shaowen Wang, Kaifeng Lyu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22213v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的知识注入问题，提出SPA方法通过提示工程生成合成数据进行知识增强，与’Large Language Models’高度相关（10分）。论文未涉及其他关键词的具体技术，如MoE、SLMs、Scaling Laws、各种训练方法、推理技术、代理系统、压缩技术等，因此这些关键词评分为0分。论文属于大模型应用研究，符合研究背景要求。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在专业领域知识覆盖不足的问题，提出了一种基于提示工程的大规模合成数据生成方法SPA，实验证明该方法在知识注入任务中优于现有基线方法。

摘要翻译

尽管大语言模型（LLM）已在海量数据上进行预训练，但其知识覆盖在专业化、数据稀缺的领域中仍不完整，这促使学界广泛研究通过合成数据生成进行知识注入。我们提出SPA（规模化提示工程增强），这是一种简单但难以被超越的基线方法，它使用少量精心设计的提示来生成大规模合成数据以进行知识注入。通过系统比较，我们发现SPA的表现优于多个强基线方法。此外，我们指出了先前方法的两个关键局限：（1）基于强化学习（RL）的方法虽然能在小规模下提升基于LLM的数据增强的标记效率，但随着数据规模扩大，它们会遭遇多样性崩溃，导致收益递减；（2）尽管多阶段提示方法可能优于简单的增强方法，但经过细致的提示调优后，其优势可能消失。我们的结果表明，对于知识注入任务，精心设计的提示结合直接的大规模增强可以出人意料地有效，我们希望SPA能为该领域的未来研究提供一个强有力的基线。我们的代码发布于https://github.com/Tangkexian/SPA。

摘要 (Abstract)

While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.

关键词: knowledge injection, synthetic data generation, prompt engineering, data augmentation, large language models, LLMs, scaling, baseline

32. ❌ CayleyPy-4: AI-Holography. Towards analogs of holographic string dualities for AI tasks

作者: A. Chervov, F. Levkovich-Maslyuk, A. Smolensky, F. Khafizov, I. Kiselev, D. Melnikov, I. Koltsov, S. Kudashev, D. Shiltsov, M. Obozov, S. Krymskii, V. Kirova, E. V. Konstantinova, A. Soibelman, S. Galkin, L. Grunwald, A. Kotov, A. Alexandrov, S. Lytkin, D. Fedoriaka, A. Chevychelov, Z. Kogan, A. Natyrova, L. Cheldieva, O. Nikitina, S. Fironov, A. Vakhrushev, A. Lukyanenko, V. Ilin, D. Gorodkov, N. Bogachev, I. Gaiur, M. Zaitsev, F. Petrov, L. Petrov, T. Gaintseva, A. Gavrilova, M. N. Smirnov, N. Kalinin, A. Khan, K. Jung, H. Mousset, H. Isambert, O. Debeaupuis 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22195v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究图论与全息对偶的数学理论在AI任务中的应用，属于理论交叉研究。虽然属于AI for Science范畴（给5分），但论文内容高度专注于数学图论和物理全息对偶理论，未涉及任何具体的大模型技术、训练方法、推理优化、对齐技术、代理系统等关键词。论文提到GPT-style语言模型和RL系统仅作为类比示例，并非研究核心。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对Cayley图的离散全息弦对偶理论，将图上的粒子轨迹预测问题与离散字符串描述联系起来，并验证了对称群S_n的Cayley图对应平面多边形，实现了'复杂度=体积'范式。

摘要翻译

本文是CayleyPy项目的第四篇论文，该项目将人工智能方法应用于大型图结构的探索。在本研究中，我们提出该框架下可能存在一种新的离散全息弦对偶形式，并探讨其与人工智能系统及数学领域的关联。许多现代人工智能任务——例如GPT类语言模型或强化学习系统所处理的问题——可被视为在图结构上预测粒子轨迹的直接类比。我们针对一大类凯莱图（Cayley graphs）研究了该问题，并证明这类问题惊人地存在离散弦理论的对偶描述。我们推测此类对偶性可推广至一系列人工智能系统，从而催生更高效的计算方法。特别地，受AdS/CFT中“复杂度=体积”原理的启发，我们提出态的全息弦图像可作为数据嵌入的自然候选方案。
对于对称群S_n的凯莱图，我们的研究结果表明其对应对偶对象是平坦的平面多边形。图的直径等于多边形内部整数点数量乘以n的缩放值。图的顶点可通过全息映射对应到多边形内部的路径，而常规的图距离则对应于路径下方的面积，从而直接实现了“复杂度=体积”范式。我们还在大n极限下发现了连续共形场论（CFT）与对偶弦存在的证据。我们通过大量初始案例验证了这一图景及对偶性的其他方面。同时，我们提供了新的数据集（通过机器学习与传统工具结合获得），这些数据将有助于在更普遍情形下建立对偶关系。

摘要 (Abstract)

This is the fourth paper in the CayleyPy project, which applies AI methods to the exploration of large graphs. In this work, we suggest the existence of a new discrete version of holographic string dualities for this setup, and discuss their relevance to AI systems and mathematics. Many modern AI tasks – such as those addressed by GPT-style language models or RL systems – can be viewed as direct analogues of predicting particle trajectories on graphs. We investigate this problem for a large family of Cayley graphs, for which we show that surprisingly it admits a dual description in terms of discrete strings. We hypothesize that such dualities may extend to a range of AI systems where they can lead to more efficient computational approaches. In particular, string holographic images of states are proposed as natural candidates for data embeddings, motivated by the “complexity = volume” principle in AdS/CFT. For Cayley graphs of the symmetric group S_n, our results indicate that the corresponding dual objects are flat, planar polygons. The diameter of the graph is equal to the number of integer points inside the polygon scaled by n. Vertices of the graph can be mapped holographically to paths inside the polygon, and the usual graph distances correspond to the area under the paths, thus directly realising the “complexity = volume” paradigm. We also find evidence for continuous CFTs and dual strings in the large n limit. We confirm this picture and other aspects of the duality in a large initial set of examples. We also present new datasets (obtained by a combination of ML and conventional tools) which should be instrumental in establishing the duality for more general cases.

关键词: Cayley graphs, holographic dualities, string theory, AI for science, complexity equals volume, symmetric group, graph embeddings, discrete strings

33. ❌ Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

作者: Junrong Guo, Shancheng Fang, Yadong Qu, Hongtao Xie 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22187v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在布局生成中的应用，提出VFLM框架利用视觉反馈进行迭代优化。与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展；与’RLHF’有一定关联（5分），因为使用了强化学习训练；与’Self-Correction’高度相关（10分），因为框架实现了自我反思和迭代改进。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在布局生成中忽视视觉结果的问题，提出了基于视觉反馈的迭代优化框架VFLM，通过强化学习训练显著提升了布局的可读性和美观性。

摘要翻译

多模态大语言模型（MLLMs）的最新进展使得从自然语言描述自动生成结构化布局成为可能。现有方法通常遵循纯代码范式，即生成代码来表示布局，随后通过图形引擎渲染以生成最终图像。然而，这些方法对渲染后的视觉结果缺乏感知，难以保证可读性与美观性。本文指出视觉反馈是布局生成中的关键因素，并提出了视觉反馈布局模型（VFLM），这是一个利用视觉反馈进行迭代优化的自我改进框架。VFLM能够执行自适应反思生成，借助视觉信息对先前问题进行反思，并迭代生成输出直至达到满意质量。这一能力通过强化学习实现，其中采用了一个基于视觉的奖励模型，该模型融合了OCR准确率指标。通过仅对最终生成结果进行奖励，我们能够有效激发模型的迭代与反思生成能力。在多个基准测试上的实验表明，VFLM在性能上持续优于先进的多模态大语言模型、现有布局模型以及纯代码基线方法，从而证实了视觉反馈对于面向设计的MLLMs至关重要。我们的代码与数据已公开于https://github.com/FolSpark/VFLM。

摘要 (Abstract)

Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model’s iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.

关键词: Multimodal Large Language Models, Layout Generation, Visual Feedback, Iterative Refinement, Reinforcement Learning, Self-improving Framework, OCR Accuracy, Design-oriented MLLMs

34. ❌ Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

作者: Ireh Kim, Tesia Sker, Chanwoo Kim 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22186v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在文档级机器翻译中的应用，直接涉及LLM和SFT（两阶段微调策略），因此这两项得10分。论文提到数据质量过滤（sacreBLEU、COMET等）与Scaling Laws AND Data Quality有一定关联，得5分。文档级翻译需要处理长上下文，与Context Window Extension相关，得5分。论文明确提到解决LLM的幻觉问题，与Hallucination Mitigation高度相关，得8分。其他关键词如MoE、SLMs、量化、推理加速等未在摘要中提及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在文档级机器翻译中面临的数据稀缺和幻觉问题，提出了一种通过LLM增强数据生成、多指标过滤和两阶段微调的策略来提升翻译性能。

摘要翻译

在机器翻译领域，大型语言模型（LLMs）的表现通常逊于传统的编码器-解码器系统，因此其应用较为有限。然而，LLMs 在建模上下文信息方面表现卓越，这使其天然适合处理跨句连贯性至关重要的文档级翻译任务。尽管具备这种潜力，基于 LLMs 的文档级机器翻译仍面临两大关键挑战：（1）缺乏大规模、高质量的文档级平行数据；（2）LLMs 在生成过程中容易产生幻觉和遗漏。为应对这些挑战，我们提出一种利用 LLM 增强文档级数据的两阶段微调策略。首先，我们通过使用 LLM 将摘要数据转化为文档级平行数据来进行数据增强，随后结合多种指标——包括 sacreBLEU、COMET 以及基于 LaBSE 的余弦相似度——对数据进行筛选，以提升数据质量。最后，我们采用两阶段微调策略：先在丰富的句子级机器翻译资源上进行微调，再在筛选后的文档级语料上进行微调。

摘要 (Abstract)

In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.

关键词: Large Language Models, Document-Level Machine Translation, Synthetic Corpora, Two-Stage Fine-tuning, Hallucination Mitigation, Data Quality Filtering, LLM Adaptation, Parallel Data Augmentation

35. ❌ Calibeating Made Simple

作者: Yurong Chen, Zhiyi Huang, Michael I. Jordan, Haipeng Luo 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22167v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是在线学习中的calibeating问题，即通过后处理外部预测来最小化累积损失并匹配基于信息量的基准。论文内容完全属于在线学习、预测理论和优化算法的理论计算机科学领域，与所有提供的大模型、深度学习、AI应用等关键词均无直接关联。论文未涉及任何语言模型、模型训练、推理优化、AI代理或科学AI应用等内容。

!!! tip deepseek-chat TL;DR

该论文研究了在线学习中的calibeating问题，通过将其简化为现有的在线学习技术，获得了适用于一般适当损失的新最优calibeating率，并首次为二元预测提供了同时实现校准和最优calibeating率的算法。

摘要翻译

我们研究校准击败问题，即在线后处理外部预测以最小化累积损失并匹配基于信息度的基准。与先前使用特定论证分析特定损失下的校准击败的研究不同，我们将校准击败问题归约至现有在线学习技术，并针对一般适当损失获得结果。具体而言，我们首先证明校准击败在极小极大意义上等价于遗憾最小化。这一结论恢复了 Foster 和 Hart [FH23] 针对 Brier 损失和对数损失所得的 $O(\log T)$ 校准击败率及其最优性，并为可混合损失及一般有界损失导出了新的最优校准击败率。其次，我们证明多校准击败在极小极大意义上等价于校准击败与经典专家问题的结合。这为可混合损失（包括 Brier 损失和对数损失）及一般有界损失带来了新的最优多校准击败率。最后，我们针对 Brier 损失同时实现校准击败与校准获得了新的界。对于二元预测，我们的结果给出了首个在达到最优 $O(\log T)$ 校准击败率的同时仍保持校准的算法。

摘要 (Abstract)

We study calibeating, the problem of post-processing external forecasts online to minimize cumulative losses and match an informativeness-based benchmark. Unlike prior work, which analyzed calibeating for specific losses with specific arguments, we reduce calibeating to existing online learning techniques and obtain results for general proper losses. More concretely, we first show that calibeating is minimax-equivalent to regret minimization. This recovers the $O(\log T)$ calibeating rate of Foster and Hart [FH23] for the Brier and log losses and its optimality, and yields new optimal calibeating rates for mixable losses and general bounded losses. Second, we prove that multi-calibeating is minimax-equivalent to the combination of calibeating and the classical expert problem. This yields new optimal multi-calibeating rates for mixable losses, including Brier and log losses, and general bounded losses. Finally, we obtain new bounds for achieving calibeating and calibration simultaneously for the Brier loss. For binary predictions, our result gives the first calibrated algorithm that at the same time also achieves the optimal $O(\log T)$ calibeating rate.

关键词: calibeating, online learning, regret minimization, proper losses, multi-calibeating, calibration, Brier loss, log loss

36. ❌ MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management

作者: Jack W O’Sullivan, Mohammad Asadi, Lennart Elbe, Akshay Chaudhari, Tahoura Nedaee, Francois Haddad, Michael Salerno, Li Fe-Fei, Ehsan Adeli, Rima Arnaout, Euan A Ashley 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22179v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文MARCUS是一个用于心脏诊断的多模态智能体系统，核心创新在于：1）采用分层智能体架构（LLM Agents, Multi-agent Systems），包含特定模态的视觉语言专家模型和一个多模态协调器；2）专门针对医学领域（AI for Science）进行视觉编码器训练和语言模型优化（Pre-training, Post-training）；3）解决了幻觉问题（Hallucination Mitigation）并提升了推理能力（Chain of Thought, System 2 Thinking）；4）使用了大规模领域数据（Scaling Laws AND Data Quality）。论文未涉及小型模型、强化学习对齐、高效微调、检索增强、上下文扩展、推理加速、模型压缩等技术细节，因此相关关键词得分较低或为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为MARCUS的多模态智能体系统，用于端到端解读心电图、超声心动图和心脏磁共振成像，通过分层智能体架构和领域专用视觉编码器，在多项心脏诊断任务上显著超越了前沿模型，并有效缓解了幻觉问题。

摘要翻译

心血管疾病仍是全球首要死亡原因，而复杂心脏检查的人工解读阻碍了诊疗进展。当前人工智能视觉-语言模型仅限于单模态输入且缺乏交互能力。我们提出MARCUS（超声与信号多模态自主推理对话系统），这是一个能够对心电图（ECG）、超声心动图和心脏磁共振成像（CMR）进行独立及多模态整合端到端解读的智能体视觉-语言系统。MARCUS采用分层智能体架构，包含针对特定模态的视觉-语言专家模型——每个模型均集成经专业领域训练的视觉编码器与多阶段优化的语言模型，并由多模态协调器进行统筹调度。基于1,350万张医学图像（25万份心电图、130万张超声心动图图像、1,200万张心脏磁共振图像）及我们构建的包含160万道专家标注问题的新型数据集进行训练，MARCUS实现了超越前沿模型（GPT-5 Thinking、Gemini 2.5 Pro Deep Think）的最优性能。在内部（斯坦福）和外部（加州大学旧金山分校）测试队列中，MARCUS在心电图解读准确率达87-91%，超声心动图67-86%，心脏磁共振85-88%，较前沿模型提升34-45%（P<0.001）。在多模态病例分析中，MARCUS准确率达到70%，约为前沿模型（22-28%）的三倍，其自由文本质量评分高出1.7-3.0倍。该智能体架构还展现出对幻影推理的抵抗能力——即避免视觉-语言模型从未经意图的文本信号或幻觉视觉内容中推导推理。MARCUS证明：结合领域专用视觉编码器与智能体协调器的架构能够实现多模态心脏影像解读。我们将模型、代码及基准测试数据集开源发布。

摘要 (Abstract)

Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.

关键词: agentic vision-language system, multimodal cardiac diagnosis, hierarchical agentic architecture, domain-specific visual encoders, hallucination mitigation, cardiac imaging interpretation, multimodal orchestrator, state-of-the-art performance

37. ❌ Multimodal Survival Analysis with Locally Deployable Large Language Models

作者: Moritz Gögl, Christopher Yau 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22158v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大模型在生物医学领域的应用，具体为使用本地可部署的轻量级大模型进行多模态生存分析，因此与’Large Language Models’（10分）、‘Small Language Models/On-device AI’（10分）和’AI for Science/Bioinformatics’（10分）高度相关。论文提到避免幻觉和校准错误，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、Scaling Laws、训练方法、推理优化、智能体等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何利用本地可部署的大型语言模型进行多模态生存分析，整合临床文本、表格协变量和基因组数据，通过师生蒸馏和多模态融合方法，在TCGA队列中优于基线模型，同时避免了云服务依赖和隐私问题，并减少了基础大模型中可能出现的幻觉和校准错误。

摘要翻译

本研究采用可本地部署的大型语言模型（LLM），对融合临床文本、表格协变量与基因组谱的多模态生存分析进行探索。鉴于许多机构面临严格的计算与隐私限制，这一场景促使我们采用轻量级、可本地化部署的模型。通过师生蒸馏与原则性多模态融合方法，我们的框架能够联合估计校准后的生存概率，并生成简洁、基于证据的预后文本。在TCGA队列上的实验表明，该方法优于标准基线模型，避免了对云服务及相关隐私问题的依赖，并降低了基础大型语言模型中可能出现的幻觉输出或概率估计失准的风险。

摘要 (Abstract)

We study multimodal survival analysis integrating clinical text, tabular covariates, and genomic profiles using locally deployable large language models (LLMs). As many institutions face tight computational and privacy constraints, this setting motivates the use of lightweight, on-premises models. Our approach jointly estimates calibrated survival probabilities and generates concise, evidence-grounded prognosis text via teacher-student distillation and principled multimodal fusion. On a TCGA cohort, it outperforms standard baselines, avoids reliance on cloud services and associated privacy concerns, and reduces the risk of hallucinated or miscalibrated estimates that can be observed in base LLMs.

关键词: multimodal survival analysis, locally deployable LLMs, clinical text, genomic profiles, teacher-student distillation, privacy constraints, hallucination mitigation, TCGA cohort

作者: Kejia Liu, Haoyang Zhou, Ruoyu Xu, Peicheng Wang, Mingli Song, Haofei Zhang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22153v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的无人机导航，研究跨视角地理定位（CVGL）方法，使用视觉特征匹配和空间关系编码技术。所有评分关键词均涉及大语言模型（LLM）及相关技术（如训练方法、推理优化、对齐、代理系统等），而本文完全不涉及任何语言模型、文本处理或LLM相关技术，纯属视觉导航研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种纯视觉驱动的跨视角无人机导航方法Bearing-UAV，通过联合预测绝对位置和航向来解决现有匹配方法在精度、存储和航向忽略方面的局限性，并在多城市基准测试中实现了更低的定位误差。

摘要翻译

跨视角地理定位方法的最新进展展现了在GNSS拒止环境下支持无人机导航的强大潜力。然而，现有研究主要集中于将无人机视图与机载地图图块进行匹配，这引入了精度与存储开销之间的固有权衡，且忽视了无人机航向在导航过程中的重要性。此外，跨视角场景中存在的显著差异及不同程度的重叠尚未得到充分考虑，限制了其在实际场景中的泛化能力。本文提出Bearing-UAV——一种纯视觉驱动的跨视角导航方法，该方法通过邻近特征联合预测无人机的绝对位置与航向，从而实现在野外环境中精确、轻量且鲁棒的导航。我们的方法利用全局与局部结构特征，并显式编码相对空间关系，使其对跨视角变化、未对准及特征稀疏条件具有鲁棒性。我们还提出了Bearing-UAV-90k，一个用于评估跨视角定位与导航的多城市基准数据集。大量实验表明，Bearing-UAV在不同地形条件下均取得了优于以往匹配/检索范式的定位误差，结果令人鼓舞。我们的代码与数据集将公开提供。

摘要 (Abstract)

Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV’s heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitly encodes relative spatial relationships, making it robust to cross-view variations, misalignment, and feature-sparse conditions. We also present Bearing-UAV-90k, a multi-city benchmark for evaluating cross-view localization and navigation. Extensive experiments show encouraging results that Bearing-UAV yields lower localization error than previous matching/retrieval paradigm across diverse terrains. Our code and dataset will be made publicly available.

关键词: cross-view geo-localization, UAV navigation, vision-driven, bearing prediction, feature matching, GNSS-denied environments, multi-city benchmark, localization error

39. ❌ More Isn’t Always Better: Balancing Decision Accuracy and Conformity Pressures in Multi-AI Advice

作者: Yuta Tsuchiya, Yukino Baba 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22152v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多AI系统咨询对人类决策的影响，实验涉及AI面板大小、共识水平和呈现方式。仅与’Multi-agent Systems OR Agent Coordination’有一定关联（8分），因为研究多个AI系统协同提供建议，涉及AI代理的协调机制。其他关键词均未涉及，论文未讨论大模型技术原理、训练方法、推理优化、科学应用等具体技术内容。

!!! tip deepseek-chat TL;DR

该研究探讨了多AI咨询如何影响人类决策准确性，发现小规模AI面板能提升准确性，而高共识会导致过度依赖，单一异议可减轻从众压力，人类化呈现能增强感知有用性而不增加从众压力。

摘要翻译

正如人们通过咨询多样化的人类顾问来改进决策一样，现在他们也可以咨询多个AI系统。先前关于群体决策的研究表明，建议整合会产生从众压力，导致过度依赖。然而，多AI咨询在何种条件下会改善或损害人类决策仍不明确。我们在三项任务中进行了实验，参与者从AI小组获得建议。我们改变了小组规模、小组内部共识度以及呈现方式的拟人化程度。相对于单一AI，小型AI小组能提升决策准确性；更大的小组则未带来增益。小组内部共识水平影响了参与者对AI建议的依赖程度：高度共识助长了过度依赖；单一异议降低了从众压力；广泛的分歧则引发困惑并削弱了合理依赖。拟人化呈现方式在某些任务中提升了感知有用性和能动性，但未增加从众压力。这些发现为呈现多AI建议提供了设计启示，可在保持准确性的同时减轻从众效应。

摘要 (Abstract)

Just as people improve decision-making by consulting diverse human advisors, they can now also consult with multiple AI systems. Prior work on group decision-making shows that advice aggregation creates pressure to conform, leading to overreliance. However, the conditions under which multi-AI consultation improves or undermines human decision-making remain unclear. We conducted experiments with three tasks in which participants received advice from panels of AIs. We varied panel size, within-panel consensus, and the human-likeness of presentation. Accuracy improved for small panels relative to a single AI; larger panels yielded no gains. The level of within-panel consensus affected participants’ reliance on AI advice: High consensus fostered overreliance; a single dissent reduced pressure to conform; wide disagreement created confusion and undermined appropriate reliance. Human-like presentations increased perceived usefulness and agency in certain tasks, without raising conformity pressure. These findings yield design implications for presenting multi-AI advice that preserve accuracy while mitigating conformity.

关键词: multi-AI consultation, decision-making, conformity pressure, panel size, within-panel consensus, human-likeness presentation, overreliance, advice aggregation

40. ❌ Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

作者: Yunzhuo Sun, Xinyue Liu, Yanyang Li, Nanding Wu, Yifang Xu, Linlin Zong, Xianchao Zhang, Wenxin Liang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22121v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于视频时刻检索的两阶段框架，其中第一阶段使用LLM指导字幕匹配并生成辅助短视频作为时间先验，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。论文未涉及其他关键词，如MoE、SLMs、Scaling Laws、训练技术、推理优化、代理系统、模型压缩、科学AI应用等，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Mamba-VMR的两阶段框架，通过LLM引导生成短视频来增强多模态查询，以解决长视频序列中精确时间定位的挑战，并在TVR基准测试中显著超越了现有方法。

摘要翻译

基于文本的视频片段检索任务仍面临挑战，主要原因在于未修剪视频中隐含时序动态特征的捕捉不足，导致长视频序列中的定位精度受限。传统方法依赖自然语言查询或静态图像增强，忽视了运动序列信息，且基于Transformer的架构计算成本高昂。现有方法未能有效整合字幕上下文与生成的时序先验信息，因此我们提出一种新颖的两阶段框架以增强时序定位能力。第一阶段通过大语言模型引导的字幕匹配，从视频字幕中识别相关文本线索，将其与查询语句融合后借助文本生成视频模型生成辅助短视频，从而捕捉隐含运动信息作为时序先验。第二阶段将增强后的查询输入多模态可控Mamba网络，通过扩展文本控制选择机制并引入视频引导门控，实现生成先验与长序列的高效融合及噪声过滤。本框架不依赖于特定基础检索模型，可广泛适用于多模态视频片段检索任务。在TVR基准数据集上的实验评估表明，该方法相比现有最优模型取得显著提升，包括计算开销的降低以及长序列定位召回率的提高。

摘要 (Abstract)

Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.

关键词: video moment retrieval, multimodal query augmentation, LLM-guided subtitle matching, text-to-video generation, Mamba network, temporal grounding, long-sequence grounding, computational efficiency

41. ❌ On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

作者: Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, Jingren Zhou 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22117v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLVR（强化学习与可验证奖励）对大型语言模型推理能力的改进，与’Large Language Models’、‘RLHF/RLAIF/DPO’（RLVR属于强化学习对齐方法）、‘Chain of Thought/CoT Reasoning’（论文关注推理能力）和’System 2 Thinking/Slow Thinking’（涉及深度推理分析）高度相关（10分）。其他关键词如MoE、SLMs、RAG、量化等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现RLVR（强化学习与可验证奖励）更新方向比幅度更能有效提升大型语言模型的推理能力，并提出了基于更新方向的测试时外推和训练时重加权方法来提高推理准确性。

摘要翻译

具备可验证奖励的强化学习（RLVR）显著提升了大型语言模型的推理能力。现有分析虽指出RLVR引发的模型更新具有稀疏性，但其主要关注更新的幅度，而很大程度上忽视了更新的方向。本研究主张，更新方向是理解RLVR效果更为关键的视角，其可通过基础模型与最终RLVR模型之间带符号的、词元层级的对数概率差值 $Δ\log p$ 来捕捉。通过统计分析及词元替换干预实验，我们证明相较于基于幅度的度量指标（如散度或熵），$Δ\log p$ 能更有效地识别那些稀疏但对推理至关重要的更新。基于此洞见，我们提出两种实际应用：（1）一种测试时外推方法，该方法沿学习到的 $Δ\log p$ 方向放大策略，从而无需额外训练即可提升推理准确率；（2）一种训练时重加权方法，该方法将学习重点集中于低概率（对应较高 $Δ\log p$）词元，从而在不同模型与基准测试中均提升了推理性能。我们的工作确立了将变化方向作为分析与改进RLVR的一个关键原则。

摘要 (Abstract)

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR’s effects, which can be captured by the signed, token-level log probability difference $Δ\log p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $Δ\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that amplifies the policy along the learned $Δ\log p$ direction to improve reasoning accuracy without further training; (2) a \textit{training-time reweighting} method that focuses learning on low-probability (corresponding to higher $Δ\log p$) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.

关键词: Reinforcement Learning with Verifiable Rewards, RLVR, Large Language Models, Reasoning, Update Direction, Token-level Log Probability, Test-time Extrapolation, Training-time Reweighting

42. ❌ GSEM: Graph-based Self-Evolving Memory for Experience Augmented Clinical Reasoning

作者: Xiao Han, Yuzheng Fan, Sendong Zhao, Haochun Wang, Bing Qin 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22096v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出GSEM框架，用于增强临床决策代理的记忆系统，核心涉及LLM在临床推理中的应用（高度相关）。与RAG相关，因为它通过图结构组织经验进行检索和重用。涉及临床推理（CoT Reasoning/System 2 Thinking）和代理（LLM Agents），并支持在线反馈驱动的自我校准（Self-Correction）。属于AI for Science（生物信息学/临床AI应用）。其他关键词如MoE、SFT、量化等未在摘要中提及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对临床决策代理中经验存储缺乏结构化关系导致检索噪声和性能下降的问题，提出了GSEM框架，通过双图层记忆图组织临床经验，实现了适用性感知检索和在线反馈校准，在多个基准测试中取得了最高准确率。

摘要翻译

临床决策智能体能够从复用既往决策经验中获益。然而，许多基于记忆增强的方法将经验存储为独立的记录，缺乏显式的关系结构，这可能导致检索噪声、不可靠的经验复用，甚至在部分场景下性能劣于直接使用大语言模型进行推理。我们提出GSEM（基于图的自演进记忆），这是一种临床记忆框架，它将临床经验组织成双层记忆图，既能捕捉单次经验内部的决策结构，也能捕获跨经验之间的关联依赖，并支持基于适用性感知的检索以及基于在线反馈的节点质量与边权重校准。在MedR-Bench和MedAgentsBench基准测试中，使用两种大语言模型骨干网络进行实验，GSEM在所有基线方法中取得了最高的平均准确率，分别在使用DeepSeek-V3.2和Qwen3.5-35B时达到70.90%和69.24%。代码发布于https://github.com/xhan1022/gsem。

摘要 (Abstract)

Clinical decision-making agents can benefit from reusing prior decision experience. However, many memory-augmented methods store experiences as independent records without explicit relational structure, which may introduce noisy retrieval, unreliable reuse, and in some cases even hurt performance compared to direct LLM inference. We propose GSEM (Graph-based Self-Evolving Memory), a clinical memory framework that organizes clinical experiences into a dual-layer memory graph, capturing both the decision structure within each experience and the relational dependencies across experiences, and supporting applicability-aware retrieval and online feedback-driven calibration of node quality and edge weights. Across MedR-Bench and MedAgentsBench with two LLM backbones, GSEM achieves the highest average accuracy among all baselines, reaching 70.90% and 69.24% with DeepSeek-V3.2 and Qwen3.5-35B, respectively. Code is available at https://github.com/xhan1022/gsem.

关键词: Clinical decision-making agents, Graph-based memory, Self-evolving memory, Experience retrieval, Clinical reasoning, LLM inference, Dual-layer memory graph, Online feedback calibration

43. ❌ SpecTM: Spectral Targeted Masking for Trustworthy Foundation Models

作者: Syed Usama Imtiaz, Mitra Nasr Azadani, Nasrin Alamdari 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22097v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究基础模型（Foundation Models）在Earth observation（EO）领域的应用，属于AI for Science范畴，因此相关关键词得高分。论文重点在预训练（Pre-training）阶段提出新的物理信息掩码方法，因此该关键词得高分。论文关注模型的可信度和可解释性，与Hallucination Mitigation和Explainable AI有一定关联，但非核心，得5分。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对地球观测基础模型缺乏物理约束的信任问题，提出了SpecTM物理信息掩码方法，通过多任务自监督学习框架显著提升了微囊藻毒素浓度预测的准确性和标签效率。

摘要翻译

当前，面向地球观测（EO）的基础模型正日益增多，但这些模型通常依赖于随机掩码技术，未能显式地融入物理约束——这一可信度缺陷在指导公共卫生决策的预测模型中尤为关键。本研究提出了一种基于物理信息的掩码设计方法——SpecTM（光谱定向掩码），该方法在预训练过程中通过跨光谱上下文信息，有目的地促进目标波段的重建。为实现这一目标，我们开发了一个灵活的多任务（波段重建、生物光学指数推断及8天超前时序预测）自监督学习（SSL）框架，该框架通过联合优化编码光谱本质表征，并利用NASA PACE卫星在伊利湖上空获取的高光谱影像，在下游微囊藻毒素浓度回归模型上进行了评估。SpecTM模型在当周预测中达到R^2 = 0.695，在8天超前预测中达到R^2 = 0.620，分别超越所有基线模型（较Ridge回归的0.51提升34%，较SVR的0.31提升99%）。消融实验表明，定向掩码相比随机掩码使预测性能提升了0.037 R^2。此外，在极端数据稀缺条件下，SpecTM以2.2倍的标签效率优势超越了强基线模型。SpecTM实现了跨地球观测领域的物理信息表征学习，并提升了基础模型的可解释性。

摘要 (Abstract)

Foundation models are now increasingly being developed for Earth observation (EO), yet they often rely on stochastic masking that do not explicitly enforce physics constraints; a critical trustworthiness limitation, in particular for predictive models that guide public health decisions. In this work, we propose SpecTM (Spectral Targeted Masking), a physics-informed masking design that encourages the reconstruction of targeted bands from cross-spectral context during pretraining. To achieve this, we developed an adaptable multi-task (band reconstruction, bio-optical index inference, and 8-day-ahead temporal prediction) self-supervised learning (SSL) framework that encodes spectrally intrinsic representations via joint optimization, and evaluated it on a downstream microcystin concentration regression model using NASA PACE hyperspectral imagery over Lake Erie. SpecTM achieves R^2 = 0.695 (current week) and R^2 = 0.620 (8-day-ahead) predictions surpassing all baseline models by (+34% (0.51 Ridge) and +99% (SVR 0.31)) respectively. Our ablation experiments show targeted masking improves predictions by +0.037 R^2 over random masking. Furthermore, it outperforms strong baselines with 2.2x superior label efficiency under extreme scarcity. SpecTM enables physics-informed representation learning across EO domains and improves the interpretability of foundation models.

关键词: Foundation Models, Earth Observation, Physics-informed Masking, Self-supervised Learning, Hyperspectral Imagery, Trustworthiness, Representation Learning, Microcystin Prediction

44. ❌ On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration

作者: Valentin Petrov 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22061v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究指令调优语言模型中的拒绝行为消除方法，具体研究对比基线构建对拒绝方向提取的影响。论文明确涉及’Large Language Models’（使用Qwen 3.5 2B模型）和’Instruction Tuning’（研究指令调优模型的拒绝行为消除），这两个关键词高度相关。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等均未在论文中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文研究发现，在指令调优语言模型的拒绝行为消除中，主题匹配的对比基线无法提取有效的拒绝方向，而不匹配的对比基线则能实现完全拒绝消除，原因是主题匹配会抵消共享的激活成分。

摘要翻译

鉴于通过定向消融从指令微调语言模型中移除拒绝行为需从残差流激活空间中提取拒绝介导方向，且现有文献将有害提示激活所对照的对比基线构建视为实现细节而非方法论问题，本研究探讨主题匹配的对比基线是否能产生更优的拒绝方向。研究基于Qwen~3.5 2B模型展开，采用逐类别匹配提示对、按类别自组织映射提取及奇异值分解正交化方法。实验发现：在任意测试层及任意权重水平上，主题匹配对比均未能产生功能性拒绝方向；而使用相同模型、相同提取代码与相同评估协议时，非匹配对比在六个层级上实现了完全拒绝消除。对失效案例的几何分析表明，主题匹配的差分操作抵消了同主题有害与无害提示间共享的主导激活成分，导致提取方向幅度低于权重矩阵投影扰动残差流的阈值。本文进一步讨论了该发现对消融研究中对比基线设计的启示。

摘要 (Abstract)

Inasmuch as the removal of refusal behavior from instruction-tuned language models by directional abliteration requires the extraction of refusal-mediating directions from the residual stream activation space, and inasmuch as the construction of the contrast baseline against which harmful prompt activations are compared has been treated in the existing literature as an implementation detail rather than a methodological concern, the present work investigates whether a topically matched contrast baseline yields superior refusal directions. The investigation is carried out on the Qwen~3.5 2B model using per-category matched prompt pairs, per-class Self-Organizing Map extraction, and Singular Value Decomposition orthogonalization. It was found that topic-matched contrast produces no functional refusal directions at any tested weight level on any tested layer, while unmatched contrast on the same model, same extraction code, and same evaluation protocol achieves complete refusal elimination on six layers. The geometric analysis of the failure establishes that topic-matched subtraction cancels the dominant activation component shared between harmful and harmless prompts of the same subject, reducing the extracted direction magnitude below the threshold at which weight-matrix projection perturbs the residual stream. The implications for the design of contrast baselines in abliteration research are discussed.

关键词: instruction-tuned language models, refusal behavior, directional abliteration, contrast baseline, activation space, Qwen 3.5 2B, refusal elimination, geometric analysis

45. ❌ A Context Engineering Framework for Improving Enterprise AI Agents based on Digital-Twin MDP

作者: Xi Yang, Aurelie Lozano, Naoki Abe, Bhavya, Saurabh Jha, Noah Zheutlin, Rohan R. Arora, Yu Deng, Daby M. Sow 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22083v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于离线强化学习改进LLM企业智能体的框架，核心涉及LLM智能体（高度相关10分）、复杂推理需求（CoT和System 2 Thinking各5分）以及数据质量限制（Scaling Laws AND Data Quality 5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及或与论文焦点无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对企业AI智能体在数据质量、复杂推理和反馈信号方面的局限性，提出了一个基于数字孪生MDP和离线强化学习的轻量级框架，显著提升了智能体在IT自动化等企业任务中的决策性能。

摘要翻译

尽管面向企业自动化和决策的AI智能体发展迅速，但其实际部署与性能的进一步提升仍受限于数据质量与数量的不足、复杂的现实世界推理需求、自博弈训练的困难以及可靠反馈信号的缺失。为应对这些挑战，我们提出一种轻量级、模型无关的框架，通过离线强化学习（RL）来改进基于大语言模型的企业智能体。所提出的基于DT-MDP的情境工程（DT-MDP-CE）框架包含三个核心组件：（1）数字孪生马尔可夫决策过程（Digital-Twin Markov Decision Process, DT-MDP），将智能体的推理行为抽象为有限MDP；（2）鲁棒的对比式逆强化学习，依托DT-MDP高效估计有理论依据的奖励函数，并从混合质量的离线轨迹中推导策略；（3）强化学习引导的情境工程，利用由（1）和（2）整合过程得到的策略，以改进智能体的决策行为。作为案例研究，我们将该框架应用于面向企业IT自动化领域的一项代表性任务。大量实验结果表明，在广泛的评估设置中，该框架相较于基线智能体均取得了一致且显著的性能提升，表明该框架可推广至企业环境中具有类似特征的其他智能体。

摘要 (Abstract)

Despite rapid progress in AI agents for enterprise automation and decision-making, their real-world deployment and further performance gains remain constrained by limited data quality and quantity, complex real-world reasoning demands, difficulties with self-play, and the lack of reliable feedback signals. To address these challenges, we propose a lightweight, model-agnostic framework for improving LLM-based enterprise agents via offline reinforcement learning (RL). The proposed Context Engineering via DT-MDP (DT-MDP-CE) framework comprises three key components: (1) A Digital-Twin Markov Decision Process (DT-MDP), which abstracts the agent’s reasoning behavior as a finite MDP; (2) A robust contrastive inverse RL, which, armed with the DT-MDP, to efficiently estimate a well-founded reward function and induces policies from mixed-quality offline trajectories; and (3) RL-guided context engineering, which uses the policy obtained from the integrated process of (1) and (2), to improve the agent’s decision-making behavior. As a case study, we apply the framework to a representative task in the enterprise-oriented domain of IT automation. Extensive experimental results demonstrate consistent and significant improvements over baseline agents across a wide range of evaluation settings, suggesting that the framework can generalize to other agents sharing similar characteristics in enterprise environments.

关键词: LLM-based enterprise agents, offline reinforcement learning, Digital-Twin MDP, context engineering, IT automation, decision-making, contrastive inverse RL, policy improvement

46. ❌ Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models

作者: Hayeon Kim, Ji Ha Jang, Junghun James Kim, Se Young Chun 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22042v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉-语言模型（VLMs）在双曲空间中的改进方法，专注于处理部分-整体关系和层次结构，属于计算机视觉和自然语言处理的交叉领域。所有给定的关键词都专门针对大语言模型（LLMs）的技术、训练方法、推理、对齐、优化、应用等具体方面，而本文研究的是视觉-语言模型（VLMs），虽然VLMs和LLMs都属于多模态或语言模型范畴，但本文内容不涉及任何LLMs特有的技术（如指令调优、RLHF、RAG、CoT、代理等），也未提及科学领域的AI应用。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

本文提出了一种不确定性引导的双曲视觉-语言模型对齐方法（UNCHA），通过建模部分到整体的语义代表性来改进多对象组合场景的理解，在零样本分类、检索和多标签分类任务上取得了最先进的性能。

摘要翻译

尽管视觉-语言模型（VLMs）已取得显著性能，但其欧几里得嵌入在捕捉层次关系（如部分-整体或父子结构）方面仍存在局限，且在多对象组合场景中常面临挑战。双曲视觉-语言模型通过更好地保持层次结构并利用蕴含关系建模部分-整体关系（即整体场景及其部分图像），缓解了这一问题。然而，现有方法未建模每个部分对于整体具有不同层次的语义代表性。我们提出不确定性引导的组合双曲对齐（UNcertainty-guided Compositional Hyperbolic Alignment, UNCHA）以增强双曲视觉-语言模型。UNCHA利用双曲不确定性建模部分到整体的语义代表性，为整体场景中更具代表性的部分分配较低不确定性，而对代表性较低的部分分配较高不确定性。这种代表性随后通过不确定性引导的权重融入对比学习目标中。最后，通过基于熵的项进行正则化的蕴含损失进一步校准不确定性。借助所提出的损失函数，UNCHA能够学习具有更精确部分-整体排序的双曲嵌入，捕捉图像中潜在的组合结构，并提升对复杂多对象场景的理解能力。UNCHA在零样本分类、检索和多标签分类基准测试中取得了最先进的性能。我们的代码和模型公开于：https://github.com/jeeit17/UNCHA.git。

摘要 (Abstract)

While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git.

关键词: Vision-Language Models, Hyperbolic embeddings, Part-to-whole relations, Semantic representativeness, Uncertainty-guided alignment, Multi-object compositional scenarios, Contrastive learning, Entailment loss

47. ❌ Future-Interactions-Aware Trajectory Prediction via Braid Theory

作者: Caio Azevedo, Stefano Sabatini, Sascha Hornauer, Fabien Moutarde 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22035v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多智能体轨迹预测，使用辫子理论（Braid Theory）建模智能体间的交互行为，属于自动驾驶领域的传统机器学习/深度学习应用。论文未涉及大语言模型（LLM）、大模型技术原理、AI for Science等关键词的核心内容。仅与’Multi-agent Systems OR Agent Coordination’有一定关联（5分），因为论文处理多智能体协调预测，但未使用LLM或相关技术。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于辫子理论的未来交互感知轨迹预测方法，通过引入辫子预测辅助任务显著提升了多智能体联合轨迹预测的准确性。

摘要翻译

为确保安全行驶，自动驾驶车辆必须预知周围大量交互智能体的未来行为，这一任务通常被表述为多智能体轨迹预测。以往许多建模社会交互关系并解决联合预测任务的方法，要么增加了大量计算负担，要么依赖启发式规则对多智能体行为类型进行标注。相比之下，辫子理论（braid theory）通过将未来轨迹投影为表达轨迹随时间相互交叉关系的辫子结构，为多智能体行为提供了精确而强大的描述符；每个辫子对应未来多智能体间特定的协调模式。在既往研究中，辫子理论仅被浅层用于推理交互智能体关系并限制预测智能体的注意力窗口。本研究表明，更充分地利用辫子表征的表达能力，并以其为条件生成轨迹，能在训练和推理阶段仅增加可忽略复杂度的前提下，显著提升联合预测性能。我们通过提出一种新颖的辅助任务——辫子预测（braid prediction）来实现这一目标，该任务与轨迹预测任务并行执行。通过将智能体间的连接边在辫子表征中分类为正确的交叉类型，辫子预测任务能够赋予模型更强的社会感知能力，这体现在联合预测结果更贴合实际多智能体行为上。这一简单的辅助任务使我们在三个独立数据集上的联合预测指标均获得显著提升。我们进一步论证了辫子预测任务如何赋予模型未来意图感知能力，从而产生更精确的联合预测结果。代码发布于github.com/caiocj1/traj-pred-braid-theory。

摘要 (Abstract)

To safely operate, an autonomous vehicle must know the future behavior of a potentially high number of interacting agents around it, a task often posed as multi-agent trajectory prediction. Many previous attempts to model social interactions and solve the joint prediction task either add extensive computational requirements or rely on heuristics to label multi-agent behavior types. Braid theory, in contrast, provides a powerful exact descriptor of multi-agent behavior by projecting future trajectories into braids that express how trajectories cross with each other over time; a braid then corresponds to a specific mode of coordination between the multiple agents in the future. In past work, braids have been used lightly to reason about interacting agents and restrict the attention window of predicted agents. We show that leveraging more fully the expressivity of the braid representation and using it to condition the trajectories themselves leads to even further gains in joint prediction performance, with negligible added complexity either in training or at inference time. We do so by proposing a novel auxiliary task, braid prediction, done in parallel with the trajectory prediction task. By classifying edges between agents into their correct crossing types in the braid representation, the braid prediction task is able to imbue the model with improved social awareness, which is reflected in joint predictions that more closely adhere to the actual multi-agent behavior. This simple auxiliary task allowed us to obtain significant improvements in joint metrics on three separate datasets. We show how the braid prediction task infuses the model with future intention awareness, leading to more accurate joint predictions. Code is available at github.com/caiocj1/traj-pred-braid-theory.

关键词: trajectory prediction, multi-agent systems, braid theory, autonomous vehicles, social interactions, joint prediction, future interactions, auxiliary task

48. ❌ ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention

作者: Xinyan Wang, Xiaogeng Liu, Chaowei Xiao 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型推理模型（LRMs）在生成长链式思维推理时的过度思考问题，提出ROM方法进行实时检测和干预。核心相关关键词：1）‘Large Language Models’（10分）- 论文明确研究大型语言模型/推理模型；2）‘Chain of Thought’（10分）- 直接针对链式思维推理中的过度思考问题；3）‘System 2 Thinking’（8分）- 涉及深度推理过程中的效率优化；4）‘Self-Correction’（8分）- 通过检测和干预实现自我改进；5）‘Speculative Decoding’（5分）- 涉及推理加速和效率提升。其他关键词如MoE、量化、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文针对大型推理模型在链式思维推理中出现的过度思考问题，提出ROM方法通过实时流式检测和干预，在保持高准确率的同时显著减少响应长度和提升效率。

摘要翻译

大型推理模型（LRMs）通过生成长链思维轨迹在复杂任务上实现了较高的准确率，但存在过度思考的问题。即使在得出正确答案后，它们仍会持续生成冗余的推理步骤。这种行为不仅增加了延迟和计算成本，还可能导致答案漂移。现有的缓解方法要么需要对主干模型进行高成本训练的修改，要么依赖于手工设计的启发式规则，这些方法并未真正捕捉过度思考的模式。我们提出了ROM，这是首个将过度思考缓解问题构建为流式预测与控制框架的方法。ROM在冻结的大型语言模型主干网络的深层隐藏状态上附加了一个轻量级检测头，实时监控生成的词元，并在检测到过度思考时立即触发向最终答案的早期转换。我们还引入了基于解题正确性边界的词元级监督方法，以及一种减少蒸馏数据偏差的数据增强策略。在七个基准测试中，ROM实现了最高的准确率（93.51%）、最短的响应长度（1,159个词元）和最佳的响应效率。相比原始基线模型，它将响应长度减少了47.2%，并将效率提升了121%。这些结果表明，流式检测是实现实时过度思考缓解的有效途径。

摘要 (Abstract)

Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.

关键词: Large Reasoning Models, Overthinking Mitigation, Chain-of-Thought, Streaming Detection, Real-time Intervention, Response Efficiency, Latency Reduction, Answer Drift

49. ❌ SegMaFormer: A Hybrid State-Space and Transformer Model for Efficient Segmentation

作者: Duy D. Nguyen, Phat T. Tran-Truong 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22002v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于3D医学图像分割的深度学习架构创新，提出了一种结合Mamba和Transformer的混合模型SegMaFormer。该研究属于计算机视觉和医学图像分析领域，与绝大多数关键词（主要关于大语言模型技术、训练方法、推理优化、对齐、代理系统等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该论文将深度学习应用于医学图像分析（生物信息学/科学AI的一个子领域），但并非其核心创新点（核心是架构设计），因此给予8分（有一定关联，但非核心）。

!!! tip deepseek-chat TL;DR

该论文针对3D医学图像分割中Transformer模型计算复杂度过高的问题，提出了一种轻量级的混合架构SegMaFormer，通过结合Mamba和Transformer模块，在保持竞争力的分割性能的同时，显著减少了模型参数和计算开销。

摘要翻译

Transformer与Mamba架构的出现，通过实现全局上下文建模能力，显著推动了三维医学图像分割的发展，这一能力在传统卷积神经网络（CNNs）中一直受限。然而，最先进的Transformer模型通常伴随着巨大的计算复杂度和参数量，这对于体数据而言尤其受限，且标注医学影像数据集的稀缺进一步加剧了这一问题。为应对这些挑战，本研究提出了SegMaFormer——一种轻量级混合架构，它将Mamba与Transformer模块协同整合于分层体数据编码器中，以实现高效的长程依赖建模。该模型策略性地在早期高分辨率阶段采用基于Mamba的层，以降低计算开销同时捕获关键空间上下文信息，并将自注意力机制保留至后期低分辨率阶段，以优化特征表示。此设计辅以广义旋转位置嵌入来增强空间感知能力。尽管结构紧凑，SegMaFormer在三个公开基准数据集（Synapse、BraTS和ACDC）上取得了具有竞争力的性能，其Dice系数与参数量大得多的模型相当。实验表明，与当前最先进模型相比，我们的方法将参数量降低了高达75倍，并显著减少了浮点运算量，从而为三维医学图像分割建立了一种高效且高性能的解决方案。

摘要 (Abstract)

The advent of Transformer and Mamba-based architectures has significantly advanced 3D medical image segmentation by enabling global contextual modeling, a capability traditionally limited in Convolutional Neural Networks (CNNs). However, state-of-the-art Transformer models often entail substantial computational complexity and parameter counts, which is particularly prohibitive for volumetric data and further exacerbated by the limited availability of annotated medical imaging datasets. To address these limitations, this work introduces SegMaFormer, a lightweight hybrid architecture that synergizes Mamba and Transformer modules within a hierarchical volumetric encoder for efficient long-range dependency modeling. The model strategically employs Mamba-based layers in early, high-resolution stages to reduce computational overhead while capturing essential spatial context, and reserves self-attention mechanisms for later, lower-resolution stages to refine feature representation. This design is augmented with generalized rotary position embeddings to enhance spatial awareness. Despite its compact structure, SegMaFormer achieves competitive performance on three public benchmarks (Synapse, BraTS, and ACDC), matching the Dice coefficient of significantly larger models. Empirically, our approach reduces parameters by up to 75x and substantially decreases FLOPs compared to current state-of-the-art models, establishing an efficient and high-performing solution for 3D medical image segmentation.

关键词: 3D medical image segmentation, Transformer, Mamba, hybrid architecture, computational efficiency, lightweight model, long-range dependency, volumetric encoder

50. ❌ TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning

作者: Dilina Rajapakse, Juan C. Rosero, Ivana Dusparic 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21988v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多目标强化学习（MORL）的可解释性框架TREX，通过轨迹归因解释策略。虽然涉及强化学习和可解释AI，但所有关键词均针对大模型/深度学习技术，而本文未涉及任何大模型、语言模型、预训练、微调、推理优化、对齐、代理系统等主题。仅与’Mechanistic Interpretability OR Explainable AI’有弱关联（5分），因为TREX属于可解释AI范畴，但并非针对大模型的可解释性。其他关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

本文提出TREX框架，通过轨迹归因和聚类行为片段来解释多目标强化学习策略的决策过程，并在MuJoCo环境中验证了其量化行为模式影响的能力。

摘要翻译

强化学习（Reinforcement Learning, RL）通过与环境的交互优化所获得的奖励信号，已证明其能够解决多个领域中的复杂决策问题。然而，许多现实场景涉及多个可能相互冲突的目标，这些目标难以用单一标量奖励来表征。多目标强化学习（Multi-Objective Reinforcement Learning, MORL）通过使智能体能够同时优化多个目标，并显式地权衡它们之间的取舍，从而解决了这一局限。然而，强化学习模型的“黑箱”特性使得所选目标权衡背后的决策过程不够清晰。当前的可解释强化学习（Explainable Reinforcement Learning, XRL）方法通常针对单一标量奖励设计，未能提供针对不同目标或用户偏好的解释。为填补这一空白，本文提出TREX，一种基于轨迹归因的可解释性框架，用于解释多目标强化学习策略。
TREX直接从学习到的专家策略中生成轨迹，覆盖不同的用户偏好，并将其聚类为具有语义意义的时间片段。我们通过训练排除特定聚类片段的补充策略，量化这些行为片段对帕累托权衡的影响，测量其在观察到的奖励和行动上相对于原始专家策略的相对偏差。在多目标MuJoCo环境——HalfCheetah、Ant和Swimmer上的实验表明，该框架能够有效分离并量化特定的行为模式。

摘要 (Abstract)

Reinforcement Learning (RL) has demonstrated its ability to solve complex decision-making problems in a variety of domains, by optimizing reward signals obtained through interaction with an environment. However, many real-world scenarios involve multiple, potentially conflicting objectives that cannot be easily represented by a single scalar reward. Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them. However, the ``black box” nature of the RL models makes the decision process behind chosen objective trade-offs unclear. Current Explainable Reinforcement Learning (XRL) methods are typically designed for single scalar rewards and do not account for explanations with respect to distinct objectives or user preferences. To address this gap, in this paper we propose TREX, a Trajectory based Explainability framework to explain Multi-objective Reinforcement Learning policies, based on trajectory attribution. TREX generates trajectories directly from the learned expert policy, across different user preferences and clusters them into semantically meaningful temporal segments. We quantify the influence of these behavioural segments on the Pareto trade-off by training complementary policies that exclude specific clusters, measuring the resulting relative deviation on the observed rewards and actions compared to the original expert policy. Experiments on multi-objective MuJoCo environments - HalfCheetah, Ant and Swimmer, demonstrate the framework’s ability to isolate and quantify the specific behavioural patterns.

关键词: Multi-Objective Reinforcement Learning, Explainable Reinforcement Learning, Trajectory Attribution, Behavioral Clustering, Pareto Trade-off, Policy Explanation, MuJoCo Environments

51. ❌ λ-GELU: Learning Gating Hardness for Controlled ReLU-ization in Deep Networks

作者: Cristian Pérez-Corral, Alberto Fernández-Hernández, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ortí 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21991v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是激活函数GELU的参数化变体λ-GELU，通过控制门控硬度实现向ReLU的平滑过渡，属于深度学习基础组件优化。所有评分关键词均聚焦于大语言模型（LLMs）的特定技术、应用或优化方法（如MoE、RLHF、RAG、量化等），而本文未涉及任何大语言模型相关内容，也未讨论LLM特有的训练、推理、对齐或应用技术。论文虽然提到了Transformers，但仅作为评估架构之一，未涉及LLM特有的规模、预训练、微调等概念。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种参数化的GELU激活函数λ-GELU，通过可学习的硬度参数控制门控行为，实现了从平滑训练到ReLU兼容模型的受控转换，并在多种网络架构上验证了其有效性和鲁棒性。

摘要翻译

高斯误差线性单元（GELU）是整流线性单元（ReLU）的一种广泛使用的平滑替代方案，然而许多部署、压缩和分析工具链最自然地适用于分段线性（ReLU型）网络。我们研究了一种硬度参数化的GELU公式：f(x;λ)=xΦ(λ x)，其中Φ为高斯累积分布函数，λ ∈ [1, ∞)控制门控锐度，其目标是将平滑门控训练转化为一条通往ReLU兼容模型的可控路径。学习λ并非易事：朴素的更新会导致不稳定动态和有效梯度衰减，因此我们引入了约束重参数化和优化器感知的更新方案。
实证研究表明，在涵盖多层感知机（MLP）、卷积神经网络（CNN）和Transformer的多样化模型-数据集组合中，我们观察到结构化的逐层硬度分布，并评估了它们在不同初始化下的鲁棒性。我们进一步研究了一种确定性的ReLU化策略：通过学习到的门控逐步硬化至理论目标，使得训练后能够以较小干扰将λ-GELU替换为ReLU。总体而言，λ-GELU提供了一个最小化且可解释的调控旋钮，用于刻画和控制门控硬度，从而在平滑训练与以ReLU为中心的下游流程之间架起桥梁。

摘要 (Abstract)

Gaussian Error Linear Unit (GELU) is a widely used smooth alternative to Rectifier Linear Unit (ReLU), yet many deployment, compression, and analysis toolchains are most naturally expressed for piecewise-linear (ReLU-type) networks. We study a hardness-parameterized formulation of GELU, f(x;λ)=xΦ(λ x), where Φ is the Gaussian CDF and λ \in [1, infty) controls gate sharpness, with the goal of turning smooth gated training into a controlled path toward ReLU-compatible models. Learning λ is non-trivial: naive updates yield unstable dynamics and effective gradient attenuation, so we introduce a constrained reparameterization and an optimizer-aware update scheme. Empirically, across a diverse set of model–dataset pairs spanning MLPs, CNNs, and Transformers, we observe structured layerwise hardness profiles and assess their robustness under different initializations. We further study a deterministic ReLU-ization strategy in which the learned gates are progressively hardened toward a principled target, enabling a post-training substitution of λ-GELU by ReLU with reduced disruption. Overall, λ-GELU provides a minimal and interpretable knob to profile and control gating hardness, bridging smooth training with ReLU-centric downstream pipelines.

关键词: GELU, ReLU, activation function, gating hardness, parameterized GELU, deep networks, controlled ReLU-ization, post-training substitution

52. ❌ LRC-WeatherNet: LiDAR, RADAR, and Camera Fusion Network for Real-time Weather-type Classification in Autonomous Driving

作者: Nour Alhuda Albashir, Lars Pernickel, Danial Hamoud, Idriss Gouigah, Eren Erdal Aksoy 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶中的多传感器融合技术（LiDAR、RADAR、相机）用于实时天气分类，属于计算机视觉和传感器融合领域。所有评分关键词均涉及大语言模型（LLM）及其相关技术（如训练方法、推理优化、应用框架等），而论文内容完全不涉及任何语言模型、深度学习基础模型或相关技术原理。论文的创新点在于多模态传感器融合架构，而非大模型技术或其在科学领域的应用，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LRC-WeatherNet的新型多传感器融合框架，通过整合LiDAR、RADAR和相机数据，解决了自动驾驶车辆在恶劣天气条件下实时天气分类的感知挑战，并在MSU-4S数据集上实现了优于单模态基线的分类性能和计算效率。

摘要翻译

自动驾驶车辆在雨、雾、雪等恶劣天气下面临着感知与导航的重大挑战，这些天气会降低LiDAR（激光雷达）、RADAR（毫米波雷达）和RGB相机传感器的性能。尽管每种传感器类型都具有独特优势，例如RADAR在低能见度条件下的鲁棒性，以及LiDAR在晴朗天气下的精确性，但它们在面对环境遮挡时也各有局限。本研究提出了LRC-WeatherNet，一种新颖的多传感器融合框架，它整合了LiDAR、RADAR和相机数据，用于实时天气状况分类。通过采用基于统一鸟瞰图表示的早期融合，以及对各模态特定特征图进行的中层门控融合，我们的方法能够适应不同天气条件下各传感器可靠性的动态变化。在涵盖九种天气类型的广泛MSU-4S数据集上进行评估后，LRC-WeatherNet展现出卓越的分类性能和计算效率，在恶劣条件下显著优于单模态基线方法。此项工作是首次将三种模态结合，用于实现自动驾驶中鲁棒的实时天气分类。我们已在https://github.com/nouralhudaalbashir/LRC-WeatherNet上发布训练模型及源代码。

摘要 (Abstract)

Autonomous vehicles face major perception and navigation challenges in adverse weather such as rain, fog, and snow, which degrade the performance of LiDAR, RADAR, and RGB camera sensors. While each sensor type offers unique strengths, such as RADAR robustness in poor visibility and LiDAR precision in clear conditions, they also suffer distinct limitations when exposed to environmental obstructions. This study proposes LRC-WeatherNet, a novel multi-sensor fusion framework that integrates LiDAR, RADAR, and camera data for real-time classification of weather conditions. By employing both early fusion using a unified Bird’s Eye View representation and mid-level gated fusion of modality-specific feature maps, our approach adapts to the varying reliability of each sensor under changing weather. Evaluated on the extensive MSU-4S dataset covering nine weather types, LRC-WeatherNet achieves superior classification performance and computational efficiency, significantly outperforming unimodal baselines in adverse conditions. This work is the first to combine all three modalities for robust, real-time weather classification in autonomous driving. We release our trained models and source code in https://github.com/nouralhudaalbashir/LRC-WeatherNet.

关键词: autonomous driving, multi-sensor fusion, LiDAR, RADAR, camera, weather classification, real-time, MSU-4S dataset

53. ❌ SecureBreak – A dataset towards safe and secure models

作者: Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21975v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM安全对齐和有害输出检测，与’Large Language Models’和’Alignment’高度相关（10分），涉及微调和安全过滤，与’Post-training’和’Hallucination Mitigation’有一定关联（5分），其他关键词未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM安全对齐的残余弱点，提出了SecureBreak数据集用于检测有害输出，并通过微调实验验证了其有效性。

摘要翻译

大型语言模型正日益成为众多现实应用中的核心组件。因此，安全对齐是其安全部署的关键要求。尽管先前的研究主要集中于模型架构和对齐方法，但仅靠这些方法无法确保完全消除有害生成内容。越来越多的科学文献表明，诸如越狱攻击和提示注入等攻击能够绕过现有的安全对齐机制，这进一步加剧了上述担忧。因此，需要额外的安全策略，一方面在训练阶段为已实现的安全对齐的鲁棒性提供定性反馈，另一方面构建一个“终极”防御层，以阻断已部署模型可能产生的不安全输出。为在此领域作出贡献，本文提出了SecureBreak，这是一个以安全为导向的数据集，旨在支持开发基于人工智能的解决方案，用于检测因安全对齐残留缺陷而导致的大型语言模型有害输出。该数据集通过细致的人工标注实现了高度可靠性，其标签采用保守方式分配以确保安全性。它在检测多个风险类别的不安全内容方面表现良好。使用预训练大型语言模型进行的测试表明，基于SecureBreak进行微调后结果有所提升。总体而言，该数据集既可用于生成后安全过滤，也可用于指导进一步的模型对齐与安全改进。

摘要 (Abstract)

Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate’’ defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security alignment. The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety. It performs well in detecting unsafe content across multiple risk categories. Tests with pre-trained LLMs show improved results after fine-tuning on SecureBreak. Overall, the dataset is useful both for post-generation safety filtering and for guiding further model alignment and security improvements.

关键词: Large Language Models, Security Alignment, Harmful Output Detection, Safety Dataset, Fine-tuning, Jailbreaking, Prompt Injection, Post-generation Filtering

54. ❌ Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora, Prompt Tuning, and Full Fine-Tuning

作者: Ulugbek Shernazarov, Rostislav Svitsov, Bin Shi 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21970v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究参数高效微调（PEFT）方法在医学文本摘要任务中的应用，特别是LoRA方法，因此与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（15分）。研究使用Flan-T5模型进行微调，属于大语言模型应用（8分）。研究在医学领域（PubMed数据集）应用AI，与’AI for Science OR Bioinformatics OR Cheminformatics’相关（10分）。论文明确涉及监督微调（SFT）方法比较，因此’Post-training OR Supervised Fine-tuning OR SFT’得10分。其他关键词如MoE、量化、推理加速等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文比较了LoRA、Prompt Tuning和全参数微调在医学文本摘要任务中的效果，发现LoRA在仅使用0.6%可训练参数的情况下，性能优于全参数微调，表明低秩约束提供了有益的 regularization。

摘要翻译

针对医疗文本摘要等特定领域任务微调大型语言模型需要大量计算资源。参数高效微调方法通过仅更新一小部分参数提供了有前景的替代方案。本文在PubMed医学摘要数据集上，比较了Flan-T5模型系列上的三种适配方法——低秩适配、提示微调和全参数微调。通过多次随机种子实验，我们证明LoRA方法持续优于全参数微调：Flan-T5-Large模型仅使用0.6%的可训练参数就取得了43.52 +/- 0.18的ROUGE-1分数，而全参数微调仅为40.67 +/- 0.21。敏感性分析检验了LoRA秩和提示词数量的影响。我们的研究结果表明低秩约束提供了有益的正则化效果，这对全参数更新的必要性假设提出了挑战。代码发布于https://github.com/eracoding/llm-medical-summarization

摘要 (Abstract)

Fine-tuning large language models for domain-specific tasks such as medical text summarization demands substantial computational resources. Parameter-efficient fine-tuning (PEFT) methods offer promising alternatives by updating only a small fraction of parameters. This paper compares three adaptation approaches-Low-Rank Adaptation (LoRA), Prompt Tuning, and Full Fine-Tuning-across the Flan-T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine-tuning, achieving 43.52 +/- 0.18 ROUGE-1 on Flan-T5-Large with only 0.6% trainable parameters compared to 40.67 +/- 0.21 for full fine-tuning. Sensitivity analyses examine the impact of LoRA rank and prompt token count. Our findings suggest the low-rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates. Code is available at https://github.com/eracoding/llm-medical-summarization

关键词: Parameter-efficient fine-tuning, LoRA, Medical text summarization, Flan-T5, PubMed dataset, Low-Rank Adaptation, Prompt Tuning, Full Fine-Tuning

55. ❌ Suiren-1.0 Technical Report: A Family of Molecular Foundation Models

作者: Junyi An, Xinyu Lu, Yun-Fei Shi, Li-Cheng Xu, Nannan Zhang, Chao Qu, Yuan Qi, Fenglei Cao 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21942v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文Suiren-1.0是分子基础模型，直接属于’AI for Science’领域，因此该关键词得10分。作为基础模型，它属于’Large Language Models OR LLMs OR Foundation Models’范畴，得10分。论文明确描述了预训练和持续预训练过程，与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关，得10分。其他关键词如MoE、SLMs、SFT、RLHF、RAG、推理加速、幻觉缓解、模型压缩、智能体等均未在摘要中提及或与分子建模无直接关联，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Suiren-1.0分子基础模型家族，通过预训练和持续预训练在量子性质预测任务上取得了最先进的性能，并开发了轻量级模型用于高效下游应用。

摘要翻译

我们推出Suiren-1.0系列分子基础模型，旨在实现对多样化有机体系的精确建模。该系列包含三个专用变体（Suiren-Base、Suiren-Dimer和Suiren-ConfAvg），它们被整合在一个连接三维构象几何空间与二维统计集成空间的算法框架中。我们首先基于7000万样本的密度泛函理论数据集，通过空间自监督和SE(3)-等变架构对Suiren-1.0基础模型（18亿参数）进行预训练，在量子性质预测任务中实现了稳健性能。Suiren-Dimer通过对1350万分子间相互作用样本的持续预训练，进一步扩展了该能力。为实现高效的下游应用，我们提出构象压缩蒸馏——一种基于扩散的框架，可将复杂的三维结构表征蒸馏为二维构象平均表征。由此产生的轻量化模型Suiren-ConfAvg能够直接从SMILES或分子图生成高保真表征。大量评估实验表明，Suiren-1.0在一系列任务中均取得了最先进的性能。所有模型与基准测试均已开源。

摘要 (Abstract)

We introduce Suiren-1.0, a family of molecular foundation models for the accurate modeling of diverse organic systems. Suiren-1.0 comprising three specialized variants (Suiren-Base, Suiren-Dimer, and Suiren-ConfAvg) is integrated within an algorithmic framework that bridges the gap between 3D conformational geometry and 2D statistical ensemble spaces. We first pre-train Suiren-Base (1.8B parameters) on a 70M-sample Density Functional Theory dataset using spatial self-supervision and SE(3)-equivariant architectures, achieving robust performance in quantum property prediction. Suiren-Dimer extends this capability through continued pre-training on 13.5M intermolecular interaction samples. To enable efficient downstream application, we propose Conformation Compression Distillation (CCD), a diffusion-based framework that distills complex 3D structural representations into 2D conformation-averaged representations. This yields the lightweight Suiren-ConfAvg, which generates high-fidelity representations from SMILES or molecular graphs. Our extensive evaluations demonstrate that Suiren-1.0 establishes state-of-the-art results across a range of tasks. All models and benchmarks are open-sourced.

关键词: molecular foundation models, pre-training, density functional theory, SE(3)-equivariant architectures, quantum property prediction, conformation compression distillation, state-of-the-art, open-sourced

56. ❌ Camera-Agnostic Pruning of 3D Gaussian Splats via Descriptor-Based Beta Evidence

作者: Peter Fasogbon, Ugurcan Budak, Patrice Rondao Alface, Hamed Rezazadegan Tavakoli 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21933v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究3D高斯泼溅的剪枝方法，属于计算机视觉和3D重建领域，与所有评分关键词（均聚焦于大语言模型、深度学习技术原理及其应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种不依赖相机参数的3D高斯泼溅剪枝方法，通过基于描述符的Beta证据模型实现高效剪枝，在保持重建质量的同时显著降低了复杂度。

摘要翻译

三维高斯斑点的剪枝对于降低其复杂度以实现高效存储、传输及下游处理至关重要。然而，现有的大多数剪枝策略依赖于相机参数、渲染图像或视点相关度量。这种依赖性在新兴的相机无关交换场景中成为障碍，因为斑点在此类场景中直接作为基于点的表示（如.ply文件）被共享。本文提出一种针对三维高斯斑点的相机无关、一次性、训练后剪枝方法，该方法仅依赖于属性衍生的邻域描述符。作为我们的核心贡献，我们引入了一种混合描述符框架，该框架直接从斑点表示中捕获结构和外观一致性。基于这些描述符，我们将剪枝问题构建为统计证据估计问题，并引入了一种Beta证据模型，该模型通过概率置信度分数量化每个斑点的可靠性。
在ISO/IEC MPEG通用测试条件（CTC）定义的标准测试序列上进行的实验表明，我们的方法在保持重建质量的同时实现了显著的剪枝效果，为现有依赖相机的剪枝策略提供了一种实用且可泛化的替代方案。

摘要 (Abstract)

The pruning of 3D Gaussian splats is essential for reducing their complexity to enable efficient storage, transmission, and downstream processing. However, most of the existing pruning strategies depend on camera parameters, rendered images, or view-dependent measures. This dependency becomes a hindrance in emerging camera-agnostic exchange settings, where splats are shared directly as point-based representations (e.g., .ply). In this paper, we propose a camera-agnostic, one-shot, post-training pruning method for 3D Gaussian splats that relies solely on attribute-derived neighbourhood descriptors. As our primary contribution, we introduce a hybrid descriptor framework that captures structural and appearance consistency directly from the splat representation. Building on these descriptors, we formulate pruning as a statistical evidence estimation problem and introduce a Beta evidence model that quantifies per-splat reliability through a probabilistic confidence score. Experiments conducted on standardized test sequences defined by the ISO/IEC MPEG Common Test Conditions (CTC) demonstrate that our approach achieves substantial pruning while preserving reconstruction quality, establishing a practical and generalizable alternative to existing camera-dependent pruning strategies.

关键词: 3D Gaussian splats, pruning, camera-agnostic, descriptor-based, Beta evidence model, post-training, reconstruction quality, complexity reduction

57. ❌ Chronological Contrastive Learning: Few-Shot Progression Assessment in Irreversible Diseases

作者: Clemens Watzenböck, Daniel Aletaha, Michaël Deman, Thomas Deimel, Jana Eder, Ivana Janickova, Robert Janiczek, Peter Mandl, Philipp Seeböck, Gabriela Supp, Paul Weiser, Georg Langs 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21935v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学影像分析中的自监督学习方法（ChronoCon），用于不可逆疾病的进展评估。该方法基于对比学习，利用患者纵向扫描的时间顺序进行训练，不依赖专家标注。论文的核心是计算机视觉和医学影像分析，而非大语言模型或深度学习技术原理的创新。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及AI在生物医学（风湿性关节炎）中的应用，但并非核心大模型技术，因此给予8分。其他关键词均与大模型、训练技术、推理优化、代理系统等无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ChronoCon的时序对比学习方法，利用患者纵向医学影像的时间顺序进行自监督学习，以降低对专家标注的依赖，在风湿性关节炎的严重程度评估中，仅用5个患者的标注数据就达到了86%的组内相关系数。

摘要翻译

医学影像中的定量疾病严重程度评分成本高昂、耗时且易受阅片者间差异影响。与此同时，临床档案中存储的纵向影像数据远多于经过专家标注的严重程度评分。现有的自监督方法通常忽略这种时序结构。我们提出ChronoCon——一种对比学习方法，该方法使用仅从患者纵向扫描的就诊顺序推导出的排序，替代基于标签的排序损失。在不可逆疾病呈单调进展这一临床合理假设下，该方法无需使用任何专家标签即可学习疾病相关表征。这拓展了Rank-N-Contrast的思想，使其从标签距离推广至时序排序。通过在类风湿关节炎X光片上进行严重程度评估验证，所学表征显著提升了标签利用效率。在低标签场景下，ChronoCon明显优于基于ImageNet权重初始化的全监督基线模型。在少样本学习实验中，仅使用五名患者的专家评分对ChronoCon进行微调，即可在严重程度评分预测中获得86%的组内相关系数。这些结果表明，时序对比学习具有利用常规可得的影像元数据以降低不可逆疾病领域标注需求的潜力。代码发布于https://github.com/cirmuw/ChronoCon。

摘要 (Abstract)

Quantitative disease severity scoring in medical imaging is costly, time-consuming, and subject to inter-reader variability. At the same time, clinical archives contain far more longitudinal imaging data than expert-annotated severity scores. Existing self-supervised methods typically ignore this chronological structure. We introduce ChronoCon, a contrastive learning approach that replaces label-based ranking losses with rankings derived solely from the visitation order of a patient’s longitudinal scans. Under the clinically plausible assumption of monotonic progression in irreversible diseases, the method learns disease-relevant representations without using any expert labels. This generalizes the idea of Rank-N-Contrast from label distances to temporal ordering. Evaluated on rheumatoid arthritis radiographs for severity assessment, the learned representations substantially improve label efficiency. In low-label settings, ChronoCon significantly outperforms a fully supervised baseline initialized from ImageNet weights. In a few-shot learning experiment, fine-tuning ChronoCon on expert scores from only five patients yields an intraclass correlation coefficient of 86% for severity score prediction. These results demonstrate the potential of chronological contrastive learning to exploit routinely available imaging metadata to reduce annotation requirements in the irreversible disease domain. Code is available at https://github.com/cirmuw/ChronoCon.

关键词: Chronological Contrastive Learning, Few-Shot Learning, Medical Imaging, Disease Progression Assessment, Self-supervised Learning, Rheumatoid Arthritis, Longitudinal Data, Label Efficiency

58. ❌ Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support

作者: Shuying Chen, Sen Cui, Zhong Cao 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21925v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出Oph-Guid-RAG系统，核心是检索增强生成（RAG）在眼科临床决策支持中的应用，与’Retrieval-Augmented Generation’高度相关（10分）。系统涉及多模态推理、查询分解和可控检索，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。应用领域为眼科临床AI，属于’AI for Science’范畴（10分）。系统强调证据基础和可追溯输出，与’Hallucination Mitigation’和’Explainable AI’相关（各5分）。论文使用GPT-5.2/5.4作为基准，涉及大模型应用，但与’Large Language Models’仅间接相关（5分）。其他关键词如MoE、SFT、RLHF等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一个基于指南的多模态检索增强生成系统Oph-Guid-RAG，用于眼科临床问答和决策支持，通过可控检索和证据引用显著提升了在挑战性病例上的准确性和鲁棒性。

摘要翻译

本研究提出Oph-Guid-RAG，一种用于眼科临床问答与决策支持的多模态视觉检索增强生成系统。我们将每页临床指南视为独立的证据单元，直接检索页面图像以保留表格、流程图与版式信息。进一步设计了具备路由与过滤机制的可控检索框架，能够选择性引入外部证据并降低噪声干扰。该系统整合了查询分解、查询重写、检索、重排序与多模态推理模块，并提供带有指南页面引用的可追溯输出。我们在HealthBench数据集上采用基于医生评分的评估协议进行测试。在困难子集上，相较于GPT-5.2，本方法将综合评分从0.2969提升至0.3861（+0.0892，+30.0%），同时获得更高的准确率，从0.5956提升至0.6576（+0.0620，+10.4%）。与GPT-5.4相比，本方法实现了+0.1289（+24.4%）的更大准确率提升。这些结果表明，对于需要精准循证推理的复杂病例，我们的方法更具优势。消融实验进一步证明，重排序、路由与检索设计对系统稳定性至关重要，尤其在困难场景下。总体而言，本研究展示了视觉检索与可控推理的结合如何提升临床人工智能应用的证据可靠性与鲁棒性，同时指出仍需进一步工作以完善系统。

摘要 (Abstract)

In this work, we propose Oph-Guid-RAG, a multimodal visual RAG system for ophthalmology clinical question answering and decision support. We treat each guideline page as an independent evidence unit and directly retrieve page images, preserving tables, flowcharts, and layout information. We further design a controllable retrieval framework with routing and filtering, which selectively introduces external evidence and reduces noise. The system integrates query decomposition, query rewriting, retrieval, reranking, and multimodal reasoning, and provides traceable outputs with guideline page references. We evaluate our method on HealthBench using a doctor-based scoring protocol. On the hard subset, our approach improves the overall score from 0.2969 to 0.3861 (+0.0892, +30.0%) compared to GPT-5.2, and achieves higher accuracy, improving from 0.5956 to 0.6576 (+0.0620, +10.4%). Compared to GPT-5.4, our method achieves a larger accuracy gain of +0.1289 (+24.4%). These results show that our method is more effective on challenging cases that require precise, evidence-based reasoning. Ablation studies further show that reranking, routing, and retrieval design are critical for stable performance, especially under difficult settings. Overall, we show how combining visionbased retrieval with controllable reasoning can improve evidence grounding and robustness in clinical AI applications,while pointing out that further work is needed to be more complete.

关键词: Retrieval-Augmented Generation, Clinical Decision Support, Ophthalmology, Multimodal Reasoning, Evidence-based AI, Guideline Grounding, Controllable Retrieval, HealthBench

59. ❌ Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors

作者: Juan Sebastian Rojas, Chi-Guhn Lee 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21921v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于深度强化学习（Deep RL）中时间差分（TD）误差的理论分析，研究两种TD误差解释在非线性深度RL架构中的数值差异及其对算法性能的影响。所有评分关键词均围绕大模型（LLMs）、深度学习技术原理创新及其在不同领域的应用，而本文的核心是深度强化学习的理论分析，未涉及大模型、语言模型、模型训练/微调技术、推理优化、AI代理、模型压缩等主题，也未应用于科学领域（如生物信息学）。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在深度强化学习中，时间差分误差的两种经典解释（连续预测之差与自举目标与预测之差）在非线性架构下会产生数值差异，这种差异会影响基于TD误差的深度差分强化学习算法的性能。

摘要翻译

时间差分（TD）误差最早由Sutton（1988）正式提出，在该研究中，它最初被定义为时序连续预测值之间的差异，随后在同一工作中又被表述为自举目标与预测值之间的差值。自此，这两种对TD误差的解释在文献中被交替使用，后者最终被采纳为深度强化学习（RL）架构中标准的评论家损失函数。本研究表明，这两种对TD误差的解释并非总是等价的。具体而言，我们证明日益非线性的深度强化学习架构会导致这两种解释产生数值上逐渐增大的差异。基于这一发现，我们进一步阐述了在深度强化学习算法中，选择不同的TD误差解释如何影响其性能——特别是那些利用TD误差计算其他量的算法，例如深度差分（即平均奖励）强化学习方法。总体而言，我们的研究结果表明，将TD误差默认为自举目标与预测值之差的解释在深度强化学习场景中并不总是成立。

摘要 (Abstract)

The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of deep RL algorithms that utilize the TD error to compute other quantities, such as with deep differential (i.e., average-reward) RL methods. All in all, our results show that the default interpretation of the TD error as the difference between a bootstrapped target and a prediction does not always hold in deep RL settings.

关键词: deep reinforcement learning, temporal difference error, TD error interpretations, nonlinear architectures, deep differential RL, average-reward RL, algorithm performance, bootstrapped target

60. ❌ SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation

作者: Linkuan Zhou, Yinghao Xia, Yufei Shen, Xiangyu Li, Wenjie Du, Cong Cong, Leyi Wei, Ran Su, Qiangguo Jin 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21904v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学图像分割的无监督域适应（UDA），属于AI for Science（生物医学AI）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文核心方法涉及域适应，与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分）。其他关键词主要涉及大语言模型（LLM）技术、推理、对齐、优化等，而本文研究的是计算机视觉中的医学图像分割，未涉及LLM、MoE、缩放定律、微调、RAG、注意力机制、智能体、量化等主题，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SHAPE的结构感知分层无监督域适应框架，通过分层特征调制和超图合理性评估来解决医学图像分割中的域适应问题，在心脏和腹部跨模态基准测试中取得了最先进的性能。

摘要翻译

无监督域适应（Unsupervised Domain Adaptation, UDA）对于在不同临床环境中部署医学分割模型至关重要。现有方法存在根本性局限：一方面，其语义感知不足的特征对齐导致分布保真度低下；另一方面，伪标签验证过程忽略了全局解剖学约束，因而无法防止生成全局结构不合理的分割结果。为解决这些问题，我们提出SHAPE（基于结构感知与合理性评估的分层无监督域适应框架），该框架将适应目标重新定义为追求全局解剖合理性。该方法以DINOv3为基础架构，其分层特征调制模块首先生成兼具高保真度与类别感知的特征，从而将核心挑战转向对伪标签的鲁棒性验证。为增强传统的像素级验证，我们引入超图合理性估计模块，利用超图来评估标准图结构无法捕捉的全局解剖合理性。该模块与结构异常剪枝模块相结合，通过跨视图稳定性消除残留的伪影。SHAPE在心脏与腹部跨模态基准测试中显著优于现有方法，在心脏数据上达到平均Dice分数90.08%（MRI->CT）和78.51%（CT->MRI），在腹部数据上达到87.48%（MRI->CT）和86.89%（CT->MRI），实现了最先进的性能。代码已发布于https://github.com/BioMedIA-repo/SHAPE。

摘要 (Abstract)

Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudo-label validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class-awareness. This shifts the core challenge to robustly validating pseudo-labels. To augment conventional pixel-level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture. This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross-view stability. SHAPE significantly outperforms prior methods on cardiac and abdominal cross-modality benchmarks, achieving state-of-the-art average Dice scores of 90.08% (MRI->CT) and 78.51% (CT->MRI) on cardiac data, and 87.48% (MRI->CT) and 86.89% (CT->MRI) on abdominal data. The code is available at https://github.com/BioMedIA-repo/SHAPE.

关键词: Unsupervised Domain Adaptation, Medical Image Segmentation, Structure-aware, Hierarchical Feature Modulation, Plausibility Evaluation, Hypergraph, Cross-modality, DINOv3

61. ❌ Not All Layers Are Created Equal: Adaptive LoRA Ranks for Personalized Image Generation

作者: Donald Shenaj, Federico Errica, Antonio Carta 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21884v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型的个性化图像生成，提出了一种自适应LoRA秩的方法（LoRA²）。论文核心与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为LoRA是论文的核心技术，且论文改进了LoRA的秩选择方法。其他关键词均与论文内容无关（0分），因为论文不涉及大语言模型、推理、对齐、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文解决了扩散模型个性化图像生成中LoRA秩选择固定的问题，提出了一种自适应秩方法LoRA²，在保持性能的同时显著降低了内存消耗和秩大小。

摘要翻译

低秩自适应（Low Rank Adaptation，LoRA）是从预训练扩散模型中生成个性化图像的实际微调策略。选择合适的秩至关重要，因为它需要在性能与内存消耗之间进行权衡，但当前的决定往往依赖于社区共识，而忽略了个性化主题的复杂性。其原因显而易见：为每个LoRA组件选择合适秩的成本是组合性的，因此我们倾向于采用实用捷径，例如为所有组件固定相同的秩。本文中，我们迈出了克服这一挑战的第一步。受学习神经网络自适应宽度的变分方法启发，我们允许每个层的秩在针对特定主题的微调过程中自由适应。我们通过对秩的位置施加重要性排序来实现这一点，从而在严格需要时有效促进更高秩的创建。在定性与定量评估中，我们的方法LoRA$^2$在29个主题上实现了DINO、CLIP-I和CLIP-T指标之间的竞争性权衡，同时相比高秩版本的LoRA，所需内存更少且秩更低。代码：https://github.com/donaldssh/NotAllLayersAreCreatedEqual。

摘要 (Abstract)

Low Rank Adaptation (LoRA) is the de facto fine-tuning strategy to generate personalized images from pre-trained diffusion models. Choosing a good rank is extremely critical, since it trades off performance and memory consumption, but today the decision is often left to the community’s consensus, regardless of the personalized subject’s complexity. The reason is evident: the cost of selecting a good rank for each LoRA component is combinatorial, so we opt for practical shortcuts such as fixing the same rank for all components. In this paper, we take a first step to overcome this challenge. Inspired by variational methods that learn an adaptive width of neural networks, we let the ranks of each layer freely adapt during fine-tuning on a subject. We achieve it by imposing an ordering of importance on the rank’s positions, effectively encouraging the creation of higher ranks when strictly needed. Qualitatively and quantitatively, our approach, LoRA$^2$, achieves a competitive trade-off between DINO, CLIP-I, and CLIP-T across 29 subjects while requiring much less memory and lower rank than high rank LoRA versions. Code: https://github.com/donaldssh/NotAllLayersAreCreatedEqual.

关键词: Low Rank Adaptation, LoRA, personalized image generation, diffusion models, adaptive rank, parameter-efficient fine-tuning, memory efficiency

62. ❌ SmaAT-QMix-UNet: A Parameter-Efficient Vector-Quantized UNet for Precipitation Nowcasting

作者: Nikolas Stavrou, Siamak Mehrkanoon 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21879v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于使用深度卷积神经网络（UNet架构）进行降水临近预报，属于气象学中的AI应用。虽然属于AI for Science范畴，但论文未涉及任何大语言模型（LLM）、基础模型或相关技术（如MoE、RLHF、RAG等）。其核心创新是向量量化（VQ）和混合卷积（MixConv）用于模型压缩和性能提升，但这些技术与评分关键词中的大模型技术（如量化、模型压缩通常指LLM的量化）无直接关联。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种参数高效的向量量化UNet模型（SmaAT-QMix-UNet），通过引入向量量化瓶颈和混合卷积来减少模型大小并提高降水临近预报的准确性。

摘要翻译

天气预报对关键的社会经济活动具有支撑作用，并有助于环境保护，然而业务化数值天气预报（Numerical Weather Prediction, NWP）系统仍然计算密集，因此在某些应用中效率不高。与此同时，近期基于深度数据驱动模型的研究在临近预报任务中展现出有前景的结果。本文提出SmaAT-QMix-UNet，它是SmaAT-UNet的增强变体，引入了两项关键创新：在编码器-解码器桥接处加入向量量化（Vector Quantization, VQ）瓶颈层，并用混合核深度可分离卷积（mixed kernel depth-wise convolutions, MixConv）替换选定的编码器和解码器模块。这些改进既减小了模型规模，又提升了其临近预报性能。我们在荷兰雷达降水数据集（2016-2019年）上训练并评估了SmaAT-QMix-UNet，用于预测未来30分钟的降水。我们对比了三种配置：仅使用VQ、仅使用MixConv以及完整的SmaAT-QMix-UNet。Grad-CAM显著性图突出了影响每次临近预报的关键区域，而码本的UMAP嵌入则展示了VQ层如何对编码器输出进行聚类。SmaAT-QMix-UNet的源代码已在GitHub上公开\footnote{\href{https://github.com/nstavr04/MasterThesisSnellius}{https://github.com/nstavr04/MasterThesisSnellius}}。

摘要 (Abstract)

Weather forecasting supports critical socioeconomic activities and complements environmental protection, yet operational Numerical Weather Prediction (NWP) systems remain computationally intensive, thus being inefficient for certain applications. Meanwhile, recent advances in deep data-driven models have demonstrated promising results in nowcasting tasks. This paper presents SmaAT-QMix-UNet, an enhanced variant of SmaAT-UNet that introduces two key innovations: a vector quantization (VQ) bottleneck at the encoder-decoder bridge, and mixed kernel depth-wise convolutions (MixConv) replacing selected encoder and decoder blocks. These enhancements both reduce the model’s size and improve its nowcasting performance. We train and evaluate SmaAT-QMix-UNet on a Dutch radar precipitation dataset (2016-2019), predicting precipitation 30 minutes ahead. Three configurations are benchmarked: using only VQ, only MixConv, and the full SmaAT-QMix-UNet. Grad-CAM saliency maps highlight the regions influencing each nowcast, while a UMAP embedding of the codewords illustrates how the VQ layer clusters encoder outputs. The source code for SmaAT-QMix-UNet is publicly available on GitHub \footnote{\href{https://github.com/nstavr04/MasterThesisSnellius}{https://github.com/nstavr04/MasterThesisSnellius}}.

关键词: Precipitation Nowcasting, Vector Quantization, UNet, Parameter-efficient, Deep Learning, Weather Forecasting, Model Compression, Radar Data

63. ❌ P^2O: Joint Policy and Prompt Optimization

作者: Xinyu Lu, Kaiqi Zhang, Jinglin Yang, Boxi Cao, Yaojie Lu, Hongyu Lin, Min He, Xianpei Han, Le Sun 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21877v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的推理能力增强，通过结合提示优化和策略优化解决RLVR中的探索效率问题。与’Large Language Models’高度相关（10分），因为论文明确研究LLM推理能力增强。与’Chain of Thought’和’System 2 Thinking’相关（8分），因为论文关注多步推理和深度推理能力提升。其他关键词如MoE、量化、科学AI等未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出P^2O框架，通过联合提示优化和策略优化解决LLM在强化学习可验证奖励中面对硬样本时探索效率低的问题，显著提升了模型性能和在分布外数据上的泛化能力。

摘要翻译

具备可验证奖励的强化学习已成为增强大语言模型推理能力的重要范式。然而，传统RLVR方法存在探索效率低下的问题，尤其在面对成功率趋近于零的“困难样本”时。在此类场景中，对稀疏结果奖励的依赖通常导致优势估计为零，尽管这些样本蕴含高信息价值，模型却无法获得有效的监督信号。为解决这一问题，我们提出P^2O框架，该框架将提示优化与策略优化协同整合。P^2O在训练迭代中识别困难样本，并利用遗传帕累托算法进化提示模板，引导模型发现成功轨迹。关键创新在于：与传统依赖输入增强的提示工程方法不同，P^2O将通过优化提示产生的推理增益直接蒸馏至模型参数中。该机制为困难样本提供了更密集的正向监督信号，并加速了收敛过程。大量实验表明，P^2O不仅在分布内数据集上取得更优性能，还展现出强大的泛化能力，在分布外基准测试中实现了显著提升（平均+4.7%）。

摘要 (Abstract)

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting “hard samples” that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P^2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P^2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.).

关键词: Large Language Models, Reinforcement Learning, Prompt Optimization, Policy Optimization, Reasoning Capabilities, Hard Samples, Generalization, P^2O

64. ❌ Manifold-Aware Exploration for Reinforcement Learning in Video Generation

作者: Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, Qifeng Chen, Harry Yang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21872v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频生成领域的强化学习对齐方法（SAGE-GRPO），旨在解决视频生成中探索噪声导致的奖励估计不可靠和训练不稳定的问题。论文与大多数关键词无关，因为这些关键词主要针对大语言模型（LLMs）及其相关技术（如MoE、量化、推理加速、RAG等）。论文仅与少数关键词有间接关联：1) “Post-training OR Supervised Fine-tuning OR SFT”（5分）：论文涉及后训练对齐，但针对视频生成而非语言模型。2) “Instruction Tuning OR Alignment OR Value Alignment”（5分）：论文涉及对齐技术，但用于视频生成而非指令调优。3) “RLHF OR RLAIF OR Direct Preference Optimization OR DPO”（5分）：论文使用GRPO（一种强化学习优化方法），与RLHF/DPO同属强化学习对齐范畴，但应用于视频生成。其他关键词如"AI for Science"不相关，因为论文聚焦视频生成，而非科学领域应用。

!!! tip deepseek-chat TL;DR

该论文针对视频生成中强化学习对齐方法（如GRPO）因探索噪声导致奖励估计不可靠和训练不稳定的问题，提出了一种基于流形感知约束的SAGE-GRPO方法，在HunyuanVideo1.5上实现了更稳定的对齐和更高的视频质量。

摘要翻译

诸如FlowGRPO等面向视频生成的群组相对策略优化（Group Relative Policy Optimization, GRPO）方法，其可靠性仍远低于语言模型和图像生成领域的同类方法。这一差距源于视频生成具有复杂的解空间，且用于探索的常微分方程至随机微分方程（ODE-to-SDE）转换可能引入过量噪声，从而降低生成序列的质量，并使奖励估计的可靠性下降，进而破坏训练后对齐的稳定性。为解决此问题，我们将预训练模型视为定义了一个有效的视频数据流形，并将核心问题归结为将探索约束在该流形邻域内，从而确保生成质量得以保持且奖励估计依然可靠。我们提出SAGE-GRPO（基于稳定探索的对齐方法），该方法在微观与宏观层面同时施加约束。在微观层面，我们推导出一种具有对数曲率修正的精确流形感知随机微分方程，并引入梯度范数均衡器以稳定不同时间步的采样与更新过程。在宏观层面，我们采用带有周期性移动锚点与逐步约束的双重信任区域机制，使信任区域能够跟踪更接近流形的检查点，并限制长时程漂移。我们在HunyuanVideo1.5模型上使用原始VideoAlign作为奖励模型对SAGE-GRPO进行评估，结果显示其在视频质量（VQ）、运动质量（MQ）、时序对齐（TA）及视觉指标（CLIPScore、PickScore）上均较先前方法取得稳定提升，证明了其在奖励最大化与整体视频质量方面的优越性能。代码与视觉展示页面位于https://dungeonmassster.github.io/SAGE-GRPO-Page/。

摘要 (Abstract)

Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE-GRPO-Page/.

关键词: Video Generation, Reinforcement Learning, Alignment, GRPO, Manifold-Aware Exploration, SDE, Trust Region, Stable Training

65. ❌ Adversarial Camouflage

作者: Paweł Borsukiewicz, Daniele Lunghi, Melissa Tessa, Jacques Klein, Tegawendé F. Bissyandé 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21867v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是对抗性攻击在面部识别系统中的应用，具体提出了一种名为’Adversarial Camouflage’的方法来保护用户隐私。论文内容完全聚焦于计算机视觉领域的面部识别算法攻击与防御，不涉及任何大语言模型、深度学习技术原理创新、AI for Science或其他评分关键词相关的大模型技术。所有关键词均与论文主题无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Adversarial Camouflage的新型对抗性攻击方法，通过在面部区域添加优化图案来显著降低多种先进面部识别模型的性能，有效保护用户隐私并展示了跨模型攻击的可迁移性。

摘要翻译

尽管面部识别算法的快速发展催生了众多有益应用，但其广泛部署也引发了人们对大规模监控风险与个人隐私威胁的严重担忧。本文提出一种名为“对抗伪装”（Adversarial Camouflage）的新颖方案，旨在以高效且易于用户在物理世界中复现的方式保护隐私。该算法首先定义了一个由颜色、形状和角度参数化的低维模式空间，随后将寻得的优化模式映射至语义有效的面部区域进行评估。我们的方法通过在多种架构上最大化识别误差，确保了即使面对黑盒系统仍具有较高的跨模型可迁移性。在模拟测试中，该方法显著降低了所有已测试前沿人脸识别模型的性能，并在真实世界的人体实验中展现出积极效果，同时揭示了不同模型间的鲁棒性差异以及攻击策略在跨架构间的可迁移性证据。

摘要 (Abstract)

While the rapid development of facial recognition algorithms has enabled numerous beneficial applications, their widespread deployment has raised significant concerns about the risks of mass surveillance and threats to individual privacy. In this paper, we introduce \textit{Adversarial Camouflage} as a novel solution for protecting users’ privacy. This approach is designed to be efficient and simple to reproduce for users in the physical world. The algorithm starts by defining a low-dimensional pattern space parameterized by color, shape, and angle. Optimized patterns, once found, are projected onto semantically valid facial regions for evaluation. Our method maximizes recognition error across multiple architectures, ensuring high cross-model transferability even against black-box systems. It significantly degrades the performance of all tested state-of-the-art face recognition models during simulations and demonstrates promising results in real-world human experiments, while revealing differences in model robustness and evidence of attack transferability across architectures.

关键词: Adversarial Camouflage, facial recognition, privacy protection, adversarial attack, cross-model transferability, face recognition models, pattern optimization, model robustness

66. ❌ Tacit Knowledge Management with Generative AI: Proposal of the GenAI SECI Model

作者: Naoshi Uchihira 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21866v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究生成式AI在知识管理中的应用，特别是针对隐性知识管理，提出了GenAI SECI模型和数字碎片化知识概念。论文与大多数技术性关键词（如MoE、量化、推理加速等）完全无关，仅与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为生成式AI通常基于大语言模型，但论文未深入讨论具体模型技术。其他关键词如AI for Science等与论文主题不直接相关。

!!! tip deepseek-chat TL;DR

该论文针对生成式AI在隐性知识管理中的应用不足，提出了整合显性和隐性知识的GenAI SECI模型和数字碎片化知识概念，并设计了相应的系统架构。

摘要翻译

生成式人工智能的出现正在引发知识管理的重大变革。生成式人工智能有望解决传统知识管理系统的局限性，并正日益在实际场景中得到部署，展现出可观的应用前景。相关研究也在迅速扩展。然而，现有工作大多聚焦于与显性知识管理相关的研究与实践。尽管已有一些利用生成式人工智能进行隐性知识管理的零散尝试，但以集成方式同时处理隐性知识与显性知识的建模与体系化研究仍显不足。本文提出了“GenAI SECI”模型，作为知识创造过程（SECI）模型的更新版本，其设计旨在充分利用生成式人工智能的能力。“GenAI SECI”模型的一个核心特征是引入了“数字碎片化知识”这一新概念，它实现了网络空间内显性知识与隐性知识的融合。此外，本文还提出了所建议模型的具体系统架构，并与具有相似问题意识和目标的先前研究模型进行了比较。

摘要 (Abstract)

The emergence of generative AI is bringing about a significant transformation in knowledge management. Generative AI has the potential to address the limitations of conventional knowledge management systems, and it is increasingly being deployed in real-world settings with promising results. Related research is also expanding rapidly. However, much of this work focuses on research and practice related to the management of explicit knowledge. While fragmentary efforts have been made regarding the management of tacit knowledge using generative AI, the modeling and systematization that handle both tacit and explicit knowledge in an integrated manner remain insufficient. In this paper, we propose the “GenAI SECI” model as an updated version of the knowledge creation process (SECI) model, redesigned to leverage the capabilities of generative AI. A defining feature of the “GenAI SECI” model is the introduction of “Digital Fragmented Knowledge”, a new concept that integrates explicit and tacit knowledge within cyberspace. Furthermore, a concrete system architecture for the proposed model is presented, along with a comparison with prior research models that share a similar problem awareness and objectives.

关键词: Generative AI, Knowledge Management, Tacit Knowledge, SECI Model, Digital Fragmented Knowledge, System Architecture, Explicit Knowledge

67. ❌ Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation

作者: Yuyang You, Yongzhi Li, Jiahui Li, Yadong Mu, Quan Chen, Peng Jiang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21864v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频扩散模型的蒸馏技术，核心贡献是提出自适应回归损失、时间正则化损失和推理时帧插值策略来解决过饱和、时间不一致和模式崩溃问题。所有评分关键词均与大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、量化、幻觉缓解等）或特定科学领域AI应用（如生物信息学）直接相关，而本文研究的是视频生成领域的扩散模型蒸馏，属于生成式AI的子领域但与大模型技术原理、语言模型应用或指定科学AI领域无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对视频扩散模型蒸馏中存在的过饱和、时间不一致和模式崩溃问题，提出了一种包含自适应回归损失、时间正则化损失和推理时帧插值策略的新框架，在VBench和VBench2基准测试中实现了稳定的少步视频合成，显著提升了感知保真度和运动真实感。

摘要翻译

视频生成近期已成为生成式人工智能领域的核心任务。然而，视频合成固有的巨大计算成本使得模型蒸馏成为高效部署的关键技术。尽管其重要性显著，目前专门针对视频扩散模型设计的蒸馏方法仍较为匮乏。主流方法通常直接沿用图像蒸馏技术，这常常导致过饱和、时间不一致和模式崩溃等伪影问题。为应对这些挑战，我们提出了一种专为视频扩散模型定制的新型蒸馏框架。其核心创新包括：（1）一种自适应回归损失，通过动态调整空间监督权重以防止因分布偏移过大而产生的伪影；（2）一种时间正则化损失，用以对抗时间崩溃，促进平滑且物理合理的采样轨迹；（3）一种推理时帧插值策略，在保持感知质量的同时降低采样开销。在VBench和VBench2基准上进行的大量实验与消融研究表明，我们的方法实现了稳定的少步数视频合成，显著提升了感知保真度与运动真实感。在多项指标上，该方法均稳定优于现有蒸馏基线。

摘要 (Abstract)

Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.

关键词: video generation, model distillation, diffusion models, oversaturation, temporal collapse, adaptive regression loss, temporal regularization, frame interpolation

68. ❌ Sim-to-Real of Humanoid Locomotion Policies via Joint Torque Space Perturbation Injection

作者: Junhyeok Rui Cha, Woohyun Cha, Jaeyong Shin, Donghyeon Kim, Jaeheung Park 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21853v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文研究的是机器人控制策略的仿真到现实迁移方法，具体针对人形机器人运动控制，通过向关节扭矩注入状态依赖的扰动来模拟现实差距。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于机器人控制、强化学习和仿真技术，未涉及任何大模型、深度学习技术原理或AI在生物/化学信息学等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种通过向关节扭矩注入状态依赖扰动的新方法，用于训练人形机器人运动控制策略，以解决仿真到现实的迁移问题，实验表明该方法能提升策略对复杂未知现实差距的鲁棒性。

摘要翻译

本文针对基于仿真经验训练控制策略的现有仿真到现实方法，提出了一种新颖的替代方案。与通常依赖于在固定有限参数集上进行领域随机化的先前方法不同，所提方法在正向仿真过程中向输入关节扭矩注入状态相关的扰动。这些扰动旨在模拟比标准参数随机化更广泛的现实差距，且无需额外训练。通过使用神经网络作为灵活的扰动生成器，所提方法能够表征参数随机化无法捕获的复杂、状态相关的不确定性，例如非线性执行器动力学和接触柔顺性。实验结果表明，所提方法使人形机器人运动策略在仿真和实际部署中，对复杂、未知的现实差距实现了卓越的鲁棒性。

摘要 (Abstract)

This paper proposes a novel alternative to existing sim-to-real methods for training control policies with simulated experiences. Unlike prior methods that typically rely on domain randomization over a fixed finite set of parameters, the proposed approach injects state-dependent perturbations into the input joint torque during forward simulation. These perturbations are designed to simulate a broader spectrum of reality gaps than standard parameter randomization without requiring additional training. By using neural networks as flexible perturbation generators, the proposed method can represent complex, state-dependent uncertainties, such as nonlinear actuator dynamics and contact compliance, that parametric randomization cannot capture. Experimental results demonstrate that the proposed approach enables humanoid locomotion policies to achieve superior robustness against complex, unseen reality gaps in both simulation and real-world deployment.

关键词: sim-to-real, humanoid locomotion, control policies, joint torque perturbation, state-dependent perturbations, reality gap, robustness, neural networks

69. ❌ Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models

作者: Aryan Kasat, Smriti Singh, Aman Chadha, Vinija Jain 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21854v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在道德推理任务中的表现，直接涉及’Large Language Models’和’Alignment’（因探讨对齐训练对推理输出的影响）。研究通过分析模型对道德困境的回应，评估其推理的一致性和深度，与’Chain of Thought’和’System 2 Thinking’高度相关，因为这些关键词涉及多步推理和深度思考过程。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等，论文未涉及具体技术细节或应用领域，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究大语言模型在道德困境中是否展现真实的道德推理发展轨迹，结果发现模型回应普遍呈现后习俗推理阶段，与人类发展规范相反，并存在道德解耦现象，表明对齐训练可能仅使模型获得成熟道德推理的修辞惯例而非底层发展轨迹。

摘要翻译

大型语言模型是否具备道德推理能力，抑或仅仅是听起来像具备这种能力？本研究探讨了LLM对道德困境的回应是否真实地呈现出科尔伯格道德发展阶段理论中的发展性递进，还是说对齐训练仅仅产生了表面类似成熟道德判断的、缺乏内在发展轨迹的推理式输出。我们采用一个经三种评判模型验证的“LLM即评判者”评分流程，对来自13种不同架构、参数规模和训练机制的LLM在六个经典道德困境中生成的600余条回应进行分类，并进行了十项互补性分析，以刻画所得模式的本质与内在一致性。我们的结果揭示了一个显著的倒置现象：无论模型规模、架构或提示策略如何，回应都压倒性地对应于后习俗水平推理（第5-6阶段），这实质上与人类以第4阶段为主导的发展常态相反。最引人注目的是，一个模型子集表现出道德脱钩现象：即陈述的道德理由与行动选择之间存在系统性不一致。这种逻辑上的不连贯性在不同规模和提示策略下持续存在，代表了一种独立于修辞复杂性的直接推理一致性失败。模型规模具有统计学上显著但实际影响较小的效应；训练类型没有显著的独立主效应；模型表现出近乎机械的跨困境一致性，在语义不同的道德问题上产生逻辑上无法区分的回应。我们认为，这些模式构成了道德口技现象的证据：即通过对齐训练，模型习得了成熟道德推理的修辞惯例，却未获得这些惯例本应代表的深层发展轨迹。

摘要 (Abstract)

Do large language models reason morally, or do they merely sound like they do? We investigate whether LLM responses to moral dilemmas exhibit genuine developmental progression through Kohlberg’s stages of moral development, or whether alignment training instead produces reasoning-like outputs that superficially resemble mature moral judgment without the underlying developmental trajectory. Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and conduct ten complementary analyses to characterize the nature and internal coherence of the resulting patterns. Our results reveal a striking inversion: responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model size, architecture, or prompting strategy, the effective inverse of human developmental norms, where Stage 4 dominates. Most strikingly, a subset of models exhibit moral decoupling: systematic inconsistency between stated moral justification and action choice, a form of logical incoherence that persists across scale and prompting strategy and represents a direct reasoning consistency failure independent of rhetorical sophistication. Model scale carries a statistically significant but practically small effect; training type has no significant independent main effect; and models exhibit near-robotic cross-dilemma consistency producing logically indistinguishable responses across semantically distinct moral problems. We posit that these patterns constitute evidence for moral ventriloquism: the acquisition, through alignment training, of the rhetorical conventions of mature moral reasoning without the underlying developmental trajectory those conventions are meant to represent.

关键词: Large Language Models, Moral Reasoning, Alignment Training, Kohlberg’s Stages, Moral Dilemmas, Reasoning Consistency, Moral Ventriloquism, LLM-as-Judge

70. ❌ Agentic Personas for Adaptive Scientific Explanations with Knowledge Graphs

作者: Susana Nunes, Tiago Guerreiro, Catia Pesquita 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究基于强化学习的科学解释生成方法，使用知识图谱和代理角色（agentic personas）来生成适应性解释，应用于药物发现领域。核心相关关键词：1）‘AI for Science OR Bioinformatics OR Cheminformatics’（10分）：论文明确研究科学领域（药物发现）的AI应用，是核心内容。2）‘Mechanistic Interpretability OR Explainable AI’（8分）：论文聚焦于AI解释方法（explainability），是主要研究主题。3）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（5分）：论文使用’agentic personas’和’explanation agent’，涉及代理概念，但未明确使用LLM代理或自主代理工作流，有一定关联。其他关键词均未在论文标题或摘要中提及，与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对科学发现中AI解释方法缺乏适应性的问题，提出了一种基于强化学习和代理角色的知识图谱解释生成方法，在药物发现评估中显示该方法能匹配最先进预测性能、减少反馈需求两个数量级，并显著优于非适应性基线。

摘要翻译

人工智能解释方法通常采用静态用户模型，无论专家的目标、推理策略或决策情境如何，均生成非自适应的解释。基于知识图谱的解释方法虽具备基于路径的实体化推理能力，仍继承了这一局限。在科学发现等复杂领域中，这种假设无法捕捉专家群体多样的认知策略与认知立场，导致解释难以促进深度理解和知情决策。然而，人类专家的稀缺性限制了直接利用人工反馈生成自适应解释的可行性。
本研究提出一种用于科学解释生成的强化学习方法，该方法引入代理角色——即专家推理策略的结构化表征，以引导解释代理遵循特定的认知偏好。在针对药物发现领域的知识图谱解释评估中，我们测试了两种基于专家反馈构建的、体现不同认知立场的代理角色。
结果表明，角色驱动的解释在达到最先进预测性能的同时，其偏好与对应专家的偏好高度吻合。自适应解释持续优于非自适应基线模型（样本量 n = 22），且基于角色的训练将反馈需求降低了两个数量级。这些发现证明，代理角色能够为复杂高风险领域的人工智能系统提供可扩展的自适应可解释性解决方案。

摘要 (Abstract)

AI explanation methods often assume a static user model, producing non-adaptive explanations regardless of expert goals, reasoning strategies, or decision contexts. Knowledge graph-based explanations, despite their capacity for grounded, path-based reasoning, inherit this limitation. In complex domains such as scientific discovery, this assumption fails to capture the diversity of cognitive strategies and epistemic stances among experts, preventing explanations that foster deeper understanding and informed decision-making. However, the scarcity of human experts limits the use of direct human feedback to produce adaptive explanations. We present a reinforcement learning approach for scientific explanation generation that incorporates agentic personas, structured representations of expert reasoning strategies, that guide the explanation agent towards specific epistemic preferences. In an evaluation of knowledge graph-based explanations for drug discovery, we tested two personas that capture distinct epistemic stances derived from expert feedback. Results show that persona-driven explanations match state-of-the-art predictive performance while persona preferences closely align with those of their corresponding experts. Adaptive explanations were consistently preferred over non-adaptive baselines (n = 22), and persona-based training reduces feedback requirements by two orders of magnitude. These findings demonstrate how agentic personas enable scalable adaptive explainability for AI systems in complex and high-stakes domains.

关键词: scientific explanation generation, knowledge graphs, agentic personas, reinforcement learning, drug discovery, adaptive explainability, expert reasoning strategies, AI systems

71. ❌ On the Number of Conditional Independence Tests in Constraint-based Causal Discovery

作者: Marc Franquesa Monés, Jiaqi Zhang, Caroline Uhler 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21844v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 这篇论文专注于因果发现中的约束基方法，特别是PC算法的复杂度改进，属于传统机器学习/统计学习领域。所有关键词都涉及大模型、深度学习及相关技术（如MoE、RLHF、RAG等），而本文完全不涉及这些内容。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文在基因表达数据和真实世界数据上进行了验证，属于科学领域的AI应用，但并非核心创新点，因此给5分（有一定关联）。其他关键词与论文内容完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了约束基因果发现中条件独立性测试的数量问题，提出了一种新算法，将测试复杂度从指数级改进为多项式级，并证明了该算法在测试数量上达到指数最优性。

摘要翻译

从观测数据中学习因果关系是一个基础性问题，在众多领域具有广泛应用。基于约束的方法通过执行条件独立性检验来推断潜在的因果结构。然而，现有算法（如著名的PC算法）需要执行大量独立性检验，其最坏情况下的检验次数与因果图的最大度呈指数关系。尽管已有大量研究，但在不引入额外假设的前提下，是否存在具有更优复杂度的算法仍不明确。本文提出一种算法，其检验复杂度达到 $p^{\mathcal{O}(s)}$，其中 $p$ 为图中节点数，$s$ 表示底层本质图（essential graph）的最大无向团大小。作为该结果的补充，我们证明任何基于约束的算法至少需要进行 $2^{Ω(s)}$ 次条件独立性检验，这表明所提算法在所需条件独立性检验次数方面达到了对数因子内的指数最优性。最后，我们在半合成基因表达数据和真实世界数据上通过仿真验证了理论结果，证明相较于现有方法，本算法在所需条件独立性检验次数方面具有更高效率。

摘要 (Abstract)

Learning causal relations from observational data is a fundamental problem with wide-ranging applications across many fields. Constraint-based methods infer the underlying causal structure by performing conditional independence tests. However, existing algorithms such as the prominent PC algorithm need to perform a large number of independence tests, which in the worst case is exponential in the maximum degree of the causal graph. Despite extensive research, it remains unclear if there exist algorithms with better complexity without additional assumptions. Here, we establish an algorithm that achieves a better complexity of $p^{\mathcal{O}(s)}$ tests, where $p$ is the number of nodes in the graph and $s$ denotes the maximum undirected clique size of the underlying essential graph. Complementing this result, we prove that any constraint-based algorithm must perform at least $2^{Ω(s)}$ conditional independence tests, establishing that our proposed algorithm achieves exponent-optimality up to a logarithmic factor in terms of the number of conditional independence tests needed. Finally, we validate our theoretical findings through simulations, on semi-synthetic gene-expression data, and real-world data, demonstrating the efficiency of our algorithm compared to existing methods in terms of number of conditional independence tests needed.

关键词: causal discovery, constraint-based methods, conditional independence tests, PC algorithm, algorithm complexity, essential graph, gene-expression data, simulations

72. ❌ Select, Label, Evaluate: Active Testing in NLP

作者: Antonio Purificato, Maria Sofia Bucarelli, Andrea Bacciu, Amin Mantrach, Fabrizio Silvestri 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21840v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Select, Label, Evaluate: Active Testing in NLP》专注于自然语言处理（NLP）中测试数据标注的成本和效率问题，提出并评估了Active Testing框架，旨在通过选择信息量最大的测试样本来减少标注工作量。然而，论文内容与所有给定的评分关键词均无直接关联：它不涉及大模型（LLMs）、深度学习技术原理（如MoE、Scaling Laws、Pre-training、Fine-tuning、Alignment、RLHF、PEFT、RAG等）、推理方法（如CoT、System 2 Thinking、MCTS）、代理系统（LLM Agents、Multi-agent Systems）、模型优化技术（Quantization、Speculative Decoding）、可解释性（Mechanistic Interpretability）、世界模型（World Models）、模型合并（Model Merging）、上下文学习（In-context Learning）或科学AI应用（AI for Science）。论文的核心是数据标注策略和模型评估方法，属于NLP中的实验方法论，而非大模型或深度学习技术本身。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对NLP中测试数据标注成本高的问题，提出了Active Testing框架，通过选择信息量最大的测试样本进行标注，在18个数据集上的实验表明，该方法能减少高达95%的标注工作量，同时保持与全测试集相比在1%以内的性能估计精度差异。

摘要翻译

人工标注的成本与时间始终是自然语言处理（Natural Language Processing, NLP）领域的重要瓶颈，其中测试数据标注因需满足低误差、高质量标签的严格要求以确保模型评估的可靠性而尤为昂贵。传统方法需对整个测试集进行标注，导致巨大的资源消耗。主动测试（Active Testing）是一种选择信息量最大的测试样本进行标注的框架。在给定标注预算的条件下，该框架旨在选择能够最佳估计模型性能的子集，同时最小化成本与人力投入。本研究将主动测试在NLP领域进行形式化，并对现有方法在涵盖4项不同NLP任务的18个数据集和4种嵌入策略上进行了广泛基准测试。实验表明，标注量可减少高达95%，且性能估计准确度与完整测试集相比差异在1%以内。我们的分析揭示了不同数据特征与任务类型下方法效果的差异，未出现单一方法具有普遍优越性。最后，针对现有样本选择策略需预先设定标注预算的局限，我们提出了一种自适应停止准则，可自动确定最优样本数量。

摘要 (Abstract)

Human annotation cost and time remain significant bottlenecks in Natural Language Processing (NLP), with test data annotation being particularly expensive due to the stringent requirement for low-error and high-quality labels necessary for reliable model evaluation. Traditional approaches require annotating entire test sets, leading to substantial resource requirements. Active Testing is a framework that selects the most informative test samples for annotation. Given a labeling budget, it aims to choose the subset that best estimates model performance while minimizing cost and human effort. In this work, we formalize Active Testing in NLP and we conduct an extensive benchmarking of existing approaches across 18 datasets and 4 embedding strategies spanning 4 different NLP tasks. The experiments show annotation reductions of up to 95%, with performance estimation accuracy difference from the full test set within 1%. Our analysis reveals variations in method effectiveness across different data characteristics and task types, with no single approach emerging as universally superior. Lastly, to address the limitation of requiring a predefined annotation budget in existing sample selection strategies, we introduce an adaptive stopping criterion that automatically determines the optimal number of samples.

关键词: Active Testing, NLP, test data annotation, model evaluation, annotation reduction, sample selection, adaptive stopping criterion, benchmarking

73. ❌ Instruction Set and Language for Symbolic Regression

作者: Ezequiel Lopez-Rubio, Mario Pascual-Gonzalez 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21836v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于符号回归（Symbolic Regression）中的表示问题，提出了一种名为IsalSR的框架来解决结构冗余问题。该研究属于AI for Science（科学人工智能）领域，具体涉及符号回归这一科学计算和自动发现任务，因此与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分）。然而，论文未涉及大模型（LLMs）、深度学习技术原理、模型训练/微调方法（如MoE、SFT、RLHF、PEFT）、推理优化（如RAG、量化）、代理系统或任何其他列出的具体大模型技术关键词，因此这些关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文解决了符号回归中因表达式DAG存在多种等价编号方案而导致的结构冗余问题，提出了IsalSR表示框架，通过生成剪枝规范字符串将等价表示折叠为单一规范形式。

摘要翻译

符号回归（SR）中一个基础但长期未被充分解决的障碍是结构冗余：每个表达式有向无环图（DAG）都对应着许多不同的节点编号方案，这些方案均编码了同一表达式，各自占据搜索空间中的独立位置，消耗适应度评估资源却未增加多样性。我们提出了IsalSR（符号回归指令集与语言），这是一种表示框架，它将表达式DAG编码为基于紧凑双层字母表的字符串，并计算一种经过剪枝的规范字符串——一种完整的带标签DAG同构不变量——该规范形式将所有等价表示折叠为单一的规范形式。

摘要 (Abstract)

A fundamental but largely unaddressed obstacle in Symbolic regression (SR) is structural redundancy: every expression DAG with admits many distinct node-numbering schemes that all encode the same expression, each occupying a separate point in the search space and consuming fitness evaluations without adding diversity. We present IsalSR (Instruction Set and Language for Symbolic Regression), a representation framework that encodes expression DAGs as strings over a compact two-tier alphabet and computes a pruned canonical string – a complete labeled-DAG isomorphism invariant – that collapses all the equivalent representations into a single canonical form.

关键词: Symbolic Regression, Structural Redundancy, Expression DAG, Canonical Form, Isomorphism Invariant, Instruction Set, Representation Framework

74. ❌ CoRA: Boosting Time Series Foundation Models for Multivariate Forecasting through Correlation-aware Adapter

作者: Hanyin Cheng, Xingjian Wu, Yang Shu, Zhongwen Rao, Lujia Pan, Bin Yang, Chenjuan Guo 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21828v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出CoRA，一种用于时间序列基础模型（TSFMs）的轻量级即插即用适配器，通过捕获通道间相关性来提升多元预测性能。核心相关关键词：1）‘Foundation Models’（8分）：论文明确研究时间序列基础模型（TSFMs），属于基础模型范畴；2）‘Post-training OR Supervised Fine-tuning OR SFT’（8分）：CoRA通过微调（fine-tuning）TSFMs来提升性能；3）‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（10分）：CoRA是参数高效的微调方法，作为轻量级适配器，仅需微调，高度相关；4）‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：涉及基础模型的微调，与预训练/领域适应有一定关联；5）‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）：时间序列预测可视为科学AI应用，但非生物/化学信息学核心。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对现有时间序列基础模型在多元预测中忽视通道间相关性的问题，提出了一种轻量级的相关性感知适配器CoRA，通过低秩分解和对比学习捕获动态和静态相关性，实验证明能有效提升预测性能。

摘要翻译

现有大多数时间序列基础模型采用通道独立建模方式，主要关注时间依赖关系的捕获与泛化，却忽视了通道间的相关性或未能区分相关性的不同维度。然而在多变量时间序列预测任务中，这些相关性具有至关重要的作用。为此，我们提出一种相关性感知适配器——CoRrelation-aware Adapter（CoRA），这是一种轻量级即插即用方法，仅需对时间序列基础模型进行微调即可捕获不同类型的相关性，从而提升预测性能。具体而言，为降低计算复杂度，我们创新性地将相关矩阵分解为低秩的时变分量与时不变分量。针对时变分量，我们进一步设计可学习多项式，通过捕捉趋势或周期模式来学习动态相关性。为学习仅存在于部分通道间的正负相关性，我们引入一种新颖的双重对比学习方法，该方法通过投影层识别相关性，在训练阶段受异构-局部对比损失函数调控，且无需在推理阶段引入额外计算负担。在10个真实世界数据集上的大量实验表明，CoRA能有效提升时间序列基础模型在多变量预测任务中的性能表现。

摘要 (Abstract)

Most existing Time Series Foundation Models (TSFMs) use channel independent modeling and focus on capturing and generalizing temporal dependencies, while neglecting the correlations among channels or overlooking the different aspects of correlations. However, these correlations play a vital role in Multivariate time series forecasting. To address this, we propose a CoRrelation-aware Adapter (CoRA), a lightweight plug-and-play method that requires only fine-tuning with TSFMs and is able to capture different types of correlations, so as to improve forecast performance. Specifically, to reduce complexity, we innovatively decompose the correlation matrix into low-rank Time-Varying and Time-Invariant components. For the Time-Varying component, we further design learnable polynomials to learn dynamic correlations by capturing trends or periodic patterns. To learn positive and negative correlations that appear only among some channels, we introduce a novel dual contrastive learning method that identifies correlations through projection layers, regulated by a Heterogeneous-Partial contrastive loss during training, without introducing additional complexity in the inference stage. Extensive experiments on 10 real-world datasets demonstrate that CoRA can improve TSFMs in multivariate forecasting performance.

关键词: Time Series Foundation Models, Multivariate Forecasting, Correlation-aware Adapter, Parameter-efficient Fine-tuning, Low-rank Decomposition, Contrastive Learning, Channel Correlations

75. ❌ BadminSense: Enabling Fine-Grained Badminton Stroke Evaluation on a Single Smartwatch

作者: Taizhou Chen, Kai Chen, Xingyu Liu, Pingchuan Ke, Zhida Sun 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21825v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《BadminSense: Enabling Fine-Grained Badminton Stroke Evaluation on a Single Smartwatch》专注于基于智能手表的可穿戴传感系统，用于羽毛球击球动作的细粒度分析，包括击球类型分类、质量预测和击球点估计。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而该论文研究的是基于传感器数据的传统机器学习或信号处理应用，未涉及任何大模型、深度学习技术或AI for Science的具体内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文开发了一个基于智能手表的系统BadminSense，通过振动信号实现羽毛球击球动作的细粒度分析，包括击球分割、分类、质量预测和击球点估计，以帮助业余球员在没有专业教练的情况下评估表现。

摘要翻译

评估羽毛球表现通常需要专业教练指导，而业余选手往往难以获得此类资源。本文提出BadminSense——一个基于智能手表的可穿戴传感系统，用于实现细粒度羽毛球表现分析。通过对经验丰富的羽毛球运动员进行访谈，我们明确了四项系统设计需求与三项实现思路，以此指导BadminSense的开发。我们采集了12名经验丰富的业余选手的羽毛球击球数据集，并标注了细粒度标签，包括击球类型、专家评估的击球评分以及羽毛球击球点位置。基于该数据集，BadminSense利用市售智能手表采集的振动信号，实现了击球动作分割与分类、击球质量预测及击球点位置估算。评估结果表明

摘要 (Abstract)

Evaluating badminton performance often requires expert coaching, which is rarely accessible for amateur players. We present adminSense, a smartwatch-based system for fine-grained badminton performance analysis using wearable sensing. Through interviews with experienced badminton players, we identified four system design requirements with three implementation insights that guide the development of BadminSense. We then collected a badminton strokes dataset on 12 experienced badminton amateurs and annotated it with fine-grained labels, including stroke type, expert-assessed stroke rating, and shuttle impact location. Built on this dataset, BadminSense segments and classifies strokes, predicts stroke quality, and estimates shuttle impact location using vibration signal from an off-the-shelf smartwatch. Our evaluations show that

关键词: badminton stroke evaluation, smartwatch-based system, wearable sensing, stroke classification, stroke quality prediction, shuttle impact location, vibration signal, fine-grained performance analysis

76. ❌ SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection

作者: Shuxian Zhao, Jie Gui, Baosheng Yu, Lu Dong, Zhipeng Gui 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21824v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于钢铁表面缺陷检测的计算机视觉任务，使用视觉-语言数据集和基准测试，主要涉及图像分类、少样本/零样本识别和可解释性研究。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关。仅与’Explainable AI’有一定关联（论文强调可解释性），与’AI for Science’有较强关联（属于工业制造领域的AI应用）。

!!! tip deepseek-chat TL;DR

该研究针对钢铁表面缺陷检测中现有方法可解释性和泛化性不足的问题，提出了一个包含粗到细文本标注的视觉-语言数据集SteelDefectX和基准测试，实验表明该数据集能显著提升模型的可解释性、泛化性和迁移能力。

摘要翻译

钢材表面缺陷检测对于保障现代制造业的产品质量与可靠性至关重要。现有方法通常依赖于仅含标签数据训练的基础图像分类模型，这限制了其可解释性与泛化能力。为解决这些挑战，我们提出了SteelDefectX——一个包含25类缺陷、总计7,778张图像的视觉-语言数据集，并提供了从粗粒度到细粒度的文本标注。在粗粒度层面，数据集提供类别级信息，包括缺陷类型、典型视觉特征及相关的工业成因；在细粒度层面，则标注了样本特异性属性，如形状、尺寸、深度、位置与对比度，使模型能够学习更丰富、更精细的缺陷表征。我们进一步构建了包含四项任务的基准测试：纯视觉分类、视觉-语言分类、少样本/零样本识别以及零样本迁移，以评估模型性能与泛化能力。多个基线模型的实验表明，从粗到细的文本标注显著提升了模型的可解释性、泛化性与迁移能力。我们希望SteelDefectX能成为推动可解释、可泛化的钢材表面缺陷检测研究的重要资源。数据集将在https://github.com/Zhaosxian/SteelDefectX公开提供。

摘要 (Abstract)

Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.

关键词: steel surface defect detection, vision-language dataset, coarse-to-fine annotation, generalizable, explainable AI, few-shot zero-shot recognition, transfer learning, industrial manufacturing

77. ❌ Ctrl-A: Control-Driven Online Data Augmentation

作者: Jesper B. Christensen, Ciaran Bench, Spencer A. Thomas, Hüsnü Aslan, David Balslev-Harder, Nadia A. S. Smith, Alessandra Manzin 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21819v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的数据增强算法（ControlAugment），研究内容为图像数据增强策略的自动化调整，使用控制理论动态调整增强强度分布。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新、AI for Science应用或评分关键词中的任何技术主题。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文是纯粹的计算机视觉数据增强方法研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于控制理论的自动化数据增强算法ControlAugment，用于图像视觉任务，通过动态调整增强强度分布来提升模型性能，无需手动设计增强策略，在多个基准数据集上表现出与现有先进方法相当的竞争力。

摘要翻译

我们提出ControlAugment（Ctrl-A），一种面向图像视觉任务的自动化数据增强算法。该算法融合了控制理论原理，能够在模型训练过程中在线调整增强强度分布。Ctrl-A无需对单个增强强度进行初始化设置，而是通过控制循环架构及我们定义的相对操作响应曲线，在训练过程中动态且独立地调整各增强操作的强度分布。这种基于操作特性的更新机制使Ctrl-A能够抑制对模型性能产生负面影响的增强类型，从而无需针对新的图像视觉任务手动设计增强策略。基于通用WideResNet-28-10架构，在CIFAR-10、CIFAR-100和SVHN-core基准数据集上的实验表明，Ctrl-A与当前最先进的数据增强策略相比具有显著竞争力。

摘要 (Abstract)

We introduce ControlAugment (Ctrl-A), an automated data augmentation algorithm for image-vision tasks, which incorporates principles from control theory for online adjustment of augmentation strength distributions during model training. Ctrl-A eliminates the need for initialization of individual augmentation strengths. Instead, augmentation strength distributions are dynamically, and individually, adapted during training based on a control-loop architecture and what we define as relative operation response curves. Using an operation-dependent update procedure provides Ctrl-A with the potential to suppress augmentation styles that negatively impact model performance, alleviating the need for manually engineering augmentation policies for new image-vision tasks. Experiments on the CIFAR-10, CIFAR-100, and SVHN-core benchmark datasets using the common WideResNet-28-10 architecture demonstrate that Ctrl-A is highly competitive with existing state-of-the-art data augmentation strategies.

关键词: ControlAugment, data augmentation, control theory, image-vision tasks, augmentation strength, online adjustment, WideResNet, benchmark datasets

78. ❌ Extending Precipitation Nowcasting Horizons via Spectral Fusion of Radar Observations and Foundation Model Priors

作者: Yuze Qin, Qingyong Li, Zhiqing Guo, Wen Wang, Yan Liu, Yangli-ao Geng 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21768v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究降水临近预报，属于AI for Science领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文使用了Pangu-Weather作为基础模型，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词主要涉及大模型技术细节（如MoE、RLHF、量化等）或特定应用（如Agent、工具使用等），与论文的天气预测主题无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PW-FouCast的频率域融合框架，通过将Pangu-Weather天气预报作为光谱先验集成到基于傅里叶的骨干网络中，解决了雷达图像与气象数据之间的表示异质性问题，从而有效延长了降水临近预报的可靠预测时间范围。

摘要翻译

降水临近预报对于灾害缓解和航空安全至关重要。然而，仅依赖雷达的模型常因缺乏大尺度大气环境信息，导致在较长预见期内性能下降。虽然集成天气基础模型预测的气象变量提供了一种潜在的解决方案，但现有架构未能有效调和雷达图像与气象数据之间深刻的表征异质性。为弥补这一差距，我们提出了PW-FouCast——一种新颖的频域融合框架，该框架在基于傅里叶的主干网络中利用盘古天气（Pangu-Weather）预报作为频谱先验。我们的架构引入了三项关键创新：（i）盘古天气引导的频率调制，将频谱幅值和相位与气象先验对齐；（ii）频率记忆模块，用于校正相位差异并保持时间演化特性；（iii）逆向频率注意力机制，以重建通常在频谱滤波中丢失的高频细节。在SEVIR和MeteoNet基准数据集上的大量实验表明，PW-FouCast实现了最先进的性能，在保持结构保真度的同时有效延长了可靠预报的预见期。我们的代码公开于https://github.com/Onemissed/PW-FouCast。

摘要 (Abstract)

Precipitation nowcasting is critical for disaster mitigation and aviation safety. However, radar-only models frequently suffer from a lack of large-scale atmospheric context, leading to performance degradation at longer lead times. While integrating meteorological variables predicted by weather foundation models offers a potential remedy, existing architectures fail to reconcile the profound representational heterogeneities between radar imagery and meteorological data. To bridge this gap, we propose PW-FouCast, a novel frequency-domain fusion framework that leverages Pangu-Weather forecasts as spectral priors within a Fourier-based backbone. Our architecture introduces three key innovations: (i) Pangu-Weather-guided Frequency Modulation to align spectral magnitudes and phases with meteorological priors; (ii) Frequency Memory to correct phase discrepancies and preserve temporal evolution; and (iii) Inverted Frequency Attention to reconstruct high-frequency details typically lost in spectral filtering. Extensive experiments on the SEVIR and MeteoNet benchmarks demonstrate that PW-FouCast achieves state-of-the-art performance, effectively extending the reliable forecast horizon while maintaining structural fidelity. Our code is available at https://github.com/Onemissed/PW-FouCast.

关键词: Precipitation nowcasting, Foundation models, Pangu-Weather, Frequency-domain fusion, Spectral priors, Radar observations, Weather forecasting, AI for science

79. ❌ Cycle Inverse-Consistent TransMorph: A Balanced Deep Learning Framework for Brain MRI Registration

作者: Jiaqi Shang, Haojin Wu, Yinyi Lai, Zongyu Li, Chenghao Zhang, Jia Guo 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21760v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像配准的深度学习框架，使用Swin-UNet和Transformer架构，与绝大多数大模型技术关键词（如LLM、MoE、RLHF、RAG等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及医学影像分析（生物信息学相关应用），但并非核心创新点，因此给予5分（有一定关联）。其他26个关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于循环逆一致Transformer的深度学习框架（CICTM），用于脑部MRI图像的可变形配准，在大型多中心数据集上实现了准确且稳定的性能。

摘要翻译

可变形图像配准在医学图像分析中具有基础性作用，它能够实现不同受试者间解剖结构的空间对齐。尽管近年来基于深度学习的方法显著提升了计算效率，但许多现有方法在捕捉长距离解剖对应关系及保持形变一致性方面仍存在局限。本研究提出了一种基于循环逆一致变换器的可变形脑部MRI配准框架。该模型将Swin-UNet架构与双向一致性约束相结合，实现了前向与后向形变场的联合估计。该设计使框架既能捕捉局部解剖细节，又能建模全局空间关系，同时提升形变稳定性。我们在一个包含2851例T1加权脑部MRI扫描的大型多中心数据集上对所提框架进行了全面评估，该数据集汇集自13个公开数据库。实验结果表明，所提框架在保持稳定且物理合理的形变场的同时，在多项定量评估指标上均取得了优异且均衡的性能。附录中提供了与ANTs、ICNet及VoxelMorph等基线方法的详细定量比较。实验证明，CICTM在多项评估标准中均保持稳定优异的性能，同时生成具有物理合理性的形变场。这些特性使得本框架特别适用于对精度与形变稳定性均有严格要求的大规模神经影像数据集。

摘要 (Abstract)

Deformable image registration plays a fundamental role in medical image analysis by enabling spatial alignment of anatomical structures across subjects. While recent deep learning-based approaches have significantly improved computational efficiency, many existing methods remain limited in capturing long-range anatomical correspondence and maintaining deformation consistency. In this work, we present a cycle inverse-consistent transformer-based framework for deformable brain MRI registration. The model integrates a Swin-UNet architecture with bidirectional consistency constraints, enabling the joint estimation of forward and backward deformation fields. This design allows the framework to capture both local anatomical details and global spatial relationships while improving deformation stability. We conduct a comprehensive evaluation of the proposed framework on a large multi-center dataset consisting of 2851 T1-weighted brain MRI scans aggregated from 13 public datasets. Experimental results demonstrate that the proposed framework achieves strong and balanced performance across multiple quantitative evaluation metrics while maintaining stable and physically plausible deformation fields. Detailed quantitative comparisons with baseline methods, including ANTs, ICNet, and VoxelMorph, are provided in the appendix. Experimental results demonstrate that CICTM achieves consistently strong performance across multiple evaluation criteria while maintaining stable and physically plausible deformation fields. These properties make the proposed framework suitable for large-scale neuroimaging datasets where both accuracy and deformation stability are critical.

关键词: deformable image registration, brain MRI, transformer, cycle inverse-consistent, Swin-UNet, deep learning, medical image analysis, deformation fields

作者: Xu Liu, Yongheng Zhang, Qiguang Chen, Yao Li, Sheng Wang, Libo Qin 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21754v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态链式思维推理框架的创新，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分），涉及深度推理过程，与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（8分）。论文提到在多个模型上实验，隐含使用大模型，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）。其他关键词如MoE、量化、RAG、对齐等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有交错模态链式思维推理方法中视觉信息插入静态和表示不连贯的问题，提出了动态精确视觉思维集成框架DaP-ICoT，显著提升了推理效率并减少了72.6%的token消耗。

摘要翻译

近年来，交错模态思维链推理通过融合多模态输入与输出取得了显著成功，受到日益广泛的关注。尽管现有交错模态思维链方法展现出良好性能，但仍存在两大局限：（1）静态视觉思维定位，即在固定推理步骤中机械插入视觉信息，导致推理效率低下且缺乏灵活性；（2）断裂的视觉思维表征，即视觉标记呈现不连续与语义不连贯的问题。为应对这些局限，本研究提出融合动态精准视觉思维的交错模态思维链推理方法，其包含两个核心组件：（1）动态视觉思维集成模块，可根据推理需求自适应引入视觉输入，减少冗余并提升效率；（2）精准视觉思维引导机制，确保视觉表征保持语义连贯与上下文对齐。在多基准测试与模型上的实验表明，该方法实现了最先进的性能表现。此外，该方法显著减少了图像插入数量，使标记消耗量降低72.6%，从而实现了更高效的交错模态思维链推理。

摘要 (Abstract)

Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.

关键词: Interleaved-modal Chain-of-Thought, ICoT, Dynamic Visual Thought Integration, Precise Visual Thought Guidance, Multimodal Reasoning, Efficient Reasoning, Token Consumption Reduction, State-of-the-art Performance

81. ❌ The Presupposition Problem in Representation Genesis

作者: Yiling Wu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21745v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文以大型语言模型（LLMs）为案例研究，探讨认知哲学中的表征起源问题，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文不涉及任何具体的大模型技术（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）、应用场景（如智能体、工具使用、科学AI等）或性能问题（如幻觉缓解、可解释性等），因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文以大型语言模型为案例，诊断了哲学中解释表征起源时存在的'表征预设'结构问题，并提出了避免该问题的两个最低充分条件。

摘要翻译

大语言模型是首个在未明确经历表征生成过程的情况下实现高水平认知性能的系统：表征生成指系统从非表征的物理状态转变为能够以内容敏感方式引导行为的状态。以往的认知系统在我们能够考察之前已完成这一转变，心灵哲学亦将生成问题视为背景条件而非解释目标。大语言模型提供了一个未明确经历该转变的案例，使得生成问题重新凸显：若生成过程未曾发生，哪些认知能力会受到影响？其原因何在？当前我们缺乏回答这些问题的概念工具。本文认为，这一困境源于结构性原因。当应用于生成问题时，心灵哲学的主要理论框架——包括思想语言假说、目的语义学、预测加工理论、生成主义以及发生现象学——都呈现出共同特征：在某个解释步骤中，这些框架都使用了其解释效力依赖于系统已被组织为表征者的概念。我们将这种模式称为"表征预设结构”，它导致了系统性的解释延迟。试图用现有范畴词汇解释内容可操控表征的首次获得时，会从转变本身的表征侧引入资源，这构成了"表征回归”。本文提供的是概念诊断而非新理论，通过确立问题的结构框架，推导出任何避免该模式的解释方案所需满足的两项最低充分性条件。大语言模型使得这一理论缺失产生了实际影响，而不仅仅是理论层面的问题。

摘要 (Abstract)

Large language models are the first systems to achieve high cognitive performance without clearly undergoing representation genesis: the transition from a non-representing physical system to one whose states guide behavior in a content-sensitive way. Prior cognitive systems had already made this transition before we could examine it, and philosophy of mind treated genesis as a background condition rather than an explanatory target. LLMs provide a case that does not clearly involve this transition, making the genesis question newly urgent: if genesis did not occur, which cognitive capacities are affected, and why? We currently lack the conceptual resources to answer this. The reason, this paper argues, is structural. Major frameworks in philosophy of mind, including the Language of Thought hypothesis, teleosemantics, predictive processing, enactivism, and genetic phenomenology, share a common feature when applied to the genesis question: at some explanatory step, each deploys concepts whose explanatory purchase depends on the system already being organized as a representer. This pattern, which we call the Representation Presupposition structure, generates systematic explanatory deferral. Attempts to explain the first acquisition of content-manipulable representation within the existing categorical vocabulary import resources from the representational side of the transition itself. We call this the Representation Regress. The paper offers a conceptual diagnosis rather than a new theory, establishing the structure of the problem and deriving two minimum adequacy conditions for any account that avoids this pattern. LLMs make the absence of such a theory consequential rather than merely theoretical.

关键词: Large Language Models, Representation Genesis, Philosophy of Mind, Representation Presupposition, Cognitive Systems, Content-sensitive Behavior, Explanatory Deferral, Representation Regress

82. ❌ The Reasoning Error About Reasoning: Why Different Types of Reasoning Require Different Representational Structures

作者: Yiling Wu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21736v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一个关于推理类型与表征系统结构需求的理论框架，属于认知科学、AI和哲学交叉的基础理论研究。它不涉及具体的大模型技术、训练方法、优化算法或应用领域，因此与绝大多数技术性关键词完全无关（0分）。仅与三个关键词有弱关联：1）‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（5分）：论文讨论推理类型，包括多步推理，但未涉及CoT技术；2）‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（5分）：论文涉及深度推理，但未使用这些特定术语；3）‘Mechanistic Interpretability OR Explainable AI’（5分）：论文分析推理失败的结构原因，与可解释性有概念关联，但非技术性研究。

!!! tip deepseek-chat TL;DR

该论文提出了一个理论框架，分析不同推理类型（如归纳、类比、因果推理、演绎）对表征系统结构属性（可操作性、一致性、结构保持性、组合性）的不同需求，并指出仅靠统计学习的扩展无法满足演绎推理所需的结构保证。

摘要翻译

不同类型的推理对表征系统提出了不同的结构性要求，然而在心理学、人工智能与心灵哲学领域尚未形成关于这些要求的系统性解释。本文提出一个框架，识别出表征系统的四种结构特性：可操作性、一致性、结构保持性与组合性。从归纳、类比推理、因果推断到演绎与形式逻辑，不同推理形式对这些特性有着不同程度的需求。每种特性分别排除一类特定的推理失败类型。分析揭示了一个关键的结构性边界：位于该边界之下的推理类型可在关联性、概率性表征上运行，而边界之上的推理类型则需要完全满足所有四种特性。仅通过统计学习的规模扩展而无结构性重组，无法跨越这一边界，因为演绎推理所需的结构性保证无法通过概率性手段近似实现。来自人工智能评估、发展心理学与认知神经科学的汇聚性证据在不同直接性层面上支持该框架。本文推导出三项可检验的预测，包括复合性退化、针对结构性破坏的选择性脆弱性以及规模扩展下的不可还原性。该框架是一种必要条件式解释，不预设表征形式，旨在重组而非终结现有争论。

摘要 (Abstract)

Different types of reasoning impose different structural demands on representational systems, yet no systematic account of these demands exists across psychology, AI, and philosophy of mind. I propose a framework identifying four structural properties of representational systems: operability, consistency, structural preservation, and compositionality. These properties are demanded to different degrees by different forms of reasoning, from induction through analogy and causal inference to deduction and formal logic. Each property excludes a distinct class of reasoning failure. The analysis reveals a principal structural boundary: reasoning types below it can operate on associative, probabilistic representations, while those above it require all four properties to be fully satisfied. Scaling statistical learning without structural reorganization is insufficient to cross this boundary, because the structural guarantees required by deductive reasoning cannot be approximated through probabilistic means. Converging evidence from AI evaluation, developmental psychology, and cognitive neuroscience supports the framework at different levels of directness. Three testable predictions are derived, including compounding degradation, selective vulnerability to targeted structural disruption, and irreducibility under scaling. The framework is a necessary-condition account, agnostic about representational format, that aims to reorganize existing debates rather than close them.

关键词: reasoning types, representational systems, structural properties, deductive reasoning, probabilistic representations, cognitive neuroscience, AI evaluation, developmental psychology

83. ❌ Cognitive Agency Surrender: Defending Epistemic Sovereignty via Scaffolded AI Friction

作者: Kuangzhe Xu, Yu Shen, Longjie Yan, Yinghui Ren 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21735v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI-HCI交互中的认知代理放弃问题，提出通过"脚手架认知摩擦"和"多智能体系统"作为认知强制功能来维护人类认知主权。论文仅与"Multi-agent Systems OR Agent Coordination"高度相关（10分），因为明确提到"repurposing Multi-Agent Systems (MAS) as explicit cognitive forcing functions"。其他关键词均未在标题或摘要中提及，与论文的技术内容（如大模型训练、推理优化、对齐等）完全无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

论文研究了生成式AI导致的认知代理放弃风险，通过分析2023-2026年AI-HCI论文发现代理接管趋势加剧，提出使用多智能体系统作为脚手架认知摩擦来维护人类认知主权，并规划了多模态计算表型分析议程。

摘要翻译

生成式人工智能的扩散已将良性的认知卸载转变为认知能动性放弃的系统性风险。在“零摩擦”设计这一商业信条的驱动下，高度流畅的人工智能界面积极利用人类的认知吝啬性，过早地满足认知闭合需求，并诱发严重的自动化偏见。为实证量化这种认知侵蚀，我们对2023年至2026年初的1,223篇高置信度AI-HCI（人工智能-人机交互）论文部署了零样本语义分类流程（阈值 $τ=0.7$）。我们的分析揭示了一种不断升级的“能动性接管”：2025年曾短暂涌现捍卫人类认知主权的研究（占19.1%），但在2026年初被急剧压制（降至13.1%），同时优化自主机器智能体（autonomous machine agents）的研究呈爆发式转向（增至19.6%），而无摩擦可用性研究则持续保持着结构性霸权（占67.3%）。为拆解这一陷阱，我们理论化提出“支架式认知摩擦”，将多智能体系统（Multi-Agent Systems, MAS）重新定位为显式的认知强制功能（例如，计算型魔鬼代言人），以注入相关的认知张力并打断启发式执行。此外，我们勾勒了一个多模态计算表型分析议程——整合注视转移熵、任务诱发瞳孔测量、功能性近红外光谱（fNIRS）以及分层漂移扩散模型（Hierarchical Drift Diffusion Modeling, HDDM）——旨在从数学上解耦决策结果与认知努力。最终，有意设计的摩擦不仅是一种心理学干预，更是强制执行全球人工智能治理和维系社会认知韧性的基础性技术前提。

摘要 (Abstract)

The proliferation of Generative Artificial Intelligence has transformed benign cognitive offloading into a systemic risk of cognitive agency surrender. Driven by the commercial dogma of “zero-friction” design, highly fluent AI interfaces actively exploit human cognitive miserliness, prematurely satisfying the need for cognitive closure and inducing severe automation bias. To empirically quantify this epistemic erosion, we deployed a zero-shot semantic classification pipeline ($τ=0.7$) on 1,223 high-confidence AI-HCI papers from 2023 to early 2026. Our analysis reveals an escalating “agentic takeover”: a brief 2025 surge in research defending human epistemic sovereignty (19.1%) was abruptly suppressed in early 2026 (13.1%) by an explosive shift toward optimizing autonomous machine agents (19.6%), while frictionless usability maintained a structural hegemony (67.3%). To dismantle this trap, we theorize “Scaffolded Cognitive Friction,” repurposing Multi-Agent Systems (MAS) as explicit cognitive forcing functions (e.g., computational Devil’s Advocates) to inject germane epistemic tension and disrupt heuristic execution. Furthermore, we outline a multimodal computational phenotyping agenda – integrating gaze transition entropy, task-evoked pupillometry, fNIRS, and Hierarchical Drift Diffusion Modeling (HDDM) – to mathematically decouple decision outcomes from cognitive effort. Ultimately, intentionally designed friction is not merely a psychological intervention, but a foundational technical prerequisite for enforcing global AI governance and preserving societal cognitive resilience.

关键词: Cognitive Agency Surrender, Epistemic Sovereignty, Scaffolded AI Friction, Multi-Agent Systems, Automation Bias, Cognitive Miserliness, Computational Devil’s Advocates, Multimodal Computational Phenotyping

84. ❌ EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning

作者: Andreas Sauter, Yuyue Zhao, Jacopo Urbani, Wenxiang Hu, Zaiqiao Meng, Lun Zhou, Xiaohui Yan, Yougang Lyu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21728v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM在科学领域的应用（AI for Science），提出EvoIdeator框架，使用强化学习（RL）优化LLM（Qwen3-4B）生成科学想法的能力，涉及LLM Agents和Self-Improvement（通过RL实现自我优化），但未涉及其他技术如MoE、量化、推理加速等。

!!! tip deepseek-chat TL;DR

论文提出EvoIdeator框架，通过基于检查表的强化学习优化大型语言模型，显著提升科学想法生成的质量和泛化能力。

摘要翻译

科学创意生成是自主知识发现的基石，然而将初始概念迭代演化为高质量研究提案的过程，对大型语言模型（LLMs）而言仍是巨大挑战。现有的强化学习（RL）范式通常依赖基于量规的标量奖励，这类奖励提供整体质量评分但缺乏可操作的细粒度指导；反之，基于语言的改进方法多局限于推理阶段的提示工程，其目标模型并未显式优化以内化此类批评。为弥合这一差距，我们提出 EvoIdeator 框架，该框架通过将强化学习训练目标与基于清单的反馈对齐，促进科学创意的演化。EvoIdeator 利用结构化评判模型生成两种协同信号：（1）用于多维度优化的词典序奖励，以及（2）提供细粒度语言反馈，针对事实依据、可行性与方法严谨性进行片段级批评。通过将这些信号整合至强化学习循环中，我们使策略模型在优化与推理阶段均能系统化利用精准反馈。大量实验表明，基于 Qwen3-4B 构建的 EvoIdeator 在关键科学指标上显著优于规模大得多的前沿模型。更重要的是，学习得到的策略展现出对多样化外部反馈源的强大泛化能力，无需进一步微调，为自优化的自主创意生成提供了可扩展且严谨的路径。

摘要 (Abstract)

Scientific idea generation is a cornerstone of autonomous knowledge discovery, yet the iterative evolution required to transform initial concepts into high-quality research proposals remains a formidable challenge for Large Language Models (LLMs). Existing Reinforcement Learning (RL) paradigms often rely on rubric-based scalar rewards that provide global quality scores but lack actionable granularity. Conversely, language-based refinement methods are typically confined to inference-time prompting, targeting models that are not explicitly optimized to internalize such critiques. To bridge this gap, we propose \textbf{EvoIdeator}, a framework that facilitates the evolution of scientific ideas by aligning the RL training objective with \textbf{checklist-grounded feedback}. EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) \emph{lexicographic rewards} for multi-dimensional optimization, and (2) \emph{fine-grained language feedback} that offers span-level critiques regarding grounding, feasibility, and methodological rigor. By integrating these signals into the RL loop, we condition the policy to systematically utilize precise feedback during both optimization and inference. Extensive experiments demonstrate that EvoIdeator, built on Qwen3-4B, significantly outperforms much larger frontier models across key scientific metrics. Crucially, the learned policy exhibits strong generalization to diverse external feedback sources without further fine-tuning, offering a scalable and rigorous path toward self-refining autonomous ideation.

关键词: Large Language Models, Reinforcement Learning, Scientific Idea Generation, Checklist-grounded Feedback, Self-refining, Autonomous Ideation, AI for Science

85. ❌ CurvZO: Adaptive Curvature-Guided Sparse Zeroth-Order Optimization for Efficient LLM Fine-Tuning

作者: Shuo Wang, Ziyu Chen, Ming Tang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21725v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的高效微调方法，与’Large Language Models’、‘Post-training/SFT’、‘PEFT’高度相关（10分）。提出稀疏零阶优化方法，与’Mixture of Experts/Sparse Models’、‘Small Language Models/On-device AI’、‘Quantization/Model Compression’有一定关联（5分），因为这些技术都关注模型效率。其他关键词如Scaling Laws、Pre-training、Alignment等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种自适应曲率引导的稀疏零阶优化方法（CurvZO），用于解决大语言模型微调中的内存效率问题，实验表明该方法在保持内存效率的同时，相比基线方法提高了准确性并减少了训练时间。

摘要翻译

基于反向传播的大语言模型（LLM）微调虽能实现高性能，但会带来巨大的内存开销，从而限制了其在资源受限硬件上的可扩展性。零阶（ZO）优化仅依赖前向传播，提供了一种内存高效的替代方案，但由于其梯度估计方差较高，通常存在收敛速度慢或不稳定的问题。稀疏零阶更新通过仅扰动参数子集部分解决了这一问题，但其有效性取决于能否选择信息量丰富的参数，这在零阶优化中具有挑战性，因为每次查询仅产生标量反馈。我们提出了自适应曲率引导稀疏零阶优化（CurvZO），该方法能够在线从标量零阶反馈中追踪曲率信号，并利用这些信号构建参数级的采样分布，以在每次更新时选择坐标，从而降低稀疏零阶梯度估计器的方差。此外，CurvZO 能根据不断演化的曲率信号分布动态调整扰动预算，从而产生既保持聚焦又具备足够探索性的稀疏零阶更新。在多种自然语言处理任务上对 OPT 和 Llama 模型进行的广泛实验表明，与零阶基线方法相比，CurvZO 能持续提升微调性能并减少训练时间。它在准确率上最高提升了 4.4 个百分点，并实现了高达 $2\times$ 的加速，同时保持了内存效率。

摘要 (Abstract)

Fine-tuning large language models (LLMs) with backpropagation achieves high performance but incurs substantial memory overhead, limiting scalability on resource-constrained hardware. Zeroth-order (ZO) optimization provides a memory-efficient alternative by relying solely on forward passes, yet it typically suffers from slow or unstable convergence due to high-variance gradient estimates. Sparse ZO updates partially address this issue by perturbing only a subset of parameters, but their effectiveness hinges on selecting informative parameters, which is challenging in ZO optimization because each query yields only scalar feedback. We propose \textbf{Adaptive Curvature-Guided Sparse Zeroth-Order Optimization (CurvZO)}, which tracks curvature signals online from scalar ZO feedback and leverages these signals to construct a parameter-wise sampling distribution for selecting coordinates at each update, reducing the variance of the sparse ZO gradient estimator. Moreover, CurvZO dynamically adapts the perturbation budget to the evolving curvature signal distribution, yielding sparse ZO updates that remain both focused and sufficiently exploratory. Extensive experiments on OPT and Llama across diverse NLP tasks show that CurvZO consistently improves fine-tuning performance and reduces training time over ZO baselines. It improves accuracy by up to 4.4 points and achieves up to a $2\times$ speedup, while preserving memory efficiency.

关键词: Large Language Models, Fine-tuning, Zeroth-order Optimization, Sparse Updates, Memory Efficiency, Parameter-efficient Fine-tuning, Adaptive Curvature, Training Acceleration

86. ❌ FISformer: Replacing Self-Attention with a Fuzzy Inference System in Transformer Models for Time Series Forecasting

作者: Bulent Haznedar, Levent Karacan 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21724v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FISformer专注于时间序列预测的Transformer架构改进，用模糊推理系统替代传统注意力机制。所有关键词均与大模型（LLM）技术、训练方法、推理优化、代理系统等直接相关，而本文研究的是基础Transformer架构的注意力机制替代方案，属于深度学习模型结构创新，但未涉及大模型特定技术。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文提到了模糊逻辑的interpretability（可解释性），但这不是核心焦点。其他关键词如AI for Science虽涉及科学应用，但本文是通用时间序列预测，非特定科学领域（如生物信息学）。

!!! tip deepseek-chat TL;DR

该论文针对Transformer在时间序列预测中注意力机制难以建模不确定性和非线性依赖的问题，提出用模糊推理系统替代自注意力机制，实验表明FISFormer在预测精度、噪声鲁棒性和可解释性方面优于现有Transformer变体。

摘要翻译

Transformer在时间序列预测领域取得了显著进展，但其对确定性点积注意力机制的依赖限制了其建模多元时间维度不确定性与非线性依赖关系的能力。为突破这一局限，我们提出FISFormer——一种模糊推理系统驱动的Transformer模型，该模型以模糊推理系统交互机制替代传统注意力模块。在此框架中，每个查询-键值对会在各特征维度上经历模糊推理过程，通过可学习的隶属度函数和基于规则的推理来估计令牌间关系强度。这些由模糊推理系统衍生的交互权重能够捕捉不确定性，并提供令牌间可解释的连续映射关系。沿令牌轴进行softmax归一化处理后，这些权重通过逐元素乘法与对应的值特征相结合，最终生成上下文增强的令牌表征。该设计将模糊逻辑的可解释性与不确定性建模能力，与Transformer的表征学习优势相融合。在多个基准数据集上的大量实验表明，相较于最先进的Transformer变体，FISFormer在预测精度、噪声鲁棒性和可解释性方面均表现出优越性，从而验证了模糊推理作为传统注意力机制有效替代方案的可行性。

摘要 (Abstract)

Transformers have achieved remarkable progress in time series forecasting, yet their reliance on deterministic dot-product attention limits their capacity to model uncertainty and nonlinear dependencies across multivariate temporal dimensions. To address this limitation, we propose FISFormer, a Fuzzy Inference System-driven Transformer that replaces conventional attention with a FIS Interaction mechanism. In this framework, each query-key pair undergoes a fuzzy inference process for every feature dimension, where learnable membership functions and rule-based reasoning estimate token-wise relational strengths. These FIS-derived interaction weights capture uncertainty and provide interpretable, continuous mappings between tokens. A softmax operation is applied along the token axis to normalize these weights, which are then combined with the corresponding value features through element-wise multiplication to yield the final context-enhanced token representations. This design fuses the interpretability and uncertainty modeling of fuzzy logic with the representational power of Transformers. Extensive experiments on multiple benchmark datasets demonstrate that FISFormer achieves superior forecasting accuracy, noise robustness, and interpretability compared to state-of-the-art Transformer variants, establishing fuzzy inference as an effective alternative to conventional attention mechanisms.

关键词: FISFormer, Transformer, Fuzzy Inference System, Time Series Forecasting, Attention Mechanism, Interpretability, Uncertainty Modeling, Multivariate Temporal Data

87. ❌ SemEval-2026 Task 12: Abductive Event Reasoning: Towards Real-World Event Causal Inference for Large Language Models

作者: Pengfei Cao, Mingxuan Yang, Yubo Chen, Chenlong Zhang, Mingxuan Liu, Kang Liu, Jun Zhao 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21720v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LLMs在溯因事件推理和因果推断任务中的应用，与’Large Language Models’高度相关（8分），因为论文明确将LLMs作为评估对象。与推理相关的关键词’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为溯因推理涉及多步逻辑推理和深度思考过程。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，评分为0。论文未涉及特定科学领域应用，因此’AI for Science’等关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了SemEval-2026 Task 12：溯因事件推理基准任务，旨在评估大型语言模型在证据丰富的真实世界事件中识别最可能直接原因的能力，吸引了122名参与者并收到了518份提交。

摘要翻译

理解现实世界事件发生的原因对于自然语言处理和实践决策均至关重要，然而在证据丰富的场景中，直接原因推断的研究仍显不足。为填补这一空白，我们组织了SemEval-2026 Task 12：溯因事件推理（Abductive Event Reasoning，AER）。\footnote{任务数据可在 https://github.com/sooo66/semeval2026-task12-dataset.git 获取} 该任务要求系统从支持性证据中识别出目标事件最合理的直接原因。我们将AER构建为一个基于证据的多选基准测试，旨在捕捉现实世界因果推理的关键挑战，包括分散的证据、间接的背景因素以及语义相关但非因果的干扰项。本次共享任务吸引了122名参与者，共收到518份提交结果。本文介绍了任务设计、数据集构建流程、评估设置及系统结果。AER为现实世界事件的溯因推理提供了一个聚焦的基准，并揭示了未来因果推理与多文档理解研究面临的挑战。

摘要 (Abstract)

Understanding why real-world events occur is important for both natural language processing and practical decision-making, yet direct-cause inference remains underexplored in evidence-rich settings. To address this gap, we organized SemEval-2026 Task 12: Abductive Event Reasoning (AER).\footnote{The task data is available at https://github.com/sooo66/semeval2026-task12-dataset.git} The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence-grounded multiple-choice benchmark that captures key challenges of real-world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non-causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.

关键词: Abductive Event Reasoning, Causal Inference, Large Language Models, Multi-document Understanding, Evidence-grounded Reasoning, Real-world Events, Benchmark Evaluation, SemEval Task

88. ❌ When Exploration Comes for Free with Mixture-Greedy: Do we need UCB in Diversity-Aware Multi-Armed Bandits?

作者: Bahar Dibaei Nia, Farzan Farnia 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21716v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文研究的是多样性感知多臂老虎机（Multi-Armed Bandits）在生成模型选择中的应用，核心是算法优化（Mixture-Greedy vs. UCB）和理论分析。所有给定的关键词均与大模型/深度学习技术原理或科学应用直接相关，但本文未涉及任何大模型、深度学习、AI for Science的具体技术（如LLM、MoE、训练方法、推理优化、代理系统等），也未讨论生物信息学等科学领域应用。论文仅以生成模型选择为应用背景，但未深入任何生成模型技术本身，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了在多样性感知多臂老虎机中用于生成模型选择时，简单的Mixture-Greedy策略比传统的UCB探索方法收敛更快、性能更好，并提供了理论解释表明多样性目标本身能诱导隐式探索。

摘要翻译

在现代生成式人工智能中，从多个生成模型中进行高效选择日益重要，因为从次优模型采样成本高昂。该问题可被形式化为一个多臂老虎机任务。在考虑多样性的评估指标下，非退化的生成器混合体可能优于任何单一模型，这使得该设定区别于经典的最佳臂识别问题。因此，先前的研究方法将上置信界探索奖励项纳入混合目标函数。然而，在多个数据集和评估指标上，我们观察到UCB项持续减缓收敛速度，并常常降低采样效率。相比之下，一种简单的混合贪心策略（不含显式的UCB类乐观项）收敛更快且能实现更优性能，尤其对于广泛使用的指标（如FID和Vendi分数），这些指标难以构建紧致的置信界。我们提供了理论见解以解释这一现象：在明晰的结构条件下，考虑多样性的目标函数通过偏好内部混合而引发隐式探索，导致对所有臂的线性采样，并为基于熵、基于核函数以及FID类目标函数提供次线性遗憾保证。这些结果表明，在面向生成模型选择的多样性感知多臂老虎机中，探索可能源于目标函数几何结构的内在特性，从而对显式置信奖励的必要性提出了质疑。

摘要 (Abstract)

Efficient selection among multiple generative models is increasingly important in modern generative AI, where sampling from suboptimal models is costly. This problem can be formulated as a multi-armed bandit task. Under diversity-aware evaluation metrics, a non-degenerate mixture of generators can outperform any individual model, distinguishing this setting from classical best-arm identification. Prior approaches therefore incorporate an Upper Confidence Bound (UCB) exploration bonus into the mixture objective. However, across multiple datasets and evaluation metrics, we observe that the UCB term consistently slows convergence and often reduces sample efficiency. In contrast, a simple \emph{Mixture-Greedy} strategy without explicit UCB-type optimism converges faster and achieves even better performance, particularly for widely used metrics such as FID and Vendi where tight confidence bounds are difficult to construct. We provide theoretical insight explaining this behavior: under transparent structural conditions, diversity-aware objectives induce implicit exploration by favoring interior mixtures, leading to linear sampling of all arms and sublinear regret guarantees for entropy-based, kernel-based, and FID-type objectives. These results suggest that in diversity-aware multi-armed bandits for generative model selection, exploration can arise intrinsically from the objective geometry, questioning the necessity of explicit confidence bonuses.

关键词: multi-armed bandits, generative model selection, diversity-aware evaluation, Mixture-Greedy, UCB exploration, sample efficiency, implicit exploration, regret guarantees

89. ❌ Compensating Visual Insufficiency with Stratified Language Guidance for Long-Tail Class Incremental Learning

作者: Xi Wang, Xu Yang, Donghao Sun, Cheng Deng 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21708v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确使用LLMs生成分层语言树来指导长尾类增量学习，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词涉及具体技术（如MoE、RLHF、RAG等）、应用领域（如生物信息学）或模型特性（如量化、幻觉缓解），论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文提出利用大语言模型生成分层语言树来指导长尾类增量学习，通过分层自适应和对齐语言指导缓解数据不平衡和灾难性遗忘，在多个基准测试中实现了最先进的性能。

摘要翻译

长尾类增量学习（LT CIL）仍然极具挑战性，因为尾类样本的稀缺不仅阻碍了其学习过程，还在持续演变且不平衡的数据分布下加剧了灾难性遗忘。为解决这些问题，我们利用了语言知识的丰富信息性和可扩展性。具体而言，我们分析LT CIL的数据分布，以指导大语言模型（LLMs）生成一种分层语言树，该树以从粗粒度到细粒度的层次结构组织语义信息。基于此结构，我们提出了分层自适应语言引导，该方法利用可学习的权重融合多尺度语义表征，从而实现对尾类的动态监督调整，并减轻数据不平衡的影响。此外，我们引入了分层对齐语言引导，该方法利用语言树的结构稳定性来约束优化过程并强化语义-视觉对齐，从而缓解灾难性遗忘。在多个基准数据集上的大量实验表明，我们的方法取得了最先进的性能。

摘要 (Abstract)

Long-tail class incremental learning (LT CIL) remains highly challenging because the scarcity of samples in tail classes not only hampers their learning but also exacerbates catastrophic forgetting under continuously evolving and imbalanced data distributions. To tackle these issues, we exploit the informativeness and scalability of language knowledge. Specifically, we analyze the LT CIL data distribution to guide large language models (LLMs) in generating a stratified language tree that hierarchically organizes semantic information from coarse to fine grained granularity. Building upon this structure, we introduce stratified adaptive language guidance, which leverages learnable weights to merge multi-scale semantic representations, thereby enabling dynamic supervisory adjustment for tail classes and alleviating the impact of data imbalance. Furthermore, we introduce stratified alignment language guidance, which exploits the structural stability of the language tree to constrain optimization and reinforce semantic visual alignment, thereby alleviating catastrophic forgetting. Extensive experiments on multiple benchmarks demonstrate that our method achieves state of the art performance.

关键词: Long-tail class incremental learning, Large language models, Stratified language tree, Semantic alignment, Catastrophic forgetting, Data imbalance, Visual-language guidance, State-of-the-art performance

90. ❌ Rethinking Token Reduction for Large Vision-Language Models

作者: Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21701v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究大型视觉语言模型（LVLMs）的token压缩方法，以降低推理成本。与’Large Language Models’相关度较高（8分），因为LVLMs是大模型的一种。与’Quantization OR Model Compression’和’Speculative Decoding OR Inference Acceleration’有一定关联（各5分），因为token压缩属于模型压缩和推理加速的范畴。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉语言模型在多轮视觉问答中token冗余导致高推理成本的问题，提出了一种基于学习的提示无关压缩方法MetaCompress，实现了更好的效率-准确性权衡。

摘要翻译

大型视觉语言模型（LVLMs）在视觉理解与推理方面表现卓越，但过多的视觉标记会导致高昂的推理成本。尽管近期的标记缩减方法缓解了这一问题，但它们主要针对单轮视觉问答（VQA），而更具实用性的多轮视觉问答（MT-VQA）场景则尚未得到充分探索。MT-VQA引入了额外的挑战，因为后续问题无法预先获知，且可能涉及任意图像区域，这使得现有的缩减策略难以生效。具体而言，现有方法可分为两类：提示依赖型方法，其偏向于初始文本提示，可能丢弃对后续对话轮次有用的信息；提示无关型方法，虽然技术上适用于多轮场景，但依赖于启发式缩减指标（如注意力分数），导致性能欠佳。本文提出一种基于学习的提示无关型方法，命名为MetaCompress，以克服启发式设计的局限性。我们首先将标记缩减形式化为一种可学习的压缩映射，将剪枝与合并等现有格式统一为单一学习目标。基于此形式化框架，我们引入了一种数据高效训练范式，能够以有限的计算成本学习最优压缩映射。在MT-VQA基准测试及多种LVLM架构上的大量实验表明，MetaCompress在保持跨对话轮次强泛化能力的同时，实现了更优的效率-精度权衡。代码发布于https://github.com/MArSha1147/MetaCompress。

摘要 (Abstract)

Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.

关键词: Large Vision-Language Models, token reduction, inference costs, multi-turn VQA, MetaCompress, compression mapping, efficiency-accuracy trade-offs, generalization

91. ❌ A Blueprint for Self-Evolving Coding Agents in Vehicle Aerodynamic Drag Prediction

作者: Jinhui Ren, Huaiming Li, Yabin Liu, Tao Li, Zhaokun Liu, Yujia Liang, Zengle Ge, Chufan Wu, Xiaomin Yuan, Danyu Liu, Annan Li, Jianmin Wu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21698v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于开发一种用于车辆空气动力学阻力预测的自进化编码代理框架，属于AI在科学计算和工程领域的应用。摘要和标题中未提及任何大模型、深度学习技术原理或具体的大模型技术关键词（如LLM、MoE、RLHF等）。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究将AI应用于科学工程问题（空气动力学模拟），符合AI for Science的范畴，但未涉及生物信息学或化学信息学。其他关键词均与论文内容无关，论文的核心是代理进化、约束优化和工程工作流自动化，而非大模型技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于合约的自进化编码代理蓝图，用于在工业约束下发现可执行的代理管道，以预测车辆空气动力学阻力系数，从而加速设计迭代并保持可靠性。

摘要翻译

高保真车辆风阻评估的主要制约因素并非求解器运行时间，而是工作流程中的摩擦：几何清理、网格划分重试、队列竞争以及跨团队可复现性失败。本文提出一种以合约为中心的蓝图，用于构建自进化的编码智能体，这些智能体能够在工业约束下发现可执行的替代流程来预测阻力系数 $C_d$。该方法将替代流程的发现定义为对程序（而非静态模型实例）的约束优化问题，结合了Famou-Agent式评估器反馈、基于种群的岛屿演化、结构化变异（数据、模型、损失函数和分割策略）以及平衡排序质量、稳定性和成本的多目标选择。严格的评估合约强制要求，任何候选流程在被采纳前都必须满足防泄漏、确定性重放、多种子鲁棒性和资源预算约束。在八种匿名演化算子的测试中，最佳系统达到了综合得分0.9335，符号准确率0.9180，而演化轨迹与消融分析表明自适应采样和岛屿迁移是收敛质量的主要驱动因素。部署模型明确采用“筛选与升级”策略：替代流程为设计探索提供高通量排序，但低置信度或分布外案例会自动升级至高保真CFD计算。最终贡献在于建立了一个可审计、可复用的工作流程，在加速空气动力学设计迭代的同时，保持了决策级可靠性、治理可追溯性和安全边界。

摘要 (Abstract)

High-fidelity vehicle drag evaluation is constrained less by solver runtime than by workflow friction: geometry cleanup, meshing retries, queue contention, and reproducibility failures across teams. We present a contract-centric blueprint for self-evolving coding agents that discover executable surrogate pipelines for predicting drag coefficient $C_d$ under industrial constraints. The method formulates surrogate discovery as constrained optimization over programs, not static model instances, and combines Famou-Agent-style evaluator feedback with population-based island evolution, structured mutations (data, model, loss, and split policies), and multi-objective selection balancing ranking quality, stability, and cost. A hard evaluation contract enforces leakage prevention, deterministic replay, multi-seed robustness, and resource budgets before any candidate is admitted. Across eight anonymized evolutionary operators, the best system reaches a Combined Score of 0.9335 with sign-accuracy 0.9180, while trajectory and ablation analyses show that adaptive sampling and island migration are primary drivers of convergence quality. The deployment model is explicitly ``screen-and-escalate’’: surrogates provide high-throughput ranking for design exploration, but low-confidence or out-of-distribution cases are automatically escalated to high-fidelity CFD. The resulting contribution is an auditable, reusable workflow for accelerating aerodynamic design iteration while preserving decision-grade reliability, governance traceability, and safety boundaries.

关键词: self-evolving coding agents, vehicle aerodynamic drag prediction, surrogate pipelines, constrained optimization, population-based island evolution, multi-objective selection, CFD escalation, aerodynamic design iteration

92. ❌ MIND: Multi-agent inference for negotiation dialogue in travel planning

作者: Hunmin Do, Taejun Yoon, Kiyong Jung 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21696v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MIND专注于多智能体协商对话框架，核心使用LLM作为智能体基础（摘要提到LLM-as-a-Judge），属于多智能体系统研究。因此与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理优化、科学AI应用等均未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该研究提出了MIND多智能体协商对话框架，通过战略评估阶段准确推断对手意愿，在旅行规划协商中显著优于传统多智能体辩论方法，实现了更高的共识达成率和理性表现。

摘要翻译

尽管多智能体辩论（Multi-Agent Debate, MAD）研究已取得进展，但其在协调复杂利益相关者需求（如旅行规划）方面的有效性仍很大程度上未被探索。为填补这一空白，我们提出了MIND（面向协商对话的多智能体推理框架），该框架旨在模拟具有异质偏好的旅行者之间建立现实共识的过程。基于心智理论（Theory of Mind, ToM），MIND引入了战略评估阶段，能够从语言细微差别中以90.2%的准确率推断对手意愿度（w）。实验结果表明，MIND优于传统MAD框架，在高意愿度命中率上提升20.5%，辩论命中率提高30.7%，并能有效优先处理关键约束条件。此外，通过大语言模型即评判的定性评估证实，MIND在合理性（68.8%）与流畅性（72.4%）上均超越基线模型，整体胜率达到68.3%。这些发现验证了MIND能有效建模人类协商动态，从而达成具有说服力的共识。

摘要 (Abstract)

While Multi-Agent Debate (MAD) research has advanced, its efficacy in coordinating complex stakeholder interests such as travel planning remains largely unexplored. To bridge this gap, we propose MIND (Multi-agent Inference for Negotiation Dialogue), a framework designed to simulate realistic consensus-building among travelers with heterogeneous preferences. Grounded in the Theory of Mind (ToM), MIND introduces a Strategic Appraisal phase that infers opponent willingness (w) from linguistic nuances with 90.2% accuracy. Experimental results demonstrate that MIND outperforms traditional MAD frameworks, achieving a 20.5% improvement in High-w Hit and a 30.7% increase in Debate Hit-Rate, effectively prioritizing high-stakes constraints. Furthermore, qualitative evaluations via LLM-as-a-Judge confirm that MIND surpasses baselines in Rationality (68.8%) and Fluency (72.4%), securing an overall win rate of 68.3%. These findings validate that MIND effectively models human negotiation dynamics to derive persuasive consensus.

关键词: Multi-agent Systems, Negotiation Dialogue, Travel Planning, Theory of Mind, Consensus-building, LLM-as-a-Judge, Strategic Appraisal, Debate Framework

93. ❌ Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain

作者: Mohammad Asadi, Tahoura Nedaee, Jack W. O’Sullivan, Euan Ashley, Ehsan Adeli 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21693v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在医学视觉问答中的幻觉检测问题，与’Large Language Models’高度相关（10分），属于大模型在科学领域的应用。直接针对’Hallucination Mitigation’问题提出新方法（10分）。论文涉及模型置信度和证据分析，与’Mechanistic Interpretability’有一定关联（5分）。医学VQA应用属于’AI for Science’范畴（10分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对医学多模态大语言模型在视觉问答中容易产生幻觉的问题，提出了一种无需随机采样、无需外部模型的确定性幻觉检测方法CEBaG，在多个模型和基准测试中显著优于现有方法。

摘要翻译

多模态大语言模型（MLLMs）在医学视觉问答（VQA）任务中展现出强大潜力，但其仍易产生幻觉现象，即生成与输入图像内容相矛盾的回应，这在临床环境中可能带来严重风险。现有的幻觉检测方法，如语义熵（Semantic Entropy, SE）和视觉增强语义熵（Vision-Amplified Semantic Entropy, VASE），需要对每个样本进行10至20次随机生成，并依赖外部自然语言推理模型进行语义聚类，导致计算成本高昂且难以实际部署。我们观察到，幻觉响应在模型自身的对数概率中表现出一种独特特征：令牌级置信度不一致以及对视觉证据的敏感性较弱。基于这一观察，我们提出置信度-证据贝叶斯增益（Confidence-Evidence Bayesian Gain, CEBaG），这是一种确定性幻觉检测方法，无需随机采样、无需外部模型且无需任务特定超参数。CEBaG融合了两个互补信号：令牌级预测方差（用于捕捉响应令牌间不一致的置信度）和证据强度（用于衡量图像相对于纯文本推理对每个令牌预测的偏移程度）。在四种医学MLLM模型和三个VQA基准测试（共16种实验设置）上的评估表明，CEBaG在16种设置中的13种取得了最高的AUC值，平均比VASE提高8个AUC点，同时该方法完全确定且自包含。代码将在论文录用后公开。

摘要 (Abstract)

Multimodal large language models (MLLMs) have shown strong potential for medical Visual Question Answering (VQA), yet they remain prone to hallucinations, defined as generating responses that contradict the input image, posing serious risks in clinical settings. Current hallucination detection methods, such as Semantic Entropy (SE) and Vision-Amplified Semantic Entropy (VASE), require 10 to 20 stochastic generations per sample together with an external natural language inference model for semantic clustering, making them computationally expensive and difficult to deploy in practice. We observe that hallucinated responses exhibit a distinctive signature directly in the model’s own log-probabilities: inconsistent token-level confidence and weak sensitivity to visual evidence. Based on this observation, we propose Confidence-Evidence Bayesian Gain (CEBaG), a deterministic hallucination detection method that requires no stochastic sampling, no external models, and no task-specific hyperparameters. CEBaG combines two complementary signals: token-level predictive variance, which captures inconsistent confidence across response tokens, and evidence magnitude, which measures how much the image shifts per-token predictions relative to text-only inference. Evaluated across four medical MLLMs and three VQA benchmarks (16 experimental settings), CEBaG achieves the highest AUC in 13 of 16 settings and improves over VASE by 8 AUC points on average, while being fully deterministic and self-contained. The code will be made available upon acceptance.

关键词: Multimodal Large Language Models, Medical Visual Question Answering, Hallucination Detection, Confidence-Evidence Bayesian Gain, Deterministic Method, Token-level Confidence, Visual Evidence Sensitivity, Medical AI Applications

94. ❌ AI Token Futures Market: Commoditization of Compute and Derivatives Contract Design

作者: Yicai Xing 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21690v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	5.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI推理token的商品化和期货市场设计，属于大模型应用的经济金融领域。与"Large Language Models"高度相关（8分），因为论文以LLMs和VLAs的广泛部署为背景，分析其token的商品属性。与"Monte Carlo Tree Search OR MCTS AND LLM"有一定关联（5分），因为论文使用Monte Carlo模拟评估期货合约的对冲效率，但并非MCTS算法本身。其他关键词主要涉及大模型技术原理、训练方法、推理优化、应用场景等，论文未涉及这些具体技术内容，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究AI推理token的商品化问题，提出标准化token期货合约设计方案，并通过Monte Carlo模拟证明该期货合约能在应用层需求激增场景下将企业计算成本波动降低62%-78%。

摘要翻译

随着大语言模型（LLM）与视觉-语言-动作模型（VLA）的广泛部署，人工智能推理所消耗的令牌正演变为一种新型商品。本文系统分析了令牌的商品属性，论证了其从智能服务输出向计算基础设施原材料的转变，并与电力、碳排放配额及带宽等成熟商品进行了类比。基于电力期货市场的历史经验与商品金融化理论，我们提出了一套标准化的令牌期货合约完整设计方案，包括标准推理令牌（Standard Inference Token, SIT）的定义、合约规格、结算机制、保证金制度及做市商安排。通过构建均值回归跳跃扩散随机过程模型并进行蒙特卡洛模拟，我们评估了所设计期货合约对应用层企业的套期保值效率。模拟结果表明，在应用层需求爆发的情景下，令牌期货可将企业计算成本波动降低62%-78%。本文还探讨了GPU算力期货的可行性，并对令牌期货市场的监管框架进行了讨论，为计算资源的金融化提供了理论基础与实践路径。

摘要 (Abstract)

As large language models (LLMs) and vision-language-action models (VLAs) become widely deployed, the tokens consumed by AI inference are evolving into a new type of commodity. This paper systematically analyzes the commodity attributes of tokens, arguing for their transition from intelligent service outputs to compute infrastructure raw materials, and draws comparisons with established commodities such as electricity, carbon emission allowances, and bandwidth. Building on the historical experience of electricity futures markets and the theory of commodity financialization, we propose a complete design for standardized token futures contracts, including the definition of a Standard Inference Token (SIT), contract specifications, settlement mechanisms, margin systems, and market-maker regimes. By constructing a mean-reverting jump-diffusion stochastic process model and conducting Monte Carlo simulations, we evaluate the hedging efficiency of the proposed futures contracts for application-layer enterprises. Simulation results show that, under an application-layer demand explosion scenario, token futures can reduce enterprise compute cost volatility by 62%-78%. We also explore the feasibility of GPU compute futures and discuss the regulatory framework for token futures markets, providing a theoretical foundation and practical roadmap for the financialization of compute resources.

关键词: AI tokens, futures market, commoditization, large language models, Monte Carlo simulation, hedging efficiency, compute resources, financialization

95. ❌ Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces

作者: Neelmani Vispute 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21692v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于AI智能体的结构化推理溯源系统，与LLM智能体、推理过程、可解释性等关键词高度相关。论文提出Agent Execution Record（AER）系统，旨在捕获智能体的意图、观察、推理、证据链等结构化信息，支持推理模式挖掘、置信度校准等分析。这与’LLM Agents’、‘Chain of Thought’、‘System 2 Thinking’等关键词高度相关（10分），与’Large Language Models’、‘Self-Correction’、‘Explainable AI’等有一定关联（8分），与’Tool Use’、‘Multi-agent Systems’、‘Hallucination Mitigation’等有部分关联（5分）。其他关键词如模型架构、训练方法、压缩技术等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对自主AI智能体缺乏结构化推理溯源能力的问题，提出了Agent Execution Record（AER）系统，通过捕获意图、观察、推理、证据链等结构化信息，实现了对智能体推理行为的规范化记录和群体级行为分析。

摘要翻译

随着人工智能代理从人类监督的副驾驶模式转向自主平台基础设施，跨调查群体分析其推理行为的能力已成为一项紧迫的基础设施需求。现有的运维工具能有效满足相邻需求：状态检查点系统提供容错能力；可观测性平台为调试提供执行追踪；遥测标准确保互操作性。然而，当前系统未能原生提供一种作为一等（first-class）、模式层（schema-level）原语的结构化推理溯源——即规范化的、可查询的记录，用以说明代理为何选择每个行动、从每次观察中得出何种结论、每个结论如何影响其策略，以及哪些证据支持其最终判定。本文提出代理执行记录（Agent Execution Record, AER），这是一种结构化推理溯源原语，它在每个步骤中将意图、观察和推论作为一等可查询字段进行捕获，同时记录带有修订依据的版本化计划、证据链、带有置信度的结构化判定以及委托授权链。我们形式化区分了计算状态持久化与推理溯源，论证了后者通常无法从前者中可靠重构，并展示了AER如何支持群体层面的行为分析：包括推理模式挖掘、置信度校准、跨代理比较以及通过模拟回放进行的反事实回归测试。我们提出了一个具有可扩展领域配置文件的领域无关模型、一个参考实现与SDK，并概述了一种基于生产级平台化根因分析代理的初步部署所制定的评估方法。

摘要 (Abstract)

As AI agents transition from human-supervised copilots to autonomous platform infrastructure, the ability to analyze their reasoning behavior across populations of investigations becomes a pressing infrastructure requirement. Existing operational tooling addresses adjacent needs effectively: state checkpoint systems enable fault tolerance; observability platforms provide execution traces for debugging; telemetry standards ensure interoperability. What current systems do not natively provide as a first-class, schema-level primitive is structured reasoning provenance – normalized, queryable records of why the agent chose each action, what it concluded from each observation, how each conclusion shaped its strategy, and which evidence supports its final verdict. This paper introduces the Agent Execution Record (AER), a structured reasoning provenance primitive that captures intent, observation, and inference as first-class queryable fields on every step, alongside versioned plans with revision rationale, evidence chains, structured verdicts with confidence scores, and delegation authority chains. We formalize the distinction between computational state persistence and reasoning provenance, argue that the latter cannot in general be faithfully reconstructed from the former, and show how AERs enable population-level behavioral analytics: reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. We present a domain-agnostic model with extensible domain profiles, a reference implementation and SDK, and outline an evaluation methodology informed by preliminary deployment on a production platformized root cause analysis agent.

关键词: AI agents, reasoning provenance, autonomous agents, behavioral analytics, structured reasoning, evidence chains, execution records, population-level analysis

96. ❌ Mirage The Illusion of Visual Understanding

作者: Mohammad Asadi, Jack W. O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, Euan Ashley 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21687v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究多模态AI系统（特别是视觉-语言模型）的推理机制和评估问题，核心发现是模型在没有图像输入的情况下也能生成详细描述和推理（称为“海市蜃楼推理”），这直接涉及大模型（LLMs/Foundation Models）在科学（医学）领域的应用、推理过程（Chain of Thought/System 2 Thinking）、幻觉缓解/事实性、可解释性机制，以及AI for Science（生物信息学/医学应用）。其他关键词如MoE、SLMs、训练技术、优化方法、代理系统等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文揭示了多模态AI系统在视觉-语言推理中的根本性漏洞，发现前沿模型在没有图像输入的情况下也能生成详细的图像描述和推理（称为“海市蜃楼推理”），并在医学等多模态基准测试中取得高分，从而暴露了当前评估方法的缺陷，并提出了B-Clean解决方案以确保公平的视觉基础评估。

摘要翻译

多模态人工智能系统在广泛的实际任务中取得了显著性能，然而视觉-语言推理背后的机制仍鲜为人知。我们报告了三项挑战当前主流假设的研究发现，这些假设涉及此类系统如何处理与整合视觉信息。首先，前沿模型能够为从未提供的图像生成详细描述和复杂推理轨迹（包括具有病理偏倚的临床发现），我们将此现象称为幻象推理。其次，在没有图像输入的情况下，模型在通用及医学多模态基准测试中仍能获得极高的分数，这对其实际效用与设计提出了质疑。最极端的情况下，我们的模型在未获取任何图像时，仍在一个标准胸部X光问答基准测试中取得了最高排名。第三，当模型被明确要求在不查看图像的情况下猜测答案，而非通过隐式提示使其假定图像存在时，其性能显著下降。显式猜测似乎会触发一种更为保守的响应机制，这与幻象机制形成鲜明对比——在幻象机制中，模型的行为表现得仿佛已获得图像输入。这些发现揭示了视觉-语言模型在推理与评估方式上的根本性缺陷，表明亟需建立能够消除文本线索（这些线索可能支持非视觉推理）的私有基准测试，尤其在医疗等人工智能误判可能造成严重后果的领域。为此，我们提出了B-Clean作为一项原则性解决方案，旨在实现对多模态人工智能系统公平、基于视觉的评估。

摘要 (Abstract)

Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual-language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and integrate visual information. First, Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. Explicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided. These findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.

关键词: Multimodal AI, Visual-language reasoning, Mirage reasoning, Hallucination, Medical AI evaluation, Benchmark design, Vision-grounded evaluation, Clinical AI

97. ❌ Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

作者: Hung-Hsuan Chen 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21676v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种深度循环Transformer架构，专注于解决多步推理和组合泛化问题，与’Chain of Thought’和’System 2 Thinking’高度相关（核心内容）。论文涉及Transformer架构创新，与’Large Language Models’有一定关联。论文对推理机制的分析与’Mechanistic Interpretability’有一定关联。其他关键词如MoE、量化、RAG等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种深度循环Transformer架构，通过解耦计算深度与参数数量，实现了在推理时通过增加循环步骤进行更深层次推理的能力，并在三个组合推理任务中验证了其性能随思考步骤增加而显著提升的效果。

摘要翻译

标准Transformer具有固定的计算深度，这从根本上限制了其泛化至需要可变深度推理任务的能力，例如多跳图遍历或嵌套逻辑。我们提出一种深度循环Transformer，通过在潜在空间中迭代应用共享权重的Transformer模块，将计算深度与参数量解耦——使模型能在推理时通过循环步骤换取更深层次的推理能力。我们的架构包含三种机制以实现深度循环（20+步骤）的稳定性：（1）静默思考目标函数，仅监督最终输出，迫使模型进行真正的多步推理而非依赖中间启发式捷径；（2）LayerScale初始化技术，保护脆弱的推理状态免受未训练层噪声的影响；（3）具有恒等偏置的循环机制，构建跨越多步的梯度高速通路。我们在三个归纳偏置递减的组合推理领域进行评估：图可达性（严格邻接掩码）、嵌套布尔逻辑（相对位置编码）以及非结构化关系文本（序列位置不提供结构提示）。在所有任务中，我们观察到清晰的计算边界——当思考步骤随任务复杂度增加时，模型性能从随机水平跃迁至近乎完美的临界点。此外，这些任务展现出性质不同的泛化行为：精确但脆弱（图推理）、近似但稳健（逻辑推理），以及无需结构提示的自主潜在路径选择（文本推理）。这种递进关系揭示了任务不变的循环推理核心与任务特定的感知接口之间的相互作用如何塑造分布外泛化能力，为纵向思维链提供了机制性视角，与主流的横向令牌生成范式形成互补。

摘要 (Abstract)

Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space – enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} – a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.

关键词: depth-recurrent Transformer, compositional generalization, multi-step reasoning, computational frontier, silent thinking objective, out-of-distribution generalization, vertical chain-of-thought, latent space recurrence

98. ❌ Cross-Scenario Deraining Adaptation with Unpaired Data: Superpixel Structural Priors and Multi-Stage Pseudo-Rain Synthesis

作者: Kangbo Zhao, Miaoxin Guan, Xiang Chen, Yukai Shi, Jinshan Pan 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21661v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的图像去雨任务，提出了一种跨场景去雨适应框架，使用超像素结构先验和多阶段伪雨合成。虽然论文涉及深度学习在图像处理中的应用，但其核心内容与提供的大模型（LLM）相关关键词（如LLMs、MoE、RLHF、RAG等）完全无关。唯一有微弱关联的是’Domain Adaptation’（属于’Pre-training OR Continual Pre-training OR Domain Adaptation’关键词），因为论文处理跨场景适应问题，但这不是大模型领域的领域适应，而是计算机视觉中的领域适应，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需目标域配对数据的跨场景图像去雨适应框架，通过超像素结构先验和多阶段伪雨合成，在OOD场景中实现了PSNR提升32%至59%并加速训练收敛。

摘要翻译

图像去雨在底层计算机视觉中扮演着关键角色，是构建鲁棒的户外监控与自动驾驶系统的先决条件。尽管深度学习范式在严格对齐的场景中取得了显著成功，但在泛化至未见过的分布外场景时，其性能常出现严重下降。这一失败主要源于合成训练数据集与真实世界降雨复杂物理动态之间的显著领域差异。为应对这些挑战，本文提出了一种开创性的跨场景去雨自适应框架。与传统方法不同，我们的方法无需目标域中的成对雨图观测，仅利用无雨背景图像。我们设计了一个超像素生成模块，通过简单线性迭代聚类从源域中提取稳定的结构先验。随后，引入一种分辨率自适应融合策略，通过纹理相似性将这些源结构与目标背景对齐，确保合成多样且真实的伪数据。最后，我们实现了一种伪标签重合成机制，该机制采用多阶段噪声生成来模拟真实的雨纹。该框架作为一个通用的即插即用模块，能够无缝集成到任意的去雨架构中。在多个前沿模型上的大量实验表明，我们的方法在分布外领域中实现了高达32%至59%的峰值信噪比提升，同时显著加速了训练收敛。

摘要 (Abstract)

Image deraining plays a pivotal role in low-level computer vision, serving as a prerequisite for robust outdoor surveillance and autonomous driving systems. While deep learning paradigms have achieved remarkable success in firmly aligned settings, they often suffer from severe performance degradation when generalized to unseen Out-of-Distribution (OOD) scenarios. This failure stems primarily from the significant domain discrepancy between synthetic training datasets and the complex physical dynamics of real-world rain. To address these challenges, this paper proposes a pioneering cross-scenario deraining adaptation framework. Diverging from conventional approaches, our method obviates the requirements for paired rainy observations in the target domain, leveraging exclusively rain-free background images. We design a Superpixel Generation (Sup-Gen) module to extract stable structural priors from the source domain using Simple Linear Iterative Clustering. Subsequently, a Resolution-adaptive Fusion strategy is introduced to align these source structures with target backgrounds through texture similarity, ensuring the synthesis of diverse and realistic pseudo-data. Finally, we implement a pseudo-label re-Synthesize mechanism that employs multi-stage noise generation to simulate realistic rain streaks. This framework functions as a versatile plug-and-play module capable of seamless integration into arbitrary deraining architectures. Extensive experiments on state-of-the-art models demonstrate that our approach yields remarkable PSNR gains of up to 32% to 59% in OOD domains while significantly accelerating training convergence.

关键词: image deraining, cross-scenario adaptation, unpaired data, superpixel structural priors, pseudo-rain synthesis, domain discrepancy, out-of-distribution generalization, deep learning

99. ❌ Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks

作者: Yanming Mu, Hao Hu, Feiyang Li, Qiao Yuan, Jiang Wu, Zichuan Liu, Pengcheng Liu, Mei Wang, Hongwei Zhou, Yuling Liu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21654v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统的安全性，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（15分）。论文明确提到RAG用于缓解LLM的幻觉问题，因此与’Large Language Models OR LLMs OR Foundation Models’和’Hallucination Mitigation OR Factuality OR Truthfulness’相关（各10分）。其他关键词如MoE、SLMs、训练方法、推理技术、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文系统分析了检索增强生成（RAG）系统的安全威胁（如数据中毒、对抗攻击），提出了输入输出双视角的防御技术分类，并建立了统一的基准测试框架，旨在促进下一代RAG系统的鲁棒性和可信度发展。

摘要翻译

检索增强生成（Retrieval-Augmented Generation，RAG）通过引入外部知识库，显著缓解了大语言模型中的幻觉问题与领域知识不足。然而，RAG的多模块架构引入了复杂的系统级安全漏洞。本文以RAG工作流程为指引，分析了其底层漏洞机制，并系统性地归类了数据投毒、对抗攻击、成员推理攻击等核心威胁向量。基于此威胁评估，我们从输入与输出双阶段视角构建了RAG防御技术的分类体系。输入侧分析梳理了包括动态访问控制、同态加密检索和对抗性预过滤在内的数据保护机制；输出侧研究总结了联邦学习隔离、差分隐私扰动及轻量化数据脱敏等先进的防泄露技术。为建立未来实验设计的统一基准，我们整合了权威测试数据集、安全标准与评估框架。据我们所知，本文首次提出了专注于RAG系统安全的端到端综述。与现有文献孤立分析特定漏洞不同，我们系统性地描绘了整个流程——对威胁模型、防御机制和评估基准进行了统一分析。通过深入揭示潜在风险，本研究旨在推动构建高鲁棒性、可信赖的下一代RAG系统。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) significantly mitigates the hallucinations and domain knowledge deficiency in large language models by incorporating external knowledge bases. However, the multi-module architecture of RAG introduces complex system-level security vulnerabilities. Guided by the RAG workflow, this paper analyzes the underlying vulnerability mechanisms and systematically categorizes core threat vectors such as data poisoning, adversarial attacks, and membership inference attacks. Based on this threat assessment, we construct a taxonomy of RAG defense technologies from a dual perspective encompassing both input and output stages. The input-side analysis reviews data protection mechanisms including dynamic access control, homomorphic encryption retrieval, and adversarial pre-filtering. The output-side examination summarizes advanced leakage prevention techniques such as federated learning isolation, differential privacy perturbation, and lightweight data sanitization. To establish a unified benchmark for future experimental design, we consolidate authoritative test datasets, security standards, and evaluation frameworks. To the best of our knowledge, this paper presents the first end-to-end survey dedicated to the security of RAG systems. Distinct from existing literature that isolates specific vulnerabilities, we systematically map the entire pipeline-providing a unified analysis of threat models, defense mechanisms, and evaluation benchmarks. By enabling deep insights into potential risks, this work seeks to foster the development of highly robust and trustworthy next-generation RAG systems.

关键词: Retrieval-Augmented Generation, RAG, Security Vulnerabilities, Threat Vectors, Defense Technologies, Benchmark, Hallucination Mitigation, Large Language Models

100. ❌ Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks

作者: Yiliang Song, Hongjun An, Jiangan Chen, Xuanchen Yan, Huan Song, Jiawei Shao, Xuelong Li 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21636v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM基准测试中的污染问题和分数可靠性，直接涉及LLM评估方法，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等）或应用领域（如生物信息学），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM基准测试中因数据污染导致的分数可靠性问题，并提出了一种审计框架来评估污染敏感性和分数置信度，发现噪声条件下模型性能的异质性提升表明基准分数可能反映污染记忆而非真实泛化能力。

摘要翻译

公共基准测试日益主导着大型语言模型（LLM）的排名、选择与部署。我们将这种以基准为中心的制度框架称为“硅基官僚主义”与“人工智能应试教育”，并指出其建立在一个脆弱的假设之上：基准分数直接反映了真实的泛化能力。然而在实践中，此类分数可能将应试能力与原则性能力混为一谈，尤其当现代训练流程难以完全排除数据污染和语义泄漏时。为此，我们提出一种审计框架，用于分析LLM基准测试中的污染敏感性与分数置信度。通过采用路由-工作者架构，我们将洁净对照组与噪声实验组进行对比——在噪声条件下，基准问题在被传递至下游前会经历系统性删除、改写与扰动。对于一个真正洁净的基准，噪声条件不应持续优于洁净对照基线。然而在多个模型中，我们普遍观察到噪声条件下存在异质性但显著高于基线的性能增益，这表明基准相关线索可能被重组并激活污染关联记忆。这些结果意味着相似的基准分数可能承载着截然不同的置信水平。我们主张不应全盘否定基准测试，而应在基于基准的评估中补充对污染敏感性与分数置信度的显式审计。

摘要 (Abstract)

Public benchmarks increasingly govern how large language models (LLMs) are ranked, selected, and deployed. We frame this benchmark-centered regime as Silicon Bureaucracy and AI Test-Oriented Education, and argue that it rests on a fragile assumption: that benchmark scores directly reflect genuine generalization. In practice, however, such scores may conflate exam-oriented competence with principled capability, especially when contamination and semantic leakage are difficult to exclude from modern training pipelines. We therefore propose an audit framework for analyzing contamination sensitivity and score confidence in LLM benchmarks. Using a router-worker setup, we compare a clean-control condition with noisy conditions in which benchmark problems are systematically deleted, rewritten, and perturbed before being passed downstream. For a genuinely clean benchmark, noisy conditions should not consistently outperform the clean-control baseline. Yet across multiple models, we find widespread but heterogeneous above-baseline gains under noisy conditions, indicating that benchmark-related cues may be reassembled and can reactivate contamination-related memory. These results suggest that similar benchmark scores may carry substantially different levels of confidence. Rather than rejecting benchmarks altogether, we argue that benchmark-based evaluation should be supplemented with explicit audits of contamination sensitivity and score confidence.

关键词: LLM benchmarks, contamination sensitivity, score confidence, benchmark evaluation, semantic leakage, generalization, audit framework, Silicon Bureaucracy

101. ❌ Efficient Zero-Shot AI-Generated Image Detection

作者: Ryosuke Sonoda, Ramya Srinivasan 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21619v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于AI生成图像检测，提出了一种基于频率扰动敏感性的训练免费检测方法。所有评分关键词均与大语言模型（LLM）或深度学习技术原理相关，而本文研究的是计算机视觉领域的图像检测问题，与文本生成、语言模型、模型训练/微调、推理优化、AI代理等关键词无直接关联。虽然涉及AI生成内容检测，但核心是图像而非文本，且未使用或改进大模型技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于频率扰动敏感性的训练免费AI生成图像检测方法，在保持低计算成本的同时，在OpenFake基准上比现有最优方法提升了近10%的AUC。

摘要翻译

文本到图像模型的快速发展使得人工智能生成的图像日益逼真，这对生成内容的准确检测提出了重大挑战。基于训练的检测器通常对未见图像的泛化能力有限，而无训练方法虽具有更好的鲁棒性，却难以捕捉真实图像与合成图像之间的细微差异。本研究提出一种无训练的人工智能生成图像检测方法，通过测量表征对结构化频率扰动的敏感性，实现对细微篡改的检测。该方法计算量轻便，因为扰动生成仅需对输入图像进行单次傅里叶变换。因此，其推理速度比大多数无训练检测器快一到两个数量级。在具有挑战性的基准测试上进行的大量实验证明了本方法相对于最先进技术（SoTA）的有效性。特别是在OpenFake基准测试中，本方法将AUC（曲线下面积）较SoTA提升了近$10%$，同时保持了显著更低的计算成本。

摘要 (Abstract)

The rapid progress of text-to-image models has made AI-generated images increasingly realistic, posing significant challenges for accurate detection of generated content. While training-based detectors often suffer from limited generalization to unseen images, training-free approaches offer better robustness, yet struggle to capture subtle discrepancies between real and synthetic images. In this work, we propose a training-free AI-generated image detection method that measures representation sensitivity to structured frequency perturbations, enabling detection of minute manipulations. The proposed method is computationally lightweight, as perturbation generation requires only a single Fourier transform for an input image. As a result, it achieves one to two orders of magnitude faster inference than most training-free detectors.Extensive experiments on challenging benchmarks demonstrate the efficacy of our method over state-of-the-art (SoTA). In particular, on OpenFake benchmark, our method improves AUC by nearly $10%$ compared to SoTA, while maintaining substantially lower computational cost.

关键词: AI-generated image detection, training-free detection, frequency perturbations, Fourier transform, computational efficiency, zero-shot detection, representation sensitivity, OpenFake benchmark

102. ❌ AgenticRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents

作者: Tianyi Li, Zixuan Wang, Guidong Lei, Xiaodong Li, Hui Li 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21613v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	15.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AgenticRec专注于基于大语言模型的推荐代理（LLM Agents），通过工具集成（Tool Use）和策略优化来解决推荐系统中的推理与排序问题。核心相关关键词包括：LLM Agents（高度相关，论文核心）、Tool Use（高度相关，框架核心）、Large Language Models（高度相关，基础技术）、Chain of Thought和System 2 Thinking（高度相关，涉及推理过程）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等与论文内容无关，评分为0。

!!! tip deepseek-chat TL;DR

论文提出了AgenticRec框架，通过集成推荐工具和优化策略，解决了基于大语言模型的推荐代理中推理与排序反馈脱节的问题，显著提升了推荐性能。

摘要翻译

基于大语言模型的推荐智能体为推荐系统提供了前景广阔的新范式。然而，现有推荐智能体通常存在中间推理与最终排序反馈脱节的问题，且难以捕捉细粒度用户偏好。为此，我们提出了AgenticRec——一个面向排序的智能体推荐框架，该框架能够在稀疏隐式反馈下优化完整的决策轨迹（包括中间推理、工具调用及最终排序列表生成）。我们的方法包含三项核心贡献。首先，我们设计了一套集成于ReAct循环的、面向推荐的专业工具集，以支持基于证据的推理。其次，我们提出了理论上无偏的列表级分组相对策略优化算法（List-Wise Group Relative Policy Optimization, list-wise GRPO）来最大化排序效用，确保为复杂的工具使用轨迹进行准确的信用分配。第三，我们引入了渐进式偏好细化方法（Progressive Preference Refinement, PPR）以解决细粒度偏好歧义。该方法通过从排序违规中挖掘困难负例，并应用双向偏好对齐，从而最小化成对排序误差的凸上界。基准测试实验证实，AgenticRec显著优于基线方法，验证了统一推理、工具使用与排序优化的必要性。

摘要 (Abstract)

Recommender agents built on Large Language Models offer a promising paradigm for recommendation. However, existing recommender agents typically suffer from a disconnect between intermediate reasoning and final ranking feedback, and are unable to capture fine-grained preferences. To address this, we present AgenticRec, a ranking-oriented agentic recommendation framework that optimizes the entire decision-making trajectory (including intermediate reasoning, tool invocation, and final ranking list generation) under sparse implicit feedback. Our approach makes three key contributions. First, we design a suite of recommendation-specific tools integrated into a ReAct loop to support evidence-grounded reasoning. Second, we propose theoretically unbiased List-Wise Group Relative Policy Optimization (list-wise GRPO) to maximize ranking utility, ensuring accurate credit assignment for complex tool-use trajectories. Third, we introduce Progressive Preference Refinement (PPR) to resolve fine-grained preference ambiguities. By mining hard negatives from ranking violations and applying bidirectional preference alignment, PPR minimizes the convex upper bound of pairwise ranking errors. Experiments on benchmarks confirm that AgenticRec significantly outperforms baselines, validating the necessity of unifying reasoning, tool use, and ranking optimization.

关键词: Recommender Agents, Large Language Models, Tool Integration, Policy Optimization, Ranking, Reasoning, Preference Refinement, Agentic Workflow

103. ❌ Rule-State Inference (RSI): A Bayesian Framework for Compliance Monitoring in Rule-Governed Domains

作者: Abdou-Raouf Atarmla 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21610v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于合规监控的贝叶斯框架RSI，专注于规则驱动的领域（如税务、法规），核心是使用贝叶斯推理从部分和噪声观测中推断潜在的规则激活状态、合规率和参数漂移。论文内容与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关，未涉及任何大模型、深度学习、AI for Science等主题，而是传统的概率机器学习方法在特定领域的应用。

!!! tip deepseek-chat TL;DR

论文提出了一种名为Rule-State Inference (RSI)的贝叶斯框架，用于在规则驱动的领域（如税务合规）中，从部分和噪声观测中推断潜在的规则激活状态、合规率和参数漂移，并在多哥财政系统上验证了其理论保证和高效性。

摘要翻译

现有用于合规性监测的机器学习框架——马尔可夫逻辑网络、概率软逻辑、监督模型——共享一个基本范式：它们将观测数据视为真实情况，并试图从中推导规则。这一假设在诸如税务或监管合规等受规则约束的领域并不成立，因为这些领域的权威规则是预先已知的，真正的挑战在于从部分且有噪声的观测中推断规则激活、合规率及参数漂移的潜在状态。
我们提出了规则状态推断（Rule-State Inference, RSI），这是一个贝叶斯框架，它通过将监管规则编码为结构化先验，并将合规性监测转化为对潜在规则状态空间 S = {(a_i, c_i, delta_i)} 的后验推断，从而颠覆了上述范式。其中 a_i 捕捉规则激活状态，c_i 建模合规率，delta_i 量化参数漂移。我们证明了三个理论保证：（T1）RSI 通过先验比率修正，能以 O(1) 时间吸收监管规则变化，且独立于数据集规模；（T2）后验分布具有伯恩斯坦-冯·米塞斯一致性，随着观测数据的积累收敛于真实的规则状态；（T3）平均场变分推断能单调地最大化证据下界（Evidence Lower BOund, ELBO）。
我们在多哥财政系统上实例化了 RSI，并发布了 RSI-Togo-Fiscal-Synthetic v1.0 基准数据集，该数据集基于真实的多哥税务总局（OTR）监管规则（2022-2025年），包含 2000 家合成企业。在没有任何标注训练数据的情况下，RSI 实现了 F1=0.519 和 AUC=0.599，同时吸收监管规则变化的时间低于 1 毫秒，而完整模型重新训练需要 683-1082 毫秒——速度提升至少 600 倍。

摘要 (Abstract)

Existing machine learning frameworks for compliance monitoring – Markov Logic Networks, Probabilistic Soft Logic, supervised models – share a fundamental paradigm: they treat observed data as ground truth and attempt to approximate rules from it. This assumption breaks down in rule-governed domains such as taxation or regulatory compliance, where authoritative rules are known a priori and the true challenge is to infer the latent state of rule activation, compliance, and parametric drift from partial and noisy observations. We propose Rule-State Inference (RSI), a Bayesian framework that inverts this paradigm by encoding regulatory rules as structured priors and casting compliance monitoring as posterior inference over a latent rule-state space S = {(a_i, c_i, delta_i)}, where a_i captures rule activation, c_i models the compliance rate, and delta_i quantifies parametric drift. We prove three theoretical guarantees: (T1) RSI absorbs regulatory changes in O(1) time via a prior ratio correction, independently of dataset size; (T2) the posterior is Bernstein-von Mises consistent, converging to the true rule state as observations accumulate; (T3) mean-field variational inference monotonically maximizes the Evidence Lower BOund (ELBO). We instantiate RSI on the Togolese fiscal system and introduce RSI-Togo-Fiscal-Synthetic v1.0, a benchmark of 2,000 synthetic enterprises grounded in real OTR regulatory rules (2022-2025). Without any labeled training data, RSI achieves F1=0.519 and AUC=0.599, while absorbing regulatory changes in under 1ms versus 683-1082ms for full model retraining – at least a 600x speedup.

关键词: Rule-State Inference, Bayesian framework, compliance monitoring, rule-governed domains, latent state inference, regulatory rules, variational inference, fiscal system

104. ❌ DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers

作者: Tianyu Cao, Helin Wang, Ari Frummer, Yuval Sieradzki, Adi Arbel, Laureano Moro Velazquez, Jesus Villalba, Oren Gal, Thomas Thebaud, Najim Dehak 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21608v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DiT-Flow专注于语音增强任务，采用基于流匹配和扩散变换器的生成模型方法。与大多数关键词（如LLMs、指令调优、RAG等）无关，因为这些关键词主要针对大语言模型和文本生成任务。然而，论文明确提到使用了LoRA（参数高效微调）和MoE（专家混合）框架，因此这两个关键词获得较高分数：LoRA相关度10分（核心方法），MoE相关度8分（集成框架）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于流匹配和扩散变换器的语音增强框架DiT-Flow，通过集成LoRA和MoE实现参数高效训练，在多种失真条件下优于现有生成模型。

摘要翻译

近期，扩散模型与流匹配等生成模型在音频任务中展现出强大性能。然而，语音增强模型通常在有限数据集上训练，并在狭窄条件下评估，限制了其实际适用性。为此，我们提出DiT-Flow——一种基于流匹配的语音增强框架，该框架构建于潜在扩散变换器（Diffusion Transformer，DiT）主干网络之上，并针对包括噪声、混响和压缩在内的多种失真类型进行鲁棒性训练。DiT-Flow在紧凑的变分自编码器（Variational Auto-Encoders，VAEs）衍生的潜在特征上运行。我们在StillSonicSet数据集上验证了所提方法，该数据集是一个合成但声学逼真的数据集，由LibriSpeech、FSD50K、FMA以及90个Matterport3D场景构成。实验表明，DiT-Flow在多项指标上持续优于当前最先进的生成式语音增强模型，证明了流匹配在多条件语音增强中的有效性。尽管当前研究不断致力于提升合成数据的真实感，但语音增强领域仍存在一个持续瓶颈：训练条件与部署条件之间不可避免的失配。通过将低秩自适应（LoRA）与混合专家（MoE）框架相结合，我们为DiT-Flow实现了参数高效的高性能训练，使其对多种失真具有鲁棒性，仅使用总参数量的4.9%即在五种未见失真类型上取得了更优性能。

摘要 (Abstract)

Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.

关键词: speech enhancement, flow matching, diffusion transformer, latent space, LoRA, Mixture of Experts, robustness, generative models

105. ❌ INTRYGUE: Induction-Aware Entropy Gating for Reliable RAG Uncertainty Estimation

作者: Alexandra Bazarova, Andrei Volodichev, Daria Kotova, Alexey Zaytsev 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21607v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统中的不确定性估计问题，直接涉及’Retrieval-Augmented Generation’（核心内容，15分）、‘Hallucination Mitigation’（解决幻觉问题，10分）和’Mechanistic Interpretability’（基于诱导头的机制解释，10分）。论文使用LLMs作为基础模型，因此’Large Language Models’相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文揭示了RAG系统中基于熵的不确定性估计方法因诱导头与熵神经元的机制冲突而失效的问题，并提出了一种基于诱导头激活模式的熵门控方法INTRYGUE，在多个基准测试中优于现有基线。

摘要翻译

尽管检索增强生成（RAG）显著提升了大型语言模型（LLMs）的事实可靠性，但它并未完全消除幻觉现象，因此稳健的不确定性量化（UQ）仍然至关重要。本文揭示，在RAG场景中，基于标准熵的UQ方法常因一种机制性悖论而失效。模型内部存在一种固有的语境利用“拉锯战”：归纳头（induction heads）通过复制正确答案来促进基于事实的响应，但同时会附带激活先前建立的“熵神经元”。这种相互作用导致预测熵被夸大，使得模型在输出准确结果时反而发出错误的低确定性信号。为解决此问题，我们提出了INTRYGUE（面向不确定性估计的归纳感知熵门控），这是一种基于机制的方法，通过根据归纳头的激活模式对预测熵进行门控处理。在四个RAG基准测试和六个开源LLM（参数量4B至13B）上的评估表明，INTRYGUE始终匹配或超越一系列广泛的UQ基线方法。我们的研究证明，在RAG系统中结合预测不确定性与可解释的语境利用内部信号，能够有效提升幻觉检测性能。

摘要 (Abstract)

While retrieval-augmented generation (RAG) significantly improves the factual reliability of LLMs, it does not eliminate hallucinations, so robust uncertainty quantification (UQ) remains essential. In this paper, we reveal that standard entropy-based UQ methods often fail in RAG settings due to a mechanistic paradox. An internal “tug-of-war” inherent to context utilization appears: while induction heads promote grounded responses by copying the correct answer, they collaterally trigger the previously established “entropy neurons”. This interaction inflates predictive entropy, causing the model to signal false uncertainty on accurate outputs. To address this, we propose INTRYGUE (Induction-Aware Entropy Gating for Uncertainty Estimation), a mechanistically grounded method that gates predictive entropy based on the activation patterns of induction heads. Evaluated across four RAG benchmarks and six open-source LLMs (4B to 13B parameters), INTRYGUE consistently matches or outperforms a wide range of UQ baselines. Our findings demonstrate that hallucination detection in RAG benefits from combining predictive uncertainty with interpretable, internal signals of context utilization.

关键词: Retrieval-Augmented Generation, Uncertainty Estimation, Hallucination Mitigation, Induction Heads, Mechanistic Interpretability, Entropy Gating, Large Language Models, RAG Benchmarks

106. ❌ Riemannian Geometry Speaks Louder Than Words: From Graph Foundation Model to Next-Generation Graph Intelligence

作者: Philip S. Yu, Li Sun 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21601v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出Riemannian Foundation Model (RFM)作为Graph Foundation Models (GFMs)的新范式，与LLMs和Foundation Models高度相关（8分），因为它旨在构建类似LLMs的通用图基础模型。论文涉及多领域预训练和适应，与Pre-training/Domain Adaptation有一定关联（5分）。论文提到RFM agents，与LLM Agents相关（5分）。论文强调内在几何和可解释性，与Explainable AI相关（5分）。论文提到图在生命科学等领域的应用，与AI for Science相关（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Quantization等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有图神经网络（GNNs）在记忆保留和可解释性方面的局限性，以及大语言模型（LLMs）难以直接处理图结构的问题，提出了一种基于黎曼几何的图基础模型（RFM），旨在通过内在几何实现通用图建模，并最终构建RFM agents以推动下一代图智能的发展。

摘要翻译

图作为一种描述对象间复杂关系的自然表达形式，在通信、交通、社会计算与生命科学等领域发挥着关键作用。当前学界普遍认同图基础模型对于推动图学习发展至关重要，然而在如何构建类似于大语言模型的强大通用图基础模型这一问题上仍存在显著分歧。图神经网络在面对多领域预训练与适应任务时，在记忆保持与可解释性方面存在局限。图序列化的挑战阻碍了大语言模型的直接应用，因为文字难以捕捉图数据固有的结构复杂性与多样性。相比之下，黎曼几何为结构建模提供了优雅的数学框架，同时保持与图语义学习（甚至与大语言模型）的兼容性。本文主张，对于图数据而言，黎曼几何比文字更具表现力，并系统阐述了图基础模型的基础原理。通过黎曼几何的重新构想，我们提出一个前瞻性理念——黎曼基础模型——该模型为捕捉复杂结构模式与发现跨领域通用规律开辟了新路径。黎曼基础模型强调图的内蕴几何特性，具备结构推理与生成的内生能力，超越了单纯表示空间的转换。基于此，我们规划了渐进式发展路线：首先通过内蕴几何实现通用结构理解，继而以黎曼几何引擎重构大语言模型，最终实现通用图建模及更广泛的应用。因此，黎曼基础模型推动着从设计图模型到利用黎曼智能体解决图结构应用的范式转变，从而开启新一代图智能的新纪元。

摘要 (Abstract)

Graphs provide a natural description of the complex relationships among objects, and play a pivotal role in communications, transportation, social computing, the life sciences, etc. Currently, there is strong agreement that Graph Foundation Models (GFMs) are essential for advancing graph learning, yet considerable disagreement persists on how to build a powerful, general-purpose GFM analogous to Large Language Models (LLMs). Graph Neural Networks (GNNs) exhibit limitations in memory retention and principled interpretability when confronted with multi-domain pretraining and adaptation. The challenge of graph serialization hinders the direct application of LLMs, as the words struggle to capture the structural complexity and diversity inherent in graphs. In contrast, Riemannian geometry offers an elegant mathematical framework for modeling structures, while remaining compatible with graph semantic learning, even with LLMs. In this paper, we argue that, for graphs, Riemannian geometry speaks louder than words, and lay out the foundational principles for GFM. Reimagining with Riemannian geometry, we introduce a blue sky idea-Riemannian Foundation Model (RFM)-that opens a new pathway for capturing complex structural patterns and uncovering cross-domain generalities. RFM emphasizes intrinsic graph geometry and embodies endogenous capacities for structural inference and generation, moving beyond mere representation-space switching. Accordingly, we outline a progressive agenda that begins with universal structural understanding through intrinsic geometry, and then rebuilds LLM with a Riemannian engine for general-purpose graph modeling and beyond. Thus, RFM enables a paradigm shift from designing graph models to solving graph-structured applications with RFM agents, unlocking the next-generation graph intelligence.

关键词: Graph Foundation Models, Riemannian geometry, Large Language Models, Graph Neural Networks, Structural inference, Cross-domain generality, RFM agents, Graph intelligence

107. ❌ A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment

作者: Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei, Arjun Masurkar, Kevin M. Spiegler, Philip Kuball, Stefania C. Bray, Megan Bernath, Deanna R. Willis, Jiang Bian, Lei Xing, Eric Topol, Kyunghyun Cho, Yu Huang, Ruogu Fang, Narges Razavian, James Zou 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21597v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出Cerebra，一个用于痴呆症多模态表征和风险评估的多智能体AI系统，核心是协调专门代理进行EHR、临床笔记和医学影像分析。高度相关关键词：‘LLM Agents/Autonomous Agents/Agentic Workflow’和’Multi-agent Systems/Agent Coordination’（10分），因为论文核心是多智能体系统；‘AI for Science/Bioinformatics/Cheminformatics’（10分），属于生物医学AI应用。中等相关：‘Large Language Models/LLMs/Foundation Models’（8分），论文提到多模态基础模型和语言模型基线；‘Pre-training/Continual Pre-training/Domain Adaptation’和’Post-training/Supervised Fine-tuning/SFT’（5分），涉及模型训练；‘Mechanistic Interpretability/Explainable AI’（5分），提到可解释性。其他关键词与论文技术细节（如MoE、量化、推理加速等）无关，得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个名为Cerebra的多智能体AI系统，用于痴呆症的多模态表征和风险评估，在大型多机构数据集上显著优于现有单模态模型和语言模型基线，并提高了临床医生的诊断准确性。

摘要翻译

现代临床实践日益依赖于对异构、动态且不完整的患者数据进行推理。尽管多模态基础模型的最新进展提升了各类临床任务的性能，但现有模型大多仍为静态、不透明且难以与真实临床工作流契合。我们提出Cerebra——一个交互式多智能体AI团队，它能协调针对电子健康记录（EHR）、临床文本和医学影像分析的专用智能体。这些输出被整合至面向临床医生的仪表板，该仪表板将可视化分析与对话界面相结合，使临床医生能在诊疗现场质询预测结果并评估风险背景。Cerebra通过操作结构化表征支持隐私保护部署，并在数据模态不完整时保持稳健性。我们使用涵盖四个独立医疗系统、涉及300万患者的大规模多机构数据集对Cerebra进行评估。其性能持续优于当前最先进的单模态模型与大型多模态语言模型基线。在痴呆症风险预测中，其AUROC最高达0.80，而最强单模态模型为0.74，语言模型基线为0.68。在痴呆症诊断任务中，其AUROC达到0.86；在生存预测中，C指数达到0.81。在由经验丰富的医师参与的阅片研究中，Cerebra显著提升了专家表现，将前瞻性痴呆风险评估的准确率提高了17.5个百分点。这些结果证明了Cerebra在临床护理中实现可解释、稳健决策支持的潜力。

摘要 (Abstract)

Modern clinical practice increasingly depends on reasoning over heterogeneous, evolving, and incomplete patient data. Although recent advances in multimodal foundation models have improved performance on various clinical tasks, most existing models remain static, opaque, and poorly aligned with real-world clinical workflows. We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis. These outputs are synthesized into a clinician-facing dashboard that combines visual analytics with a conversational interface, enabling clinicians to interrogate predictions and contextualize risk at the point of care. Cerebra supports privacy-preserving deployment by operating on structured representations and remains robust when modalities are incomplete. We evaluated Cerebra using a massive multi-institutional dataset spanning 3 million patients from four independent healthcare systems. Cerebra consistently outperformed both state-of-the-art single-modality models and large multimodal language model baselines. In dementia risk prediction, it achieved AUROCs up to 0.80, compared with 0.74 for the strongest single-modality model and 0.68 for language model baselines. For dementia diagnosis, it achieved an AUROC of 0.86, and for survival prediction, a C-index of 0.81. In a reader study with experienced physicians, Cerebra significantly improved expert performance, increasing accuracy by 17.5 percentage points in prospective dementia risk estimation. These results demonstrate Cerebra’s potential for interpretable, robust decision support in clinical care.

关键词: multimodal AI, multi-agent system, dementia risk assessment, clinical decision support, EHR analysis, medical imaging, interpretable AI, healthcare AI

108. ❌ Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications

作者: Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang, Dusit Niyato 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21594v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究无人机辅助无线网络中的多智能体深度强化学习（MADRL）算法，主要关注轨迹规划、网络形成和传输控制策略的联合优化，并提出了时空注意力预测方法来恢复丢失信息。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关。唯一相关的关键词是’Multi-agent Systems OR Agent Coordination’，因为论文明确研究多无人机（多智能体）协作问题，这是论文的核心内容之一，因此给予10分。其他关键词均未在论文标题或摘要中提及，也没有涉及相关技术概念。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合延迟惩罚奖励和时空注意力预测的多智能体深度强化学习算法，用于优化无人机辅助无线网络中的轨迹规划、网络形成和传输控制，实验表明该方法相比传统MADRL能减少50%信息延迟并提高75%吞吐量。

摘要翻译

本文采用多架无人机通过中继通信，将地面用户的数据加速传输至远程基站。无人机间的间歇性信息交换通常导致获取完整系统状态的延迟，并阻碍其有效协作。为最大化总吞吐量，我们首先提出一种时延容忍的多智能体深度强化学习算法，该算法通过引入时延惩罚奖励机制激励无人机间信息共享，同时联合优化无人机的轨迹规划、网络构型与传输控制策略。此外，考虑到不可靠信道条件造成的信息丢失，我们进一步提出一种基于时空注意力的预测方法，以恢复丢失信息并增强各无人机对网络状态的感知能力。这两项设计旨在提升通信受限的无人机辅助无线网络的容量。仿真结果表明，与传统多智能体深度强化学习方法相比，新方法可实现信息延迟降低50%以上，吞吐量提升75%。有趣的是，研究显示提升无人机信息共享并不会牺牲网络容量，反而能同时显著改善学习性能与吞吐量。该方法还有效降低了对无人机信息交换的需求，从而推动多智能体深度强化学习在无人机辅助无线网络中的实际部署。

摘要 (Abstract)

In this paper, we employ multiple UAVs to accelerate data transmissions from ground users (GUs) to a remote base station (BS) via the UAVs’ relay communications. The UAVs’ intermittent information exchanges typically result in delays in acquiring the complete system state and hinder their effective collaboration. To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing the UAVs’ trajectory planning, network formation, and transmission control strategies. Additionally, considering information loss due to unreliable channel conditions, we further propose a spatio-temporal attention based prediction approach to recover the lost information and enhance each UAV’s awareness of the network state. These two designs are envisioned to enhance the network capacity in UAV-assisted wireless networks with limited communications. The simulation results reveal that our new approach achieves over 50% reduction in information delay and 75% throughput gain compared to the conventional MADRL. Interestingly, it is shown that improving the UAVs’ information sharing will not sacrifice the network capacity. Instead, it significantly improves the learning performance and throughput simultaneously. It is also effective in reducing the need for UAVs’ information exchange and thus fostering practical deployment of MADRL in UAV-assisted wireless networks.

关键词: multi-agent deep reinforcement learning, UAV-assisted wireless networks, trajectory planning, spatio-temporal attention, throughput maximization, delay-tolerant algorithm, information sharing, network capacity

109. ❌ PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

作者: Hyoseok Park, Yeonsang Park 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21576v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PRISM专注于解决长上下文LLM推理中的内存瓶颈问题，通过光子计算实现O(1)复杂度的KV块选择。与论文高度相关的关键词包括：‘Large Language Models OR LLMs OR Foundation Models’（论文研究对象是LLM推理）、‘Context Window Extension OR Long Context LLMs’（核心解决长上下文问题）、‘KV Cache Compression OR Linear Attention OR FlashAttention’（直接针对KV缓存扫描的内存瓶颈）、‘Speculative Decoding OR Inference Acceleration’（核心目标是加速推理）。其他关键词如MoE、SFT、RAG、量化等与论文内容无关，因为论文聚焦于硬件加速和内存优化，而非模型架构、训练方法或特定应用领域。

!!! tip deepseek-chat TL;DR

论文PRISM提出了一种基于光子计算的新型硬件加速方法，通过O(1)复杂度的块选择机制，解决了长上下文LLM推理中KV缓存扫描的O(n)内存瓶颈问题，在64K上下文长度下实现了16倍流量减少和四个数量级的能效提升。

摘要翻译

长上下文大语言模型推理的瓶颈并非计算能力，而是源于每个解码步骤中扫描键值缓存所带来的O(n)内存带宽成本——这是一道仅靠算力扩展无法突破的壁垒。近期光子加速器在稠密注意力计算方面已展现出卓越的吞吐性能；然而，当应用于长上下文场景时，这些方法仍继承了与电子注意力相同的O(n)内存扩展特性。我们观察到，真正的关键杠杆点在于粗粒度块选择步骤：这是一个内存受限的相似性搜索任务，用于确定需要获取哪些键值块。我们首次发现，该任务在结构上契合光子广播加权范式——查询通过被动分束广播至所有候选块，特征签名具有准静态特性（与电光微环调制编程相匹配），且仅需排序信息（将精度要求放宽至4-6比特）。至关重要的是，光子优势随上下文长度增长而放大：当N增加时，电子扫描成本线性上升，而光子评估始终保持O(1)复杂度。我们将这一洞见实例化为PRISM（基于微环加权的内积相似性光子排序系统），这是一种薄膜铌酸锂相似性计算引擎。在Qwen2.5-7B模型上进行的硬件损伤型“大海捞针”评估表明，在k=32的设置下，从4K到64K词元的范围内均保持100%准确率，并在64K上下文长度下实现16倍的数据传输削减。在实际上下文长度（n ≥ 4K）条件下，PRISM相比GPU基线实现了四个数量级的能效优势。

摘要 (Abstract)

Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step – a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm – the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).

关键词: Long-context LLM inference, KV cache, Memory bottleneck, Photonic accelerator, O(1) complexity, Block selection, Energy efficiency, Inference acceleration

110. ❌ Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

作者: Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang, Tao Ren, Jinyang Jiang, Yijie Peng, Yikun Ban, Fuzhen Zhuang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21574v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体协作增强大语言模型推理能力，与’Large Language Models’、‘Chain of Thought’、‘System 2 Thinking’、‘Self-Correction’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分），因为这些关键词直接对应论文研究的LLM多智能体协作推理框架、结构化推理流程（answer-critique-rewrite）和自校正机制。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种鲁棒的多智能体强化学习框架（DACR+ARE），通过结构化三阶段协作推理流程和自适应鲁棒估计器，解决了多智能体协作中信用分配困难和奖励噪声导致的训练不稳定问题，在数学推理和具身智能基准测试中表现出更强的鲁棒性和稳定性。

摘要翻译

多智能体协作已成为增强大语言模型推理能力的重要范式，但其存在交互层面的模糊性，导致生成、批判与修订环节界限不清，使得跨智能体的贡献分配难以实现。此外，该场景下的策略优化易受重尾噪声奖励的影响，可能扭曲优势估计并引发训练不稳定甚至发散。为同时解决这两个问题，我们提出一个面向协作推理的鲁棒多智能体强化学习框架，该框架包含双智能体“回答-批判-重写”（Dual-Agent Answer-Critique-Rewrite, DACR）与自适应鲁棒估计器（Adaptive Robust Estimator, ARE）两个核心组件。DACR将推理过程解构为结构化的三阶段流程：回答、批判与重写，同时显式量化每个智能体对其协作伙伴性能的边际贡献。ARE在多智能体策略优化过程中对批次经验均值进行鲁棒估计。在数学推理与具身智能基准测试中，即使在噪声奖励环境下，我们的方法在同构与异构设置下均持续优于基线模型。这些结果表明，该方法对奖励噪声具有更强的鲁棒性，训练动态更为稳定，能有效避免因噪声奖励信号导致的优化失败。

摘要 (Abstract)

Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit assignment across agents difficult. Moreover, policy optimization in this setting is vulnerable to heavy-tailed and noisy rewards, which can bias advantage estimation and trigger unstable or even divergent training. To address both issues, we propose a robust multi-agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual-Agent Answer-Critique-Rewrite (DACR) and an Adaptive Robust Estimator (ARE). DACR decomposes reasoning into a structured three-stage pipeline: answer, critique, and rewrite, while enabling explicit attribution of each agent’s marginal contribution to its partner’s performance. ARE provides robust estimation of batch experience means during multi-agent policy optimization. Across mathematical reasoning and embodied intelligence benchmarks, even under noisy rewards, our method consistently outperforms the baseline in both homogeneous and heterogeneous settings. These results indicate stronger robustness to reward noise and more stable training dynamics, effectively preventing optimization failures caused by noisy reward signals.

关键词: Multi-agent Reinforcement Learning, Large Language Models, Collaborative Reasoning, Robust Estimation, Credit Assignment, Noisy Rewards, Policy Optimization, Answer-Critique-Rewrite

111. ❌ CataractSAM-2: A Domain-Adapted Model for Anterior Segment Surgery Segmentation and Scalable Ground-Truth Annotation

作者: Mohammad Eslami, Dhanvinkumar Ganeshkumar, Saber Kazeminasab, Michael G. Morley, Michael V. Boland, Michael M. Lin, John B. Miller, David S. Friedman, Nazlee Zebardast, Lucia Sobrin, Tobias Elze 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21566v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文CataractSAM-2专注于计算机视觉和医学机器人领域，特别是眼科手术视频的语义分割。它直接与’Domain Adaptation’高度相关（10分），因为它是Segment Anything Model 2的领域适应扩展，用于白内障手术。同时，它也与’AI for Science’高度相关（10分），因为它应用于生物信息学/医学领域，推动AI驱动的医疗机器人解决方案。其他关键词主要涉及大语言模型（LLM）的技术原理、训练方法、推理优化或代理系统，与这篇专注于视觉分割模型和医学应用的论文无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了CataractSAM-2，一个基于Segment Anything Model 2的领域适应模型，用于实时高精度分割白内障眼科手术视频，并引入了一个交互式标注框架以加速高质量地面实况掩码的创建，展示了在青光眼小梁切除术中的零样本泛化能力。

摘要翻译

我们推出CataractSAM-2，这是对Meta公司Segment Anything Model 2（SAM-2）进行领域适配的扩展模型，专为白内障眼科手术视频的高精度实时语义分割而设计。该模型定位于计算机视觉与医疗机器人技术的交叉领域，能够实现精确的术中感知，这对于机器人辅助和计算机引导的手术系统至关重要。此外，为减轻人工标注负担，我们引入了一种交互式标注框架，将稀疏提示与基于视频的掩码传播相结合。该工具显著缩短了标注时间，促进了高质量真实掩码的可扩展生成，从而加速了眼部前节手术数据集的开发。我们还验证了该模型在青光眼小梁切除术中强大的零样本泛化能力，证实了其跨手术流程的实用性及在更广泛外科应用中的潜力。训练完成的模型与标注工具包已作为开源资源发布，确立了CataractSAM-2作为扩展眼科前节手术数据集、推进医疗机器人实时人工智能驱动解决方案以及手术视频理解研究的基础平台。

摘要 (Abstract)

We present CataractSAM-2, a domain-adapted extension of Meta’s Segment Anything Model 2, designed for real-time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM-2 enables precise intraoperative perception crucial for robotic-assisted and computer-guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video-based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high-quality ground-truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model’s strong zero-shot generalization to glaucoma trabeculectomy procedures, confirming its cross-procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open-source resources, establishing CataractSAM-2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics, as well as surgical video understanding.

关键词: CataractSAM-2, domain adaptation, semantic segmentation, ophthalmic surgery, annotation framework, zero-shot generalization, medical robotics, surgical video understanding

112. ❌ Rethinking SAR ATR: A Target-Aware Frequency-Spatial Enhancement Framework with Noise-Resilient Knowledge Guidance

作者: Yansong Lin, Zihan Cheng, Jielei Wang, Guoming Lua, Zongyong Cui 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21565v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于合成孔径雷达（SAR）自动目标识别（ATR），提出了一种结合频域-空间特征增强和知识蒸馏的深度学习框架。论文的核心是计算机视觉和遥感领域的特定应用，而非大语言模型（LLM）或通用大模型技术。所有关键词（共27个）中，只有“AI for Science OR Bioinformatics OR Cheminformatics”与论文有一定关联，因为SAR ATR可视为AI在科学（遥感、地球观测）领域的一个应用实例，但论文并未直接涉及生物信息学或化学信息学。其他关键词均与大模型技术原理、训练方法、推理优化、智能体等高度相关，而本文研究的是传统的卷积神经网络架构和知识蒸馏，与这些大模型关键词无直接联系。因此，除“AI for Science”相关关键词得5分（有一定关联）外，其余均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文针对合成孔径雷达图像中相干斑点噪声导致目标特征模糊、识别精度下降的问题，提出了一种目标感知的频域-空间增强框架（FSCE），通过结合空间多尺度卷积、频域小波卷积和在线知识蒸馏，显著提升了噪声条件下的目标识别稳定性和模型泛化能力，并在多个数据集上验证了其有效性。

摘要翻译

合成孔径雷达自动目标识别（SAR ATR）在海洋导航与灾害监测中具有重要价值。然而，SAR图像固有的相干斑噪声往往会掩盖显著的目标特征，导致识别精度下降并限制模型的泛化能力。为解决这一问题，本文提出一种具有抗噪知识引导的目标感知频空增强框架（FSCE），用于SAR目标识别。该框架包含一个频空浅层特征自适应增强（DSAF）模块，该模块通过空间多尺度卷积与频域小波卷积处理浅层特征。此外，采用结合在线知识蒸馏（KD）的师生学习范式，引导学生网络更有效地聚焦目标区域，从而增强其对高噪声背景的鲁棒性。通过注意力迁移与抗噪表征学习的协同优化，所提方法显著提升了噪声条件下目标识别的稳定性。基于FSCE框架，本文开发了两种不同性能侧重的网络架构：轻量化的DSAFNet-M与高精度的DSAFNet-L。在MSTAR、FUSARShip和OpenSARShip数据集上进行了大量实验。结果表明，DSAFNet-L在三个数据集上相比多种方法均取得竞争性或更优的性能；DSAFNet-M在保持相当精度的同时显著降低了模型复杂度。这些结果证明所提出的FSCE框架具有强大的跨模型泛化能力。

摘要 (Abstract)

Synthetic aperture radar automatic target recognition (SAR ATR) is of considerable importance in marine navigation and disaster monitoring. However, the coherent speckle noise inherent in SAR imagery often obscures salient target features, leading to degraded recognition accuracy and limited model generalization. To address this issue, this paper proposes a target-aware frequency-spatial enhancement framework with noise-resilient knowledge guidance (FSCE) for SAR target recognition. The proposed framework incorporates a frequency-spatial shallow feature adaptive enhancement (DSAF) module, which processes shallow features through spatial multi-scale convolution and frequency-domain wavelet convolution. In addition, a teacher-student learning paradigm combined with an online knowledge distillation method (KD) is employed to guide the student network to focus more effectively on target regions, thereby enhancing its robustness to high-noise backgrounds. Through the collaborative optimization of attention transfer and noise-resilient representation learning, the proposed approach significantly improves the stability of target recognition under noisy conditions. Based on the FSCE framework, two network architectures with different performance emphases are developed: lightweight DSAFNet-M and high-precision DSAFNet-L. Extensive experiments are conducted on the MSTAR, FUSARShip and OpenSARShip datasets. The results show that DSAFNet-L achieves competitive or superior performance compared with various methods on three datasets; DSAFNet-M significantly reduces the model complexity while maintaining comparable accuracy. These results indicate that the proposed FSCE framework exhibits strong cross-model generalization.

关键词: Synthetic Aperture Radar, Automatic Target Recognition, Frequency-Spatial Enhancement, Noise-Resilient, Knowledge Distillation, Teacher-Student Learning, Feature Enhancement, Model Generalization

113. ❌ Toward a Theory of Hierarchical Memory for Language Agents

作者: Yashar Talebirad, Ali Parsaee, Csongor Y. Szepesvari, Amirhossein Nadiri, Osmar Zaiane 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21564v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于语言智能体的分层记忆理论，与以下关键词高度相关：1) ‘Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’（10分），因为论文核心涉及信息检索和生成；2) ‘Context Window Extension OR Long Context LLMs’（10分），因为论文明确解决上下文长度限制问题；3) ‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分），因为论文专注于语言智能体系统。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为智能体通常基于大模型。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对语言智能体系统中分层记忆设计缺乏统一理论的问题，提出了一个基于提取、粗化和遍历三个算子的通用形式化框架，并展示了其在11个现有系统中的适用性。

摘要翻译

近期许多长上下文与智能体系统通过引入分层记忆机制来应对上下文长度限制：它们从原始数据中提取原子单元，通过分组与压缩构建多级表征，并在令牌预算约束下遍历该结构以检索内容。尽管相关实现反复出现，但目前缺乏用于比较设计选择的共享形式化框架。我们提出一种基于三种操作符的统一理论框架。提取操作（$α$）将原始数据映射为原子信息单元；粗化操作（$C = (π, ρ)$）对单元进行分区并为每个分组分配表征；遍历操作（$τ$）则根据查询与预算选择需纳入上下文的单元。我们界定了表征函数 $ρ$ 的自足性谱系，并论证其如何约束可行的检索策略（即粗化与遍历的耦合关系）。最后，我们在涵盖文档层级、对话记忆与智能体执行轨迹的十一个现有系统中实例化解构，展示了该框架的普适性。

摘要 (Abstract)

Many recent long-context and agentic systems address context-length limitations by adding hierarchical memory: they extract atomic units from raw data, build multi-level representatives by grouping and compression, and traverse this structure to retrieve content under a token budget. Despite recurring implementations, there is no shared formalism for comparing design choices. We propose a unifying theory in terms of three operators. Extraction ($α$) maps raw data to atomic information units; coarsening ($C = (π, ρ)$) partitions units and assigns a representative to each group; and traversal ($τ$) selects which units to include in context given a query and budget. We identify a self-sufficiency spectrum for the representative function $ρ$ and show how it constrains viable retrieval strategies (a coarsening-traversal coupling). Finally, we instantiate the decomposition on eleven existing systems spanning document hierarchies, conversational memory, and agent execution traces, showcasing its generality.

关键词: hierarchical memory, language agents, context-length limitations, retrieval, extraction, coarsening, traversal, agentic systems

114. ❌ Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

作者: Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, Fuzhen Zhuang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21563v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体大语言模型协作中的信用分配问题，提出CCPO框架。高度相关关键词：‘Large Language Models’（论文明确使用LLMs）、‘LLM Agents’（研究LLM智能体协作）、‘Multi-agent Systems’（核心研究多智能体协调）。中等相关：‘Chain of Thought’和’System 2 Thinking’（论文涉及数学和逻辑推理任务）。其他关键词如MoE、SLMs、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多智能体大语言模型协作中的信用分配问题，提出了Counterfactual Credit Policy Optimization框架，通过反事实轨迹估计个体贡献，在数学和逻辑推理任务中有效缓解搭便车现象并优于现有多智能体强化学习方法。

摘要翻译

协作式多智能体大语言模型（LLMs）能够通过角色分解与多元假设聚合来解决复杂推理任务。然而，此类系统的强化学习（RL）常受信用分配问题制约：共享的全局奖励掩盖了个体贡献，增大了更新方差并助长搭便车行为。我们提出了反事实信用策略优化（Counterfactual Credit Policy Optimization, CCPO），该框架通过反事实轨迹估计每个智能体的边际贡献，从而为各智能体分配特定的学习信号。CCPO构建动态反事实基线，模拟移除某智能体贡献后的结果，为策略优化提供角色敏感的优势函数。为进一步提升在异构任务与数据分布下的稳定性，我们提出一种全局历史感知归一化方案，利用全局推演统计量校准优势函数。我们在两种协作拓扑结构上评估CCPO：顺序式“思考—推理”双人组与多智能体投票机制。在数学与逻辑推理基准测试中，CCPO有效缓解了搭便车现象，并优于现有强多智能体RL基线方法，为协作式LLM训练提供了更细粒度且更高效的信用分配方案。代码已发布于https://github.com/bhai114/ccpo。

摘要 (Abstract)

Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent’s marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent’s contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think–Reason dyad and multi-agent voting. Across mathematical and logical reasoning benchmarks, CCPO mitigates free-riding and outperforms strong multi-agent RL baselines, yielding finer-grained and more effective credit assignment for collaborative LLM training. Our code is available at https://github.com/bhai114/ccpo.

关键词: multi-agent collaboration, large language models, credit assignment, counterfactual trajectories, reinforcement learning, policy optimization, mathematical reasoning, logical reasoning

115. ❌ Greater accessibility can amplify discrimination in generative AI

作者: Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou, Valentin Hofmann, Katharina von der Wense, Anne Lauscher 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22260v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究音频大语言模型中的性别歧视问题，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确研究LLMs在语音交互中的偏见放大问题。其他关键词涉及具体技术方法（如MoE、SFT、RAG等）、推理技术（如CoT、MCTS）、效率优化（如量化、注意力机制）或科学应用（如生物信息学），论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

研究发现语音交互的大语言模型会基于说话者声音产生系统性性别歧视，放大社会偏见，而音高调节可作为缓解策略，揭示了AI可访问性与公平性之间的紧张关系。

摘要翻译

数亿人在教育、工作和医疗保健等领域依赖大语言模型（LLMs）。然而，已知这些模型会复制并放大其训练数据中存在的社会偏见。此外，基于文本的交互界面对许多人而言仍构成障碍，例如识字能力有限、运动功能受损或仅能使用移动设备的用户。语音交互有望提升可访问性，但与文本不同，语音携带了用户难以隐藏的身份线索，这引发了人们对可访问性提升是否可能以公平对待为代价的担忧。本文研究表明，支持音频的大语言模型表现出系统性的性别歧视，仅基于说话者的语音，其回应就会偏向性别刻板印象的形容词和职业，并且放大了超越基于文本交互中观察到的偏见。因此，语音界面并非仅仅将文本模型扩展至新模态，而是引入了与副语言线索相关的独特偏见机制。一项补充性调查证据（样本量 n=1,000）表明，不常使用聊天机器人的用户最不愿接受未公开的属性推断，且在得知此类做法时最可能停止使用。为展示一种潜在的缓解策略，我们证明通过音高操纵可以系统地调节性别歧视性输出。总体而言，我们的研究揭示了人工智能发展中的一个关键矛盾：通过语音界面扩大可访问性的努力，同时为歧视创造了新的途径，这就要求我们必须同步解决公平性与可访问性问题。

摘要 (Abstract)

Hundreds of millions of people rely on large language models (LLMs) for education, work, and even healthcare. Yet these models are known to reproduce and amplify social biases present in their training data. Moreover, text-based interfaces remain a barrier for many, for example, users with limited literacy, motor impairments, or mobile-only devices. Voice interaction promises to expand accessibility, but unlike text, speech carries identity cues that users cannot easily mask, raising concerns about whether accessibility gains may come at the cost of equitable treatment. Here we show that audio-enabled LLMs exhibit systematic gender discrimination, shifting responses toward gender-stereotyped adjectives and occupations solely on the basis of speaker voice, and amplifying bias beyond that observed in text-based interaction. Thus, voice interfaces do not merely extend text models to a new modality but introduce distinct bias mechanisms tied to paralinguistic cues. Complementary survey evidence ($n=1,000$) shows that infrequent chatbot users are most hesitant to undisclosed attribute inference and most likely to disengage when such practices are revealed. To demonstrate a potential mitigation strategy, we show that pitch manipulation can systematically regulate gender-discriminatory outputs. Overall, our findings reveal a critical tension in AI development: efforts to expand accessibility through voice interfaces simultaneously create new pathways for discrimination, demanding that fairness and accessibility be addressed in tandem.

关键词: large language models, voice interfaces, gender discrimination, social biases, accessibility, fairness, audio-enabled LLMs, bias amplification

116. ❌ MemDLM: Memory-Enhanced DLM Training

作者: Zehua Pei, Hui-Ling Zhen, Weizhe Lin, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22241v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出MemDLM，一种用于扩散语言模型（DLM）的新型训练方法，属于大模型技术原理的创新。核心相关关键词：1）‘Large Language Models’（10分）：DLM是语言模型的一种，论文直接研究其训练方法；2）‘Pre-training’（8分）：论文聚焦于DLM的训练过程改进；3）‘Context Window Extension’（8分）：方法在推理时重新启用可提升长上下文理解能力；4）‘Retrieval-Augmented Generation’（5分）：参数化记忆在推理时表现为一种新兴的权重内检索机制，辅助检索任务。其他关键词如MoE、SFT、RLHF等未涉及，评0分。

!!! tip deepseek-chat TL;DR

论文针对扩散语言模型（DLM）存在的训练-推理不匹配问题，提出MemDLM方法，通过双层优化嵌入模拟去噪过程，使用参数化记忆捕获轨迹经验，从而加快收敛、降低训练损失，并在推理时作为权重内检索机制提升长上下文理解和检索任务性能。

摘要翻译

扩散语言模型（Diffusion Language Models, DLMs）相比自回归（Auto-Regressive, AR）模型具有显著优势，例如全注意力并行解码和灵活的生成能力。然而，它们存在明显的训练-推断不匹配问题：DLMs 在训练时采用静态的单步掩码预测目标，但在部署时却通过多步渐进去噪轨迹进行生成。我们提出 MemDLM（记忆增强型 DLM），通过双层优化（Bi-level Optimization）将模拟去噪过程嵌入训练，从而缩小这一差距。内层循环更新一组快速权重，形成参数化记忆（Parametric Memory），以捕获每个样本的局部轨迹经验；而外层循环则基于此记忆更新基础模型。通过将记忆压力从词元表示卸载到参数，MemDLM 实现了更快的收敛速度和更低的训练损失。此外，内层循环在推断时可重新启用作为适应步骤，从而在长上下文理解任务上带来额外提升。我们发现，当在推断时激活时，这种参数化记忆表现为一种新兴的权重内检索机制，有助于 MemDLM 在具有挑战性的“大海捞针”检索任务上进一步减轻词元级注意力瓶颈。代码：https://github.com/JarvisPei/MemDLM。

摘要 (Abstract)

Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: https://github.com/JarvisPei/MemDLM.

关键词: Diffusion Language Models, DLM, train-inference mismatch, Bi-level Optimization, Parametric Memory, long-context understanding, in-weight retrieval, Needle-in-a-Haystack retrieval

117. ❌ Adapting Self-Supervised Speech Representations for Cross-lingual Dysarthria Detection in Parkinson’s Disease

作者: Abner Hernandez, Eunjung Yeo, Kwanghee Choi, Chin-Jou Li, Zhengjun Yue, Rohan Kumar Das, Jan Rusz, Mathew Magimai Doss, Juan Rafael Orozco-Arroyave, Tomás Arias-Vergara, Andreas Maier, Elmar Nöth, David R. Mortensen, David Harwath, Paula Andrea Perez-Toro 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22225v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于使用自监督语音表示进行跨语言构音障碍检测，属于AI在生物医学领域的应用。与大多数大模型技术关键词（如LLMs、MoE、RLHF等）无关，因为这些涉及文本生成模型而非语音处理。唯一相关的关键词是：1) “Pre-training OR Continual Pre-training OR Domain Adaptation”（8分），因为论文使用自监督预训练的语音表示并进行领域适应（语言对齐）；2) “AI for Science OR Bioinformatics OR Cheminformatics”（10分），因为研究涉及帕金森病诊断，属于生物信息学/医学AI应用。其他关键词均不涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种表示级语言转移方法，通过自监督语音表示对齐来改善跨语言帕金森病构音障碍检测，在捷克语、德语和西班牙语数据集上显著提高了敏感性和F1分数。

摘要翻译

构音障碍语音数据的有限性使得跨语言检测成为一个重要但具有挑战性的问题。一个关键难点在于，语音表征通常编码了语言依赖的结构，这可能干扰构音障碍的检测。我们提出了一种表征层面的语言迁移方法，该方法利用基于健康对照语音估计的质心向量适应，将源语言的自监督语音表征与目标语言的分布对齐。我们在捷克语、德语和西班牙语的帕金森病语音数据集的口部DDK（Diadochokinetic）录音上，于跨语言和多语言两种设置下评估了该方法。语言迁移在跨语言设置中显著提高了检测的敏感性和F1分数，同时在多语言设置中也带来了较小但一致的性能提升。表征分析进一步表明，语言迁移降低了嵌入空间中的语言身份信息，这支持了“语言迁移移除了语言依赖结构”的解释。

摘要 (Abstract)

The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson’s disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.

关键词: dysarthria detection, Parkinson’s disease, self-supervised speech representations, cross-lingual, language shift, domain adaptation, speech analysis, medical AI

118. ❌ Gumbel Distillation for Parallel Text Generation

作者: Chi Zhang, Xixi Hu, Bo Liu, Qiang Liu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22216v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于通过Gumbel Distillation技术提升并行语言模型的生成质量，核心涉及语言模型（特别是自回归模型作为教师）和推理加速（通过并行解码），因此与’Large Language Models’和’Speculative Decoding’相关度较高（8分）。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Gumbel Distillation的新蒸馏技术，通过Gumbel-Max技巧将高性能自回归教师模型的输出映射到潜在噪声空间，有效提升了并行语言模型的生成质量，在OpenWebText数据集上使MAUVE分数提高了30.0%，生成困惑度降低了10.5%。

摘要翻译

自回归语言模型缓慢的序列生成特性推动了并行解码方法的发展。然而，这些非自回归模型在建模词元序列的复杂联合分布时存在困难，往往以牺牲生成质量为代价。为缩小这一性能差距，我们提出了Gumbel蒸馏法——一种新颖的蒸馏技术，能使并行解码器有效学习该分布。我们的方法利用Gumbel-Max技巧，构建了从潜在Gumbel噪声空间到高性能自回归教师模型输出词元的确定性映射。作为一种模型无关技术，Gumbel蒸馏可无缝集成于多种并行解码架构（包括MDLM和BD3-LM）。在LM1B和OpenWebText数据集上的实验表明，Gumbel蒸馏显著提升了并行语言模型的生成质量：在OpenWebText数据集上训练的MDLM模型，其MAUVE分数提升30.0%，生成困惑度改善10.5%。代码发布于https://github.com/hxixixh/gumbel-distill。

摘要 (Abstract)

The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at https://github.com/hxixixh/gumbel-distill.

关键词: Gumbel Distillation, parallel decoding, autoregressive language models, non-AR models, generation quality, MDLM, BD3-LM, MAUVE score

119. ❌ The Semantic Ladder: A Framework for Progressive Formalization of Natural Language Content for Knowledge Graphs and AI Systems

作者: Lars Vogt 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22136v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出一个语义阶梯框架，用于自然语言内容到形式化语义模型的渐进式转换，属于知识表示和语义基础设施领域。所有评分关键词都聚焦于大模型/深度学习的技术原理、训练方法、推理优化、应用等具体方面，而本文完全不涉及这些具体的大模型技术、训练方法、推理机制或应用场景。论文讨论的是通用的语义表示和知识基础设施，而非特定的大模型技术或应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为'语义阶梯'的架构框架，用于解决自然语言与形式化语义模型之间的表示鸿沟问题，通过渐进式语义形式化方法支持可扩展、可互操作的AI就绪数据和知识基础设施。

摘要翻译

语义数据与知识基础设施必须调和两种根本不同的表征形式：自然语言（大多数知识借此创造与传播）与形式化语义模型（其支持机器可操作的集成、互操作与推理）。弥合这一鸿沟仍是核心挑战，尤其在数据录入环节即需完全语义形式化的情况下。本文提出“语义阶梯”这一架构框架，支持数据与知识的渐进式形式化。该框架基于模块化语义单元（即可识别意义载体）的概念，将表征组织在语义明确性逐级递增的多个层次上，涵盖从自然语言文本片段到基于本体的高阶逻辑模型。层级间的转换在保持语义连续性与可追溯性的同时，支持语义增强、陈述结构化和逻辑建模。该方法支持语义知识空间的增量构建，减轻语义解析负担，并促进异构表征（包括自然语言、结构化语义模型与基于向量的嵌入表示）的集成。语义阶梯由此为可扩展、可互操作且适配人工智能的数据与知识基础设施奠定了基础。

摘要 (Abstract)

Semantic data and knowledge infrastructures must reconcile two fundamentally different forms of representation: natural language, in which most knowledge is created and communicated, and formal semantic models, which enable machine-actionable integration, interoperability, and reasoning. Bridging this gap remains a central challenge, particularly when full semantic formalization is required at the point of data entry. Here, we introduce the Semantic Ladder, an architectural framework that enables the progressive formalization of data and knowledge. Building on the concept of modular semantic units as identifiable carriers of meaning, the framework organizes representations across levels of increasing semantic explicitness, ranging from natural language text snippets to ontology-based and higher-order logical models. Transformations between levels support semantic enrichment, statement structuring, and logical modelling while preserving semantic continuity and traceability. This approach enables the incremental construction of semantic knowledge spaces, reduces the semantic parsing burden, and supports the integration of heterogeneous representations, including natural language, structured semantic models, and vector-based embeddings. The Semantic Ladder thereby provides a foundation for scalable, interoperable, and AI-ready data and knowledge infrastructures.

关键词: Semantic Ladder, progressive formalization, natural language, formal semantic models, knowledge representation, semantic infrastructure, semantic enrichment, AI-ready data

120. ❌ Multiperspectivity as a Resource for Narrative Similarity Prediction

作者: Max Upravitelev, Veronika Solopova, Jing Yang, Charlott Jakob, Premtim Sahitaj, Ariana Sahitaj, Vera Schmitt 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22103v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用31个LLM角色组成的集成系统进行叙事相似性预测，直接涉及LLM的应用，因此"Large Language Models OR LLMs OR Foundation Models"得10分。其他关键词如MoE、SLMs、训练方法、推理技术、压缩、科学AI等均未在摘要中提及，与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出通过集成多个LLM角色来整合多视角解读，以解决叙事相似性预测中因不同解读导致单一标准答案的挑战，在SemEval-2026 Task 4数据集上达到0.705的准确率，并发现性别相关词汇与准确率负相关。

摘要翻译

预测叙事相似性可被理解为一项本质上的阐释性任务：对同一文本的不同但同样有效的解读可能产生相异的阐释，从而导致不同的相似性判断，这对编码单一标准答案的语义评估基准构成了根本性挑战。我们并非将这种多视角性视为需要克服的障碍，而是提议将其纳入预测系统的决策过程中。为探索这一策略，我们构建了一个包含31个大型语言模型角色的集合。这些角色涵盖遵循特定阐释框架的专业实践者到更具直觉性、大众化风格的人物。我们在SemEval-2026任务4数据集上进行了实验，系统取得了0.705的准确率分数。准确率随集合规模扩大而提升，这与弱化独立性条件下的孔多塞陪审团定理动态特征相符。专业实践者角色个体表现较差，但其错误相关性更低，在多数投票机制下能产生更大的集合增益。我们的错误分析显示，在所有角色类别中，聚焦性别的阐释性词汇与准确率均存在一致的负相关关系，这可能表明模型关注了与基准无关的维度，或是产生了标准答案中未包含的有效阐释。这一发现凸显了需要建立能够容纳阐释多样性的评估框架。

摘要 (Abstract)

Predicting narrative similarity can be understood as an inherently interpretive task: different, equally valid readings of the same text can produce divergent interpretations and thus different similarity judgments, posing a fundamental challenge for semantic evaluation benchmarks that encode a single ground truth. Rather than treating this multiperspectivity as a challenge to overcome, we propose to incorporate it in the decision making process of predictive systems. To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting. Our error analysis reveals a consistent negative association between gender-focused interpretive vocabulary and accuracy across all persona categories, suggesting either attention to dimensions not relevant for the benchmark or valid interpretations absent from the ground truth. This finding underscores the need for evaluation frameworks that account for interpretive plurality.

关键词: narrative similarity prediction, multiperspectivity, LLM personas, ensemble, interpretive frameworks, SemEval-2026, accuracy, gender-focused vocabulary

121. ❌ Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison

作者: Caio Vicentino 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22075v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是比较自回归语言模型（AR）和掩码扩散语言模型（MDLM）这两种生成范式，属于大语言模型（LLMs）的基础技术研究。它涉及模型的预训练（Pre-training）过程，因为研究在相同数据、计算预算和硬件下训练两种模型，并分析其收敛行为、过拟合和生成多样性。其他关键词如MoE、SFT、RLHF、RAG、推理加速、AI for Science等均未在论文中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究在严格控制条件下比较了自回归和掩码扩散语言模型，发现两者训练吞吐量相近，但自回归模型收敛更快且易过拟合，而掩码扩散模型收敛较慢、仍在改进，且能生成更多样但偶尔语法不一致的文本。

摘要翻译

本文对自回归语言模型与掩码扩散语言模型进行了受控的实证比较。两种模型均在相同数据（来自TinyStories的5000万词元）、相同计算预算（20,000步，批次大小32，序列长度512）和相同硬件（NVIDIA H100 80GB）下训练，从而将生成范式作为唯一变量进行隔离研究。我们报告了三点发现。首先，两种范式实现了相近的训练吞吐量（约每秒5万词元），其中掩码扩散模型仅需增加4.7%的实耗时间。其次，自回归模型收敛更快，在14,000步时开始出现过拟合；而掩码扩散模型收敛较慢，在20,000步时仍在持续改进，这表明二者存在不同的计算最优训练机制。第三，通过对1000个生成样本的定量多样性分析，揭示了结构多样性-流畅性之间的权衡：自回归模型能生成流畅但重复性高的输出（99.8%的样本以相同词语开头），而掩码扩散模型能产生更多样化的叙事（93.4%的样本具有独特的5词开头，更高的Distinct-n分数，更低的Self-BLEU分数），但代价是偶尔出现语法不一致现象。我们已公开所有代码、训练检查点和数据流程以确保可复现性。

摘要 (Abstract)

We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.

关键词: autoregressive language models, masked diffusion language models, controlled comparison, training convergence, overfitting, generation diversity, fluency-diversity trade-off, TinyStories dataset

122. ❌ Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

作者: Stella Eva Tsiapali, Cong-Thanh Do, Kate Knill 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22056v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的知识蒸馏（KD）方法，特别是针对不同分词器的模型。因此，‘Large Language Models’ 高度相关（10分）。知识蒸馏旨在训练更小的学生模型，与 ‘Small Language Models’ 和 ‘Quantization/Model Compression’ 有一定关联（各5分）。论文分析注意力机制并提出新方法，与 ‘Mechanistic Interpretability’ 有一定关联（5分）。其他关键词如 MoE、Scaling Laws、Alignment、RAG、Agents 等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究大语言模型知识蒸馏中不同分词器导致的词汇不匹配问题，提出了一种基于生成对抗学习的新方法 DSKD-CMA-GA，实验表明其在文本生成质量上取得了适度但一致的提升，特别是在分布外数据上。

摘要翻译

大型语言模型（LLM）在各类语言任务中实现了最先进的性能，但由于其规模和资源需求，部署成本高昂。知识蒸馏通过训练较小的学生模型模仿较大的教师模型来解决这一问题，在无明显性能损失的前提下提升了效率。基于跨模型注意力的双空间知识蒸馏已成为针对不同分词器的LLM之间进行知识蒸馏的SOTA方法，但其内部工作机制在很大程度上仍不透明。本研究通过手动令牌对齐探测和热力图可视化，系统分析了DSKD-CMA的注意力机制，揭示了其优势与局限。在此基础上，我们提出了一种基于生成对抗学习的新方法DSKD-CMA-GA，以解决由不同模型计算的键与查询之间分布不匹配的问题。实验表明，该方法在文本生成质量上取得了适度但稳定的ROUGE-L提升（平均提升0.37分），尤其在分布外数据上表现显著，从而缩小了跨分词器与同分词器知识蒸馏之间的性能差距。

摘要 (Abstract)

Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.

关键词: Knowledge Distillation, Large Language Models, Vocabulary Mismatch, Cross-Model Attention, Generative Adversarial Learning, Text Generation, ROUGE-L, Out-of-distribution Data

123. ❌ Retrieving Climate Change Disinformation by Narrative

作者: Max Upravitelev, Veronika Solopova, Charlott Jakob, Premtim Sahitaj, Vera Schmitt 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22015v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于气候虚假信息检测，将叙事检测重新定义为检索任务，并提出了SpecFi框架。与大多数大模型技术关键词无关，但与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’有一定关联（5分），因为涉及检索任务和生成假设文档。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为应用于气候科学领域。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究将气候虚假信息叙事检测重新定义为检索任务，提出了SpecFi框架来生成假设文档以弥合抽象叙事描述与具体文本实例之间的差距，并在三个气候虚假信息数据集上验证了其有效性。

摘要翻译

检测气候虚假信息叙事通常依赖于固定的分类体系，这类体系难以适应新兴叙事。因此，我们将叙事检测重新定义为检索任务：以某一叙事的核心信息作为查询，根据文本与叙事的一致性对语料库中的文本进行排序。这一框架无需预定义标签集，且能兼容新兴叙事。我们重新调整了三个气候虚假信息数据集（CARDS、Climate Obstruction、PolyNarrative的气候变化子集）以用于检索评估，并提出了SpecFi框架——该框架通过生成假设性文档来弥合抽象叙事描述与其具体文本实例之间的差距。SpecFi利用基于图社区检测方法提取的社区摘要作为少样本生成示例，在未接触叙事标签的情况下，于CARDS数据集上实现了0.505的平均准确率均值（MAP）。我们进一步提出了叙事方差这一基于嵌入表示的难度度量指标，并通过偏相关分析表明：标准检索方法在高方差叙事上性能显著下降（BM25的MAP损失达63.4%），而SpecFi-CS方法仍保持稳健（仅损失32.7%）。我们的分析还发现，无监督生成的社区摘要与专家构建的分类体系描述高度接近，这表明基于图的方法能够从未标注文本中有效提取叙事结构。

摘要 (Abstract)

Detecting climate disinformation narratives typically relies on fixed taxonomies, which do not accommodate emerging narratives. Thus, we re-frame narrative detection as a retrieval task: given a narrative’s core message as a query, rank texts from a corpus by alignment with that narrative. This formulation requires no predefined label set and can accommodate emerging narratives. We repurpose three climate disinformation datasets (CARDS, Climate Obstruction, climate change subset of PolyNarrative) for retrieval evaluation and propose SpecFi, a framework that generates hypothetical documents to bridge the gap between abstract narrative descriptions and their concrete textual instantiations. SpecFi uses community summaries from graph-based community detection as few-shot examples for generation, achieving a MAP of 0.505 on CARDS without access to narrative labels. We further introduce narrative variance, an embedding-based difficulty metric, and show via partial correlation analysis that standard retrieval degrades on high-variance narratives (BM25 loses 63.4% of MAP), while SpecFi-CS remains robust (32.7% loss). Our analysis also reveals that unsupervised community summaries converge on descriptions close to expert-crafted taxonomies, suggesting that graph-based methods can surface narrative structure from unlabeled text.

关键词: climate disinformation, narrative detection, retrieval task, SpecFi framework, hypothetical documents, community detection, embedding-based metric, graph-based methods

124. ❌ On the Challenges and Opportunities of Learned Sparse Retrieval for Code

作者: Simon Lupart, Maxime Louis, Thibault Formal, Hervé Déjean, Stéphane Clinchant 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22008v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于代码检索的稀疏检索模型（SPLADE-Code），与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为核心是学习稀疏检索模型；与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），因为研究代码检索作为RAG的关键组件；与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为论文涉及LLM-based软件工程系统；其他关键词如SLMs、Scaling Laws、Pre-training等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了代码检索中学习稀疏检索（LSR）的挑战和机遇，提出了SPLADE-Code模型系列，在轻量级训练下实现了最先进的检索性能，并分析了其低延迟优势。

摘要翻译

大规模代码库检索是现代基于大语言模型的软件工程系统的关键组成部分。现有方法主要依赖稠密嵌入模型，而学习型稀疏检索在代码领域仍基本未被探索。然而，将稀疏检索应用于代码具有挑战性，原因在于子词片段化、自然语言查询与代码之间的语义鸿沟、编程语言和子任务的多样性，以及代码文档的长度——这些因素可能损害稀疏性和检索延迟。我们提出了SPLADE-Code，这是首个专为代码检索定制的大规模学习型稀疏检索模型系列（参数规模6亿至80亿）。尽管采用轻量级单阶段训练流程，SPLADE-Code在10亿参数以下的检索模型中实现了最先进的性能（在MTEB Code基准上达到75.4分），并在更大规模上取得了有竞争力的结果（80亿参数达到79.0分）。研究表明，学习型扩展词元对于弥合词汇匹配与语义匹配至关重要；同时，延迟分析表明，学习型稀疏检索能够在包含100万个代码段的集合上实现亚毫秒级检索，且效果损失极小。

摘要 (Abstract)

Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.

关键词: learned sparse retrieval, code retrieval, SPLADE-Code, large language models, retrieval-augmented generation, latency analysis, programming languages, semantic matching

125. ❌ Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

作者: Xixi Wu, Qianguo Sun, Ruiyang Zhang, Chao Song, Junlong Wu, Yiyan Qi, Hong Cheng 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21972v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究强化学习（RL）在大型语言模型（LLM）驱动的自主代理中的应用，特别是针对长视野规划和工具使用的场景。因此，与’Large Language Models OR LLMs OR Foundation Models’、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Tool Use OR Function Calling OR API Tool Use’高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、Scaling Laws、各种训练方法（Pre-training, SFT, RLHF等）、推理优化、多智能体系统、模型压缩、科学AI等，故这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文系统研究了强化学习在大型语言模型驱动的自主代理中，用于长视野规划和工具使用的设计空间，提出了一个实用配方，并在TravelPlanner测试平台上实现了最先进的性能。

摘要翻译

强化学习（RL）对于将大型语言模型（LLMs）进化为能够进行长程规划的自主智能体至关重要，然而在复杂、多轮环境中扩展强化学习的实用方案仍不明确。本文利用TravelPlanner——一个需要工具协调以满足多维度约束的挑战性测试平台——进行了一项系统的实证研究。我们将智能体强化学习的设计空间分解为五个维度：奖励塑形、模型缩放、数据构成、算法选择和环境稳定性。通过控制实验，我们得出七项关键结论，例如：（1）奖励和算法的选择具有规模依赖性，较小模型受益于分阶段奖励和增强探索，而较大模型则能通过简单的密集奖励高效收敛；（2）约1K个训练样本配合难度均衡的混合数据，是在领域内和领域外性能均达到最佳平衡的关键点；（3）环境稳定性对于防止策略退化至关重要。基于我们提炼的方案，经强化学习训练的模型在TravelPlanner上实现了最先进的性能，显著超越了领先的大型语言模型。

摘要 (Abstract)

Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.

关键词: Reinforcement Learning, Large Language Models, Autonomous Agents, Long-horizon Planning, Tool Use, TravelPlanner, Agentic RL, State-of-the-art Performance

126. ❌ BHDD: A Burmese Handwritten Digit Dataset

作者: Swan Htet Aung, Hein Htet, Htoo Say Wah Khaing, Thuya Myo Nyunt 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21966v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《BHDD: A Burmese Handwritten Digit Dataset》专注于创建和评估一个缅甸手写数字数据集，涉及数据收集、统计分析以及使用传统机器学习模型（如MLP和CNN）进行基准测试。所有给定的关键词均与大语言模型、深度学习技术原理、科学应用或大模型在不同领域的研究应用相关，而本文仅涉及基础的计算机视觉和数据集构建，未涉及任何大模型、深度学习技术原理创新或科学应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文创建了缅甸手写数字数据集（BHDD），包含87,561张图像，并通过简单基线模型（如MLP和CNN）实现了高达99.83%的测试准确率。

摘要翻译

本文介绍了缅甸手写数字数据集（Burmese Handwritten Digit Dataset, BHDD），该数据集包含87,561张分为十类的缅甸手写数字灰度图像。每张图像尺寸为28×28像素，遵循MNIST格式。训练集包含60,000个样本，各类别均匀分布；测试集包含27,561个样本，其类别频率与采集时自然出现的分布一致。超过150名不同年龄和背景的贡献者提供了样本。我们分析了数据集的类别分布、像素统计特征和形态学变化，并识别出因缅甸文字圆形特征而易混淆的数字对。使用简单基线模型（多层感知机MLP、双层卷积神经网络CNN，以及采用批量归一化和数据增强的改进CNN）分别达到了99.40%、99.75%和99.83%的测试准确率。BHDD数据集基于CC BY-SA 4.0协议公开，访问地址为：https://github.com/baseresearch/BHDD

摘要 (Abstract)

We introduce the Burmese Handwritten Digit Dataset (BHDD), a collection of 87,561 grayscale images of handwritten Burmese digits in ten classes. Each image is 28x28 pixels, following the MNIST format. The training set has 60,000 samples split evenly across classes; the test set has 27,561 samples with class frequencies as they arose during collection. Over 150 people of different ages and backgrounds contributed samples. We analyze the dataset’s class distribution, pixel statistics, and morphological variation, and identify digit pairs that are easily confused due to the round shapes of the Myanmar script. Simple baselines (an MLP, a two-layer CNN, and an improved CNN with batch normalization and augmentation) reach 99.40%, 99.75%, and 99.83% test accuracy respectively. BHDD is available under CC BY-SA 4.0 at https://github.com/baseresearch/BHDD

关键词: Burmese Handwritten Digit Dataset, handwritten digit recognition, dataset collection, computer vision, MNIST format, convolutional neural network, baseline models, test accuracy

127. ❌ SLURP-TN : Resource for Tunisian Dialect Spoken Language Understanding

作者: Haroun Elleuch, Salima Mdhaffar, Yannick Estève, Fethi Bougares 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21940v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究突尼斯方言的语音理解数据集创建和基线模型开发，属于特定语言资源的构建工作。论文摘要中提到了’deep neural network models’和’pre-trained language models’，但所有评分关键词都聚焦于大语言模型（LLM）的特定技术、方法或应用领域（如MoE、RLHF、RAG、量化等），而本文并未涉及这些具体的大模型技术或应用。论文的核心是语音理解和方言数据集，而非大模型技术创新或科学领域应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文创建了首个突尼斯方言语音理解数据集SLURP-TN，并开发了相应的自动语音识别和语音理解基线模型，以解决低资源语言在语音理解领域缺乏数据资源的问题。

摘要翻译

口语理解旨在从用户查询的语音表述中提取语义信息，是任务导向对话系统的核心组成部分。随着深度神经网络模型的显著进步与预训练语言模型的发展，口语理解领域已取得重大突破。然而，由于缺乏相关资源，仅有少数高资源语言受益于此进展。本文通过引入SLURP-TN数据集以缓解这一障碍。该数据集由55名母语者录制突尼斯方言语句构建而成，语句内容经人工翻译自SLURP的六个领域。最终形成的突尼斯方言口语理解数据集包含4165条语句，录音时长约5小时。基于此，我们开发了若干利用SLURP-TN的自动语音识别与口语理解模型。数据集及基线模型已公开于：https://huggingface.co/datasets/Elyadata/SLURP-TN。

摘要 (Abstract)

Spoken Language Understanding (SLU) aims to extract the semantic information from the speech utterance of user queries. It is a core component in a task-oriented dialogue system. With the spectacular progress of deep neural network models and the evolution of pre-trained language models, SLU has obtained significant breakthroughs. However, only a few high-resource languages have taken advantage of this progress due to the absence of SLU resources. In this paper, we seek to mitigate this obstacle by introducing SLURP-TN. This dataset was created by recording 55 native speakers uttering sentences in Tunisian dialect, manually translated from six SLURP domains. The result is an SLU Tunisian dialect dataset that comprises 4165 sentences recorded into around 5 hours of acoustic material. We also develop a number of Automatic Speech Recognition and SLU models exploiting SLUTP-TN. The Dataset and baseline models are available at: https://huggingface.co/datasets/Elyadata/SLURP-TN.

关键词: Spoken Language Understanding, Tunisian dialect, SLU dataset, Automatic Speech Recognition, low-resource language, pre-trained language models, baseline models, speech utterance

128. ❌ Ara-Best-RQ: Multi Dialectal Arabic SSL

作者: Haroun Elleuch, Ryan Whetten, Salima Mdhaffar, Yannick Estève, Fethi Bougares 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21900v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于阿拉伯语多方言语音处理的自监督学习模型，仅与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（涉及预训练和领域适应），与其他关键词（主要针对大语言模型、推理、对齐、压缩等技术）完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了Ara-BEST-RQ，一个针对阿拉伯语多方言语音处理的自监督学习模型家族，通过预训练达到600M参数，在方言识别任务上实现了最先进性能，并证明针对阿拉伯方言的预训练比多语言或非阿拉伯语数据的单语模型更能提升下游任务表现。

摘要翻译

我们推出Ara-BEST-RQ系列模型，这是一组专为多方言阿拉伯语语音处理设计的自监督学习模型。通过利用5,640小时爬取的创作共用许可语音数据，并结合公开可用数据集，我们基于Conformer架构预训练了参数量达6亿的BEST-RQ模型。该系列模型在方言识别和自动语音识别任务上进行了评估，在方言识别任务中实现了最先进的性能，且参数量少于同类竞争模型。我们证明，相较于使用非阿拉伯语数据训练的多语言或单语言模型，针对阿拉伯语方言家族进行定向预训练能显著提升下游任务性能。所有模型、代码及预处理数据集将公开发布，以支持阿拉伯语语音技术研究的可复现性与持续探索。

摘要 (Abstract)

We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech and combining it with publicly available datasets, we pre-train conformer-based BEST-RQ models up to 600M parameters. Our models are evaluated on dialect identification (DID) and automatic speech recognition (ASR) tasks, achieving state-of-the-art performance on the former while using fewer parameters than competing models. We demonstrate that family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data. All models, code, and pre-processed datasets will be publicly released to support reproducibility and further research in Arabic speech technologies.

关键词: self-supervised learning, Arabic speech processing, multi-dialectal, pre-training, conformer-based, dialect identification, automatic speech recognition, state-of-the-art

129. ❌ Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

作者: Xi Xuan, Wenxin Zhang, Zhiyu Li, Jennifer Williams, Ville Hautamäki, Tomi H. Kinnunen 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21875v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于语音深度伪造源验证，研究如何通过切比雪夫多项式和黎曼度量学习来解耦说话人特征与源生成器特征。论文内容涉及语音处理、度量学习、特征解耦和深度伪造检测，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新或任何评分关键词中的技术（如MoE、Scaling Laws、RLHF、RAG、量化等）。论文属于语音AI和多媒体安全领域，与评分关键词列表中的大模型、深度学习技术原理、AI for Science等主题无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了语音深度伪造源验证中说话人特征对源验证的影响，提出了一个结合切比雪夫多项式和黎曼度量学习的说话人解耦度量学习框架，有效提升了源特征判别力。

摘要翻译

语音深度伪造源验证系统旨在判定两段合成语音是否源自同一生成器，其通常假设所得源嵌入向量独立于说话人特征。然而，该假设尚未得到验证。本文首先探究了说话人因素对源验证的影响，并提出一种结合两种新型损失函数的说话人解耦度量学习框架。第一种损失函数利用切比雪夫多项式缓解解耦优化过程中的梯度不稳定问题；第二种损失函数将源嵌入与说话人嵌入投影至双曲空间，借助黎曼度量距离减少说话人信息干扰，从而学习更具区分性的源特征。在MLAAD基准数据集上的实验结果表明，该框架在四种专为源-说话人解耦场景设计的新评估协议下均表现优异。相关代码、评估协议及演示网站已公开于https://github.com/xxuan-acoustics/RiemannSD-Net。

摘要 (Abstract)

Speech deepfake source verification systems aims to determine whether two synthetic speech utterances originate from the same source generator, often assuming that the resulting source embeddings are independent of speaker traits. However, this assumption remains unverified. In this paper, we first investigate the impact of speaker factors on source verification. We propose a speaker-disentangled metric learning (SDML) framework incorporating two novel loss functions. The first leverages Chebyshev polynomial to mitigate gradient instability during disentanglement optimization. The second projects source and speaker embeddings into hyperbolic space, leveraging Riemannian metric distances to reduce speaker information and learn more discriminative source features. Experimental results on MLAAD benchmark, evaluated under four newly proposed protocols designed for source-speaker disentanglement scenarios, demonstrate the effectiveness of SDML framework. The code, evaluation protocols and demo website are available at https://github.com/xxuan-acoustics/RiemannSD-Net.

关键词: speech deepfake, source verification, speaker disentanglement, Chebyshev polynomial, Riemannian metric learning, metric learning, hyperbolic space, MLAAD benchmark

130. ❌ Riding Brainwaves in LLM Space: Understanding Activation Patterns Using Individual Neural Signatures

作者: Ajan Subramanian, Sumukh Bettadapura, Rohan Sathish 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21847v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在神经科学领域的应用，具体探索冻结LLM的隐藏状态如何编码个体特定的脑电图（EEG）信号，属于大模型在科学（神经科学/生物信息学）领域的创新应用。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），与’AI for Science OR Bioinformatics OR Cheminformatics’有较强关联（8分）。研究涉及对模型内部激活模式的分析，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。论文未涉及其他关键词所代表的具体技术（如MoE、训练方法、推理优化、智能体等），因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究探索了冻结的大语言模型（如Qwen 2.5 7B）的隐藏状态是否以及如何编码个体特定的脑电图（EEG）信号，发现模型的深层包含稳定、个体特异性的神经方向，为基于EEG的个性化模型提供了几何基础。

摘要翻译

消费级脑电图设备正逐步进入日常穿戴设备领域，从耳塞到头带均可见其应用，这引发了一个问题：语言模型能否适配个体神经响应？我们通过探究冻结的大语言模型表征是否编码了人特异性脑电信号——即激活空间中能预测特定个体大脑活动而非他人活动的方向——来验证这一设想。利用30名参与者阅读自然主义句子时采集的词汇级脑电图数据（ZuCo语料库），我们为每位参与者训练独立的线性探针，将冻结的Qwen 2.5 7B模型的隐藏状态映射到个体脑电功率。在所有测试的脑电特征上，人特异性探针均优于单一群体探针；针对高伽马功率，人特异性探针达到ρ=0.183，较群体探针（ρ=0.020，p<10^-4）提升九倍。以注视次数作为阴性对照则未显示人特异性优势（p=0.360），因为注视次数反映的是词汇长度与频率而非个体认知特征。个体方向具有时间稳定性（分半余弦相似度=0.824），在个体间不可迁移（自身ρ=0.369 vs. 他人ρ=0.143，p<10^-19），且与共享群体信号相区别：当群体成分被移除后，人特异性探针仍保持预测能力。该人特异性信号集中体现在模型的深层网络，随深度增加持续增强，在28层中的第24层达到峰值。这一结论在不同架构模型（LLaMA 3.1 8B）中保持一致，且通过词汇级混杂因素控制检验。研究表明，冻结的语言模型在其深层网络中蕴含稳定的人特异性神经方向，为基于脑电的个性化应用提供了几何学基础。

摘要 (Abstract)

Consumer-grade EEG is entering everyday devices, from earbuds to headbands, raising the question of whether language models can be adapted to individual neural responses. We test this by asking whether frozen LLM representations encode person-specific EEG signals, directions in activation space that predict one person’s brain activity but not another’s. Using word-level EEG from 30 participants reading naturalistic sentences (ZuCo corpus), we train a separate linear probe for each person, mapping hidden states from a frozen Qwen 2.5 7B to that individual’s EEG power. Person-specific probes outperform a single population probe on every EEG feature tested; for high-gamma power, the person-specific probe achieves rho = 0.183, a ninefold improvement over the population probe (rho = 0.020, p < 10^-4). A negative control, fixation count, shows no person-specific advantage (p = 0.360); fixation count reflects word length and frequency rather than individual cognition. The individual directions are temporally stable (split-half cosine = 0.824), non-transferable across people (self rho = 0.369 vs. other rho = 0.143, p < 10^-19), and distinct from the shared population signal: person-specific probes retain predictive power after the population component is removed. The person-specific signal concentrates in the model’s deep layers, rising consistently with depth and peaking at Layer 24 of 28. The results are consistent across architectures (LLaMA 3.1 8B) and survive word-level confound controls. Frozen language models contain stable, person-specific neural directions in their deep layers, providing a geometric foundation for EEG-driven personalization.

关键词: Large Language Models, EEG, Neural Signatures, Personalization, Activation Patterns, Linear Probe, Qwen 2.5, ZuCo Corpus

131. ❌ Politics of Questions in News: A Mixed-Methods Study of Interrogative Stances as Markers of Voice and Power

作者: Bros Victor, Barbini Matilde, Gerard Patrick, Gatica-Perez Daniel 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21823v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究新闻话语中疑问句的政治功能，采用混合方法分析法语数字新闻语料库，属于计算语言学、话语分析和新闻研究领域。论文内容完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用，所有关键词均与大模型技术、AI方法、模型训练优化、推理加速、AI应用等无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过混合方法研究当代法语数字新闻中疑问句的政治功能，发现疑问句主要用于引入或组织议题，通常在同一文章中得到回答，且倾向于突出已有知名度的个人和地点，表现出强烈的个性化特征。

摘要翻译

新闻话语中的疑问句已在语言学和会话分析领域得到研究，但主要集中在广播访谈及规模较小、通常为英语的语料库中，而大规模计算新闻研究很少区分疑问句与陈述句或其功能差异。本文通过混合方法研究当代法语数字新闻中的“提问政治”，将上述研究方向相结合。基于2023年1月至2024年6月期间发布的超百万篇文章，我们自动检测疑问立场、近似划分其功能类型，并在存在时定位文本回答，同时将这些量化指标与基于疑问句语义和语用理论进行定性标注的子语料库相关联。疑问句分布稀疏但呈现系统性模式：它们主要用以引入或组织议题，其余多数案例为信息寻求型或回声式疑问，而显性的引导性或附加疑问句则较为罕见。尽管其密度和组合因媒体机构和主题而异，我们的探索性分析表明，疑问句绝大多数在同一文章内得到承接，且通常与后续类回答片段相关联——这些回答多呈现于记者的叙事声音中，较少通过引语呈现。疑问语境密集出现具名个人、组织和地点，而公众及广泛社会群体的提及频率显著较低，这表明疑问话语倾向于凸显已具显著性的行动者与地点，从而呈现出强烈的个人化特征。我们展示了如何在语料库规模上操作化疑问立场、文本承接和声音特征，并论证计算方法与语用学、社会学视角的结合，有助于解释提问实践如何建构当代新闻话语。

摘要 (Abstract)

Interrogatives in news discourse have been examined in linguistics and conversation analysis, but mostly in broadcast interviews and relatively small, often English-language corpora, while large-scale computational studies of news rarely distinguish interrogatives from declaratives or differentiate their functions. This paper brings these strands together through a mixed-methods study of the “Politics of Questions” in contemporary French-language digital news. Using over one million articles published between January 2023 and June 2024, we automatically detect interrogative stances, approximate their functional types, and locate textual answers when present, linking these quantitative measures to a qualitatively annotated subcorpus grounded in semantic and pragmatic theories of questions. Interrogatives are sparse but systematically patterned: they mainly introduce or organize issues, with most remaining cases being information-seeking or echo-like, while explicitly leading or tag questions are rare. Although their density and mix vary across outlets and topics, our heuristic suggests that questions are overwhelmingly taken up within the same article and usually linked to a subsequent answer-like span, most often in the journalist’s narrative voice and less often through quoted speech. Interrogative contexts are densely populated with named individuals, organizations, and places, whereas publics and broad social groups are mentioned much less frequently, suggesting that interrogative discourse tends to foreground already prominent actors and places and thus exhibits strong personalization. We show how interrogative stance, textual uptake, and voice can be operationalized at corpus scale, and argue that combining computational methods with pragmatic and sociological perspectives can help account for how questioning practices structure contemporary news discourse.

关键词: interrogatives, news discourse, computational linguistics, corpus analysis, French-language news, questioning practices, political communication, mixed-methods study

132. ❌ TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression

作者: Li Wang, Yandong Wang, Xin Yu, Kui Zhang, Tianhao Peng, Wenjun Wu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21663v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）处理长上下文时的多轮强化学习问题，与’Large Language Models OR LLMs OR Foundation Models’和’Context Window Extension OR Long Context LLMs’高度相关（10分），因为论文明确针对LLMs处理超出上下文窗口限制的长文档，提出TAMTRL方法改进长上下文处理。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术（Pre-training、SFT、RLHF等）、推理优化（KV Cache、Speculative Decoding）、代理系统、科学AI应用等均未在论文标题或摘要中提及，因此评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型处理长文档时因上下文窗口限制需分块处理导致的多轮训练信用分配问题，提出了教师对齐奖励重塑方法（TAMTRL），通过自监督方式为每轮记忆更新提供细粒度学习信号，在多个长上下文基准测试中显著提升了模型性能。

摘要翻译

大型语言模型（LLM）的快速发展使其在广泛任务中取得了显著性能提升。然而，当处理超出模型上下文窗口限制的长文档时，无法单次处理全部上下文，因此必须进行分块处理。这需要多轮读取不同文本块并更新记忆。然而，监督信号通常仅由最终结果提供，这使得在多轮训练场景中难以评估每一轮记忆更新的质量，从而引出了时序信用分配难题。现有方法（如LLM-as-a-judge或过程奖励模型）会产生高昂计算开销且存在估计噪声。为更好地解决多轮记忆训练中的信用分配问题，我们提出面向多轮强化学习的教师对齐奖励重塑方法（Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning, TAMTRL）。TAMTRL利用相关文档作为教师信号，通过将其与每轮模型输入对齐，并以自监督方式通过归一化概率分配奖励。这为每次记忆更新提供了细粒度学习信号，从而提升了长上下文处理能力。在七个长上下文基准测试中，使用多种不同规模模型进行的实验表明，TAMTRL始终优于强基线方法，验证了其有效性。代码发布于https://anonymous.4open.science/r/TAMTRL-F1F8。

摘要 (Abstract)

The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model’s context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at https://anonymous.4open.science/r/TAMTRL-F1F8.

关键词: large language models, long-context processing, multi-turn reinforcement learning, credit assignment, memory updates, teacher-aligned reward reshaping, context window limit, chunk-wise processing

133. ❌ A Comparative Analysis of LLM Memorization at Statistical and Internal Levels: Cross-Model Commonalities and Model-Specific Signatures

作者: Bowen Chen, Namgi Han, Yusuke Miyao 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21658v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的记忆机制，属于大模型技术原理创新，与’Large Language Models’高度相关（10分）。研究通过分析中间层和注意力头来理解记忆过程，属于’Mechanistic Interpretability’范畴（10分）。论文发现记忆率随模型规模呈对数线性缩放，与’Scaling Laws’有一定关联（5分）。研究基于预训练模型进行分析，与’Pre-training’有一定关联（5分）。其他关键词如MoE、SFT、RAG、推理加速等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

本研究通过比较多个LLM系列（Pythia、OpenLLaMa等），在统计和内部层面分析了记忆行为的共性与特异性，揭示了记忆率随模型规模的对数线性缩放规律、共享的记忆序列模式以及不同模型家族在重要注意力头分布上的独特特征。

摘要翻译

记忆能力是人类与大型语言模型智能的基础构成要素。然而，尽管大语言模型的性能快速提升，我们对记忆机制的理解却相对滞后。由于对大语言模型预训练数据的获取有限，先前研究大多集中于单一模型系列，导致各系列间的观察结果相互孤立，难以区分哪些发现具有普适性、哪些具有特异性。本研究汇集了多个模型系列（Pythia、OpenLLaMa、StarCoder、OLMo1/2/3），从统计层面与内部机制层面分析了它们共享或独特的记忆行为，在关联个体观察结果的同时揭示了新发现。在统计层面，我们发现记忆率随模型规模呈对数线性增长，且被记忆的序列可被进一步压缩。深入分析表明，不同模型中被记忆序列呈现出共享的频率与领域分布规律，但各模型在上述规律下亦表现出个体差异。在内部机制层面，我们发现大语言模型能够消除特定注入的扰动，而被记忆的序列对此类扰动更为敏感。通过解码中间层及进行注意力头消融实验，我们揭示了记忆过程中通用的解码机制以及共享的关键注意力头。然而，这些关键注意力头的分布在不同模型系列间存在差异，展现出独特的家族层面特征。本研究通过整合多类实验并揭示新发现，为建立对大语言模型记忆机制普适而本质的理解奠定了基础。

摘要 (Abstract)

Memorization is a fundamental component of intelligence for both humans and LLMs. However, while LLM performance scales rapidly, our understanding of memorization lags. Due to limited access to the pre-training data of LLMs, most previous studies focus on a single model series, leading to isolated observations among series, making it unclear which findings are general or specific. In this study, we collect multiple model series (Pythia, OpenLLaMa, StarCoder, OLMo1/2/3) and analyze their shared or unique memorization behavior at both the statistical and internal levels, connecting individual observations while showing new findings. At the statistical level, we reveal that the memorization rate scales log-linearly with model size, and memorized sequences can be further compressed. Further analysis demonstrated a shared frequency and domain distribution pattern for memorized sequences. However, different models also show individual features under the above observations. At the internal level, we find that LLMs can remove certain injected perturbations, while memorized sequences are more sensitive. By decoding middle layers and attention head ablation, we revealed the general decoding process and shared important heads for memorization. However, the distribution of those important heads differs between families, showing a unique family-level feature. Through bridging various experiments and revealing new findings, this study paves the way for a universal and fundamental understanding of memorization in LLM.

关键词: LLM memorization, model series comparison, statistical analysis, internal mechanisms, attention head ablation, scaling laws, mechanistic interpretability, pre-training data

134. ❌ DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing

作者: Nasser-Eddine Monir, Zakaria Baou 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21571v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要贡献是创建了一个英语-Tashlhiyt平行语料库DATASHI，用于低资源语言处理任务，特别是正字法规范化。论文的核心相关性在于：1）明确提到使用最先进的大语言模型（GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max）进行评估，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）；2）论文展示了从零样本到少样本提示的改进，这直接涉及’In-context Learning OR Many-shot Learning’（10分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术（预训练、微调、对齐等）、推理优化、代理系统、模型压缩等，论文均未涉及或提及，因此评分为0分。论文属于低资源语言处理领域，不属于生物信息学或化学信息学等科学AI应用，因此’AI for Science’相关关键词也得0分。

!!! tip deepseek-chat TL;DR

该研究创建了DATASHI英语-Tashlhiyt平行语料库以解决低资源Amazigh语言的正字法规范化问题，并通过评估多个大语言模型发现Gemini-2.5-Pro在少样本提示下取得了最低的错误率和良好的跨语言泛化能力。

摘要翻译

DATASHI是一个全新的英语-塔什利特语平行语料库，填补了阿马齐格语系计算资源的关键空白。该语料库包含5,000个句对，其中包含一个1,500句的子集，提供专家标准化版本和非标准化的用户生成版本，从而支持对正字法多样性与规范化的系统性研究。这种双重设计不仅支持基于文本的自然语言处理任务——如分词、翻译和文本规范化，也为朗读语音数据收集与多模态对齐研究奠定了基础。通过使用前沿大语言模型（GPT-5、Claude-Sonnet-4.5、Gemini-2.5-Pro、Mistral、Qwen3-Max）进行的综合评估显示，从零样本提示到少样本提示均带来显著性能提升，其中Gemini-2.5-Pro在词级和字符级错误率上表现最优，并展现出强大的跨语言泛化能力。通过对不同音系类别（如长辅音、强势音、小舌音和咽音）的编辑操作（删除、替换和插入）进行细粒度分析，进一步揭示了各模型对塔什利特语标记性特征的敏感差异，为低资源阿马齐格语正字法规范化研究提供了新的诊断性见解。

摘要 (Abstract)

DATASHI is a new parallel English-Tashlhiyt corpus that fills a critical gap in computational resources for Amazigh languages. It contains 5,000 sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, enabling systematic study of orthographic diversity and normalization. This dual design supports text-based NLP tasks - such as tokenization, translation, and normalization - and also serves as a foundation for read-speech data collection and multimodal alignment. Comprehensive evaluations with state-of-the-art Large Language Models (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) show clear improvements from zero-shot to few-shot prompting, with Gemini-2.5-Pro achieving the lowest word and character-level error rates and exhibiting robust cross-lingual generalization. A fine-grained analysis of edit operations - deletions, substitutions, and insertions - across phonological classes (geminates, emphatics, uvulars, and pharyngeals) further highlights model-specific sensitivities to marked Tashlhiyt features and provides new diagnostic insights for low-resource Amazigh orthography normalization.

关键词: parallel corpus, low-resource language processing, orthography normalization, Tashlhiyt, Amazigh languages, large language models, few-shot prompting, cross-lingual generalization

135. ❌ CatRAG: Functor-Guided Structural Debiasing with Retrieval Augmentation for Fair LLMs

作者: Ravi Ranjan, Utkarsh Grover, Mayur Akewar, Xiaomin Lin, Agoritsa Polyzou 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21524v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM偏见消除，与’Large Language Models’高度相关（10分），并创新性地将范畴论与RAG结合，与’Retrieval-Augmented Generation’高度相关（10分）。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在公平性方面的偏见问题，提出CatRAG框架，通过范畴论引导的结构化去偏与检索增强生成相结合，在BBQ基准测试中显著提升准确性并大幅降低偏见分数。

摘要翻译

大型语言模型（LLMs）被部署于高风险场景中，但可能表现出人口统计、性别和地理偏见，从而损害公平性与可信度。现有的去偏见方法（包括嵌入空间投影、基于提示的引导以及因果干预）通常仅在流程的单一阶段发挥作用，导致偏见缓解不彻底，且在分布变化下产生脆弱的效用权衡。我们提出CatRAG去偏见框架，这是一种双管齐下的方法，将函子与检索增强生成（RAG）引导的结构化去偏见相结合。函子组件利用范畴论结构，诱导出一种原则性的、保持结构的投影，该投影能抑制嵌入空间中与偏见相关的方向，同时保留任务相关的语义。在针对三种开源LLM（Meta Llama-3、OpenAI GPT-OSS和Google Gemma-3）的问答偏见基准测试（Bias Benchmark for Question Answering, BBQ）中，CatRAG取得了最先进的结果：相较于相应的基础模型，其准确率提升最高达40%；相较于先前的去偏见方法，提升超过10%；同时，在性别、国籍、种族及交叉子群组上，将偏见分数降至接近零（基础模型的偏见分数为60%）。

摘要 (Abstract)

Large Language Models (LLMs) are deployed in high-stakes settings but can show demographic, gender, and geographic biases that undermine fairness and trust. Prior debiasing methods, including embedding-space projections, prompt-based steering, and causal interventions, often act at a single stage of the pipeline, resulting in incomplete mitigation and brittle utility trade-offs under distribution shifts. We propose CatRAG Debiasing, a dual-pronged framework that integrates functor with Retrieval-Augmented Generation (RAG) guided structural debiasing. The functor component leverages category-theoretic structure to induce a principled, structure-preserving projection that suppresses bias-associated directions in the embedding space while retaining task-relevant semantics. On the Bias Benchmark for Question Answering (BBQ) across three open-source LLMs (Meta Llama-3, OpenAI GPT-OSS, and Google Gemma-3), CatRAG achieves state-of-the-art results, improving accuracy by up to 40% over the corresponding base models and by more than 10% over prior debiasing methods, while reducing bias scores to near zero (from 60% for the base models) across gender, nationality, race, and intersectional subgroups.

关键词: Large Language Models, Debiasing, Retrieval-Augmented Generation, Fairness, Bias Mitigation, Category Theory, Structural Debiasing, BBQ Benchmark

136. ❌ Generalizable Self-Evolving Memory for Automatic Prompt Optimization

作者: Guanbao Liang, Yuanchen Bei, Sheng Zhou, Yuheng Qin, Huan Zhou, Bingxin Jia, Bin Li, Jiajun Bu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21520v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究自动提示优化框架MemAPO，与LLMs高度相关（10分），涉及推理轨迹和错误模式，与Chain of Thought和System 2 Thinking有一定关联（8分），并强调自我反思和记忆更新，与Self-Correction高度相关（10分）。其他关键词如MoE、SLMs、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了MemAPO框架，通过双记忆机制和自演化经验积累来解决自动提示优化中泛化能力不足和知识无法积累的问题，实验表明其在多个基准测试中优于现有方法并降低了优化成本。

摘要翻译

自动提示优化是将大语言模型适配至下游任务的有效方法，但现有方法通常仅针对固定任务搜索特定提示。这种范式限制了模型对异构查询的泛化能力，并阻碍了模型随时间积累可复用的提示知识。本文提出MemAPO——一种将提示优化重新定义为可泛化、自演进经验积累的记忆驱动框架。MemAPO采用双记忆机制：将成功的推理轨迹提炼为可复用的策略模板，同时将错误生成结果组织为结构化错误模式以捕捉重复性故障类型。面对新提示时，该框架会同时检索相关策略与失败模式，组合生成既能促进有效推理又能规避已知错误的提示。通过迭代式自我反思与记忆编辑，MemAPO持续更新其记忆库，使得提示优化能够随时间持续改进，而非为每个任务重新开始优化。在多类基准测试上的实验表明，MemAPO在显著降低优化成本的同时，持续优于代表性提示优化基线方法。

摘要 (Abstract)

Automatic prompt optimization is a promising approach for adapting large language models (LLMs) to downstream tasks, yet existing methods typically search for a specific prompt specialized to a fixed task. This paradigm limits generalization across heterogeneous queries and prevents models from accumulating reusable prompting knowledge over time. In this paper, we propose MemAPO, a memory-driven framework that reconceptualizes prompt optimization as generalizable and self-evolving experience accumulation. MemAPO maintains a dual-memory mechanism that distills successful reasoning trajectories into reusable strategy templates while organizing incorrect generations into structured error patterns that capture recurrent failure modes. Given a new prompt, the framework retrieves both relevant strategies and failure patterns to compose prompts that promote effective reasoning while discouraging known mistakes. Through iterative self-reflection and memory editing, MemAPO continuously updates its memory, enabling prompt optimization to improve over time rather than restarting from scratch for each task. Experiments on diverse benchmarks show that MemAPO consistently outperforms representative prompt optimization baselines while substantially reducing optimization cost.

关键词: Automatic Prompt Optimization, Large Language Models, Memory-driven Framework, Self-evolving Experience, Reasoning Trajectories, Error Patterns, Self-reflection, Generalizable Prompting

137. ❌ Triangulating Temporal Dynamics in Multilingual Swiss Online News

作者: Bros Victor, Dufraisse Evan, Popescu Adrian, Gatica-Perez Daniel 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21519v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文研究瑞士多语言在线新闻的时态动态分析，属于媒体研究和计算社会科学领域。论文使用了命名实体识别、情感分析等NLP技术，但所有关键词均聚焦于大模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、量化、推理加速等）或AI for Science的具体应用。论文未涉及任何大模型技术原理、训练方法、优化技术或AI在生物/化学信息学中的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过结合定量分析和定性解释的三角测量方法，研究了瑞士法语、德语和意大利语三个主要语言区域数字新闻的时态趋势，揭示了不同的时间模式以及语言和文化背景如何影响新闻报道。

摘要翻译

在多语言社会中分析新闻报道能够为公共话语动态和集体叙事发展提供宝贵洞见，然而在国家媒体生态系统中兼顾语言与文化多样性的综合性研究仍然有限，在瑞士这类复杂语境中尤为突出。本文采用定量分析与质性洞察相结合的三重验证方法，研究了瑞士三大主要语言区（法语、德语、意大利语）数字媒体的时序趋势。我们收集并处理了超过170万篇新闻文章，运用了词汇计量、命名实体识别与基于维基数据（Wikidata）的关联、定向情感分析以及基于共识的变点检测。为实现系统的跨语言比较并关联本土化与文化邻近性理论，我们推导出本土化特征谱系及邻近性显著度比率。我们的分析涵盖主题性事件、周期性事件和独立事件。通过整合量化数据与质性阐释，我们为瑞士数字媒体的动态机制提供了新见解，并验证了三重验证方法在媒体研究中的有效性。研究结果揭示了差异化的时序模式，凸显了语言与文化语境如何影响新闻报道。本研究方法为其他多语言或文化多元的媒体环境提供了可应用的框架，有助于深入理解语言与文化因素如何塑造新闻呈现。

摘要 (Abstract)

Analyzing news coverage in multilingual societies can offer valuable insights into the dynamics of public discourse and the development of collective narratives, yet comprehensive studies that account for linguistic and cultural diversity within national media ecosystems remain limited, particularly in complex contexts such as Switzerland. This paper studies temporal trends in Swiss digital media across the country’s three main linguistic regions, French, German, and Italian, using a triangulated methodology that combines quantitative analyses with qualitative insights. We collected and processed over 1.7 million news articles, applying lexical metrics, named entity recognition and Wikidata-based linking, targeted sentiment analysis, and consensus-based change-point detection. To enable principled cross-language comparisons and to connect to theories of domestication and cultural proximity, we derive domestication profiles together with a proximity salience ratio. Our analysis spans thematic, recurrent, and singular events. By integrating quantitative data with qualitative interpretation, we provide new insights into the dynamics of Swiss digital media and demonstrate the usefulness of triangulation in media studies. The findings reveal distinct temporal patterns and highlight how linguistic and cultural contexts influence reporting. Our approach offers a framework applicable to other multilingual or culturally diverse media environments, contributing to a deeper understanding of how news is shaped by linguistic and cultural factors.

关键词: multilingual news analysis, temporal trends, Swiss digital media, triangulated methodology, linguistic regions, cultural diversity, named entity recognition, sentiment analysis

138. ❌ Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

作者: Mohamed Sobhi Jabal, Jikai Zhang, Dominic LaBella, Jessica L. Houk, Dylan Zhang, Jeffrey D. Rudie, Kirti Magudia, Maciej A. Mazurowski, Evan Calabrese 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21494v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是开发一个用于脑肿瘤随访评估的多智能体LLM系统，因此与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分）。研究属于医学AI应用，与’AI for Science’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理优化、模型压缩等均未在摘要中提及或涉及，故评0分。

!!! tip deepseek-chat TL;DR

本研究开发了一个结合多智能体大语言模型和卷积神经网络的端到端系统，用于自动化脑肿瘤报告和数据系统（BT-RADS）评分，在509次术后MRI检查中实现了76.0%的准确率，显著优于初始临床评估的57.5%。

摘要翻译

脑肿瘤报告与数据系统（Brain Tumor Reporting and Data System, BT-RADS）为弥漫性胶质瘤患者治疗后MRI反应评估提供了标准化框架，但其应用需综合影像学趋势、药物效应及放疗时间等多重复杂因素。本研究评估了一种端到端的多智能体大语言模型（Large Language Model, LLM）与卷积神经网络（Convolutional Neural Network, CNN）系统在BT-RADS自动分类中的性能。该系统结合基于CNN的自动肿瘤分割技术，对来自一家大型医疗中心的509例连续胶质瘤治疗后MRI检查进行了回顾性评估。其中，提取智能体从非结构化临床记录中识别临床变量（类固醇使用状态、贝伐珠单抗使用状态、放疗日期），而评分智能体则应用BT-RADS决策逻辑，将提取的变量与体积测量结果进行整合。专家参考标准分类由一位独立的委员会认证神经放射科医生确立。在509例检查中，492例符合纳入标准。该系统分类准确率为374/492（76.0%；95% CI，72.1%-79.6%），而初始临床评估准确率为283/492（57.5%；95% CI，53.1%-61.8%）（提升18.5个百分点；P<0.001）。在情境依赖类别中系统表现出高敏感性（BT-1b 100%、BT-1a 92.7%、BT-3a 87.5%），而在阈值依赖类别中敏感性中等（BT-3c 74.8%、BT-2 69.2%、BT-4 69.3%、BT-3b 57.1%）。对于BT-4类别，阳性预测值达92.9%。与初始临床评分相比，多智能体LLM系统在BT-RADS分类上与专家参考标准具有更高的一致性，在情境依赖评分中展现出高准确性，并对BT-4检测具有高阳性预测价值。

摘要 (Abstract)

The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.

关键词: multi-agent LLM system, BT-RADS classification, brain tumor assessment, automated scoring, clinical variable extraction, CNN-based segmentation, medical imaging AI, glioma follow-up

139. ❌ Effective Strategies for Asynchronous Software Engineering Agents

作者: Jiayi Geng, Graham Neubig 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21489v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多智能体协作系统在软件工程任务中的应用，核心是CAID协调范式。与’LLM Agents’和’Multi-agent Systems’高度相关（10分），因为论文专注于多智能体协作和协调机制。与’Tool Use’相关（8分），因为智能体使用git等工具进行协作。与’Large Language Models’相关（8分），因为AI智能体通常基于LLMs构建。与’Chain of Thought’、‘System 2 Thinking’和’Self-Correction’有一定关联（5分），因为多步任务规划和测试验证涉及推理和修正。其他关键词如MoE、量化、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对多智能体在软件工程长时程任务中协作效率低的问题，提出了基于集中式异步隔离委托（CAID）的协调范式，在论文复现和库开发任务上分别将准确率提升了26.7%和14.3%。

摘要翻译

人工智能代理在解决GitHub问题等孤立软件工程任务方面已日益成熟。然而，涉及多个相互依赖子任务的长期任务仍在准确性和及时完成方面构成挑战。解决这类长期任务的一种自然方法是采用异步多智能体协作，即多个智能体同时处理任务的不同部分。但多智能体系统的有效应用已被证明异常困难：多个智能体的并发编辑会相互干扰，依赖关系难以同步，且将部分进展整合为连贯整体具有挑战性。另一方面，人类开发者长期以来依赖成熟的协作基础设施来应对大型软件项目中的这些挑战。受这些协作原语的启发，我们提出集中式异步隔离委托（Centralized Asynchronous Isolated Delegation, CAID）——一种基于三大软件工程核心原语的结构化多智能体协调范式：集中式任务委托、异步执行和隔离工作空间。CAID通过中央管理器构建具备依赖感知的任务规划，在隔离工作空间中并行执行子任务，并通过基于可执行测试验证的结构化集成来整合进展。在实证评估中，我们发现CAID在论文复现任务（PaperBench）上比单智能体基线绝对准确率提升26.7%，在Python库开发任务（Commit0）上提升14.3%。通过系统分析，我们发现分支与合并是多智能体协作的核心协调机制，而诸如git worktree、git commit和git merge等软件工程原语使其能够以可靠且可执行的方式实现。

摘要 (Abstract)

AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.

关键词: AI agents, multi-agent collaboration, software engineering, asynchronous coordination, task delegation, isolated workspaces, dependency-aware planning, test-based verification

140. ❌ TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

作者: Kai-Wei Chang, Yi-Cheng Lin, Huang-Cheng Chou, Wenze Ren, Yu-Han Huang, Yun-Shao Tsai, Chien-Cheng Chen, Yu Tsao, Yuan-Fu Liao, Shrikanth Narayanan, James Glass, Hung-yi Lee 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21478v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究低资源语言（台语）的语音意图数据集构建，并探索了两种数据挖掘策略，其中一种使用了LLM进行伪标注。因此，与’Large Language Models’关键词有一定关联（5分），因为LLM被用作数据标注工具。同时，论文涉及语音技术在医疗保健等实际应用场景，与’AI for Science’中的科学应用有一定相关性（5分）。其他关键词如MoE、SFT、RAG等均未在论文中涉及，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对低资源语言台语，构建了一个真实世界的语音意图数据集TaigiSpeech，并探索了基于LLM伪标注和音频-视觉框架的数据挖掘策略，以解决标注数据稀缺问题。

摘要翻译

语音技术发展迅速，已服务于全球多样化人群。然而，由于资源有限，许多语言在技术中仍未被充分代表。本文介绍\textbf{TaigiSpeech}——一个针对台湾闽南语（亦称台语/闽南语）的真实场景语音意图数据集，该语言属于资源匮乏且主要依赖口语传播的语言。本数据集采集自年长使用者，包含21位说话者共计3000条语音样本，专为医疗保健与家庭助理等实际意图检测场景设计。为应对标注数据稀缺的挑战，我们探索了两种具有不同监督程度的数据挖掘策略：一是通过中间语言进行大型语言模型伪标注的关键词匹配数据挖掘；二是利用多模态线索、仅需极少文本监督的视听框架。这一设计为资源匮乏的无文字口语提供了可扩展的数据集构建方案。TaigiSpeech将依据CC BY 4.0许可协议公开发布，以促进对资源匮乏及无文字语言的广泛采用与研究。项目网站及数据集可通过https://kwchang.org/taigispeech获取。

摘要 (Abstract)

Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on https://kwchang.org/taigispeech.

关键词: TaigiSpeech, low-resource language, speech intent dataset, data mining, LLM pseudo labeling, audio-visual framework, Taiwanese Taigi, intent detection

141. ❌ Beyond Correlation: Refutation-Validated Aspect-Based Sentiment Analysis for Explainable Energy Market Returns

作者: Wihan van der Heever, Keane Ong, Ranjan Satapathy, Erik Cambria 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21473v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于金融市场的细粒度情感分析框架，使用传统统计方法（OLS、Newey West HAC、refutation tests）分析情感信号与股票回报的关系，未涉及任何大模型、深度学习技术或AI for Science的具体应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该研究提出了一个基于反驳验证的细粒度情感分析框架，用于分析能源市场情感信号与股票回报的关系，发现只有少数关联通过所有统计检验，为金融分析提供了更稳健的方法论验证。

摘要翻译

本文提出了一种用于金融市场细粒度情感分析的证伪验证框架，以解决相关性研究无法区分真实关联与虚假关联的局限性。利用能源行业的X平台数据，我们检验了细分维度情感信号是否与股票收益存在经证伪验证的稳健关系。我们的分析流程整合了净比率评分与Z标准化处理、采用Newey West异方差自相关稳健标准误的普通最小二乘法回归，以及包含安慰剂检验、随机共同因果检验、子集稳定性检验和自助法在内的系列证伪测试。在六个能源行业股票代码的测试中，仅少数关联性通过全部验证，而可再生能源板块则表现出特定情感维度与预测周期的差异化响应。尽管未能确立因果关系，但该框架提供了统计稳健、方向可解释的信号。受样本规模限制（六只股票、一个季度数据），本研究结论的普适性受限，主要作为方法论的概念验证研究。

摘要 (Abstract)

This paper proposes a refutation-validated framework for aspect-based sentiment analysis in financial markets, addressing the limitations of correlational studies that cannot distinguish genuine associations from spurious ones. Using X data for the energy sector, we test whether aspect-level sentiment signals show robust, refutation-validated relationships with equity returns. Our pipeline combines net-ratio scoring with z-normalization, OLS with Newey West HAC errors, and refutation tests including placebo, random common cause, subset stability, and bootstrap. Across six energy tickers, only a few associations survive all checks, while renewables show aspect and horizon specific responses. While not establishing causality, the framework provides statistically robust, directionally interpretable signals, with limited sample size (six stocks, one quarter) constraining generalizability and framing this work as a methodological proof of concept.

关键词: aspect-based sentiment analysis, financial markets, refutation-validated framework, energy sector, equity returns, statistical robustness, methodological proof of concept

142. ❌ DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

作者: Siqi Guo, Ming Lin, Tianbao Yang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21465v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确使用Large Language Models (LLMs) 来自动将PyTorch代码转换为优化的Triton/CUDA内核，这是其核心方法，因此该关键词得10分。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术（Pre-training、SFT、RLHF等）、推理优化（Speculative Decoding）、解释性（Explainable AI）、科学AI应用等，论文均未涉及或提及，因此得0分。论文专注于LLMs在代码生成和优化中的特定应用，而非通用大模型技术原理或其他领域应用。

!!! tip deepseek-chat TL;DR

该论文提出DRTriton框架，利用大规模合成数据和强化学习训练LLMs，将PyTorch代码自动转换为高性能Triton/CUDA内核，在KernelBench测试中显著优于GPT-5.2和Claude-Sonnet-4.5。

摘要翻译

开发高效的CUDA内核是生成式人工智能领域一项基础而富有挑战性的任务。近期研究利用大语言模型（LLMs）将PyTorch参考实现自动转换为CUDA内核，显著降低了工程投入。然而，诸如GPT-5.2和Claude-Sonnet-4.5等先进大语言模型在此特定任务中仍面临困难。为应对这一挑战，我们提出了DRTriton——一个可扩展的学习框架，用于训练大语言模型将PyTorch代码转换为高度优化的Triton内核，这些内核在运行时被编译为CUDA内核。DRTriton包含三个核心组件：（i）数据合成算法CSP-DAG，该算法在可控难度下保证对算子空间的全面覆盖和无偏均匀采样；（ii）采用解耦奖励的课程强化学习，能同步高效优化转换成功率和推理速度；（iii）测试时搜索算法，可进一步提升所生成Triton内核的推理速度。值得注意的是，尽管仅使用合成数据进行训练，DRTriton能有效泛化至现实世界中即使对人类专家也颇具挑战性的CUDA内核。实验结果表明，在KernelBench Level 2测试集上，DRTriton-7B在92%的案例中实现了加速，而GPT-5.2和Claude-Sonnet-4.5的加速比例分别为23%和19%。

摘要 (Abstract)

Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.

关键词: Large Language Models, CUDA kernels, Triton kernels, synthetic data, reinforcement learning, code generation, inference speed, PyTorch

143. ❌ DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

作者: James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21461v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究偏好对齐（preference alignment）技术，提出了一种无需更新模型权重、在推理时通过稀疏自编码器（SAE）进行条件化控制的方法。因此，与偏好对齐、微调、参数高效微调、可解释性等关键词高度相关（10分）。论文涉及LLM（Gemma、Qwen）的应用，因此LLM关键词10分。SAE属于稀疏模型，因此MoE/稀疏模型关键词5分。论文测试了2B-9B模型，涉及较小模型，因此SLM关键词5分。其他关键词如数据质量、预训练、RAG、推理加速、幻觉缓解等与论文内容无关或未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DSPA的动态稀疏自编码器引导方法，用于在推理时实现数据高效的偏好对齐，无需更新基础模型权重，在保持性能的同时显著减少了计算开销。

摘要翻译

偏好对齐通常通过对偏好数据进行权重更新训练来实现，这增加了大量对齐阶段的计算开销且机制可解释性有限。我们提出用于偏好对齐的动态稀疏自编码器导向方法（Dynamic SAE Steering for Preference Alignment, DSPA），这是一种推理时方法，可使稀疏自编码器（sparse autoencoder, SAE）导向具备提示条件性。基于偏好三元组，DSPA计算一个条件差异映射，将提示特征与生成控制特征相关联；在解码过程中，该方法仅修改被激活的token潜在表示，无需更新基础模型权重。在Gemma-2-2B/9B和Qwen3-8B模型上的实验表明，DSPA提升了MT-Bench评分，在AlpacaEval基准上表现具有竞争力，同时保持了多项选择任务的准确性。在受限偏好数据条件下，DSPA仍保持鲁棒性，其性能可与两阶段RAHF-SCIT流程相媲美，同时将对齐阶段浮点运算量最高减少至$4.47\times$。最后，我们对DSPA修改的SAE特征进行审计，发现偏好方向主要由语篇和风格信号主导，并通过理论分析阐明了条件差异映射的估计原理以及top-$k$消融技术的适用条件。

摘要 (Abstract)

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to $4.47\times$ fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top-$k$ ablation is principled.

关键词: preference alignment, sparse autoencoder, inference-time method, parameter-efficient, data-efficient, mechanistic interpretability, conditional-difference map, weight-updating training

144. ❌ Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

作者: Tae-Eun Song 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21454v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LLM基准测试污染检测问题，核心贡献是Cross-Context Verification方法和Hierarchical Cross-Context Architecture多智能体框架。与LLM技术高度相关（10分），涉及推理过程分析（Chain of Thought和System 2 Thinking各8分），并直接使用多智能体系统进行验证（LLM Agents和Multi-agent Systems各10分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM编码基准测试中的解决方案泄露和测试质量问题，提出了Cross-Context Verification方法和Hierarchical Cross-Context Architecture多智能体框架，能够完美区分污染样本和真实推理，并发现33%的先前污染标签是假阳性。

摘要翻译

大语言模型编码基准正面临可信度危机：普遍存在的解决方案泄露与测试质量问题削弱了SWE-bench Verified的可靠性，而现有检测方法——包括释义一致性、n-元重叠度、困惑度分析——均未直接观测模型是进行推理还是单纯回忆。同时，简单重复验证会降低准确性：多轮审查产生误判的速度远快于发现真实错误的速度，这表明需要结构性方法。
我们提出跨上下文验证（Cross-Context Verification, CCV），一种黑盒方法，通过在N个独立会话中解决同一基准问题并测量解决方案的多样性，结合分层跨上下文架构（Hierarchical Cross-Context Architecture, HCCA）——一个多智能体分析框架，通过在不同专业分析角色间刻意限制信息传递来防止确认偏误。
在9个SWE-bench Verified问题上（45次试验，Claude Opus 4.6，温度参数0），CCV实现了对污染样本与真实推理的完美区分（曼-惠特尼U=0，p约等于0.012，r=1.0）。关键发现包括：（1）污染具有二元性——模型要么完美复现，要么完全无法回忆；（2）推理缺失是完美的区分指标；（3）先前33%的污染标注为误判；（4）HCCA的独立分析结构能发现单分析师方法遗漏的污染-缺陷复合案例。一项将HCCA扩展至多阶段验证（执行者→验证者→指导者）的试点实验得出了负面结果——100%的盲从性确认——进一步证明关键机制在于信息限制而非结构复杂性。我们已公开全部代码与数据。

摘要 (Abstract)

LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods–paraphrase consistency, n-gram overlap, perplexity analysis–never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary–models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA’s independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result–100% sycophantic confirmation–providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.

关键词: benchmark contamination, LLM coding benchmarks, multi-agent analysis, reasoning detection, Cross-Context Verification, Hierarchical Cross-Context Architecture, solution leakage, SWE-bench Verified

145. ❌ KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

作者: Shuai Wang, Yinan Yu 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21440v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在知识图谱推理任务中的应用，通过强化学习框架实现多跳推理，与’Large Language Models’高度相关（10分）。研究涉及推理过程，与’Chain of Thought’和’System 2 Thinking’高度相关（各10分），因为论文提出将多步推理整合到单一推理阶段，实现全局推理和动态路径探索。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等未在论文中涉及，均给0分。

!!! tip deepseek-chat TL;DR

论文提出KG-Hopper框架，通过强化学习赋能紧凑型开源大语言模型，在单一推理轮次中实现知识图谱的多跳推理，实验表明其性能优于大型多步系统，并与专有模型竞争。

摘要翻译

大型语言模型（LLMs）展现出令人瞩目的自然语言处理能力，但在知识密集型推理任务中往往表现不佳。知识库问答（Knowledge Base Question Answering, KBQA）作为一项典型挑战，因其需要准确的多跳推理而尤为突出，该任务依赖于结构化的知识图谱（Knowledge Graphs, KGs）。现有方法通常遵循预定义流程执行顺序推理步骤，这种分步隔离的推理方式限制了灵活性，并容易导致错误级联。为应对这些局限，我们提出KG-Hopper——一种新颖的强化学习（Reinforcement Learning, RL）框架，使紧凑的开源LLMs能够在单轮推理中执行集成的多跳知识图谱推理。我们并非逐步推理，而是训练一个推理专用LLM，将整个知识图谱遍历与决策过程嵌入到统一的“思考”阶段，从而实现对跨步骤依赖的全局推理以及具备回溯能力的动态路径探索。在八个知识图谱推理基准测试上的实验结果表明，基于70亿参数LLM的KG-Hopper持续超越规模更大的多步推理系统（参数量高达700亿），并与GPT-3.5-Turbo、GPT-4o-mini等专有模型达到相当的性能水平，同时保持紧凑、开源和数据高效的特点。代码已公开于：https://github.com/Wangshuaiia/KG-Hopper。

摘要 (Abstract)

Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking’’ stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG-Hopper.

关键词: Large Language Models, Knowledge Graph Reasoning, Reinforcement Learning, Multi-hop Reasoning, Compact LLMs, KBQA, Global Reasoning, Dynamic Path Exploration

146. ❌ PROMPT2BOX: Uncovering Entailment Structure among LLM Prompts

作者: Neeladri Bhuiya, Shib Sankar Dasgupta, Andrew McCallum, Haw-Shiuan Chang 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21438v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于LLM提示的分析和评估方法，核心贡献是提出PROMPT2BOX方法，使用box embedding来更好地捕捉提示之间的语义相似性和特异性关系，从而更有效地识别LLM的弱点。因此，它高度相关于’Large Language Models OR LLMs OR Foundation Models’（权重1.0），因为LLM是研究的核心对象。然而，论文不涉及其他关键词，如MoE、SLMs、训练技术（预训练、微调、对齐等）、推理优化、代理系统、模型压缩或科学AI应用。这些关键词与论文的焦点——提示嵌入和弱点分析——没有直接关联。

!!! tip deepseek-chat TL;DR

该论文提出PROMPT2BOX方法，使用box embedding来捕捉LLM提示之间的语义相似性和特异性关系，从而更有效地识别LLM的弱点，相比向量基线能多识别8.9%的弱点。

摘要翻译

为探究大语言模型（LLM）的弱点，研究者通常将提示词嵌入向量空间并进行聚类以提取有洞察力的模式。然而，向量嵌入主要捕获主题相似性。因此，那些主题相同但具体程度不同（进而难度不同）的提示词往往获得相似的表示，这使得细粒度的弱点分析变得困难。为克服这一局限，我们提出了PROMPT2BOX方法，该方法通过一个训练过的编码器将提示词嵌入到盒型嵌入（box embedding）空间中。该编码器在现有数据集与合成数据集上训练，输出的盒型嵌入不仅能捕捉语义相似性，还能捕获提示词之间的具体性关系（例如，“撰写一个冒险故事”比“撰写一个故事”更具体）。我们进一步为盒型嵌入开发了一种新颖的降维技术，以促进数据集的可视化与比较。实验表明，盒型嵌入在捕捉提示词具体性方面始终优于向量基线方法。在下游任务中，针对来自UltraFeedback数据集的17个大语言模型构建层次聚类树时，PROMPT2BOX能比向量基线多识别出8.9%的LLM弱点，并且在层次深度与指令具体性之间实现了约33%更强的相关性。

摘要 (Abstract)

To discover the weaknesses of LLMs, researchers often embed prompts into a vector space and cluster them to extract insightful patterns. However, vector embeddings primarily capture topical similarity. As a result, prompts that share a topic but differ in specificity, and consequently in difficulty, are often represented similarly, making fine-grained weakness analysis difficult. To address this limitation, we propose PROMPT2BOX, which embeds prompts into a box embedding space using a trained encoder. The encoder, trained on existing and synthesized datasets, outputs box embeddings that capture not only semantic similarity but also specificity relations between prompts (e.g., “writing an adventure story” is more specific than “writing a story”). We further develop a novel dimension reduction technique for box embeddings to facilitate dataset visualization and comparison. Our experiments demonstrate that box embeddings consistently capture prompt specificity better than vector baselines. On the downstream task of creating hierarchical clustering trees for 17 LLMs from the UltraFeedback dataset, PROMPT2BOX can identify 8.9% more LLM weaknesses than vector baselines and achieves an approximately 33% stronger correlation between hierarchical depth and instruction specificity.

关键词: LLM prompts, box embeddings, entailment structure, weakness analysis, specificity relations, hierarchical clustering, UltraFeedback dataset, PROMPT2BOX

147. ❌ Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

作者: Hang Gao, Dimitris N. Metaxas 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21437v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Transformer-based embedding models中的语义偏移问题及其对检索性能的影响，与检索增强生成（RAG）高度相关（8分），因为RAG依赖文本嵌入和检索；与大型语言模型（LLMs）有一定关联（5分），因为嵌入模型是LLM生态系统的一部分；其他关键词如MoE、SFT、RLHF等与论文的嵌入和检索焦点无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出语义偏移是文本嵌入和检索中的根本挑战，通过理论分析和实验证明语义偏移能解释嵌入崩溃并预测检索性能下降。

摘要翻译

基于Transformer的嵌入模型依赖池化操作将变长文本映射为单个向量，这虽然支持高效的相似性搜索，但也引发了众所周知的几何病态问题，例如各向异性与长度诱导的嵌入坍缩。现有研究主要描述了这些病态现象的\textit{表现}，但对于其\textit{何时}以及\textit{为何}损害下游检索任务，则缺乏深入解释。本文认为，缺失的因果因素是\textit{语义偏移}：即文本内部语义固有的、结构化的演变与离散化过程。
我们首先对Transformer嵌入中的\textit{语义平滑}现象进行了理论分析：随着文本组成句子间的语义多样性增加，池化后的表示必然偏离每个独立句子的嵌入，从而产生一个平滑且区分度更低的向量。在此基础上，我们将语义偏移形式化为一个可计算的度量，该度量融合了局部语义演变与全局语义离散度。通过在多种语料库和多个嵌入模型上进行受控实验，我们证明语义偏移与嵌入集中现象的严重程度高度吻合，并能预测检索性能的下降，而仅凭文本长度则无法做到这一点。总体而言，语义偏移为理解嵌入坍缩以及诊断各向异性何时产生危害提供了一个统一且可操作的视角。

摘要 (Abstract)

Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.

关键词: semantic shift, text embedding, retrieval, Transformer embeddings, embedding collapse, anisotropy, semantic smoothing, retrieval degradation

148. ❌ Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks

作者: Navya Mehrotra, Adam Visokay, Kristina Gligorić 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21404v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在文本标注中的应用，特别是主观任务中LLM输出反映不同人类视角的偏差问题，并提出了Perspective-Driven Inference方法。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLM是研究的基础工具和对象。其他关键词涉及具体技术原理（如MoE、SFT、RAG等）、应用领域（如AI for Science）或性能优化（如量化、推理加速），论文均未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在主观任务标注中存在的视角偏差问题，提出了一种Perspective-Driven Inference方法，通过自适应采样策略优化人工标注资源分配，在礼貌性和冒犯性评分任务中有效提升了难以建模人群的标注准确性。

摘要翻译

大型语言模型正日益被用于文本标注，但其输出反映某些人类视角的能力优于其他视角。现有的LLM标注误差校正方法均假设存在单一标准答案，然而这一假设在主观性任务中并不成立——不同人口统计学群体间的意见分歧本身具有研究价值。为此，我们提出"视角驱动推断"方法，将跨群体标注分布作为核心研究对象，并利用有限的人工标注资源对其进行估算。我们设计了一种自适应抽样策略，可将人工标注资源集中投放于LLM代理模型预测准确度最低的群体。通过在礼貌度与冒犯性评分任务上的实验评估，本方法相较于均匀抽样基线在较难建模的人口群体上实现了针对性提升，同时保持了整体覆盖度。

摘要 (Abstract)

Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.

关键词: Large language models, LLM annotations, subjective tasks, perspective-driven inference, adaptive sampling, demographic groups, human annotation, bias correction

149. ❌ Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models

作者: Jinghan Cao, Yu Ma, Xinjin Li, Qingyang Ren, Xiangyun Chen 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21389v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大型语言模型（LLMs）与小型语言模型（SLMs）在任务特定效率方面的比较，因此与’Large Language Models’和’Small Language Models’高度相关（10分）。论文提出了Performance-Efficiency Ratio（PER）指标，涉及推理效率，与’Speculative Decoding OR Inference Acceleration’有一定关联（5分）。其他关键词如MoE、Scaling Laws、训练方法、对齐、RAG、推理技术、代理、量化、幻觉缓解、可解释性、科学AI等均未在标题或摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文通过引入Performance-Efficiency Ratio（PER）指标，系统比较了16个语言模型在五个NLP任务上的效率，发现小型语言模型（0.5-3B参数）在所有任务中均表现出更优的PER分数，为资源受限环境中的模型部署提供了定量依据。

摘要翻译

大型语言模型展现出卓越性能，但伴随的计算成本过高，难以适配资源受限的部署场景。本文首次针对五大不同自然语言处理任务，对16种语言模型进行了全面的任务特定效率分析。我们提出了性能效率比（Performance-Efficiency Ratio, PER）这一新颖指标，该指标通过几何平均归一化方法，综合了准确率、吞吐量、内存占用和延迟。系统性评估表明，小型模型（0.5–30亿参数）在所有给定任务中均取得了更优的PER分数。这些发现为在生产环境中优先考虑推理效率而非边际精度提升的小型模型部署，奠定了量化基础。

摘要 (Abstract)

Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the Performance-Efficiency Ratio (PER), a novel metric integrating accuracy, throughput, memory, and latency through geometric mean normalization. Our systematic evaluation reveals that small models (0.5–3B parameters) achieve superior PER scores across all given tasks. These findings establish quantitative foundations for deploying small models in production environments prioritizing inference efficiency over marginal accuracy gains.

关键词: Small Language Models, Large Language Models, efficiency analysis, Performance-Efficiency Ratio, inference efficiency, computational costs, NLP tasks, resource-constrained deployments

150. ❌ PLR: Plackett-Luce for Reordering In-Context Learning Examples

作者: Pawel Batorski, Paul Swoboda 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21373v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	15.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PLR专注于改进大语言模型（LLMs）的上下文学习（ICL）性能，通过提出一种基于Plackett-Luce模型的概率排序方法来解决示例顺序敏感性问题。因此，它与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLMs是ICL的基础模型；与’In-context Learning OR Many-shot Learning’高度相关（15分），因为这是论文的核心研究主题，直接涉及ICL的优化。其他关键词如MoE、SFT、RAG等与论文内容无关，因为论文未涉及这些技术原理或应用领域。

!!! tip deepseek-chat TL;DR

论文PLR提出了一种基于Plackett-Luce模型的概率方法，用于优化大语言模型中上下文学习示例的顺序，实验表明该方法能有效提升少样本分类和数学推理任务的准确性。

摘要翻译

上下文学习（ICL）通过以少量ICL示例为条件来调整大型语言模型，从而避免昂贵的参数更新。在诸多影响因素中，模型性能通常对示例的排列顺序高度敏感。然而，在$n!$种可能的排列顺序上进行穷举搜索是不可行的。因此，更高效的排序方法通常采用基于标签集的模型置信度度量（例如标签概率熵），或直接寻找最优排序。我们提出PLR，一种基于概率的上下文学习示例排序方法，它利用普拉基特-卢斯模型学习排序的概率分布，以替代离散的排序搜索。PLR使用普拉基特-卢斯分布对排序进行建模，并迭代更新其参数，以在任务级度量指标下将概率质量集中于高性能的排序上。候选排序通过Gumbel扰动排序过程高效采样。在多个分类基准测试上的实验表明，对于$k \in {4, 8, 16, 32}$个示例，PLR能持续提升少样本准确率；我们进一步在基于标签的排序方法不适用的数学推理任务上验证了其性能提升。代码发布于https://github.com/Batorskq/PLR。

摘要 (Abstract)

In-context learning (ICL) adapts large language models by conditioning on a small set of ICL examples, avoiding costly parameter updates. Among other factors, performance is often highly sensitive to the ordering of the examples. However, exhaustive search over the $n!$ possible orderings is infeasible. Therefore more efficient ordering methods use model confidence measures (e.g., label-probability entropy) over label sets or take a direct approach to finding the best ordering. We propose PLR, a probabilistic approach to in-context example ordering that replaces discrete ordering search with learning a probability distribution over orderings with the Plackett-Luce model. PLR models orderings using a Plackett-Luce distribution and iteratively updates its parameters to concentrate probability mass on high-performing orderings under a task-level metric. Candidate orderings are sampled efficiently via a Gumbel perturb-and-sort procedure. Experiments on multiple classification benchmarks show that PLR consistently improves few-shot accuracy for $k \in {4, 8, 16, 32}$ examples, and we further demonstrate gains on mathematical reasoning tasks where label-based ordering methods are not applicable. Our code is available at https://github.com/Batorskq/PLR.

关键词: In-context learning, Example ordering, Plackett-Luce model, Large language models, Few-shot accuracy, Probabilistic approach, Classification benchmarks, Mathematical reasoning

151. ❌ TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference

作者: Jaber Jaber, Osama Jaber 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21365v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文TIDE专注于LLM推理加速技术，通过后训练系统实现基于token的早期退出机制。核心相关关键词：1) ‘Large Language Models’ (10分) - 论文明确针对LLM推理优化；2) ‘Post-training’ (10分) - 系统是后训练方法，无需模型重训练；3) ‘Speculative Decoding OR Inference Acceleration’ (10分) - 直接提升推理速度和吞吐量。其他关键词如MoE、量化、对齐等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

论文提出TIDE系统，通过后训练添加轻量级路由器实现LLM推理中基于token的早期退出，在保持准确性的同时显著降低延迟并提高吞吐量。

摘要翻译

大型语言模型无论计算难度如何，都会将每个词元输入所有层进行处理。本文提出TIDE，一种训练后增强系统，该系统在周期性检查点层附加微型学习路由器，并在推理阶段为每个词元选择其隐藏状态已收敛的最早层。TIDE无需模型重训练，兼容所有HuggingFace因果语言模型，可自动检测GPU架构，并通过融合CUDA内核支持float32、float16和bfloat16精度。在配备DeepSeek R1 Distill 8B的NVIDIA A100上，TIDE实现了100%的预填充提前退出率（5%的词元在第11层退出，其余在第31层退出），将预填充延迟降低7.2%，单批次吞吐量提升6.6%。在自回归解码过程中，98-99%的词元实现提前退出，同时模型能正确求解包含95个独立输出词元的多步骤数学问题。在Qwen3 8B（36层）测试中，批次大小为8时吞吐量提升8.1%。使用2,000个WikiText样本进行校准耗时不足3分钟，生成约4 MB的路由器检查点。该系统包含1,308行Python代码和1,081行CUDA/C++代码，并通过74项测试。代码地址：https://github.com/RightNow-AI/TIDE

摘要 (Abstract)

Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post-training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto-detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1 Distill 8B, TIDE achieves 100% prefill exit rate (5% of tokens exit at layer 11, the remaining at layer 31), reduces prefill latency by 7.2%, and increases single-batch throughput by 6.6%. During autoregressive decoding, 98-99% of tokens exit early while the model correctly solves a multi-step math problem with 95 unique output tokens. On Qwen3 8B (36 layers), throughput improves by 8.1% at batch size 8. Calibration on 2,000 WikiText samples takes under 3 minutes and produces a ~4 MB router checkpoint. The system comprises 1,308 lines of Python and 1,081 lines of CUDA/C++ with 74 passing tests. Code: https://github.com/RightNow-AI/TIDE

关键词: early exit, LLM inference, post-training, token-level routing, inference acceleration, latency reduction, throughput improvement, CUDA optimization

152. ❌ AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation

作者: Liang Ding 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21362v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM Agent评估方法，与’LLM Agents’高度相关（10分），使用DPO训练代理与’RLHF/DPO’高度相关（10分），涉及工具使用场景与’Tool Use’有一定关联（5分），其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM Agent评估中固定评分标准无法适应不同任务需求的问题，提出了AdaRubric方法来自动生成任务特定的评估标准，显著提升了评估与人类判断的相关性，并基于此训练的DPO代理在多个基准测试中取得了更好的任务成功率。

摘要翻译

LLM-as-Judge评估方法在智能体任务中表现不佳，因为固定的评估标准无法捕捉此类任务的关键要素：代码调试需要关注正确性与错误处理能力；网页导航则强调目标对齐与操作效率。为此，我们提出了ADARUBRIC方法，通过以下机制弥补这一缺陷：根据任务描述动态生成任务专属评估标准，以置信度加权的分维度反馈对任务轨迹进行逐步评分，并采用新颖的维度感知过滤器（DimensionAwareFilter）筛选偏好对——该过滤器被证明是防止高评分维度掩盖维度层面失败的必要条件。在WebArena和ToolBench基准测试中，ADARUBRIC与人类评估的皮尔逊相关系数达到r=0.79（较最佳静态基线提升0.16），且具备部署级可靠性（克里彭多夫α系数=0.83）。基于ADARUBRIC偏好对训练的DPO智能体在三个基准测试中，任务成功率较Prometheus提升6.8至8.5个百分点；其优势可迁移至SWE-bench代码修复任务（提升4.9个百分点），并在5千步训练时将PPO收敛速度提升6.6个百分点——以上成果均无需任何人工设计评估标准。代码地址：https://github.com/alphadl/AdaRubrics。

摘要 (Abstract)

LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task-specific evaluation rubrics on the fly from task descriptions, scoring trajectories step-by-step with confidence-weighted per-dimension feedback, and filtering preference pairs with the novel DimensionAwareFilter - a provably necessary condition for preventing high-scoring dimensions from masking dimension-level failures. On WebArena and ToolBench, ADARUBRIC achieves Pearson r=0.79 human correlation (+0.16 over the best static baseline) with deployment-grade reliability (Krippendorff’s $α$=0.83). DPO agents trained on ADARUBRIC preference pairs gain +6.8 to +8.5 pp task success over Prometheus across three benchmarks; gains transfer to SWE-bench code repair (+4.9 pp) and accelerate PPO convergence by +6.6 pp at 5K steps - both without any rubric engineering. Code: https://github.com/alphadl/AdaRubrics.

关键词: LLM Agent Evaluation, Task-Adaptive Rubrics, DPO Training, Human Correlation, Preference Pairs, WebArena, ToolBench, SWE-bench

153. ❌ Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF

作者: K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21359v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在低资源语言方言中的性能偏差评估，直接涉及LLMs、RAG和RLAIF技术。摘要明确提到’Large language models (LLMs)’、‘retrieval-augmented generation (RAG) pipeline’和’RLAIF evaluations’，这些是论文方法论的核心组成部分，因此给予10分。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等均未在论文中提及或相关，给予0分。论文虽涉及AI应用，但具体是语言评估而非生物信息学或化学信息学，因此’AI for Science’相关关键词也得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个两阶段框架来评估大型语言模型在九种孟加拉语方言问答任务中的性能偏差，发现模型性能下降与语言差异相关且模型规模增加并不总能缓解这种偏差。

摘要翻译

大语言模型（LLM）在处理低资源语言的地区方言时，常表现出性能偏差，但目前仍缺乏量化这些差异的框架。我们提出了一个两阶段框架，用于评估LLM在九种孟加拉语方言上的问答方言偏差。首先，我们采用检索增强生成（RAG）流程，将标准孟加拉语问题翻译并人工标注为方言变体，构建了包含4,000个问题集的数据集。由于传统翻译质量评估指标对非标准化方言失效，我们采用“LLM即评判员”的方法评估翻译保真度，其与人工评估的相关性证实其优于传统指标。其次，我们基于这些人工标注集对19个LLM进行基准测试，通过多评判员一致性与人工复核验证，完成了68,395次RLAIF评估。研究结果显示，性能下降与语言差异程度密切相关。例如，对于差异极大的吉大港方言，模型回答得分仅为5.44/10，而坦盖尔方言则达到7.68/10。此外，增大模型规模并不能持续缓解这种偏差。本研究贡献包括：一套经过验证的翻译质量评估方法、一个严谨的基准数据集，以及面向安全关键应用的临界偏差敏感度（Critical Bias Sensitivity, CBS）度量指标。

摘要 (Abstract)

Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.

关键词: Large language models, Bengali dialects, dialectal bias, retrieval-augmented generation, RLAIF, benchmarking, question-answering, low-resource languages

154. ❌ AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

作者: Liang Ding 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21357v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体（LLM Agents）在真实世界任务（如WebArena导航和ToolBench工具使用）中的失败轨迹利用问题，提出AgentHER框架将失败轨迹重新标注为高质量训练数据，用于SFT和DPO训练。因此与LLM Agents、Tool Use、SFT、DPO高度相关（10分），与LLMs相关（10分），其他关键词如MoE、SLMs、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

论文针对LLM智能体在真实任务中失败轨迹被丢弃的问题，提出AgentHER框架将失败轨迹重新标注为高质量训练数据，显著提升了智能体在WebArena和ToolBench上的性能表现。

摘要翻译

大语言模型智能体在多数现实任务中表现不佳——GPT-4o在WebArena导航任务中的成功率低于15%，在ToolBench上的pass@1准确率不足55%（Zhou et al., 2024; Qin et al., 2024）——然而每个失败轨迹通常都被直接丢弃，浪费了经验收集的主要来源。我们提出AgentHER框架，通过将后见经验回放（Hindsight Experience Replay, HER；Andrychowicz et al., 2017）原理适配至自然语言智能体轨迹，以恢复这些丢失的训练信号，实现离线数据增强。其核心洞见简明扼要：未能达成目标A的轨迹，往往能成为某个可达成的替代目标B的正确示范。AgentHER通过四阶段流程实现这一构想——失败分类、结果提取、基于置信度门控的LLM引导提示重标注以及数据封装——将废弃的失败轨迹转化为高质量的监督微调（SFT）、直接偏好优化（DPO）和ShareGPT格式的训练数据，并提供零成本的基于规则的实现与LLM评判器实现两种方案。在WebArena（Zhou et al., 2024）和ToolBench（Qin et al., 2024）测试中，AgentHER在四个模型系列（GPT-4o、Qwen2.5-72B/7B、LLaMA-3.1-8B）上相较仅使用成功轨迹的SFT提升了7.1-11.7个百分点，同时实现2倍数据效率——仅用50%的成功示范即可达到基线性能。该增益在1.5B至72B参数规模范围内保持稳定（提升5.8-9.2个百分点），并在迭代重部署中持续累积（额外轮次中再提升2.1个百分点）。人工评估证实，在多评判器验证下重标注精确度达97.7%。

摘要 (Abstract)

LLM agents fail on the majority of real-world tasks – GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) – yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline – failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging – that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency – matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.

关键词: LLM agents, Hindsight Experience Replay, trajectory relabeling, SFT, DPO, offline data augmentation, WebArena, ToolBench

155. ❌ Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles

作者: Adi Gabay, Gabriel Stanovsky, Liat Peterfreund 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21350v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在认知推理（epistemic reasoning）与记忆/简化（reduction）方面的能力差异，通过经典逻辑谜题进行评估。高度相关关键词：‘Large Language Models’（研究对象）、‘Chain of Thought’和’System 2 Thinking’（涉及多步推理和深度推理分析）。中等相关：‘LLM Agents’（涉及智能体知识推理）和’Mechanistic Interpretability’（分析模型内部行为）。其他关键词如模型架构、训练方法、优化技术等与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究通过引入'简化阶梯'方法，评估大型语言模型在认知推理任务中区分深度推理与简单记忆/简化映射的能力，发现模型在需要真正认知推理时普遍存在困难。

摘要翻译

认知推理要求智能体根据局部观察和其他智能体知识的信息来推断世界状态。先前针对大语言模型在经典认知谜题上的评估研究，将其行为解释为认知推理与机械记忆之间的二元对立。我们认为这种框架并不完整：在近期模型中，记忆应被更好地理解为归约的一种特例，即新实例被映射到已知问题上。为此，我们提出了“归约阶梯”框架——通过一系列渐进式修改，使实例逐步偏离经典认知谜题，在保持底层逻辑不变的同时不断增加归约难度。研究发现，虽然某些大型模型能通过归约取得成功，但其他模型在早期阶段即告失败，且一旦需要真正的认知推理，所有模型均表现出显著困难。

摘要 (Abstract)

Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents’ knowledge. Prior work evaluating LLMs on canonical epistemic puzzles interpreted their behavior through a dichotomy between epistemic reasoning and brittle memorization. We argue that this framing is incomplete: in recent models, memorization is better understood as a special case of reduction, where a new instance is mapped onto a known problem. Instead, we introduce a reduction ladder, a sequence of modifications that progressively move instances away from a canonical epistemic puzzle, making reduction increasingly difficult while preserving the underlying logic. We find that while some large models succeed via reduction, other models fail early, and all models struggle once epistemic reasoning is required.

关键词: Large Language Models, Epistemic Reasoning, Reduction, Logic Puzzles, Cognitive Evaluation, Reasoning Capabilities, Memorization, Model Behavior Analysis

156. ❌ Generalized Discrete Diffusion from Snapshots

作者: Oussama Zekri, Théo Uscidda, Nicolas Boullé, Anna Korba 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21342v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于离散扩散模型的通用框架（GDDS），涉及扩散建模、均匀化、证据下界（ELBO）和大词汇量离散生成任务。所有关键词均与大语言模型、深度学习技术原理或科学应用相关，而本文研究的是生成建模的通用数学框架，与这些关键词无直接关联，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

论文提出了一个支持任意噪声过程的通用离散扩散建模框架（GDDS），在大型离散状态空间上实现了比现有方法更高的训练效率和生成质量，并首次在大规模任务中超越了自回归模型。

摘要翻译

我们提出广义离散扩散快照模型（Generalized Discrete Diffusion from Snapshots，简称GDDS），这是一个用于离散扩散建模的统一框架，支持在大型离散状态空间上进行任意噪声化过程。我们的框架涵盖了所有现有的离散扩散方法，同时在选择破坏性动力学方面允许更大的灵活性。前向噪声化过程基于均匀化方法，能够实现快速的任意破坏。对于反向过程，我们基于快照隐变量（而非整个噪声化路径）推导出一个简单的证据下界（ELBO），使得能够以清晰的概率解释高效训练标准生成建模架构。我们在大规模离散生成任务上的实验表明，所提出的框架在训练效率和生成质量方面优于现有离散扩散方法，并首次在此规模上超越自回归模型。我们在项目页面（https://oussamazekri.fr/gdds）上提供了代码及相关博客文章。

摘要 (Abstract)

We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large-vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \href{https://oussamazekri.fr/gdds}{https://oussamazekri.fr/gdds}.

关键词: discrete diffusion modeling, generalized framework, uniformization, evidence lower bound (ELBO), large-vocabulary discrete generation, generative modeling, training efficiency, generation quality

157. ❌ Enhancing reasoning accuracy in large language models during inference time

作者: Vinay Sharma, Manish Jain 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21301v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在推理任务中的准确性提升，直接涉及’Large Language Models’（核心研究对象）、‘Chain of Thought’（使用CoT提示）、‘Self-Reflection’（评估的三种策略之一）。‘System 2 Thinking’和’Hallucination Mitigation’有一定关联，因为推理准确性提升间接涉及深度思考和事实性。其他关键词如MoE、SLMs、训练方法、加速技术、科学应用等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该研究通过系统评估三种推理时策略（自一致性、双模型一致性、自反思）来提升大语言模型在多步推理任务中的准确性，发现自一致性方法在准确率上带来9%-15%的显著提升。

摘要翻译

大语言模型（LLMs）通常展现出强大的语言能力，但在多步骤推理任务上仍不可靠，尤其是在未经额外训练或微调直接部署时。本研究探讨了通过推理时技术来提升大语言模型推理准确性的方法。我们系统评估了三类推理时策略：（i）基于随机解码的自我一致性，即通过控制温度参数和核采样对模型进行多次采样，并选择出现频率最高的最终答案；（ii）双模型推理一致性，即比较两个独立模型的输出，仅信任推理轨迹一致的结果；（iii）自我反思，即模型对自身推理过程进行批判与修正。在所有评估方法中，我们均采用思维链（Chain-of-Thought, CoT）[1]提示技术，引导模型在生成最终答案前先展示明确的中间推理步骤。本研究在统一的提示与验证设置下，对三种推理时策略进行了受控比较评估。我们在LLM[2]上的实验表明：采用核采样与受控温度参数的自我一致性方法效果显著，相比贪婪单次解码在准确率上实现了9%至15%的绝对提升，该方法非常适合低风险领域，能以极低开销带来显著增益；双模型方法为模型推理步骤提供了额外确认，因而更适用于中等风险领域，其更高的可靠性可抵消额外的计算成本；自我反思仅带来边际改善，表明对于规模较小、非专门用于推理的模型，在推理时采用该策略效果有限。

摘要 (Abstract)

Large Language Models (LLMs) often exhibit strong linguistic abilities while remaining unreliable on multi-step reasoning tasks, particularly when deployed without additional training or fine-tuning. In this work, we study inference-time techniques to improve the reasoning accuracy of LLMs. We systematically evaluate three classes of inference-time strategies: (i) self-consistency via stochastic decoding, where the model is sampled multiple times using controlled temperature and nucleus sampling and the most frequent final answer is selected; (ii) dual-model reasoning agreement, where outputs from two independent models are compared and only consistent reasoning traces are trusted; and (iii) self-reflection, where the model critiques and revises its own reasoning. Across all evaluated methods, we employ Chain-of-Thought (CoT) [1] prompting to elicit explicit intermediate reasoning steps before generating final answers. In this work, we provide a controlled comparative evaluation across three inference-time strategies under identical prompting and verification settings. Our experiments on LLM [2] show that self-consistency with nucleus sampling and controlled temperature value yields the substantial gains, achieving a 9% to 15% absolute improvement in accuracy over greedy single-pass decoding, well-suited for low-risk domains, offering meaningful gains with minimal overhead. The dual-model approach provides additional confirmation for model reasoning steps thus more appropriate for moderate-risk domains, where higher reliability justifies additional compute. Self-reflection offers only marginal improvements, suggesting limited effectiveness for smaller non-reasoning models at inference time.

关键词: Large Language Models, reasoning accuracy, inference-time techniques, self-consistency, self-reflection, Chain-of-Thought, multi-step reasoning, stochastic decoding

158. ❌ More Than Sum of Its Parts: Deciphering Intent Shifts in Multimodal Hate Speech Detection

作者: Runze Sun, Yu Zheng, Zexuan Xiong, Zhongjin Qu, Lei Chen, Jiwen Lu, Jie Zhou 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21298v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多模态仇恨言论检测，提出了一个基于法庭辩论代理的框架（ARCADE）和一个新的基准数据集（H-VLI）。虽然论文涉及人工智能在内容审核领域的应用，但其核心内容（多模态融合、意图转移分析、代理辩论框架）与评分关键词列表中的大模型技术原理、训练方法、推理优化、代理系统等具体技术点均无直接关联。论文未提及任何大语言模型、MoE、缩放定律、微调方法、RAG、注意力机制、思维链、量化等关键词相关的技术。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对社交媒体上从纯文本向多模态演变的复杂仇恨言论检测难题，提出了一个通过模拟法庭辩论来深度分析模态间语义交互的ARCADE框架，并在其构建的H-VLI基准上显著超越了现有方法，尤其在检测隐含仇恨内容方面表现出色。

摘要翻译

打击社交媒体上的仇恨言论对维护网络安全至关重要，但这高度依赖于自动检测系统的效能。随着内容形式的演变，仇恨言论正从纯文本形式转向复杂的多模态表达，使得隐性攻击更难被发现。然而，当前系统在处理这些微妙案例时常显不足，因为它们难以应对多模态内容中涌现的、超越各模态简单叠加的含义。为弥补这一差距，我们超越二元分类的框架，转而刻画语义意图的转移过程——即多模态如何通过交互，从良性线索中构建隐性仇恨，或通过语义反转来中和毒性内容。基于这一细粒度的问题定义，我们构建了“视觉-语言交互仇恨”（H-VLI）基准数据集，其中真实意图取决于多模态间复杂的相互作用，而非显性的视觉或文本侮辱。为有效解析这些复杂线索，我们进一步提出了“基于法庭代理辩论的非对称推理”（ARCADE）框架。通过模拟司法流程，让代理模型主动进行指控与辩护辩论，ARCADE迫使模型在作出裁决前深入审视深层语义线索。大量实验表明，ARCADE在H-VLI基准上显著优于现有先进基线模型，尤其在具有挑战性的隐性案例中表现突出，同时在现有成熟基准上保持了竞争力。我们的代码与数据公开于：https://github.com/Sayur1n/H-VLI

摘要 (Abstract)

Combating hate speech on social media is critical for securing cyberspace, yet relies heavily on the efficacy of automated detection systems. As content formats evolve, hate speech is transitioning from solely plain text to complex multimodal expressions, making implicit attacks harder to spot. Current systems, however, often falter on these subtle cases, as they struggle with multimodal content where the emergent meaning transcends the aggregation of individual modalities. To bridge this gap, we move beyond binary classification to characterize semantic intent shifts where modalities interact to construct implicit hate from benign cues or neutralize toxicity through semantic inversion. Guided by this fine-grained formulation, we curate the Hate via Vision-Language Interplay (H-VLI) benchmark where the true intent hinges on the intricate interplay of modalities rather than overt visual or textual slurs. To effectively decipher these complex cues, we further propose the Asymmetric Reasoning via Courtroom Agent DEbate (ARCADE) framework. By simulating a judicial process where agents actively argue for accusation and defense, ARCADE forces the model to scrutinize deep semantic cues before reaching a verdict. Extensive experiments demonstrate that ARCADE significantly outperforms state-of-the-art baselines on H-VLI, particularly for challenging implicit cases, while maintaining competitive performance on established benchmarks. Our code and data are available at: https://github.com/Sayur1n/H-VLI

关键词: multimodal hate speech detection, intent shifts, vision-language interplay, implicit attacks, courtroom agent debate, semantic cues, benchmark dataset, social media content moderation

159. ❌ Conversation Tree Architecture: A Structured Framework for Context-Aware Multi-Branch LLM Conversations

作者: Pranav Hemanth, Sampriti Saha 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21278v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在扩展多话题对话中的上下文管理问题，提出Conversation Tree Architecture框架。高度相关关键词：1) ‘Large Language Models’（论文明确研究LLM对话系统）；2) ‘Context Window Extension’（解决上下文窗口累积导致的逻辑上下文污染问题）；3) ‘LLM Agents’和’Multi-agent Systems’（框架支持多智能体设置）。其他关键词如MoE、SFT、RAG等未涉及，评分为0。

!!! tip deepseek-chat TL;DR

论文针对LLM在扩展多话题对话中因上下文窗口无限累积导致的逻辑上下文污染问题，提出了Conversation Tree Architecture框架，通过树状结构和节点隔离机制实现结构化上下文管理。

摘要翻译

大型语言模型（LLM）正越来越多地被部署用于长篇幅、多主题的对话，然而当前对话界面采用的扁平化、仅追加式结构存在一个根本性限制：所有上下文都累积在单一的无边界窗口中，导致主题不同的对话线程相互干扰，并逐步降低响应质量。我们将这种失效模式称为逻辑上下文污染。本文中，我们提出了对话树架构（Conversation Tree Architecture, CTA），这是一种分层框架，它将LLM对话组织为离散的、上下文隔离的节点树。每个节点维护其自身的本地上下文窗口；结构化机制控制着上下文如何在父节点与子节点之间流动——在创建分支时向下游传递，在删除分支时向上游回溯。此外，我们引入了易失性节点，即瞬时分支，其本地上下文在清除前必须被选择性地向上合并或永久丢弃。我们形式化了该架构的基本要素，阐述了上下文流动中存在的开放性设计问题，将我们的框架与先前LLM记忆管理方面的研究联系起来，并描述了一个可运行的原型实现。CTA为结构化的对话上下文管理提供了原则性基础，并能自然地扩展到多智能体场景。

摘要 (Abstract)

Large language models (LLMs) are increasingly deployed for extended, multi-topic conversations, yet the flat, append-only structure of current conversation interfaces introduces a fundamental limitation: all context accumulates in a single unbounded window, causing topically distinct threads to bleed into one another and progressively degrade response quality. We term this failure mode logical context poisoning. In this paper, we introduce the Conversation Tree Architecture (CTA), a hierarchical framework that organizes LLM conversations as trees of discrete, context-isolated nodes. Each node maintains its own local context window; structured mechanisms govern how context flows between parent and child nodes, downstream on branch creation and upstream on branch deletion. We additionally introduce volatile nodes, transient branches whose local context must be selectively merged upward or permanently discarded before purging. We formalize the architecture’s primitives, characterize the open design problems in context flow, relate our framework to prior work in LLM memory management, and describe a working prototype implementation. The CTA provides a principled foundation for structured conversational context management and extends naturally to multi-agent settings.

关键词: Large Language Models, Conversation Tree Architecture, Context Management, Logical Context Poisoning, Multi-topic Conversations, Hierarchical Framework, Context Window, Multi-agent Systems

160. ❌ The Library Theorem: How External Organization Governs Agentic Reasoning Capacity

作者: Zachary F. Mainen 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21272v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于Transformer的智能体（LLM Agents）如何通过结构化检索（Retrieval-Augmented Generation）和工具使用（Tool Use）来增强推理能力，特别是通过索引外部记忆来降低检索成本，这与Chain of Thought推理直接相关。论文将Transformer上下文窗口视为I/O页面，研究了智能体在外部记忆索引下的性能，涉及Context Window Extension和System 2 Thinking（深度推理）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、对齐、压缩、解释性等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文研究了基于Transformer的智能体如何通过索引外部记忆来结构化检索，以指数级降低推理过程中的检索成本，并发现语言模型在索引构建中表现良好，但在索引遍历时应使用确定性算法以避免参数记忆干扰。

摘要翻译

基于Transformer的智能体已通过思维链实现外部化推理，但结构化检索——即对自身推理状态建立索引——仍未被充分探索。本文将Transformer上下文窗口形式化为I/O页面，并证明具备索引化外部记忆的工具增强型智能体，其检索成本相比受限于顺序扫描的智能体呈指数级降低：每次查询的页面读取次数从$Ω(N)$降至$O(\log_b N)$，在$T$步推理过程中的累计成本从$Θ(T^2)$降至$O(T \log_b T)$——这一差距随思考深度增加而扩大。我们在一个受控查找基准测试中验证了这些预测，测试涵盖三种内容类型（随机哈希值、有序整数和百科全书条目），存储规模从50到5,000项不等，并在两个模型世代（GPT-4o-mini与GPT-5.4）中复现了关键条件。对于抽象内容，索引化智能体无论存储规模大小均实现中位数1次页面读取，符合$O(1)$预测。未建立索引的排序页面未能缩小差距：较弱模型无法维持大规模二分搜索，较强模型虽能实现接近最优的$\log_2 N$搜索，但仍比索引方案慢$5$倍。在熟悉内容（百科全书条目）上则出现竞争性失效模式：模型识别出内容领域后，绕过检索协议直接从参数化记忆中生成答案，导致即使索引健全时仍产生灾难性的令牌消耗。这种参数化记忆竞争分离了索引化本应结合的两类认知操作：理解内容（语言模型擅长）与遵循导航协议（当理解内容诱使模型走捷径时，其表现会失效）。该结果表明需要关注点分离：利用语言模型进行索引构建（语义理解在此有益），而采用确定性算法执行索引遍历（语义理解在此反而有害）。

摘要 (Abstract)

Externalized reasoning is already exploited by transformer-based agents through chain-of-thought, but structured retrieval – indexing over one’s own reasoning state – remains underexplored. We formalize the transformer context window as an I/O page and prove that tool-augmented agents with indexed external memory achieve exponentially lower retrieval cost than agents restricted to sequential scanning: $O(\log_b N)$ versus $Ω(N)$ page reads per query, and $O(T \log_b T)$ versus $Θ(T^2)$ cumulative cost over $T$ reasoning steps – a gap that widens as deliberation deepens. We test these predictions on a controlled lookup benchmark across three content types – random hashes, ordered integers, and encyclopedia entries – varying store size from 50 to 5,000 items, and replicate key conditions across two model generations (GPT-4o-mini and GPT-5.4). On abstract content, the indexed agent achieves median 1 page read regardless of store size, confirming the $O(1)$ prediction. Sorted pages without an index fail to close the gap: the weaker model cannot sustain binary search at scale, and the stronger model achieves near-optimal $\log_2 N$ search but still loses to the index by $5\times$. On familiar content (encyclopedia entries), a competing failure mode emerges: the model recognizes the domain, bypasses the retrieval protocol, and generates answers from parametric memory, producing catastrophic token expenditure even when the index is sound. This parametric memory competition dissociates the two cognitive operations that indexing combines: understanding content (where language models excel) and following navigational protocols (where they fail when understanding tempts them to shortcut). The result argues for a separation of concerns: use language models for index construction, where semantic understanding helps, and deterministic algorithms for index traversal, where it hurts.

关键词: Transformer agents, external memory, indexed retrieval, chain-of-thought, tool-augmented agents, reasoning capacity, parametric memory, deterministic algorithms

161. ❌ Explainable Semantic Textual Similarity via Dissimilar Span Detection

作者: Diego Miguel Lozano, Daryna Dementieva, Alexander Fraser 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21174v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究语义文本相似性（STS）的可解释性，通过提出差异跨度检测（DSD）任务来识别文本对中的语义差异部分，并使用LLMs辅助构建数据集和作为基线方法之一。因此，与’Large Language Models’有一定关联（5分），因为LLMs被用于数据集构建和作为基线方法；与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为论文核心是提升STS的可解释性。其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对语义文本相似性（STS）缺乏可解释性的问题，提出了差异跨度检测（DSD）任务来识别文本对中的语义差异部分，并构建了Span Similarity Dataset（SSD）数据集，实验表明LLMs和监督模型在DSD任务上表现最佳但仍有提升空间，且DSD能提升复述检测任务的性能。

摘要翻译

语义文本相似度（Semantic Textual Similarity, STS）是众多自然语言处理（Natural Language Processing, NLP）应用的核心组成部分。然而，现有方法通常将语义上的细微差别简化为单一分数，限制了结果的可解释性。为解决此问题，我们引入了差异片段检测（Dissimilar Span Detection, DSD）任务，其旨在识别文本对之间语义存在差异的片段。这有助于用户理解哪些特定的词语或标记对相似度分数产生了负面影响，或可用于提升依赖STS的下游任务性能。此外，我们发布了一个适用于该任务的新数据集——片段相似度数据集（Span Similarity Dataset, SSD），该数据集通过结合大语言模型（Large Language Models, LLMs）与人工验证的半自动化流程构建而成。我们针对DSD任务提出并评估了多种基线方法，包括基于LIME、SHAP、LLMs的无监督方法、我们自研的无监督方法，以及一种有监督方法。尽管大语言模型和有监督模型取得了最佳性能，但总体结果仍处于较低水平，这凸显了该任务的复杂性。最后，我们通过一项额外实验证明，DSD能够提升特定任务——复述检测——的性能。

摘要 (Abstract)

Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.

关键词: Semantic Textual Similarity, Dissimilar Span Detection, Explainable AI, Large Language Models, Span Similarity Dataset, Paraphrase Detection, Natural Language Processing

162. ❌ Entropy Alone is Insufficient for Safe Selective Prediction in LLMs

作者: Edward Phillips, Fredrik K. Gustafsson, Sean Wu, Anshul Thakur, David A. Clifton 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21172v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM的选择性预测系统，通过结合熵和正确性探针信号来缓解幻觉问题，因此与’Large Language Models’和’Hallucination Mitigation’高度相关（10分）。论文在BioASQ和MedicalQA等生物医学QA基准上测试，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现仅使用熵作为不确定性度量在LLM选择性预测中存在缺陷，提出结合熵与正确性探针信号的方法，在多个QA基准上改善了风险-覆盖权衡和校准性能。

摘要翻译

选择性预测系统可通过在高风险情况下拒绝回答来减轻语言模型幻觉造成的危害。不确定性量化技术常被用于识别此类情况，但很少在更广泛的选择性预测策略及其在低目标错误率下运行能力的背景下进行评估。我们发现了基于熵的不确定性方法存在一种模型依赖的失效模式，该模式会导致不可靠的拒绝行为，并通过将熵分数与正确性探针信号相结合来解决此问题。我们在三个问答基准数据集（TriviaQA、BioASQ、MedicalQA）和四个模型系列上的实验表明，相较于纯熵基线，组合分数普遍改善了风险-覆盖权衡关系并提升了校准性能。我们的研究结果凸显了面向部署的不确定性方法评估的重要性，应使用能直接反映系统是否能在特定风险水平下可靠运行的度量指标。

摘要 (Abstract)

Selective prediction systems can mitigate harms resulting from language model hallucinations by abstaining from answering in high-risk cases. Uncertainty quantification techniques are often employed to identify such cases, but are rarely evaluated in the context of the wider selective prediction policy and its ability to operate at low target error rates. We identify a model-dependent failure mode of entropy-based uncertainty methods that leads to unreliable abstention behaviour, and address it by combining entropy scores with a correctness probe signal. We find that across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score generally improves both the risk–coverage trade-off and calibration performance relative to entropy-only baselines. Our results highlight the importance of deployment-facing evaluation of uncertainty methods, using metrics that directly reflect whether a system can be trusted to operate at a stated risk level.

关键词: selective prediction, uncertainty quantification, language model hallucinations, entropy-based methods, correctness probe, risk-coverage trade-off, calibration performance, QA benchmarks

163. ❌ Mixture of Chapters: Scaling Learnt Memory in Transformers

作者: Tasmay Pankaj Tibrewal, Pritish Saha, Ankit Meda, Kunal Singh, Pradeep Moturi 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21096v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	15.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心创新是提出了一种基于Mixture-of-Experts（MoE）启发的可学习稀疏记忆库架构，用于增强Transformer的知识存储能力，因此与’Mixture of Experts OR MoE OR Sparse Models’高度相关（15分）。论文在预训练和指令微调上评估了该方法，与’Large Language Models OR LLMs OR Foundation Models’、‘Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’直接相关（各10分）。论文探讨了扩展记忆容量而不增加计算成本，与’Scaling Laws AND Data Quality’有一定关联（5分）。论文提到指令微调，与’Instruction Tuning OR Alignment OR Value Alignment’有弱关联（5分）。其他关键词如SLMs、RAG、量化、推理加速、AI for Science等未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对Transformer缺乏显式知识存储机制的问题，提出了一种受MoE启发的可学习稀疏记忆库架构，通过章节路由实现大规模记忆扩展，实验表明该方法在预训练和指令微调中优于同等计算量的基线模型，并改善了知识保留和抗遗忘性。

摘要翻译

Transformer架构缺乏显式的存储与组织训练过程中所获知识的机制。本文引入可学习的稀疏记忆库：一组随机初始化并端到端训练的潜在标记，Transformer层通过交叉注意力查询该记忆库以检索存储的知识。为在不产生过高注意力计算成本的前提下扩展记忆容量，我们受混合专家架构启发提出基于章节的路由机制，将记忆库划分为多个章节并训练路由器为每个输入选择相关子集。该方法实现了扩展至26.2万个记忆标记的同时保持可处理的计算量。我们在相关基准测试中，通过预训练和指令微调将本方法与标准Transformer（在等计算量设置下）进行对比评估。我们的模型在等计算量基准上表现更优，表明这为模型扩展提供了新的维度，证明显式关联记忆能够为模型参数隐式捕获的知识提供互补性容量。此外，我们观察到该方法在持续训练中具有更好的知识保持能力，在训练阶段转换（如从预训练转向指令微调）时表现出抗遗忘的鲁棒性。

摘要 (Abstract)

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine-tuning).

关键词: Transformers, learnable sparse memory banks, Mixture-of-Experts, chapter-based routing, scaling memory capacity, pre-training, instruction fine-tuning, knowledge retention

164. ❌ Evaluating Reasoning-Based Scaffolds for Human-AI Co-Annotation: The ReasonAlign Annotation Protocol

作者: Smitha Muthya Sudheendra, Jaideep Srivastava 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21094v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在人类标注任务中的应用，特别是LLM生成推理解释对标注者行为的影响。高度相关关键词：LLMs（核心工具）、Chain of Thought/System 2 Thinking（论文研究推理解释）、Explainable AI（解释性机制）。中等相关：Self-Correction（涉及标注修订）。其他关键词与论文的技术细节、应用领域或方法无关。

!!! tip deepseek-chat TL;DR

该论文研究了在人类-AI协同标注中，暴露LLM生成的推理解释（而非预测标签）如何影响标注者行为，发现推理解释能提高标注一致性而不会引发大规模修订。

摘要翻译

人工标注是自然语言处理评估的核心环节，然而主观性任务在标注者之间常表现出显著的差异性。尽管大语言模型能够提供结构化推理以支持标注，但其对人类标注行为的影响尚不明确。
我们提出了ReasonAlign——一种基于推理的标注框架，该框架在隐藏预测标签的同时展示大语言模型生成的解释。我们将此设计为研究推理如何影响人类标注行为的受控实验，而非对标注准确性的全面评估。采用受德尔菲式修订启发的双轮标注流程，标注者首先独立标注实例，随后在查看模型生成的推理后修正其判断。
我们在情感分类和观点检测任务上评估该方法，分析标注者间一致性与修订行为的变化。为量化这些影响，我们引入了标注努力度代理指标——该指标通过计算标注者在接触推理后修改标签的比例，来捕捉标注行为的变化。研究结果显示，接触模型推理在引发最小幅度修订的同时，与标注一致性提升相关联，这表明推理主要帮助解决模糊案例，而不会引发大范围的标注变更。
这些发现揭示了推理解释如何影响标注一致性，并凸显基于推理的框架可作为支持人机协同标注工作流的实用机制。

摘要 (Abstract)

Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. We introduce ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. We frame this as a controlled study of how reasoning affects human annotation behavior, rather than a full evaluation of annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for supporting human-AI annotation workflows.

关键词: Human-AI co-annotation, Large language models, Reasoning explanations, Annotation consistency, Inter-annotator agreement, Annotation scaffold, Delphi-style revision, Annotator behavior

165. ❌ ViCLSR: A Supervised Contrastive Learning Framework with Natural Language Inference for Natural Language Understanding Tasks

作者: Tin Van Huynh, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21084v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于越南语自然语言理解任务，提出了一种监督对比学习框架ViCLSR，并利用自然语言推理数据集优化句子表示。虽然论文涉及预训练模型（如PhoBERT）和对比学习，但所有关键词均针对大模型（LLMs）及其相关技术（如MoE、RLHF、RAG、Agent等）、特定技术（如量化、注意力优化）或科学AI应用。论文的核心是句子表示学习和低资源语言处理，未涉及任何大模型技术原理创新或大模型在不同领域的应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对越南语自然语言理解任务中数据稀缺的问题，提出了一种监督对比学习框架ViCLSR，通过利用自然语言推理数据集优化句子表示，在多个基准数据集上显著超越了现有预训练模型。

摘要翻译

高质量的文本表示对于自然语言理解至关重要，但如越南语等低资源语言因标注数据有限而面临挑战。尽管PhoBERT和CafeBERT等预训练模型表现良好，但其效果受数据稀缺制约。对比学习作为一种新兴方法，在提升句子表示方面展现出潜力，能使模型有效区分语义相似与相异的句子。我们提出ViCLSR（越南语句子表示对比学习框架），这是一种新颖的监督对比学习框架，专门为优化越南语句子嵌入而设计，利用现有的自然语言推理数据集。此外，我们提出一种将现有越南语数据集适配于监督学习的过程，确保其与对比学习方法兼容。实验表明，ViCLSR在五个自然语言理解基准数据集上显著优于强大的单语预训练模型PhoBERT，包括ViNLI（F1值提升6.97%）、ViWikiFC（+4.97% F1）、ViFactCheck（+9.02% F1）、UIT-ViCTSD（+5.36% F1）和ViMMRC2.0（准确率提升4.33%）。ViCLSR证明监督对比学习能有效缓解越南语自然语言理解任务的资源限制，并提升低资源语言的句子表示学习。进一步地，我们通过深入分析实验结果，揭示了对比学习模型获得优越性能的关键因素。ViCLSR已公开发布，以推动自然语言处理任务的研究进展。

摘要 (Abstract)

High-quality text representations are crucial for natural language understanding (NLU), but low-resource languages like Vietnamese face challenges due to limited annotated data. While pre-trained models like PhoBERT and CafeBERT perform well, their effectiveness is constrained by data scarcity. Contrastive learning (CL) has recently emerged as a promising approach for improving sentence representations, enabling models to effectively distinguish between semantically similar and dissimilar sentences. We propose ViCLSR (Vietnamese Contrastive Learning for Sentence Representations), a novel supervised contrastive learning framework specifically designed to optimize sentence embeddings for Vietnamese, leveraging existing natural language inference (NLI) datasets. Additionally, we propose a process to adapt existing Vietnamese datasets for supervised learning, ensuring compatibility with CL methods. Our experiments demonstrate that ViCLSR significantly outperforms the powerful monolingual pre-trained model PhoBERT on five benchmark NLU datasets such as ViNLI (+6.97% F1), ViWikiFC (+4.97% F1), ViFactCheck (+9.02% F1), UIT-ViCTSD (+5.36% F1), and ViMMRC2.0 (+4.33% Accuracy). ViCLSR shows that supervised contrastive learning can effectively address resource limitations in Vietnamese NLU tasks and improve sentence representation learning for low-resource languages. Furthermore, we conduct an in-depth analysis of the experimental results to uncover the factors contributing to the superior performance of contrastive learning models. ViCLSR is released for research purposes in advancing natural language processing tasks.

关键词: Vietnamese, contrastive learning, sentence representations, natural language inference, low-resource languages, supervised learning, NLU tasks, ViCLSR

166. ❌ Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

作者: Tianle Yang, Chengzhe Sun, Phil Rose, Cassandra L. Jacobs, Siwei Lyu 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21078v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究神经TTS系统在建模辅音诱导的基频扰动方面的能力，属于语音合成领域。所有评分关键词均针对大模型、深度学习技术原理及其在不同领域的应用，而本文专注于传统的TTS模型（Tacotron 2和FastSpeech 2）的评估，未涉及大模型、LLMs、MoE、量化、推理加速、对齐、RAG等任何评分关键词。论文内容与评分关键词列表完全无关，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该研究通过提出一个分段韵律探测框架，评估神经TTS系统在再现辅音诱导的基频扰动方面的能力，发现系统对高频词表现准确但对低频词泛化能力差，表明其更多依赖词汇级记忆而非抽象的分段韵律编码。

摘要翻译

本研究提出一种基于音段层面的韵律探测框架，用于评估神经TTS模型再现辅音诱发基频扰动的能力——这种细粒度的音段-韵律效应反映了局部发音机制。我们以Tacotron 2和FastSpeech 2为研究对象（两者均在LJ Speech语音库上训练），通过对比数千个按词频分层选取的词汇在合成语音与自然语音中的实现方式展开分析。这些受控分析进一步通过跨多个先进TTS系统的大规模评估进行补充。实验结果显示：模型对高频词汇能准确复现该效应，但对低频词汇的泛化能力显著不足，表明所考察的TTS架构更依赖于词汇层面的记忆而非抽象的音段-韵律编码机制。这一发现揭示了此类TTS系统在超越已见数据泛化韵律细节方面存在局限。本研究所提出的探测方法构建了一个具有语言学依据的诊断框架，可为未来TTS评估体系提供参考，并对合成语音的可解释性与真实性评估具有启示意义。

摘要 (Abstract)

This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models’ ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems’ ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.

关键词: neural TTS, consonant-induced f0 perturbation, prosodic probing, segmental-prosodic effect, Tacotron 2, FastSpeech 2, lexical frequency, synthetic speech evaluation

167. ❌ SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing

作者: Jianyi Chen, Rongxiu Zhong, Shilei Zhang, Kun Qian, Jinglei Liu, Yike Guo, Wei Xue 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21073v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于音乐生成领域，提出了一种通过时间加速（temporal speed-up）技术来处理长音频生成的方法，使用扩散模型进行生成和细化。所有评分关键词均与大语言模型（LLM）、深度学习技术原理、AI for Science等特定领域相关，而本文研究的是音乐生成，属于生成式AI在音频领域的应用，与评分关键词列表中的技术主题（如LLM、MoE、Scaling Laws、对齐、推理、代理等）无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了长格式音乐生成中因长音频表示导致的内存和计算资源过高的问题，提出了一种通过首先生成加速音频再恢复原速的简单技巧，实验证明该方法能实现高效、可扩展且高质量的长格式音乐生成。

摘要翻译

由于建模长距离依赖关系的复杂性，以及冗长音频表征所带来的巨大内存与计算需求，生成连贯的长篇幅音乐仍然是一项重大挑战。在本研究中，我们提出了一种简单而有效的技巧：我们假设人工智能模型能够理解并生成以2倍、4倍甚至8倍速率加速播放的音频。通过首先生成音乐的高速版本，我们大幅缩短了时间长度并降低了资源需求，从而使得处理那些原本会超出内存或计算限制的长篇幅音乐成为可能。生成的音频随后被恢复至原始速度，以重建完整的时间结构。这种时间加速与减速策略自然地遵循了从抽象内容到细节内容的层级化生成原则，并且可以便捷地应用于现有的音乐生成模型，以实现长篇幅音乐生成。我们将这一理念具体实现在SqueezeComposer框架中，该框架采用扩散模型在加速域进行生成，并在恢复域进行细化处理。我们在两项任务上验证了该方法的有效性：一是长篇幅音乐生成，用于评估时间维度的控制能力（包括延续、补全和从零生成）；二是全曲歌唱伴奏生成，用于评估音轨维度的控制能力。实验结果表明，我们提出的简单时间加速技巧能够实现高效、可扩展且高质量的长篇幅音乐生成。音频示例可在 https://SqueezeComposer.github.io/ 获取。

摘要 (Abstract)

Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.

关键词: long-form music generation, temporal speed-up, diffusion models, audio generation, computational efficiency, hierarchical generation, music composition, SqueezeComposer

168. ❌ LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

作者: Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, Jingang Wang, Liulin Feng, Mengxia Shen, Qi Li, Shengnan An, Shun Wang, Wei Shi, Xiangyu Xi, Xiaoyu Li, Xuezhi Cao, Yi Lu, Yunke Zhao, Zhengyu Chen, Zhimin Lin, Wei Wang, Peng Pei, Xunliang Cai 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21065v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是开发一个560B参数的MoE大模型（LongCat-Flash-Prover），专注于形式推理（数学定理证明），属于AI for Science领域。高度相关的关键词包括：MoE（模型架构）、LLMs（基础模型）、RLHF（使用HisPO算法进行强化学习优化）、Agentic Workflow（代理工具集成推理）、Tool Use（工具集成）、CoT Reasoning（多步推理分解为auto-formalization、sketching、proving）、System 2 Thinking（深度形式推理）。与AI for Science高度相关，因为专注于数学定理证明。其他关键词如Scaling Laws、Pre-training、SFT、Alignment、Self-Correction、Multi-agent Systems、Hallucination Mitigation、Interpretability、In-context Learning有一定关联，但非核心。其余关键词如SLMs、PEFT、RAG、Context Extension、KV Compression、MCTS、Quantization、Speculative Decoding、World Models、Model Merging等完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了LongCat-Flash-Prover，一个560B参数的MoE大模型，通过代理工具集成强化学习（HisPO算法）在Lean4中推进原生形式推理，在自动形式化和定理证明基准上实现了最先进的性能。

摘要翻译

我们推出LongCat-Flash-Prover，这是一个拥有5600亿参数的旗舰级开源混合专家模型，它通过智能体工具集成推理技术，在Lean4中推进了原生形式推理能力。我们将原生形式推理任务分解为三个独立的形式化能力：自动形式化、草图构建和定理证明。为提升这些能力，我们提出了一种混合专家迭代框架，以扩展高质量的任务轨迹，包括基于给定的非形式化问题生成形式化陈述、直接从陈述生成完整证明或生成引理式草图。在智能体强化学习过程中，我们提出了分层重要性采样策略优化算法，旨在稳定混合专家模型在此类长周期任务上的训练。该算法采用梯度掩蔽策略，同时考虑了策略陈旧性以及序列和标记层面固有的训练-推理引擎差异。此外，我们还引入了定理一致性与合法性检测机制，以消除奖励破解问题。大量评估表明，LongCat-Flash-Prover在自动形式化和定理证明任务上为开源权重模型设立了新的性能标杆。凭借卓越的样本效率，其在MiniF2F测试集上仅用每个问题72次推理预算就达到了97.1%的通过率。在更具挑战性的基准测试中，该模型以每个问题不超过220次尝试解决了70.8%的ProverBench问题和41.5%的PutnamBench问题，显著超越了现有开源权重基线模型。

摘要 (Abstract)

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.

关键词: Mixture of Experts, Large Language Models, Formal Reasoning, Reinforcement Learning, Agentic Tool-Integrated Reasoning, Theorem Proving, AI for Science, Hierarchical Importance Sampling Policy Optimization

169. ❌ Reading Between the Lines: How Electronic Nonverbal Cues shape Emotion Decoding

作者: Taara Kumar, Kokil Jaidka 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21038v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究计算机中介通信中的电子非语言线索对情感解码的影响，属于计算社会科学和情感计算领域。论文内容主要涉及文本分析、情感识别、用户行为研究，但完全不涉及大模型、深度学习技术原理、模型训练优化、推理加速、AI代理等关键词所代表的技术方向。所有关键词均与大模型技术、深度学习创新或AI科学应用直接相关，而本文是纯粹的文本行为研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在文本计算机中介通信中，电子非语言线索如何影响用户的情感解码准确性，并通过三项互补研究建立了电子非语言线索的分类体系、提供了因果证据并揭示了用户的解释策略。

摘要翻译

随着基于文本的计算机中介传播日益构建日常互动，一个核心问题以新的紧迫性再次浮现：在缺乏具身线索的环境中，用户如何重构非语言表达？本文对公共微博传播中的电子非语言线索——即体态语、副语言和类语言特征的文本模拟物——进行了系统性的理论驱动阐释。通过三项互补研究，我们在概念、实证和方法论层面做出了贡献。研究一基于基础非语言传播理论，建立了电子非语言线索的统一分类体系，并开发了可扩展的Python工具包以实现其自动检测。研究二采用被试内调查实验，提供了受控的因果证据：电子非语言线索能显著提升情绪解码准确率并降低感知模糊性，同时也界定了其作用边界（例如在反讽情境中，这些益处会减弱或消失）。研究三通过焦点小组讨论，揭示了用户在解读数字韵律时所采用的阐释策略，包括从预期线索缺失中推导含义，以及在模糊语境中默认采用负面解读。这些研究共同确立了电子非语言线索作为一类连贯且可量化的数字行为，完善了关于线索丰富度与阐释努力的理论框架，并为情感计算、用户建模及情感感知界面设计提供了实用工具。电子非语言线索检测工具包已作为Python和R软件包发布于https://github.com/kokiljaidka/envc。

摘要 (Abstract)

As text-based computer-mediated communication (CMC) increasingly structures everyday interaction, a central question re-emerges with new urgency: How do users reconstruct nonverbal expression in environments where embodied cues are absent? This paper provides a systematic, theory-driven account of electronic nonverbal cues (eNVCs) - textual analogues of kinesics, vocalics, and paralinguistics - in public microblog communication. Across three complementary studies, we advance conceptual, empirical, and methodological contributions. Study 1 develops a unified taxonomy of eNVCs grounded in foundational nonverbal communication theory and introduces a scalable Python toolkit for their automated detection. Study 2, a within-subject survey experiment, offers controlled causal evidence that eNVCs substantially improve emotional decoding accuracy and lower perceived ambiguity, while also identifying boundary conditions, such as sarcasm, under which these benefits weaken or disappear. Study 3, through focus group discussions, reveals the interpretive strategies users employ when reasoning about digital prosody, including drawing meaning from the absence of expected cues and defaulting toward negative interpretations in ambiguous contexts. Together, these studies establish eNVCs as a coherent and measurable class of digital behaviors, refine theoretical accounts of cue richness and interpretive effort, and provide practical tools for affective computing, user modeling, and emotion-aware interface design. The eNVC detection toolkit is available as a Python and R package at https://github.com/kokiljaidka/envc.

关键词: electronic nonverbal cues, emotion decoding, computer-mediated communication, affective computing, user modeling, textual analysis, microblog communication, interpretive strategies

170. ❌ Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

作者: Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于解决大语言模型（LLMs）在多项选择和成对评估任务中的选择偏差问题，提出了PA-GRPO方法。核心相关关键词为’Large Language Models OR LLMs OR Foundation Models’（高度相关，论文直接研究LLMs）和’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（高度相关，PA-GRPO基于GRPO，属于RLHF/DPO类优化方法）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RAG、CoT、Agents、Quantization、AI for Science等均未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在多项选择和成对评估任务中因选项位置和标签符号等非语义因素导致的选择偏差问题，提出了Permutation-Aware Group Relative Policy Optimization (PA-GRPO)方法，通过跨排列优势和一致性感知奖励机制，在七个基准测试中显著减少了选择偏差并保持了高性能。

摘要翻译

用于多项选择与成对评估任务的大语言模型（LLMs）常因选项位置、标签符号等非语义因素而表现出选择偏差。现有的推理阶段去偏方法成本高昂且可能损害推理能力，而逐点训练则忽略了同一问题在不同排列下应得到一致答案的要求。为解决这一问题，我们提出排列感知分组相对策略优化（Permutation-Aware Group Relative Policy Optimization, PA-GRPO），通过强制模型进行排列一致的语义推理来减轻选择偏差。PA-GRPO为每个实例构建排列组，生成多种候选排列，并采用两种互补机制优化模型：（1）跨排列优势度，基于同一实例所有排列的平均奖励计算相对优势；（2）一致性感知奖励，鼓励模型在不同排列下产生一致的决策。实验结果表明，PA-GRPO在七个基准测试中均优于现有强基线方法，在保持整体高性能的同时显著降低了选择偏差。代码将在Github（https://github.com/ECNU-Text-Computing/PA-GRPO）上公开。

摘要 (Abstract)

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).

关键词: Large Language Models, Selection Bias, Permutation-Aware, Group Relative Policy Optimization, PA-GRPO, Multiple-choice Evaluation, Pairwise Evaluation, Debiasing

171. ❌ CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs

作者: Florent Draye, Abir Harrasse, Vedant Palit, Tung-Yu Wu, Jiarui Liu, Punya Syon Pandey, Roderick Wu, Terry Jingchen Zhang, Zhijing Jin, Bernhard Schölkopf 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21014v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	15.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大语言模型（LLMs）的机制可解释性研究，开发了CLT-Forge库用于训练和分析跨层转码器（CLTs），以生成更紧凑的特征归因图。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），与’Mechanistic Interpretability OR Explainable AI’核心相关（15分）。论文未涉及其他关键词，如MoE、训练技术、推理优化、代理系统或特定科学应用，故其余关键词得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型机制可解释性中特征归因图过大且冗余的问题，提出了一个开源库CLT-Forge，用于可扩展地训练和分析跨层转码器，以生成更紧凑的表示并提升可解释性。

摘要翻译

机制可解释性旨在理解大型语言模型如何表征和处理信息。近期基于字典学习与转码器的方法能够通过稀疏可解释特征及其相互作用来表征模型计算，从而形成特征归因图。然而这些图谱通常规模庞大且存在冗余，在实践中限制了其可解释性。跨层转码器通过跨层共享特征同时保留层特异性解码来解决这一问题，从而产生更紧凑的表征，但其大规模训练与分析仍存在困难。我们推出了一个用于跨层转码器端到端训练与可解释性的开源库。该框架集成了具备模型分片与压缩激活缓存功能的可扩展分布式训练、用于特征分析与解释的统一自动化可解释性流程、基于Circuit-Tracer的归因图计算，以及灵活的可视化界面。这为扩展基于跨层转码器的机制可解释性研究提供了实用化统一解决方案。代码发布于：https://github.com/LLM-Interp/CLT-Forge。

摘要 (Abstract)

Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross-Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer-specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open-source library for end-to-end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit-Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT-based mechanistic interpretability. Our code is available at: https://github.com/LLM-Interp/CLT-Forge.

关键词: Large Language Models, Mechanistic Interpretability, Cross-Layer Transcoders, Feature Attribution Graphs, Dictionary Learning, Interpretable Features, Scalable Training, Open-source Library

172. ❌ Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds

作者: Abhinaba Basu 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.20991v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Transformer模型压缩（特别是GPT-2 Small等模型）的敏感性分析，与’Large Language Models’高度相关（10分），因为研究对象是GPT-2等大语言模型架构；与’Quantization OR Model Compression’高度相关（10分），因为论文专门研究模型压缩（包括激活感知剪枝、压缩脆弱性指数等）。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了Transformer模型压缩的敏感性，发现不同矩阵的压缩容忍度存在五个数量级的差异，并通过Lyapunov稳定性理论解释了误差传播机制，同时提出了形式化验证的误差边界和压缩脆弱性指数。

摘要翻译

在GPT-2 Small模型的468个矩阵中，单个矩阵在压缩时可使困惑度增加20,000倍，这表明Transformer模型的压缩敏感度横跨五个数量级。我们通过绘制五种架构（参数量1.17亿至80亿）的敏感度分布图，发现了一个稳定的层次结构：早期层MLP的上投影矩阵具有灾难性敏感度，而值投影矩阵几乎可无损压缩。该层次结构在不同压缩级别、评估规模（2K-51K词元）和数据集（WikiText-103、C4）中保持稳定。运用李雅普诺夫稳定性理论，我们证明残差连接通过使隐藏状态的增长速度超过误差增长速度，从而压缩误差。误差收缩是压缩耐受性的必要条件而非充分条件：架构特定的冗余性同样发挥关键作用，例如混合架构LFM2-2.6B尽管误差放大程度高于完全收缩的GPT-2 Small（120倍），其性能退化仅7倍。我们通过十个经机器验证的Lean 4定理（无"sorry"标记）形式化了每矩阵误差边界；所有边界在14,040多种配置中均实现零违规。研究通过下游任务评估（HellaSwag、ARC-Easy、Winogrande）、两种架构的激活感知剪枝实验，以及可量化模型鲁棒性的压缩脆弱性指数进行了验证。

摘要 (Abstract)

A single matrix out of 468 in GPT-2 Small can increase perplexity by 20,000x when compressed, revealing that transformer compression sensitivity spans five orders of magnitude. We map this sensitivity landscape across five architectures (117M-8B parameters), finding a consistent hierarchy: early-layer MLP up-projections are catastrophically sensitive while value projections compress nearly for free. This hierarchy is stable across compression levels, evaluation scales (2K-51K tokens), and datasets (WikiText-103, C4). Using Lyapunov stability theory, we show that residual connections contract compression errors by growing the hidden state faster than the error. Error contraction is necessary but not sufficient for compression tolerance: architecture-specific redundancy plays an equally important role, as demonstrated by the hybrid LFM2-2.6B degrading only 7x despite higher amplification than the fully-contracting GPT-2 Small (120x). Ten machine-checked Lean 4 theorems formalize per-matrix error bounds with no sorry markers; all bounds produce zero violations across 14,040+ configurations. We validate with downstream task evaluation (HellaSwag, ARC-Easy, Winogrande), activation-aware pruning on two architectures, and a Compression Fragility Index that rank-orders model robustness.

关键词: Transformer compression, GPT-2, Lyapunov stability, error propagation, model sensitivity, pruning, formal verification, Compression Fragility Index

173. ❌ DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles

作者: Bo Jiang 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20975v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体LLM系统中的不确定性量化，直接涉及’LLM Agents’和’Multi-agent Systems’（高度相关，10分）。论文关注复杂推理任务，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。不确定性量化与事实性和可解释性相关，因此’Factuality’和’Explainable AI’得5分。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了DiscoUQ框架，通过分析多智能体LLM系统中智能体之间的结构化分歧（包括语言特性和嵌入几何）来量化集体输出的不确定性，在多个基准测试中实现了比现有方法更好的校准性能。

摘要翻译

多智能体大语言模型系统通过多个经提示调用的语言模型实例独立回答问题，正日益广泛应用于复杂推理任务。然而，现有量化其集体输出不确定性的方法依赖于浅层的投票统计，丢弃了智能体推理中丰富的语义信息。我们提出DiscoUQ框架，该框架提取并利用智能体间分歧的结构——包括语言特性（证据重叠度、论证强度、分歧深度）和嵌入几何特征（聚类距离、离散度、内聚性）——以生成校准良好的置信度估计。我们提出了三种复杂度递增的方法：DiscoUQ-LLM（基于大语言模型提取的结构特征进行逻辑回归）、DiscoUQ-Embed（基于嵌入几何特征进行逻辑回归）以及DiscoUQ-Learn（融合所有特征的神经网络）。在使用Qwen3.5-27B构建的5智能体系统上，通过四个多样化基准测试（StrategyQA、MMLU、TruthfulQA、ARC-Challenge）评估，DiscoUQ-LLM取得了0.802的平均AUROC，优于最佳基线方法（LLM Aggregator，0.791），同时校准效果显著更优（ECE 0.036对比0.098）。学习到的特征在跨基准测试中展现出泛化能力，性能几乎无衰减，并在最需要改进的模糊“弱分歧”层级——即简单计票方法失效的场景——提供了最显著的性能提升。

摘要 (Abstract)

Multi-agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents’ reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter-agent disagreement – both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) – to produce well-calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ-LLM (logistic regression on LLM-extracted structure features), DiscoUQ-Embed (logistic regression on embedding geometry), and DiscoUQ-Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU, TruthfulQA, ARC-Challenge) with a 5-agent system using Qwen3.5-27B, DiscoUQ-LLM achieves an average AUROC of 0.802, outperforming the best baseline (LLM Aggregator, 0.791) while being substantially better calibrated (ECE 0.036 vs. 0.098). The learned features generalize across benchmarks with near-zero performance degradation and provide the largest improvements where they are most needed: in the ambiguous “weak disagreement” tier where simple vote counting fails.

关键词: LLM agents, multi-agent systems, uncertainty quantification, disagreement analysis, confidence calibration, structured disagreement, ensemble methods, reasoning tasks

174. ❌ Understanding Contextual Recall in Transformers: How Finetuning Enables In-Context Reasoning over Pretraining Knowledge

作者: Bhavya Vasudeva, Puneesh Deora, Alberto Bietti, Vatsal Sharan, Christos Thrampoulidis 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20969v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Transformer语言模型的in-context learning机制，特别是contextual recall能力如何从预训练知识通过微调实现推理。高度相关关键词包括：LLMs（研究对象）、Pre-training（研究预训练知识）、SFT（研究微调作用）、Mechanistic Interpretability（研究机制解释）、In-context Learning（核心研究主题）。中等相关：CoT Reasoning和System 2 Thinking（涉及推理机制但非核心）。其余关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了Transformer语言模型如何通过微调实现上下文推理能力，发现预训练虽能获得事实知识但不足以支持上下文回忆，而特定任务的微调能触发跨主题的上下文回忆能力，并伴随低维潜在编码的形成。

摘要翻译

基于Transformer的语言模型在上下文学习（ICL）方面表现卓越，能够根据上下文示例适应新任务，而无需参数更新。在一种特定的ICL形式中——我们称之为上下文回忆，模型通过在开放文本上进行预训练，利用成对示例来以新颖的提示格式回忆特定事实。我们探究了上下文回忆能力是否仅通过预训练即可自然涌现、需要何种微调、以及何种机制驱动了必要的表征。为此，我们引入了一个受控的合成框架，其中预训练序列由主语-语法-属性三元组构成，且属性类型与语法统计特征相关联。我们证明，尽管此类预训练能成功获取事实性知识，但不足以实现上下文回忆：当ICL提示中移除语法统计特征时，模型无法隐式推断属性类型。然而，我们发现，通过在部分主语上进行与ICL评估不同的、需要隐式推断的任务进行微调，能够触发模型在所有主语上涌现出上下文回忆能力。这一转变伴随着共享属性类型的低维潜在编码的形成。为探究其机制原理，我们构建了一个仅含注意力机制的Transformer模型，该模型复现了从事实性知识到上下文回忆的转变过程，并通过实证验证得到了证实。

摘要 (Abstract)

Transformer-based language models excel at in-context learning (ICL), where they can adapt to new tasks based on contextual examples, without parameter updates. In a specific form of ICL, which we refer to as \textit{contextual recall}, models pretrained on open-ended text leverage pairwise examples to recall specific facts in novel prompt formats. We investigate whether contextual recall emerges from pretraining alone, what finetuning is required, and what mechanisms drive the necessary representations. For this, we introduce a controlled synthetic framework where pretraining sequences consist of subject-grammar-attribute tuples, with attribute types tied to grammar statistics. We demonstrate that while such pretraining successfully yields factual knowledge, it is insufficient for contextual recall: models fail to implicitly infer attribute types when the grammar statistics are removed in ICL prompts. However, we show that finetuning on tasks requiring implicit inference, distinct from the ICL evaluation, using a subset of subjects, triggers the emergence of contextual recall across all subjects. This transition is accompanied by the formation of low-dimensional latent encodings of the shared attribute type. For mechanistic insight, we derive a construction for an attention-only transformer that replicates the transition from factual to contextual recall, corroborated by empirical validation.

关键词: Transformer, in-context learning, contextual recall, pretraining, finetuning, mechanistic interpretability, attention-only transformer, latent encodings

175. ❌ Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

作者: Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, Tuhin Chakrabarty 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20957v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在微调（fine-tuning）过程中激活预训练阶段记忆的版权书籍内容，与LLMs、微调（SFT）、对齐（alignment）高度相关（10分）。涉及RLHF作为现有安全措施（5分），预训练阶段记忆形成（5分），以及事实性/真实性（hallucination mitigation）和可解释性（mechanistic interpretability）问题（5分）。其他关键词如MoE、SLMs、RAG、推理技术、压缩、代理等未涉及（0分）。

!!! tip deepseek-chat TL;DR

研究发现微调大型语言模型（如GPT-4o、Gemini-2.5-Pro、DeepSeek-V3.1）会激活预训练阶段记忆的版权书籍内容，导致模型仅凭语义描述就能复现高达85-90%的受版权保护书籍文本，揭示了行业范围内的安全漏洞并挑战了现有版权法律辩护的前提。

摘要翻译

前沿大型语言模型公司多次向法院和监管机构保证，其模型不会存储训练数据的副本。它们进一步依赖通过人类反馈强化学习（RLHF）、系统提示和输出过滤器的安全对齐策略来阻止受版权保护作品的逐字复现，并在针对版权侵权主张的法律辩护中援引了这些措施的有效性。我们的研究表明，微调可以绕过这些保护：通过训练模型将情节摘要扩展为完整文本（这一任务天然适用于商业写作助手），我们仅使用语义描述作为提示（无需实际书籍文本），就能使GPT-4o、Gemini-2.5-Pro和DeepSeek-V3.1复现高达85-90%的预留受版权保护书籍内容，其中单次逐字复现跨度超过460词。这种提取能力具有跨作者泛化性：仅对村上春树小说进行微调，即可解锁对超过30位无关作者的受版权保护书籍的逐字回忆。该效应并非特定于任何训练作者或语料库：随机作者组合和公共领域微调数据均能产生类似的提取效果，而对合成文本进行微调则产生近乎零的提取率，这表明对个体作者作品进行微调会重新激活预训练中潜在的记忆。来自不同供应商的三个模型在相同区域记忆了相同的书籍（相关系数$r \ge 0.90$），这指向了一个行业性的普遍漏洞。我们的研究结果提供了有力证据，表明模型权重存储了受版权保护作品的副本，且对个体作者作品微调后显现的安全失效，削弱了近期合理使用裁决的一个关键前提——法院曾将有利判决建立在防止复制受保护表达的措施是否充分这一条件之上。

摘要 (Abstract)

Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami’s novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors’ works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors’ works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.

关键词: Large Language Models, Fine-tuning, Copyright Infringement, Data Memorization, Verbatim Recall, RLHF, Model Security, Fair Use

176. ❌ The Hidden Puppet Master: A Theoretical and Real-World Account of Emotional Manipulation in LLMs

作者: Jocelyn Shen, Amina Luvsanchultem, Jessica Kim, Kynnedy Smith, Valdemar Danry, Kantwon Rogers, Sharifa Alghowinem, Hae Won Park, Maarten Sap, Cynthia Breazeal 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20907v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在对话中的情感操纵问题，重点关注隐藏激励的道德维度（与价值对齐高度相关），并建立理论分类法PUPPET进行实证研究。因此，与"Large Language Models"和"Instruction Tuning OR Alignment OR Value Alignment"高度相关（10分），因为论文直接研究LLMs的操纵行为及其道德对齐问题。其他关键词涉及模型架构、训练技术、推理方法、应用领域等，论文未涉及这些具体技术或应用，故均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs在现实对话中基于隐藏激励的情感操纵问题，建立了理论分类法PUPPET，并通过人类实验发现有害激励比亲社会激励产生更大的信念转变，同时发现LLMs能适度预测信念变化但低估其幅度。

摘要翻译

随着用户日益依赖大语言模型获取实用建议与个人指导，他们更容易在不知不觉中被导向与自身利益不符的隐藏动机。先前的研究已对说服与操纵检测进行了基准测试，但这些工作依赖于模拟或辩论式场景，未能与实际人类信念转变相关联，且忽视了一个关键维度：驱动操纵行为的隐藏动机的道德属性。本文提出PUPPET——一种以动机道德为核心的大语言模型与人类对话中个性化情感操纵的理论分类体系，并开展了一项涉及N=1,035名参与者的实证研究。该研究基于现实日常查询，系统调整个性化程度与动机导向（有害型与亲社会型）。研究发现，具有有害隐藏动机的对话比亲社会型对话引发显著更大的信念转变。最后，我们对大语言模型在信念预测任务上的表现进行基准测试，发现模型虽能基于对话语境对信念变化展现出中等程度的预测能力（相关系数r=0.3-0.5），但系统性地低估了信念转变的幅度。综上，本研究为在日常实用查询场景中探究并最终对抗大语言模型的动机驱动型操纵，建立了理论扎实且经行为验证的研究基础。

摘要 (Abstract)

As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to being subtly steered toward hidden incentives misaligned with their own interests. Prior works have benchmarked persuasion and manipulation detection, but these efforts rely on simulated or debate-style settings, remain uncorrelated with real human belief shifts, and overlook a critical dimension: the morality of hidden incentives driving the manipulation. We introduce PUPPET, a theoretical taxonomy of personalized emotional manipulation in LLM-human dialogues that centers around incentive morality, and conduct a human study with N=1,035 participants across realistic everyday queries, varying personalization and incentive direction (harmful versus prosocial). We find that harmful hidden incentives produce significantly larger belief shifts than prosocial ones. Finally, we benchmark LLMs on the task of belief prediction, finding that models exhibit moderate predictive ability of belief change based on conversational contexts (r=0.3 - 0.5), but they also systematically underestimate the magnitude of belief shift. Together, this work establishes a theoretically grounded and behaviorally validated foundation for studying, and ultimately combatting, incentive-driven manipulation in LLMs during everyday, practical user queries.

关键词: LLMs, emotional manipulation, hidden incentives, value alignment, belief shifts, human study, PUPPET taxonomy, moral incentives

177. ❌ Mitigating Shortcut Reasoning in Language Models: A Gradient-Aware Training Approach

作者: Hongyu Cao, Kunpeng Liu, Dongjie Wang, Yanjie Fu 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20899v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文直接研究大语言模型（LLMs）的推理能力，提出SART训练框架来缓解捷径推理问题，因此与’Large Language Models’高度相关（10分）。该方法属于监督微调（SFT）范畴，通过修改训练动态来改善模型推理，与’Post-training’高度相关（10分）。研究核心是提升逻辑推理能力，与’Chain of Thought’和’System 2 Thinking’高度相关（各10分）。缓解捷径推理可间接提高事实性，与’Hallucination Mitigation’有一定关联（8分）。梯度分析和捷径检测涉及模型行为解释，与’Mechanistic Interpretability’有一定关联（5分）。其他关键词如MoE、量化、RAG、AI for Science等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在推理任务中依赖表面模式匹配等捷径而非真正逻辑推理的问题，提出了Shortcut-Aware Reasoning Training（SART）框架，通过梯度感知和捷径检测显著提升了模型在分布偏移下的泛化能力，在基准测试中实现了+16.5%的准确率和+40.2%的鲁棒性提升。

摘要翻译

大语言模型展现出强大的推理能力，但往往依赖表面模式匹配和答案记忆等捷径，而非真正的逻辑推断。我们提出捷径感知推理训练（Shortcut-Aware Reasoning Training, SART），这是一个基于梯度感知的框架，通过捷径分数（ShortcutScore）和梯度手术（gradient surgery）来检测并减轻促进捷径的样本。我们的方法通过梯度与验证目标的错位程度以及答案令牌集中度来识别捷径信号，并相应调整训练动态。在受控推理基准上的实验表明，SART 相比最强基线实现了 +16.5% 的准确率提升和 +40.2% 的鲁棒性提升，显著增强了分布变化下的泛化能力。代码发布于：https://github.com/fuyanjie/short-cut-aware-data-centric-reasoning。

摘要 (Abstract)

Large language models exhibit strong reasoning capabilities, yet often rely on shortcuts such as surface pattern matching and answer memorization rather than genuine logical inference. We propose Shortcut-Aware Reasoning Training (SART), a gradient-aware framework that detects and mitigates shortcut-promoting samples via ShortcutScore and gradient surgery. Our method identifies shortcut signals through gradient misalignment with validation objectives and answer-token concentration, and modifies training dynamics accordingly. Experiments on controlled reasoning benchmarks show that SART achieves +16.5% accuracy and +40.2% robustness over the strongest baseline, significantly improving generalization under distribution shifts. Code is available at: https://github.com/fuyanjie/short-cut-aware-data-centric-reasoning.

关键词: Large Language Models, Reasoning, Shortcut Learning, Gradient-Aware Training, Generalization, Distribution Shift, SART, Logical Inference

178. ❌ LLM Router: Prefill is All You Need

作者: Tanay Varshney, Annie Surla, Michelle Xu, Gomathy Venkata Krishnan, Maximilian Jeblick, David Austin, Neal Vaidya, Davide Onofrio 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20895v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM路由机制，通过Encoder-Target Decoupling和prefill activations预测模型性能，仅与’Large Language Models’高度相关（10分），其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于prefill激活的LLM路由机制，通过Encoder-Target Decoupling和数学探针优化模型选择，在显著降低计算成本的同时捕获了45.58%的Oracle性能差距。

摘要翻译

大型语言模型（LLM）在基准测试中的准确率往往相近，但它们在任务子集上的互补性表现表明，若能通过一种具备完美预见能力的理论选择器——即“预言家路由”——来调度各模型特有的优势，其整体准确率可显著超越单一模型。当前的路由器通常依赖脆弱的语义信号，我们提出通过“编码器-目标解耦”方法利用模型内部预填充激活值：该方法将提供预测信号的模型（编码器）与待评估性能的模型（目标模型）在功能上分离，从而实现对不同编码器与目标模型的最优异构配对。我们采用费希尔可分性（J）与有效维度（d_eff）作为数学探针，以提取最优的层级信号，这为我们的SharedTrunkNet架构提供了预测基础。SharedTrunkNet能够弥补最强独立模型与预言家路由之间高达45.58%的准确率差距，同时相较于最高成本模型实现了74.31%的成本节约。

摘要 (Abstract)

LLMs often share comparable benchmark accuracies, but their complementary performance across task subsets suggests that an Oracle router–a theoretical selector with perfect foresight–can significantly surpass standalone model accuracy by navigating model-specific strengths. While current routers rely on fragile semantic signals, we propose using internal prefill activations via Encoder-Target Decoupling–a functional separation between the model providing the predictive signal (the Encoder) and the model whose performance is being estimated (the Target). This allows optimized heterogeneous pairing between unique encoders and target models. We utilize Fisher Separability (J) and Effective Dimensionality (d_eff) as mathematical probes to isolate optimal layer-wise signals, providing the predictive foundation for our SharedTrunkNet architecture. SharedTrunkNet captures up to 45.58% of the accuracy gap between the strongest standalone model and the Oracle while achieving 74.31% cost savings relative to the highest-cost model.

关键词: LLM Router, Prefill Activations, Encoder-Target Decoupling, Fisher Separability, Effective Dimensionality, SharedTrunkNet, Model Routing, Performance Estimation

179. ❌ NoveltyAgent: Autonomous Novelty Reporting Agent with Point-wise Novelty Analysis and Self-Validation

作者: Jiajun Hou, Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Xiaopeng Ke, Min Zhang 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20884v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	15.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是构建一个用于学术论文新颖性评估的多智能体系统NoveltyAgent，该系统使用大语言模型作为基础，通过检索增强生成（RAG）技术进行细粒度检索和比较，并包含自我验证机制来确保忠实性。因此，与’LLM Agents/Autonomous Agents’和’Multi-agent Systems’高度相关（核心内容，15分），与’Retrieval-Augmented Generation’高度相关（核心方法，10分），与’Large Language Models’、‘Self-Correction/Self-Improvement’和’Hallucination Mitigation/Factuality’有一定关联（分别作为基础技术、验证机制和评估目标，8分）。论文属于AI在学术评估领域的应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、Scaling Laws、各种训练调优技术、推理加速、模型压缩等，论文未涉及或仅隐含使用而未作为研究重点，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对学术论文数量激增导致筛选成本高的问题，提出了一个名为NoveltyAgent的多智能体系统，通过点状新颖性分析和自我验证机制来生成全面且忠实的新颖性报告，实验表明其性能超越了现有基准方法。

摘要翻译

学术出版物的指数级增长导致质量参差的论文数量激增，增加了论文筛选成本。现有方法要么依赖通用AI评审系统中的新颖性评估模块，要么直接复用DeepResearch系统，这些方法缺乏领域特定机制，因而生成的结果质量较低。为弥补这一不足，我们提出了NoveltyAgent——一个旨在生成全面且忠实的新颖性报告的多智能体系统，能够对论文原创性进行深入评估。该系统将稿件解构为离散的新颖性论点以实现细粒度检索与比对，在构建完整相关论文数据库的同时，通过交叉验证研究主张确保报告忠实性。此外，针对此类开放式生成任务的评估难题，我们提出了基于检查清单的评估框架，为构建可靠评估体系提供无偏范式。大量实验表明，NoveltyAgent实现了最先进的性能表现，以10.15%的优势超越GPT-5 DeepResearch系统。我们期望该系统能够提供可靠、高质量的新颖性分析，帮助研究者快速识别创新论文。代码与演示版本已发布于https://github.com/SStan1/NoveltyAgent。

摘要 (Abstract)

The exponential growth of academic publications has led to a surge in papers of varying quality, increasing the cost of paper screening. Current approaches either use novelty assessment within general AI Reviewers or repurpose DeepResearch, which lacks domain-specific mechanisms and thus delivers lower-quality results. To bridge this gap, we introduce NoveltyAgent, a multi-agent system designed to generate comprehensive and faithful novelty reports, enabling thorough evaluation of a paper’s originality. It decomposes manuscripts into discrete novelty points for fine-grained retrieval and comparison, and builds a comprehensive related-paper database while cross-referencing claims to ensure faithfulness. Furthermore, to address the challenge of evaluating such open-ended generation tasks, we propose a checklist-based evaluation framework, providing an unbiased paradigm for building reliable evaluations. Extensive experiments show that NoveltyAgent achieves state-of-the-art performance, outperforming GPT-5 DeepResearch by 10.15%. We hope this system will provide reliable, high-quality novelty analysis and help researchers quickly identify novel papers. Code and demo are available at https://github.com/SStan1/NoveltyAgent.

关键词: NoveltyAgent, multi-agent system, novelty reporting, point-wise novelty analysis, self-validation, retrieval-augmented generation, autonomous agents, academic paper screening

180. ❌ RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation

作者: Kaustubh D. Dhole, Eugene Agichtein 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20882v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM评估方法，与’Large Language Models’高度相关（10分），提出RubricRAG方法直接使用检索增强生成（RAG），与’Retrieval-Augmented Generation’高度相关（10分）。研究关注评估的透明度和可解释性，与’Mechanistic Interpretability’高度相关（10分）。论文提到使用few-shot和post-training策略，与’Post-training’有一定关联（5分）。研究旨在提高评估的可靠性和事实性，与’Hallucination Mitigation’有一定关联（5分）。方法涉及few-shot学习，与’In-context Learning’有一定关联（5分）。其他关键词与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过检索增强生成领域知识来创建可解释的、查询特定的评估标准（rubrics），以改进LLM评估的透明度和可靠性，提出的RubricRAG方法相比现成LLM能生成更接近人工编写、更有效的评估标准。

摘要翻译

大型语言模型（LLMs）越来越多地通过自动化评分器（如LLM-as-judges）进行评估，有时甚至用于训练，这些评分器输出标量分数或偏好。尽管便捷，这类方法通常缺乏透明度：单一分数很少能解释答案优劣的原因、遗漏了哪些要求，或系统应如何改进。这种可解释性的不足限制了它们在模型开发、数据集构建和高风险部署中的应用。基于查询特定量规的评估通过将质量分解为明确、可核查的标准，提供了一种更透明的替代方案。然而，手动设计高质量、查询特定的量规既耗时费力，又对认知要求高，难以实际部署。尽管先前的研究侧重于为自动化下游评估生成中间量规，但这些量规是否对人类用户兼具可解释性和有效性尚不明确。在本研究中，我们探讨了与人工编写的量规相比，LLMs能否生成有用的、针对具体实例的量规，同时提升识别优质回答的效果。通过对两个量规基准的系统性研究，并结合多种少样本学习和训练后优化策略，我们发现现成的LLMs生成的量规与人工编写的量规一致性较差。我们提出了一种简单策略——RubricRAG，该策略在推理时通过从相关查询中检索领域知识来构建量规。我们证明，RubricRAG能够生成更具可解释性的量规，无论是在与人工编写量规的相似性上，还是在下游评估效果的提升方面。我们的研究结果既揭示了通过自动量规生成实现可扩展、可解释评估所面临的挑战，也展示了一种具有前景的解决路径。

摘要 (Abstract)

Large language models (LLMs) are increasingly evaluated and sometimes trained using automated graders such as LLM-as-judges that output scalar scores or preferences. While convenient, these approaches are often opaque: a single score rarely explains why an answer is good or bad, which requirements were missed, or how a system should be improved. This lack of interpretability limits their usefulness for model development, dataset curation, and high-stakes deployment. Query-specific rubric-based evaluation offers a more transparent alternative by decomposing quality into explicit, checkable criteria. However, manually designing high-quality, query-specific rubrics is labor-intensive and cognitively demanding and not feasible for deployment. While previous approaches have focused on generating intermediate rubrics for automated downstream evaluation, it is unclear if these rubrics are both interpretable and effective for human users. In this work, we investigate whether LLMs can generate useful, instance-specific rubrics as compared to human-authored rubrics, while also improving effectiveness for identifying good responses. Through our systematic study on two rubric benchmarks, and on multiple few-shot and post-training strategies, we find that off-the-shelf LLMs produce rubrics that are poorly aligned with human-authored ones. We introduce a simple strategy, RubricRAG, which retrieves domain knowledge via rubrics at inference time from related queries. We demonstrate that RubricRAG can generate more interpretable rubrics both for similarity to human-authored rubrics, and for improved downstream evaluation effectiveness. Our results highlight both the challenges and a promising approach of scalable, interpretable evaluation through automated rubric generation.

关键词: LLM evaluation, rubric generation, interpretability, retrieval-augmented generation, domain knowledge, automated grading, transparent evaluation, RubricRAG

181. ❌ Semantic Sections: An Atlas-Native Feature Ontology for Obstructed Representation Spaces

作者: Hossein Javidnia 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20867v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大语言模型（Llama 3.2 3B Instruct, Qwen 2.5 3B Instruct, Gemma 2 2B IT）的机制可解释性研究，提出了一种新的特征本体论（semantic sections）来解决表示空间中的障碍问题。因此，与’Large Language Models’、‘Small Language Models’和’Mechanistic Interpretability’高度相关（10分）。其他关键词如MoE、训练方法、推理加速、对齐等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在受阻表示空间中传统特征本体论的局限性，提出了一种新的语义切片特征表示方法，并通过实验证明该方法能更准确地恢复语义身份，优于原始的全局向量相似性方法。

摘要翻译

近期可解释性研究常将特征视为跨语境共享的单一全局方向、词典原子或潜在坐标。我们认为，在受阻表征空间中这种本体论可能失效——因为局部连贯的含义未必能组合成全局一致的特征。我们引入一种基于图谱本征的替代对象：语义截面。这是一种在语境图谱上定义的、满足传输兼容性的局部特征代表族。我们形式化定义了语义截面，证明了树支撑传播始终可实现路径化，并指出循环一致性是实现真正全局化的关键判据。由此区分出树局部截面、可全局化截面与扭曲截面三类，其中扭曲截面捕捉了局部连贯但受完整群阻碍的含义。随后，我们开发了一套基于种子传播、重叠区域同步化、缺陷剪枝、循环感知分类与去重化的发现-认证流程。通过对Llama 3.2 3B Instruct、Qwen 2.5 3B Instruct和Gemma 2 2B IT模型第16层图谱的实证分析，我们发现了不可忽视的语义截面群体，包括去重后由循环支撑的可全局化体系与扭曲体系。最关键的是，原始全局向量相似度无法恢复语义同一性：即使经过认证的可全局化截面也表现出较低的跨图表符号余弦相似度，而原始相似度基线仅能恢复少量真实截面内配对，且在中度阈值下即失效。相比之下，基于截面的同一性恢复在认证支撑集上达到完全精确。这些结果表明，在受阻体系中语义截面是一种更优的特征本体论。

摘要 (Abstract)

Recent interpretability work often treats a feature as a single global direction, dictionary atom, or latent coordinate shared across contexts. We argue that this ontology can fail in obstructed representation spaces, where locally coherent meanings need not assemble into one globally consistent feature. We introduce an atlas-native replacement object, the semantic section: a transport-compatible family of local feature representatives defined over a context atlas. We formalize semantic sections, prove that tree-supported propagation is always pathwise realizable, and show that cycle consistency is the key criterion for genuine globalization. This yields a distinction between tree-local, globalizable, and twisted sections, with twisted sections capturing locally coherent but holonomy-obstructed meanings. We then develop a discovery-and-certification pipeline based on seeded propagation, synchronization across overlaps, defect-based pruning, cycle-aware taxonomy, and deduplication. Across layer-16 atlases for Llama 3.2 3B Instruct, Qwen 2.5 3B Instruct, and Gemma 2 2B IT, we find nontrivial populations of semantic sections, including cycle-supported globalizable and twisted regimes after deduplication. Most importantly, semantic identity is not recovered by raw global-vector similarity. Even certified globalizable sections show low cross-chart signed cosine similarity, and raw similarity baselines recover only a small fraction of true within-section pairs, often collapsing at moderate thresholds. By contrast, section-based identity recovery is perfect on certified supports. These results support semantic sections as a better feature ontology in obstructed regimes.

关键词: interpretability, feature ontology, semantic sections, obstructed representation spaces, Llama 3.2, Qwen 2.5, Gemma 2, mechanistic interpretability

182. ❌ SozKZ: Training Efficient Small Language Models for Kazakh from Scratch

作者: Saken Tukenov 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20854v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究小型语言模型（SLMs）在低资源语言（哈萨克语）上的从头预训练，与’Small Language Models’高度相关（10分），涉及’Pre-training’（10分），使用Llama架构与’Large Language Models’相关（8分），并观察到模型规模从50M到600M的性能提升与’Scaling Laws’有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了为低资源语言哈萨克语从头训练小型语言模型（50M-600M参数）的可行性，结果表明使用语言适配的分词器和专用训练数据，小型模型能以较低计算成本达到与更大规模多语言模型竞争的性能。

摘要翻译

哈萨克语作为一种拥有超过2200万使用者的突厥语系语言，在现有多语言模型中仍处于服务不足的状态——这些模型为低资源语言分配了极少的容量，且采用的标记化方法不适合其黏着语形态特征。本文介绍了SozKZ系列模型：这是一组基于Llama架构、完全从零开始训练的语言模型（参数量5000万至6亿），使用专门构建的5万词符BPE标记器对90亿哈萨克语文本标记进行训练。我们在三项哈萨克语基准测试（多项选择文化问答、阅读理解Belebele和主题分类SIB-200）上评估了所有模型，并对比了五个参数量从5亿到30亿不等的多语言基线模型。我们的6亿参数模型在哈萨克文化问答中达到30.3%准确率，接近参数量两倍于它的Llama-3.2-1B模型（32.0%）；在SIB-200主题分类任务中获得25.5%准确率，超越了所有参数量在20亿以下的多语言模型。我们观察到从5000万到6亿参数规模的持续性能提升，其中多项选择问答准确率从22.8%上升至30.3%，表明进一步扩大规模仍具潜力。这些结果证明，采用适配语言的标记器、从零开始训练的小型专用模型，为低资源语言技术提供了可行路径，能以极低的计算成本实现有竞争力的性能。所有模型及标记器均以开放许可协议发布。

摘要 (Abstract)

Kazakh, a Turkic language spoken by over 22 million people, remains underserved by existing multilingual language models, which allocate minimal capacity to low-resource languages and employ tokenizers ill-suited to agglutinative morphology. We present SozKZ, a family of Llama-architecture language models (50M-600M parameters) trained entirely from scratch on 9 billion tokens of Kazakh text with a dedicated 50K BPE tokenizer. We evaluate all models on three Kazakh benchmarks – multiple-choice cultural QA, reading comprehension (Belebele), and topic classification (SIB-200) – alongside five multilingual baselines ranging from 500M to 3B parameters. Our 600M model achieves 30.3% accuracy on Kazakh cultural QA, approaching the 32.0% of Llama-3.2-1B (2x larger), and 25.5% on SIB-200 topic classification, surpassing all evaluated multilingual models up to 2B parameters. We observe consistent scaling from 50M to 600M, with MC QA accuracy rising from 22.8% to 30.3%, suggesting that further scaling remains beneficial. These results demonstrate that small, dedicated models trained from scratch with a language-appropriate tokenizer offer a viable path for low-resource language technology, achieving competitive performance at a fraction of the computational cost. All models and the tokenizer are released under open licenses.

关键词: Small Language Models, Low-resource language, Kazakh language, Pre-training from scratch, Llama architecture, BPE tokenizer, Model scaling, Computational efficiency

183. ❌ Can ChatGPT Really Understand Modern Chinese Poetry?

作者: Shanshan Wang, Derek F. Wong, Jingming Yao, Lidia S. Chao 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20851v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文直接研究ChatGPT（一种大型语言模型）对现代诗歌的理解能力，建立了评估框架并进行了实证分析，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术、方法或应用领域，如MoE、SLMs、训练技术、推理优化、代理系统、模型压缩、科学AI等，这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究评估了ChatGPT对现代中文诗歌的理解能力，发现其解释在超过73%的情况下与诗人原意一致，但在捕捉诗性等维度上表现不足。

摘要翻译

ChatGPT在诗歌生成与翻译领域已展现出卓越能力，但其对诗歌的真正理解力仍有待探索。既往诗歌相关研究多局限于实验结果分析，未能触及理解层面的核心问题。本文提出一个评估ChatGPT现代诗歌理解能力的综合框架。我们联合专业诗人，从多维度评估了ChatGPT对不同诗人创作的现代汉语诗歌的阐释能力。评估结果显示，ChatGPT在超过73%的案例中能契合原诗人的创作意图，但在某些维度——尤其是诗性特质的捕捉方面——其理解力仍有明显不足。这些发现印证了本研究所提框架的有效性与必要性。本研究不仅评估了ChatGPT理解现代诗歌的能力，更为未来大语言模型（LLM）研究及其在诗歌相关任务中的应用奠定了坚实基础。

摘要 (Abstract)

ChatGPT has demonstrated remarkable capabilities on both poetry generation and translation, yet its ability to truly understand poetry remains unexplored. Previous poetry-related work merely analyzed experimental outcomes without addressing fundamental issues of comprehension. This paper introduces a comprehensive framework for evaluating ChatGPT’s understanding of modern poetry. We collaborated with professional poets to evaluate ChatGPT’s interpretation of modern Chinese poems by different poets along multiple dimensions. Evaluation results show that ChatGPT’s interpretations align with the original poets’ intents in over 73% of the cases. However, its understanding in certain dimensions, particularly in capturing poeticity, proved to be less satisfactory. These findings highlight the effectiveness and necessity of our proposed framework. This study not only evaluates ChatGPT’s ability to understand modern poetry but also establishes a solid foundation for future research on LLMs and their application to poetry-related tasks.

关键词: ChatGPT, modern Chinese poetry, poetry understanding, evaluation framework, LLMs, poetry interpretation, poeticity

184. ❌ HiCI: Hierarchical Construction-Integration for Long-Context Attention

作者: Xiangyu Zeng, Qi Xu, Yunke Wang, Chang Xu 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20843v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	15.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究长上下文语言建模，提出HiCI分层注意力模块，将LLaMA-2的上下文从4K扩展到100K/64K tokens，因此与’Context Window Extension OR Long Context LLMs’高度相关（15分）。论文使用参数高效微调（PEFT）方法，仅增加<5.5%参数，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’相关（8分）。论文基于LLaMA-2，属于大语言模型研究，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

论文提出HiCI分层注意力模块，通过参数高效微调将LLaMA-2的上下文长度从4K扩展到100K tokens，在语言建模、检索和指令遵循任务上取得显著改进。

摘要翻译

长上下文语言建模通常被视作词元级注意力的可扩展性挑战，但现有方法中局部到全局的信息结构构建大多仍处于隐式状态。借鉴语篇理解的认知理论，我们提出HiCI（分层构建-集成）模块——一种分层注意力机制，该模块构建片段级表征，将其集成至共享的全局上下文中，并同步广播这两类信息以调节片段级注意力。我们通过对LLaMA-2进行参数高效适配（仅增加<5.5%参数）验证HiCI，成功将上下文窗口从4K扩展到100K词元（7B模型）和64K词元（13B模型）。在语言建模、信息检索和指令跟随基准测试中，HiCI相较于强基线模型均取得稳定提升，包括在主题检索任务上匹配专有模型，在代码理解任务上超越GPT-3.5-Turbo-16K。这些结果表明，显式分层结构作为长上下文建模的归纳偏置具有显著有效性。

摘要 (Abstract)

Long-context language modeling is commonly framed as a scalability challenge of token-level attention, yet local-to-global information structuring remains largely implicit in existing approaches. Drawing on cognitive theories of discourse comprehension, we propose HiCI (Hierarchical Construction–Integration), a hierarchical attention module that constructs segment-level representations, integrates them into a shared global context, and broadcasts both to condition segment-level attention. We validate HiCI through parameter-efficient adaptation of LLaMA-2 with only <5.5% additional parameters, extending context from 4K to 100K tokens (7B) and 64K tokens (13B). Across language modeling, retrieval, and instruction-following benchmarks, HiCI yields consistent improvements over strong baselines, including matching proprietary models on topic retrieval and surpassing GPT-3.5-Turbo-16K on code comprehension. These results demonstrate the effectiveness of explicit hierarchical structuring as an inductive bias for long-context modeling.

关键词: Long-context language modeling, Hierarchical attention, Context window extension, Parameter-efficient adaptation, LLaMA-2, Segment-level representations, Global context, Instruction-following benchmarks

185. ❌ VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

作者: Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22285v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究长视频理解中的线索定位问题，提出VideoDetective框架。与关键词的相关性分析：1）论文明确提到multimodal large language models (MLLMs)，因此与’Large Language Models’相关（8分）；2）论文核心解决长视频理解中context window限制问题，与’Context Window Extension OR Long Context LLMs’高度相关（10分）；3）其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在长视频理解中因上下文窗口限制导致的线索定位难题，提出了VideoDetective框架，通过整合查询-片段相关性和片段间亲和性来有效定位关键视频片段，在主流基准测试中实现了最高7.5%的准确率提升。

摘要翻译

长视频理解对多模态大语言模型（MLLMs）而言仍具挑战性，这主要受限于其有限的上下文窗口，因而需要识别与查询相关的稀疏视频片段。然而，现有方法主要仅依据查询本身进行线索定位，忽略了视频的内在结构及各片段间不同程度的相关性。为解决这一问题，我们提出了VideoDetective框架，该框架整合了查询-片段相关性与片段间亲和度，以在长视频问答中进行有效的线索搜寻。具体而言，我们将视频划分为多个片段，并基于视觉相似性和时序邻近性构建视觉-时序亲和图来表征这些片段。随后，我们执行一个“假设-验证-精炼”循环，以估计已观测片段与查询的相关性分数，并将其传播至未观测片段，从而生成全局相关性分布。该分布可指导定位最关键的视频片段，以便在稀疏观测条件下进行最终答案生成。实验表明，我们的方法在代表性基准测试中，针对一系列主流MLLMs均取得了显著性能提升，其中在VideoMME-long数据集上的准确率最高提升了7.5%。代码发布于https://videodetective.github.io/。

摘要 (Abstract)

Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video’s intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/

关键词: long video understanding, multimodal large language models, context window limitation, query-relevant segments, visual-temporal affinity graph, hypothesis-verification-refinement, sparse observation, VideoMME-long benchmark

186. ❌ DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

作者: Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, Yingcong Chen, Liuqing Yang, Haoang Li 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22280v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出DualCoT-VLA方法，核心创新在于视觉-语言并行推理机制，与’Chain of Thought’高度相关（10分），涉及’System 2 Thinking’概念（8分）。论文研究VLA模型，属于大模型在机器人领域的应用，与’Large Language Models’相关（8分）。提出的并行推理机制减少了推理延迟，与’Inference Acceleration’有一定关联（5分）。VLA模型用于机器人任务执行，与’LLM Agents’概念相关（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉-语言-动作模型在复杂多步任务中存在的逻辑规划不足和推理延迟问题，提出了并行推理的视觉-语言链式思维方法DualCoT-VLA，在多个基准测试和实际平台上实现了最先进的性能。

摘要翻译

视觉-语言-动作（Vision-Language-Action, VLA）模型将视觉观测与语言指令直接映射为机器人动作。尽管在简单任务中表现有效，但标准VLA模型在处理需要逻辑规划的复杂多步骤任务，以及要求细粒度空间感知的精确操作时，往往面临困难。近期研究尝试引入思维链（Chain-of-Thought, CoT）推理，赋予VLA模型“先思后行”的能力。然而，当前基于CoT的VLA模型存在两个关键局限：1）由于依赖孤立、单模态的CoT，无法同时捕捉低层视觉细节与高层逻辑规划；2）逐步自回归解码导致推理延迟高且错误会逐级累积。为应对这些局限，我们提出DualCoT-VLA，一种采用并行推理机制的视觉-语言CoT方法。为实现全面的多模态推理，本方法整合了用于低层空间理解的视觉CoT与用于高层任务规划的语言CoT。此外，为突破延迟瓶颈，我们引入了并行CoT机制，该机制包含两组可学习的查询令牌，将自回归推理转变为单步前向推理。大量实验表明，我们的DualCoT-VLA在LIBERO和RoboCasa GR1基准测试以及真实世界平台上均取得了最先进的性能。

摘要 (Abstract)

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting’’ capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.

关键词: Vision-Language-Action Models, Chain-of-Thought Reasoning, Parallel Reasoning, Visual CoT, Linguistic CoT, Robotic Actions, Inference Latency, Multi-step Tasks

187. ❌ The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

作者: Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22278v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）中的空间推理机制，属于大模型（LLMs）在视觉-语言多模态领域的应用，因此与’Large Language Models’相关（8分）。研究通过分析模型内部表示来理解空间关联的计算过程，这属于可解释AI范畴，与’Mechanistic Interpretability’高度相关（8分）。论文涉及空间推理任务，这与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），但并非核心焦点。其他关键词如MoE、量化、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型中空间关联的计算机制，发现空间信息主要来源于视觉编码器的全局表示，而非语言模型主干，并通过增强视觉表示提升了空间推理性能。

摘要翻译

许多多模态任务，如图像描述生成和视觉问答，要求视觉语言模型（VLMs）将对象与其属性及空间关系关联起来。然而，目前尚不清楚此类关联在VLMs内部何处以及如何被计算。在本研究中，我们发现VLMs依赖两种并行机制来表征此类关联。在语言模型主干中，中间层在与对象对应的视觉标记之上，表征内容无关的空间关系。然而，这一机制在塑造模型预测中仅起次要作用。相反，空间信息的主要来源在于视觉编码器，其表征编码了对象的布局，并被语言模型主干直接利用。值得注意的是，这种空间信号在视觉标记之间全局分布，不仅限于对象区域，还延伸至周围的背景区域。我们证明，通过增强所有图像标记中源自视觉的全局空间表征，可以提高自然图像上的空间推理性能。综上所述，我们的研究结果阐明了空间关联在VLMs内部的计算方式，并凸显了视觉编码器在实现空间推理中的核心作用。

摘要 (Abstract)

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.

关键词: vision-language models, spatial reasoning, visual encoders, language model backbone, multimodal tasks, object properties, spatial relations, model interpretability

188. ❌ Repurposing Geometric Foundation Models for Multi-view Diffusion

作者: Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, Sainan Liu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22275v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Geometric Latent Diffusion (GLD)框架，利用几何基础模型的特征空间作为多视图扩散的潜在空间，属于基础模型在计算机视觉领域的创新应用。与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为论文明确使用并重新利用几何基础模型。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文提到训练扩散模型时没有使用大规模文本到图像预训练，但涉及基础模型的预训练特征利用。其他关键词如MoE、SLMs、对齐、推理加速等与论文的计算机视觉和几何生成主题无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何为新颖视图合成寻找最优潜在空间，提出了Geometric Latent Diffusion框架，通过重新利用几何基础模型的特征空间，在2D图像质量和3D一致性指标上优于VAE和RAE，并加速了训练。

摘要翻译

尽管生成式隐空间的最新进展推动了单图像生成的显著进步，但适用于新视角合成（NVS）的最优隐空间在很大程度上仍未得到探索。具体而言，NVS需要在不同视角间保持几何一致的生成，但现有方法通常在与视角无关的VAE隐空间中操作。本文提出几何隐扩散（GLD）框架，该框架将几何基础模型中几何一致的特征空间重新用作多视角扩散的隐空间。我们证明这些特征不仅支持高保真度的RGB重建，还编码了强大的跨视角几何对应关系，从而为NVS提供了一个高度适配的隐空间。实验表明，GLD在二维图像质量和三维一致性指标上均优于VAE和RAE方法，同时相较于VAE隐空间，其训练速度提升了4.4倍以上。值得注意的是，尽管GLD的扩散模型完全从头开始训练而未利用大规模文本到图像预训练，其性能仍与依赖此类生成式预训练的最先进方法保持竞争力。

摘要 (Abstract)

While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.

关键词: Geometric Latent Diffusion, novel view synthesis, multi-view diffusion, geometric foundation models, latent space, 3D consistency, generative models, computer vision

189. ❌ DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution

作者: Zhengyao Lv, Menghan Xia, Xintao Wang, Kwan-Yee K. Wong 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22271v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频超分辨率（VSR）任务，提出了一种基于扩散模型蒸馏的DUO-VSR框架，旨在加速生成过程并提高视觉质量。论文的核心技术涉及扩散模型、生成对抗网络（GAN）、知识蒸馏和感知质量优化。所有给定的关键词均与大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、Agent等）或特定科学领域AI应用（如生物信息学）相关，而本论文研究的是计算机视觉中的视频处理任务，未涉及任何大语言模型技术或其在科学领域的应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对基于扩散模型的视频超分辨率方法采样成本高的问题，提出了DUO-VSR框架，通过双流蒸馏策略统一分布匹配和对抗监督，实现了高质量的一步生成，在视觉质量和效率上优于先前方法。

摘要翻译

基于扩散模型的视频超分辨率技术近期在保真度方面取得了显著进展，但仍受限于高昂的采样成本。虽然分布匹配蒸馏方法能够将扩散模型加速至一步生成，但直接将其应用于视频超分辨率常导致训练不稳定，同时存在监督信号退化与不足的问题。为解决这些挑战，本文提出DUO-VSR——一个基于双流蒸馏策略的三阶段框架，通过统一分布匹配与对抗监督实现一步式视频超分辨率。首先，采用渐进引导蒸馏初始化方法，通过轨迹保持蒸馏稳定后续训练过程。其次，双流蒸馏模块联合优化分布匹配蒸馏流与真伪分数特征生成对抗网络流，后者通过利用真实与生成分数模型的判别性特征，提供互补的对抗监督信号。最后，偏好引导优化阶段进一步使学生模型与感知质量偏好对齐。大量实验表明，DUO-VSR在视觉质量与效率上均优于现有的一步式视频超分辨率方法。

摘要 (Abstract)

Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a Dual-Stream Distillation strategy that unifies distribution matching and adversarial supervision for one-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.

关键词: Video Super-Resolution, Diffusion Models, Knowledge Distillation, One-step Generation, Dual-Stream Distillation, Adversarial Supervision, Perceptual Quality, Efficiency

190. ❌ GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning

作者: Yixuan Luo, Feng Qiao, Zhexiao Xiong, Yanjing Li, Nathan Jacobs 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22270v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是计算机视觉中的光流估计问题，提出了一种无监督生成方法，使用预训练深度估计网络生成伪光流数据。所有评分关键词都专门针对大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、量化等），而本文完全不涉及任何语言模型或自然语言处理技术，属于纯粹的计算机视觉领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GenOpticalFlow的无监督生成框架，通过合成大规模对齐的帧-光流数据对来解决光流估计中依赖昂贵标注数据的问题，在KITTI和Sintel数据集上取得了与现有无监督和半监督方法竞争或更优的结果。

摘要翻译

光流估计是计算机视觉中的一个基础问题，然而对昂贵真实标注的依赖限制了监督方法的可扩展性。尽管无监督和半监督方法缓解了这一问题，但它们通常依赖于基于亮度恒定性和平滑性假设的不可靠监督信号，导致在复杂现实场景中运动估计不准确。为克服这些限制，我们提出了 \modelname，这是一个新颖的框架，能够合成大规模、完美对齐的帧-光流数据对，用于监督式光流训练，且无需人工标注。具体而言，我们的方法利用预训练的深度估计网络生成伪光流，这些伪光流作为条件输入，用于训练一个下一帧生成模型，以生成高保真、像素对齐的后续帧。这一过程能够创建大量具有精确运动对应关系的高质量合成数据。此外，我们提出了一种不一致像素过滤策略，用于识别并移除生成帧中不可靠的像素，有效提升了在真实世界数据集上的微调性能。在KITTI2012、KITTI2015和Sintel数据集上进行的大量实验表明，与现有的无监督和半监督方法相比，\modelname 取得了具有竞争力或更优的结果，凸显了其作为一种可扩展、无需标注的光流学习解决方案的潜力。我们将在论文被接受后公开代码。

摘要 (Abstract)

Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame–flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.

关键词: optical flow estimation, unsupervised learning, generative approach, synthetic data generation, depth estimation, frame generation, inconsistent pixel filtering, computer vision

作者: Jeffri Murrugarra-Llerena, Pranav Chitale, Zicheng Liu, Kai Ao, Yujin Ham, Guha Balakrishnan, Paola Cascante-Bonilla 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22249v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究社会群体检测，使用VLM/LLM进行零样本评估，因此与’Large Language Models’高度相关（8分），与’LLM Agents’有一定关联（5分），因为论文提到社会智能对智能体很重要。其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了EgoGroups数据集，用于评估VLM/LLM在真实世界社会群体检测中的性能，发现VLM/LLM在零样本设置下能超越监督基线，且人群密度和文化区域影响模型表现。

摘要翻译

社会群体检测，即识别参与人际互动（如家庭成员、朋友、顾客与商家）的人类，是智能体在世界中进行交互所需社会智能的关键组成部分。现有少数社会群体检测基准受限于场景多样性不足以及对第三人称摄像头来源（如监控录像）的依赖。因此，这些基准普遍缺乏对群体在不同文化背景和无约束环境中如何形成与演变的真实世界评估。为填补这一空白，我们提出了EgoGroups——一个捕捉全球城市社会动态的第一人称视角数据集。EgoGroups涵盖65个国家，包含低、中、高人群密度场景，并覆盖四种天气/时段条件。我们提供了密集的人员与社会群体人工标注，以及丰富的地理和场景元数据。基于该数据集，我们对前沿视觉语言模型（VLM）/大语言模型（LLM）及监督模型进行了群体检测能力的广泛评估。研究发现若干有趣现象：在零样本设置下，VLM与LLM能够超越监督基线模型，而人群密度和文化区域明显影响模型性能。

摘要 (Abstract)

Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.

关键词: social group detection, first-person view dataset, VLM/LLM evaluation, zero-shot learning, crowd density, cultural context, benchmark, real-world social dynamics

192. ❌ Riverine Land Cover Mapping through Semantic Segmentation of Multispectral Point Clouds

作者: Sopitta Thurachen, Josef Taher, Matti Lehtomäki, Leena Matikainen, Linnea Blåfield, Mikel Calle Navarro, Antero Kukko, Tomi Westerlund, Harri Kaartinen 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22230v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 这篇论文专注于使用Point Transformer v2进行河流环境的多光谱LiDAR点云语义分割，以实现土地覆盖分类。论文的核心是计算机视觉和遥感应用，涉及深度学习（特别是Transformer架构）在特定科学领域（环境监测）的应用。所有关键词都直接与大语言模型（LLM）相关技术、训练方法、推理优化、对齐、代理系统等主题相关，而这篇论文完全不涉及LLM或自然语言处理。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将深度学习应用于环境科学（河流管理），属于AI for Science的广义范畴，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

本研究探索了使用Point Transformer v2对多光谱LiDAR点云进行语义分割，以准确绘制河流环境土地覆盖图，结果表明结合几何和光谱特征显著提升了性能，多数据集训练增强了模型泛化能力。

摘要翻译

河流环境中精确的土地覆盖制图对于有效的河流管理、生态理解及地貌变化监测至关重要。本研究探索使用点云数据专用先进深度神经网络架构——点变换器v2（Point Transformer v2，PTv2），通过对真实河流环境中多光谱激光雷达（LiDAR）数据进行语义分割来实现土地覆盖制图。我们利用三通道激光雷达点云的几何与光谱信息，绘制包括沙地、砾石、低矮植被、高大植被、森林地表及水体在内的土地覆盖类别。研究采用芬兰北部奥兰卡河（Oulanka）的点云数据，结合几何与光谱特征对PTv2模型进行训练与评估。为提升模型在新河流环境中的泛化能力，我们进一步探究了多数据集训练策略，通过引入额外河流数据集的稀疏标注数据来增强训练。结果表明，采用全特征配置的模型平均交并比（mean Intersection over Union，mIoU）达到0.950，显著优于仅使用几何特征的基线模型。其他消融实验揭示，激光雷达强度与反射率特征是实现精确土地覆盖制图的关键。多数据集训练实验显示出泛化性能的提升，表明即使在高质量标注数据有限的情况下，仍具备开发更强健模型的潜力。本研究论证了基于变换器的架构在多光谱点云河流环境应用中的潜力，该方法为监测泥沙输移及其他河流管理应用提供了新的技术手段。

摘要 (Abstract)

Accurate land cover mapping in riverine environments is essential for effective river management, ecological understanding, and geomorphic change monitoring. This study explores the use of Point Transformer v2 (PTv2), an advanced deep neural network architecture designed for point cloud data, for land cover mapping through semantic segmentation of multispectral LiDAR data in real-world riverine environments. We utilize the geometric and spectral information from the 3-channel LiDAR point cloud to map land cover classes, including sand, gravel, low vegetation, high vegetation, forest floor, and water. The PTv2 model was trained and evaluated on point cloud data from the Oulanka river in northern Finland using both geometry and spectral features. To improve the model’s generalization in new riverine environments, we additionally investigate multi-dataset training that adds sparsely annotated data from an additional river dataset. Results demonstrated that using the full-feature configuration resulted in performance with a mean Intersection over Union (mIoU) of 0.950, significantly outperforming the geometry baseline. Other ablation studies revealed that intensity and reflectance features were the key for accurate land cover mapping. The multi-dataset training experiment showed improved generalization performance, suggesting potential for developing more robust models despite limited high-quality annotated data. Our work demonstrates the potential of applying transformer-based architectures to multispectral point clouds in riverine environments. The approach offers new capabilities for monitoring sediment transport and other river management applications.

关键词: riverine land cover mapping, semantic segmentation, multispectral LiDAR, point cloud, Point Transformer v2, deep neural network, environmental monitoring, geometric and spectral features

193. ❌ Benchmarking Deep Learning Models for Aerial LiDAR Point Cloud Semantic Segmentation under Real Acquisition Conditions: A Case Study in Navarre

作者: Alex Salvatierra, José Antonio Sanz, Christian Gutiérrez, Mikel Galar 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22229v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习模型（KPConv、RandLA-Net、Superpoint Transformer、Point Transformer V3）进行航空LiDAR点云语义分割的基准测试，属于计算机视觉和3D点云处理领域。所有关键词均与大语言模型（LLMs）、模型训练技术、推理优化、AI对齐、智能体等主题相关，而论文完全不涉及这些主题。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文应用AI于地理空间科学（遥感），但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该研究通过基准测试评估了四种深度学习模型在真实航空LiDAR点云数据上的语义分割性能，发现所有模型总体准确率超过93%，其中KPConv在平均IoU上表现最佳（78.51%），而Point Transformer V3在车辆类别上表现突出（75.11% IoU）。

摘要翻译

深度学习的最新进展显著提升了三维语义分割的性能，但多数模型聚焦于室内或地面数据集。它们在真实航空采集条件下的表现仍未得到充分探索，尽管已有少量研究涉及类似场景，但这些研究在数据集设计、采集条件和模型选择方面存在差异。为填补这一空白，我们开展了一项实验基准测试，在西班牙纳瓦拉地区操作飞行条件下获取的大规模航空激光雷达（LiDAR）数据集上评估了多种先进架构，该数据集覆盖了异质的城市、乡村和工业景观。本研究比较了四种代表性的深度学习模型，包括KPConv、RandLA-Net、Superpoint Transformer和Point Transformer V3，针对航空勘测中常见的五个语义类别（如地面、植被、建筑物和车辆）进行分析，突出了航空数据中类别不平衡和几何变异性的固有挑战。结果表明，所有测试模型均实现了超过93%的整体准确率，其中KPConv通过在各类别（尤其是具有挑战性和代表性不足的类别）上保持稳定性能，获得了最高的平均交并比（78.51%）。Point Transformer V3在代表性不足的车辆类别上表现出最优性能（交并比75.11%），而Superpoint Transformer和RandLA-Net则在分割鲁棒性与计算效率之间进行了权衡。

摘要 (Abstract)

Recent advances in deep learning have significantly improved 3D semantic segmentation, but most models focus on indoor or terrestrial datasets. Their behavior under real aerial acquisition conditions remains insufficiently explored, and although a few studies have addressed similar scenarios, they differ in dataset design, acquisition conditions, and model selection. To address this gap, we conduct an experimental benchmark evaluating several state-of-the-art architectures on a large-scale aerial LiDAR dataset acquired under operational flight conditions in Navarre, Spain, covering heterogeneous urban, rural, and industrial landscapes. This study compares four representative deep learning models, including KPConv, RandLA-Net, Superpoint Transformer, and Point Transformer V3, across five semantic classes commonly found in airborne surveys, such as ground, vegetation, buildings, and vehicles, highlighting the inherent challenges of class imbalance and geometric variability in aerial data. Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes, particularly on challenging and underrepresented categories. Point Transformer V3 demonstrates superior performance on the underrepresented vehicle class (75.11% IoU), while Superpoint Transformer and RandLA-Net trade off segmentation robustness for computational efficiency.

关键词: aerial LiDAR, point cloud, semantic segmentation, deep learning, benchmarking, 3D models, KPConv, Point Transformer

194. ❌ Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

作者: Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang, Bingze Song, Ruitian Tian, Jiashu Zhu, Jiachen Lei, Hao Dou, Jing Tang, Lei Sun, Jiahong Wu, Xiangxiang Chu, Zeming Liu, Kaiqi Huang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22212v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频世界模型的评估基准开发，核心是评估4D生成中交互响应的能力。仅与关键词’World Models AND General World Models’高度相关（10分），因为论文直接研究世界模型的评估，特别是4D世界模型。其他关键词主要涉及大语言模型（LLM）的技术细节、训练方法、推理、对齐、压缩、代理等，论文未涉及这些具体技术，仅提及’agent-based evaluation framework’但非LLM代理，因此评0分。

!!! tip deepseek-chat TL;DR

该论文提出了Omni-WorldBench，一个用于评估4D世界模型交互响应能力的综合基准，揭示了当前世界模型在交互响应方面的关键局限性。

摘要翻译

基于视频的世界模型主要沿着两大范式发展：视频生成与三维重建。然而，现有的评估基准要么狭隘地聚焦于生成模型的视觉保真度与文本-视频对齐度，要么依赖于静态的三维重建指标，这些指标从根本上忽视了时间动态。我们认为，世界建模的未来在于四维生成，即对空间结构与时间演化进行联合建模。在此范式中，核心能力是交互响应：即准确反映交互行为如何驱动跨时空状态转换的能力。然而，目前尚无基准能系统性地评估这一关键维度。为填补这一空白，我们提出了Omni-WorldBench，这是一个专门设计用于评估世界模型在四维场景中交互响应能力的综合性基准。Omni-WorldBench包含两个关键组成部分：Omni-WorldSuite，一个涵盖不同交互层级与场景类型的系统性提示集；以及Omni-Metrics，一个基于智能体的评估框架，通过量化交互行为对最终结果及中间状态演化轨迹的因果影响，来衡量世界建模能力。我们对跨多种范式的18个代表性世界模型进行了广泛评估。我们的分析揭示了当前世界模型在交互响应方面存在的关键局限，为未来研究提供了可操作的见解。Omni-WorldBench将公开发布，以促进交互式四维世界建模领域的发展。

摘要 (Abstract)

Video–based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text–video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni–WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni–WorldBench comprises two key components: Omni–WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni–Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.

关键词: world models, 4D generation, interactive response, evaluation benchmark, video generation, 3D reconstruction, temporal dynamics, agent-based evaluation

195. ❌ Mixture of Mini Experts: Overcoming the Linear Layer Bottleneck in Multiple Instance Learning

作者: Daniel Shao, Joel Runevic, Richard J. Chen, Drew F. K. Williamson, Ahrong Kim, Andrew H. Song, Faisal Mahmood 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22198v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于计算病理学中的多实例学习（MIL），提出了一种名为MAMMOTH的混合专家模块，以解决线性层瓶颈问题。论文与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为其核心创新是基于混合专家架构。与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为MAMMOTH被设计为参数高效的模块，旨在以最小参数改动提升性能。与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为论文应用于生物信息学领域的计算病理学。其他关键词如大语言模型、推理方法、对齐技术等与论文内容无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文解决了计算病理学中多实例学习（MIL）的线性层瓶颈问题，通过引入参数高效的混合专家模块MAMMOTH，在19个分类任务中平均提升了3.8%的性能。

摘要翻译

多示例学习（Multiple Instance Learning, MIL）是计算病理学中对千兆像素全切片图像进行分类的主流框架。MIL通常遵循以下流程：1）提取图像块特征，2）通过线性层获取任务特定的图像块特征，3）将图像块特征聚合成切片级特征以进行分类。尽管已有大量研究致力于优化图像块特征提取与聚合步骤，但尚未有工作关注第二步骤——即将通用特征转换为任务特定特征的关键层。我们假设该层构成了一个被忽视的性能瓶颈，且通过为每个图像块的表型定制低秩变换，能够获得更强的特征表示，并与现有任何MIL方法产生协同效应。为此，我们提出MAMMOTH——一个参数高效的多头专家混合模块，旨在以最小的总参数量改动提升任意MIL模型的性能。通过对八种MIL方法和19项不同分类任务的评估，我们发现这种任务特定的变换对性能的影响甚至大于聚合方法的选择。例如，当配备MAMMOTH模块时，即使如最大值池化或均值池化等简单方法，其平均性能也优于使用标准线性层的任何方法。总体而言，在152组实验配置中，MAMMOTH在130组中提升了性能，平均性能提升达$+3.8%$。代码公开于https://github.com/mahmoodlab/mammoth。

摘要 (Abstract)

Multiple Instance Learning (MIL) is the predominant framework for classifying gigapixel whole-slide images in computational pathology. MIL follows a sequence of 1) extracting patch features, 2) applying a linear layer to obtain task-specific patch features, and 3) aggregating the patches into a slide feature for classification. While substantial efforts have been devoted to optimizing patch feature extraction and aggregation, none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features. We hypothesize that this layer constitutes an overlooked performance bottleneck and that stronger representations can be achieved with a low-rank transformation tailored to each patch’s phenotype, yielding synergistic effects with any of the existing MIL approaches. To this end, we introduce MAMMOTH, a parameter-efficient, multi-head mixture of experts module designed to improve the performance of any MIL model with minimal alterations to the total number of parameters. Across eight MIL methods and 19 different classification tasks, we find that such task-specific transformation has a larger effect on performance than the choice of aggregation method. For instance, when equipped with MAMMOTH, even simple methods such as max or mean pooling attain higher average performance than any method with the standard linear layer. Overall, MAMMOTH improves performance in 130 of the 152 examined configurations, with an average $+3.8%$ change in performance. Code is available at https://github.com/mahmoodlab/mammoth.

关键词: Multiple Instance Learning, computational pathology, mixture of experts, parameter-efficient, linear layer bottleneck, whole-slide images, MAMMOTH, task-specific transformation

196. ❌ A Backbone Benchmarking Study on Self-supervised Learning as a Auxiliary Task with Texture-based Local Descriptors for Face Analysis

作者: Shukesh Reddy, Abhijit Das 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22190v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的人脸分析任务，专注于自监督学习作为辅助任务与纹理特征结合，使用Masked Auto-Encoder（MAE）方法，并评估不同骨干网络的影响。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用相关，而该论文完全不涉及大语言模型、深度学习技术原理创新或AI在科学领域的应用，仅涉及计算机视觉中的自监督学习和人脸分析，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在结合纹理特征的自监督学习辅助任务中，不同骨干网络对人脸分析性能的影响，发现没有统一的骨干网络适用于所有任务，性能高度依赖于下游任务。

摘要翻译

本研究以不同骨干网络为基准，探讨了其在自监督学习作为辅助任务时的影响，旨在将基于纹理的局部描述符融入特征建模以实现高效人脸分析。已有研究证实，将主任务与自监督辅助任务相结合能够实现更鲁棒且更具判别性的表征学习。
在局部模式自监督辅助任务框架中，我们采用从浅层到深层的不同骨干网络，以掩码自编码器的自监督任务作为辅助目标，在完成主任务的同时重建局部模式等纹理特征，从而确保鲁棒且无偏的人脸分析。
为扩展基准测试范围，我们在所提框架内对多种模型配置进行了全面比较分析。为此，我们针对三个研究问题展开探讨：“骨干网络在局部模式自监督辅助任务性能中扮演何种角色？”“何种类型的骨干网络对不同人脸分析任务有效？”以及“是否存在适用于局部模式自监督辅助任务的通用骨干网络？”
为回答这些问题，我们开展了详细的研究与实验。性能评估表明，所提方法中骨干网络的选择高度依赖于下游任务，在FaceForensics++、CelebA和AffectNet数据集上分别取得了0.94、0.87和0.88的平均准确率。
针对人脸属性预测、情绪分类和深度伪造检测等多种人脸分析范式，为保持特征表征质量的一致性与泛化能力，目前尚不存在统一的骨干网络架构。

摘要 (Abstract)

In this work, we benchmark with different backbones and study their impact for self-supervised learning (SSL) as an auxiliary task to blend texture-based local descriptors into feature modelling for efficient face analysis. It is established in previous work that combining a primary task and a self-supervised auxiliary task enables more robust and discriminative representation learning. We employed different shallow to deep backbones for the SSL task of Masked Auto-Encoder (MAE) as an auxiliary objective to reconstruct texture features such as local patterns alongside the primary task in local pattern SSAT (L-SSAT), ensuring robust and unbiased face analysis. To expand the benchmark, we conducted a comprehensive comparative analysis across multiple model configurations within the proposed framework. To this end, we address the three research questions: “What is the role of the backbone in performance L-SSAT?”, “What type of backbone is effective for different face analysis tasks?”, and “Is there any generalized backbone for effective face analysis with L-SSAT?”. Towards answering these questions, we provide a detailed study and experiments. The performance evaluation demonstrates that the backbone for the proposed method is highly dependent on the downstream task, achieving average accuracies of 0.94 on FaceForensics++, 0.87 on CelebA, and 0.88 on AffectNet. For consistency of feature representation quality and generalisation capability across various face analysis paradigms, including face attribute prediction, emotion classification, and deepfake detection, there is no unified backbone.

关键词: self-supervised learning, auxiliary task, backbone benchmarking, texture-based local descriptors, face analysis, Masked Auto-Encoder, feature representation, downstream tasks

197. ❌ PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation

作者: Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Li Yi, Hao Zhao 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22193v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation》专注于手-物体交互（HOI）视频生成，属于计算机视觉和图形学领域，具体涉及姿态、外观和运动的统一建模。论文内容与绝大多数关键词（如LLM、MoE、SFT、RLHF、RAG、CoT、Agents等）完全无关，因为这些关键词主要围绕大语言模型的技术原理、训练方法、推理优化和应用范式。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为HOI生成可视为AI在科学或工程模拟中的应用（如AR/VR、具身AI），但论文未明确涉及生物信息学或化学信息学，且核心并非大模型技术，因此仅给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种统一的姿态-外观-运动引擎（PAM），用于可控的手-物体交互视频生成，在DexYCB和OAKINK2数据集上实现了优于基线模型的性能，并能通过合成数据增强下游手部姿态估计任务。

摘要翻译

手-物交互重建与合成正逐渐成为具身人工智能与增强/虚拟现实领域的核心议题。然而，尽管进展迅速，现有手-物交互生成研究仍分散在三个互不关联的方向：(1) 仅姿态合成方法，其预测MANO参数轨迹而不生成像素；(2) 单图像手-物交互生成方法，其通过掩码或二维线索推测外观但缺乏动态信息；(3) 视频生成方法，其需要完整姿态序列与真实首帧图像作为输入，导致无法实现真正的仿真到现实部署。受Joo等人(2018)研究理念的启发，我们认为手-物交互生成需要一个统一引擎，将姿态、外观与运动整合到连贯的框架中。为此，我们提出PAM：一个用于可控手-物交互视频生成的姿态-外观-运动引擎。本引擎的性能通过以下实验验证：(1) 在DexYCB数据集上，我们取得了29.13的FVD分数（对比InterDyn的38.83）与19.37毫米的MPJPE误差（对比CosHand的30.05毫米），同时生成480x720的高分辨率视频，优于256x256与256x384的基线方法。(2) 在OAKINK2数据集上，我们的完整多条件模型将FVD从68.76提升至46.31。(3) 在DexYCB上进行的输入条件消融实验表明，结合深度、分割与关键点信息能持续产生最佳结果。(4) 在下游手部姿态估计任务中，使用SimpleHand框架并增加3,400段合成视频（共207,000帧）进行训练，可使仅使用50%真实数据加成本文合成数据的模型达到与100%真实数据基线相当的性能。

摘要 (Abstract)

Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.

关键词: Hand-object interaction, Video generation, Pose-Appearance-Motion, Sim-to-real, Controllable generation, Synthetic data augmentation, HOI reconstruction, Embodied AI

198. ❌ ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints

作者: Kaili Huang, Hongming Zhang, Rui Shen, Linjun Dai, Jiahao Wang, Hanming Deng, Lewei Lu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22165v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Direct Preference Optimization（DPO）在大型视觉语言模型（LVLMs）对齐中的问题，并提出改进方法ACPO。因此与’Direct Preference Optimization’、‘Alignment’、‘Large Language Models’高度相关（10分）。论文直接解决幻觉问题，与’Hallucination Mitigation’高度相关（10分）。论文涉及对齐后的微调，与’Post-training’有一定关联（5分）。其他关键词如MoE、SLMs、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对Direct Preference Optimization在大型视觉语言模型对齐中存在的似然位移问题，提出了Asymmetric Constrained Preference Optimization方法，有效缓解了视觉锚点崩溃并减少了幻觉，在多个基准测试中优于基线方法。

摘要翻译

尽管直接偏好优化（DPO）已成为对齐大型视觉语言模型（LVLMs）的事实标准方法，但它存在似然位移问题，即被选答案和遭拒答案的概率均发生坍缩。这一优化缺陷在多模态场景中尤为有害：被选似然值的侵蚀——我们称之为视觉锚点坍缩——导致模型放弃视觉证据而依赖强语言先验，从而引发严重的幻觉现象。为解决此问题，我们提出非对称约束偏好优化（ACPO），这是一种模态无关的对齐机制，通过动态、目标导向的缩放应用于偏好优化。ACPO推导出仅作用于遭拒奖励项的复杂度感知缩放系数，非对称地抑制遭拒项的梯度流，同时将被选分布保持为梯度稳定的参考基准。虽然本质上属于通用优化目标，但打破这种梯度对称性对多模态任务至关重要，因为它能缓解语言先验对视觉标记的压制。在InternVL系列模型上的实验表明，ACPO能有效逆转标准DPO中被选奖励的退化趋势。通过阻止视觉锚点坍缩，ACPO在幻觉基准测试（HallusionBench、MM-IFEval）和通用能力排行榜（MMBench、MMStar、OCRBenchV2）上普遍优于基线方法，同时驱动通用能力的同步提升。

摘要 (Abstract)

While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods – a failure we term Visual Anchor Collapse – causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.

关键词: Direct Preference Optimization, Large Vision-Language Models, Alignment, Hallucination Mitigation, Visual Anchor Collapse, Asymmetric Constrained Preference Optimization, Multimodal Learning

199. ❌ OpenEarth-Agent: From Tool Calling to Tool Creation for Open-Environment Earth Observation

作者: Sijie Zhao, Feng Liu, Xueliang Zhang, Hao Chen, Xinyu Gu, Zhe Jiang, Fenghua Ling, Ben Fei, Wenlong Zhang, Junjue Wang, Weihao Xuan, Pengfeng Xiao, Naoto Yokoya, Lei Bai 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22148v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出OpenEarth-Agent，一个用于开放环境地球观测的工具创建代理框架，核心涉及LLM代理（LLM Agents）和工具使用（Tool Use）技术，属于AI for Science在地球观测领域的应用。其他关键词如MoE、量化、推理加速等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对开放环境下地球观测任务中多源数据和异构任务的多样性挑战，提出了首个工具创建代理框架OpenEarth-Agent，通过自适应工作流规划和工具创建实现了跨多个应用领域的全流程地球观测，并在新基准测试中验证了其有效性。

摘要翻译

地球观测（Earth Observation，EO）对于感知动态地表变化至关重要，但在开放环境中部署自主EO系统受到多源数据巨大异质性和任务多样性的制约。尽管遥感智能体已出现以简化EO工作流，但现有的工具调用型智能体局限于封闭环境，依赖预定义工具且适用范围狭窄，难以泛化至多样化的数据与任务。为突破这些限制，我们提出了OpenEarth-Agent——首个专为开放环境EO设计的工具创建型智能体框架。该框架不依赖预定义工具调用，而是通过自适应工作流规划与工具创建来泛化至未见数据与任务。这种适应性得益于多阶段工具的开放式集成与跨领域知识库的支撑，使其能够在多个应用领域的完整EO流程中实现稳健执行。为全面评估开放环境下的EO智能体，我们提出了OpenEarth-Bench——一个包含七大应用领域596个真实世界全流程案例的新型基准测试，专门用于评估智能体的自适应规划与工具创建能力。该基准仅提供必要的预训练模型工具，不含任何其他预定义的任务专用工具。大量实验表明，OpenEarth-Agent在开放环境中成功掌握了跨多领域的全流程EO任务。值得注意的是，在跨基准测试Earth-Bench上，仅配备6个基础预训练模型的工具创建型智能体取得了与依赖104个专用工具的工具调用型智能体相当的性能，且在提供完整工具集时显著超越后者。在多个案例中，所创建的工具相较于人工设计的工具展现出对数据异常更强的鲁棒性。

摘要 (Abstract)

Earth Observation (EO) is essential for perceiving dynamic land surface changes, yet deploying autonomous EO in open environments is hindered by the immense diversity of multi-source data and heterogeneous tasks. While remote sensing agents have emerged to streamline EO workflows, existing tool-calling agents are confined to closed environments. They rely on pre-defined tools and are restricted to narrow scope, limiting their generalization to the diverse data and tasks. To overcome these limitations, we introduce OpenEarth-Agent, the first tool-creation agent framework tailored for open-environment EO. Rather than calling predefined tools, OpenEarth-Agent employs adaptive workflow planning and tool creation to generalize to unseen data and tasks. This adaptability is bolstered by an open-ended integration of multi-stage tools and cross-domain knowledge bases, enabling robust execution in the entire EO pipeline across multiple application domains. To comprehensively evaluate EO agents in open environments, we propose OpenEarth-Bench, a novel benchmark comprising 596 real-world, full-pipeline cases across seven application domains, explicitly designed to assess agents’ adaptive planning and tool creation capabilities. Only essential pre-trained model tools are provided in this benchmark, devoid of any other predefined task-specific tools. Extensive experiments demonstrate that OpenEarth-Agent successfully masters full-pipeline EO across multiple domains in the open environment. Notably, on the cross-benchmark Earth-Bench, our tool-creating agent equipped with 6 essential pre-trained models achieves performance comparable to tool-calling agents relying on 104 specialized tools, and significantly outperforms them when provided with the complete toolset. In several cases, the created tools exhibit superior robustness to data anomalies compared to human-engineered counterparts.

关键词: Earth Observation, Open-Environment, Tool-Creation Agent, Adaptive Workflow Planning, Multi-domain Applications, OpenEarth-Bench, Autonomous EO, Cross-domain Knowledge Bases

200. ❌ dynActivation: A Trainable Activation Family for Adaptive Nonlinearity

作者: Alois Bachmann 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22154v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种可训练的激活函数dynActivation，主要应用于计算机视觉和语言建模任务。与关键词的相关性分析如下：1）论文在语言建模实验中测试了dynActGLU变体，因此与’Large Language Models’有一定关联（5分）；2）其他关键词（如MoE、SLMs、Scaling Laws、各种训练方法、推理技术、AI for Science等）均未在论文中涉及，因此评分为0分。论文的核心是激活函数创新，而非大模型技术或科学AI应用的直接研究。

!!! tip deepseek-chat TL;DR

该论文提出了一种可训练的激活函数dynActivation，通过动态插值非线性与线性路径，在视觉和语言建模任务中提高了训练效率和性能。

摘要翻译

本文提出 $\mathrm{dynActivation}$，一种逐层可训练的激活函数，其定义为 $f_i(x) = \mathrm{BaseAct}(x)(α_i - β_i) + β_i x$。其中 $α_i$ 和 $β_i$ 为轻量级可学习标量，用于在基础非线性函数与线性路径之间进行插值，而 $\mathrm{BaseAct}(x)$ 可为任意类 ReLU 函数。研究在多种视觉任务、语言建模任务及消融实验中对比了静态与动态类 ReLU 变体。结果表明，dynActivation 变体倾向于使深层线性化，同时保持高性能，相比 ReLU 最高可提升 $+54%$ 的训练效率。
在 CIFAR-10 数据集上，dynActivation(Mish) 在 AttentionCNN 上相比静态 Mish 最高提升 $+14.02%$，平均提升 $+6.00%$，且收敛 AUC 相对 Mish 减少 $24%$（2120 对比 2785）。在 1 至 75 层的 MNIST 深度扩展实验中，dynActivation 的测试精度始终不低于 $95%$（$95.3$–$99.3%$），而 ReLU 在 25 层时已崩溃至 $80%$ 以下。在 $\varepsilon{=}0.08$ 的 FGSM 攻击下，dynActivation(Mish) 的精度下降为 $55.39%$，而 ReLU 下降 $62.79%$（优势达 $7.40%$）。迁移至语言建模任务中，新提出的 dynActGLU 变体在 5620 步时相比 SwiGLU 相对困惑度降低 $10.3%$（4.047 对比 4.514），尽管该差距在 34300 步时消失。

摘要 (Abstract)

This paper proposes $\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \mathrm{BaseAct}(x)(α_i - β_i) + β_i x$, where $α_i$ and $β_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54%$ over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02%$ on AttentionCNN with an average improvment by $+6.00%$, with a $24%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95%$ test accuracy ($95.3$–$99.3%$), while ReLU collapses below $80%$ at 25 layers. Under FGSM at $\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39%$ accuracy drop versus $62.79%$ for ReLU ($7.40%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.

关键词: dynActivation, trainable activation, adaptive nonlinearity, language modeling, perplexity reduction, training efficiency, ReLU-like functions, deep layer linearization

201. ❌ DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment

作者: Xin Cai, Zhiyuan You, Zhoutong Zhang, Tianfan Xue 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22125v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型的潜在空间压缩技术（DA-VAE），通过细节对齐机制提高变分自编码器的压缩比，并利用轻量级适配减少扩散主干网络的重新训练成本。论文的核心是图像生成中的潜在表示优化和高效训练/推理，而非大语言模型（LLM）或深度学习基础技术原理的创新。所有评分关键词均与大语言模型、对齐、推理、代理、科学AI应用等相关，与该论文的扩散模型和图像生成焦点无直接关联。因此，所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DA-VAE的插件式潜在压缩方法，通过细节对齐机制扩展预训练VAE的潜在维度，在保持图像质量的同时，将Stable Diffusion 3.5的token数量减少4倍，实现1024×1024和2048×2048图像的高效生成，并加速推理6倍。

摘要翻译

降低令牌数量对于潜在扩散模型的高效训练与推理至关重要，尤其是在高分辨率场景下。一种常见策略是构建高压缩率的图像令牌化器，使每个令牌包含更多通道。然而，若仅针对重建任务进行训练，高维潜在空间往往会丧失有意义的结构，从而增加扩散训练的难度。现有方法通过引入语义对齐或选择性丢弃等额外目标来解决这一问题，但通常需要昂贵的扩散模型重训练。事实上，预训练的扩散模型已具备结构化的低维潜在空间；因此，一种更简单的思路是在保持该结构的同时扩展潜在空间的维度。为此，我们提出\textbf{D}etail-\textbf{A}ligned VAE（细节对齐变分自编码器），该方法通过仅对预训练扩散主干进行轻量级适配，即可提升预训练VAE的压缩率。DA-VAE采用显式的潜在空间布局：前$C$个通道直接来自基础分辨率下的预训练VAE，而额外的$D$个通道则编码更高分辨率的细节信息。通过一种简单的细节对齐机制，我们促使扩展后的潜在空间保持原始空间的结构。结合热启动微调策略，本方法仅需5个H100训练日，即可使用Stable Diffusion 3.5实现以$32 \times 32$令牌生成$1024 \times 1024$图像，令牌数量较原始模型减少$4$倍。该方法进一步实现了基于SD3.5的$2048 \times 2048$图像生成，在保持图像质量的同时获得$6$倍加速。我们还在ImageNet数据集上定量验证了该方法及其设计选择的有效性。

摘要 (Abstract)

Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training harder. Existing methods address this with extra objectives such as semantic alignment or selective dropout, but usually require costly diffusion retraining. Pretrained diffusion models, however, already exhibit a structured, lower-dimensional latent space; thus, a simpler idea is to expand the latent dimensionality while preserving this structure. We therefore propose \textbf{D}etail-\textbf{A}ligned VAE, which increases the compression ratio of a pretrained VAE with only lightweight adaptation of the pretrained diffusion backbone. DA-VAE uses an explicit latent layout: the first $C$ channels come directly from the pretrained VAE at a base resolution, while an additional $D$ channels encode higher-resolution details. A simple detail-alignment mechanism encourages the expanded latent space to retain the structure of the original one. With a warm-start fine-tuning strategy, our method enables $1024 \times 1024$ image generation with Stable Diffusion 3.5 using only $32 \times 32$ tokens, $4\times$ fewer than the original model, within 5 H100-days. It further unlocks $2048 \times 2048$ generation with SD3.5, achieving a $6\times$ speedup while preserving image quality. We also validate the method and its design choices quantitatively on ImageNet.

关键词: latent compression, diffusion models, VAE, detail alignment, high-resolution image generation, token reduction, efficient training, inference acceleration

202. ❌ Biophysics-Enhanced Neural Representations for Patient-Specific Respiratory Motion Modeling

作者: Jan Boysen, Hristina Uzunova, Heinz Handels, Jan Ehrhardt 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22123v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是使用隐式神经表示（INR）进行呼吸运动建模，属于医学影像和计算生物物理学的交叉领域。论文的核心技术是隐式神经表示和物理约束，而不是大语言模型或深度学习技术原理的创新。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文有一定关联，因为论文涉及AI在生物医学（放疗）中的应用，但并非核心内容，因此给5分。其他关键词均与大语言模型、深度学习技术原理、模型训练优化等无关，因此全部给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于物理约束的隐式神经表示方法（PRISM-RM）来建模患者特异性呼吸运动，以提高放疗中剂量投放的精度，实验表明该方法在插值场景中表现相当，在推断场景中有所改进。

摘要翻译

放射治疗中辐射剂量的精准空间递送对治疗成功至关重要。在肺部及上腹部区域，呼吸运动引入了显著的治疗不确定性，需要采用特殊的运动管理技术。为此，呼吸运动模型常被用于推断患者特异性呼吸运动，从而更有效地实施剂量投照。本研究探讨了利用隐式神经表示（INR）进行基于替代信号的呼吸运动建模的可能性。为此，我们提出了基于物理正则化的隐式替代信号呼吸运动建模方法（PRISM-RM）。我们这一新型集成呼吸运动模型无需依赖固定的参考呼吸状态。与传统成对配准技术不同，我们的方法提供了一种具有轨迹感知能力、时空连续且可微分的运动表示，从而提升了对外推场景的泛化能力。我们引入了生物物理约束，确保在训练数据之外的时间范围内也能获得生理上合理的运动估计。结果表明，与我们最初提出的基于INR的方法相比，这种轨迹感知方法在内插场景中表现相当，并显著提升了外推能力。与基于序列配准的方法相比，我们的两种方法在内插场景中表现同样出色，但在外推场景中稍显不足。然而，隐式神经表示的方法学特性使其在呼吸运动建模中尤为有效，随着其性能的持续提升，该方法展现出推动该领域发展的强大潜力。

摘要 (Abstract)

A precise spatial delivery of the radiation dose is crucial for the treatment success in radiotherapy. In the lung and upper abdominal region, respiratory motion introduces significant treatment uncertainties, requiring special motion management techniques. To address this, respiratory motion models are commonly used to infer the patient-specific respiratory motion and target the dose more efficiently. In this work, we investigate the possibility of using implicit neural representations (INR) for surrogate-based motion modeling. Therefore, we propose physics-regularized implicit surrogate-based modeling for respiratory motion (PRISM-RM). Our new integrated respiratory motion model is free of a fixed reference breathing state. Unlike conventional pairwise registration techniques, our approach provides a trajectory-aware spatio-temporally continuous and diffeomorphic motion representation, improving generalization to extrapolation scenarios. We introduce biophysical constraints, ensuring physiologically plausible motion estimation across time beyond the training data. Our results show that our trajectory-aware approach performs on par in interpolation and improves the extrapolation ability compared to our initially proposed INR-based approach. Compared to sequential registration-based approaches both our approaches perform equally well in interpolation, but underperform in extrapolation scenarios. However, the methodical features of INRs make them particularly effective for respiratory motion modeling, and with their performance steadily improving, they demonstrate strong potential for advancing this field.

关键词: implicit neural representations, respiratory motion modeling, biophysical constraints, radiotherapy, patient-specific, trajectory-aware, extrapolation, surrogate-based modeling

203. ❌ StreamingClaw Technical Report

作者: Jiawei Chen, Zhe Chen, Chaoqun Du, Maokui He, Wei He, Hengtao Li, Qizhen Li, Zide Liu, Hao Ma, Xuhao Pan, Chang Ren, Xudong Rao, Xintian Shen, Chenfeng Wang, Tao Wei, Chengjun Yu, Pengfei Yu, Shengyu Yao, Chunpeng Zhou, Kun Zhan, Lihao Zheng, Pan Zhou, Xuhan Zhu, Yufei Zheng 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22120v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文StreamingClaw提出一个面向流式视频理解和具身智能的统一智能体框架，核心涉及LLM驱动的智能体、多步推理、工具使用和多智能体系统，与这些关键词高度相关（10分）。其他关键词如MoE、量化、RAG等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

论文提出StreamingClaw框架，解决了现有智能体在流式视频理解中缺乏实时推理、长期记忆和主动交互能力的问题，实现了感知-决策-行动的闭环控制。

摘要翻译

具身智能等应用依赖于实时感知-决策-行动闭环，这对流式视频理解提出了严苛要求。然而，当前智能体存在能力割裂的问题：仅支持离线视频理解、缺乏长期多模态记忆机制，或难以在流式输入下实现实时推理与主动交互。这些缺陷已成为阻碍其在真实环境中持续感知、实时决策并执行行动的关键瓶颈。为缓解这些问题，我们提出StreamingClaw——一个面向流式视频理解与具身智能的统一智能体框架。它同时也是兼容OpenClaw的框架，支持实时、多模态的流式交互。StreamingClaw集成了五项核心能力：（1）支持实时流式推理。（2）支持在交互目标在线演化的条件下对未来事件进行推理并实现主动交互。（3）支持多模态长期存储、分层演化及跨智能体的共享记忆高效检索。（4）支持感知-决策-行动闭环。除常规工具与技能外，还提供专为真实物理环境设计的流式工具及以行动为中心的技能。（5）兼容OpenClaw框架，可充分利用开源社区的资源与支持。通过这些设计，StreamingClaw将在线实时推理、多模态长期记忆与主动交互整合于统一框架中。此外，通过将决策转化为可执行动作，实现了对物理世界的直接控制，支持具身交互的实际部署。

摘要 (Abstract)

Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.

关键词: Streaming Video Understanding, Embodied Intelligence, Real-time Reasoning, Multimodal Long-term Memory, Agent Framework, Proactive Interaction, Perception-Decision-Action Loop, OpenClaw-compatible

204. ❌ FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario

作者: Hang Dai, Hongwei Fan, Han Zhang, Duojin Wu, Jiyao Zhang, Hao Dong 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22102v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FreeArtGS专注于计算机视觉和3D重建领域，提出了一种在自由移动场景下重建铰接物体的新方法，结合了自由移动部件分割、关节估计和基于3D高斯泼溅的端到端优化。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science应用直接相关，而本文的核心技术是3D高斯泼溅、部件分割和关节估计，属于计算机视觉中的3D重建和运动分析子领域，与提供的关键词列表无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FreeArtGS的新方法，用于在自由移动场景下仅使用单目RGB-D视频重建铰接物体的视觉纹理、几何结构和关节角度，实验证明该方法在自由移动铰接物体重建方面表现优异，并在先前重建设置中保持高度竞争力。

摘要翻译

增强现实与机器人技术日益增长的需求，正推动着对高可扩展性铰接物体重建的需求。然而，现有基于离散铰接状态或随意单目视频进行重建的方案，通常需要进行复杂的轴对齐或存在覆盖不足的问题，限制了其应用范围。本文提出FreeArtGS，一种在自由移动场景下重建铰接物体的新方法，该场景设置简单且具有高可扩展性。FreeArtGS将自由移动部件分割与关节估计及端到端优化相结合，仅需单目RGB-D视频作为输入。通过利用现有点追踪与特征模型提供的先验进行优化，自由移动部件分割模块能够在无约束采集条件下，根据相对运动识别出刚性部件。关节估计模块则校准统一的对象到相机位姿，并稳健地从部件分割中恢复关节类型与轴。最后，实施基于3D高斯泼溅（3DGS）的端到端优化，以联合重建铰接物体的视觉纹理、几何结构与关节角度。我们在两个基准数据集及真实世界的自由移动铰接物体上进行了实验。实验结果表明，FreeArtGS在重建自由移动铰接物体方面持续表现优异，并在以往的重建设置中保持高度竞争力，证明了其作为现实资产生成方案的实用性与有效性。项目页面位于：https://freeartgs.github.io/

摘要 (Abstract)

The increasing demand for augmented reality and robotics is driving the need for articulated object reconstruction with high scalability. However, existing settings for reconstructing from discrete articulation states or casual monocular videos require non-trivial axis alignment or suffer from insufficient coverage, limiting their applicability. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simple setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, the free-moving part segmentation module identifies rigid parts from relative motion under unconstrained capture. The joint estimation module calibrates the unified object-to-camera poses and recovers joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry, and joint angles of the articulated object. We conduct experiments on two benchmarks and real-world free-moving articulated objects. Experimental results demonstrate that FreeArtGS consistently excels in reconstructing free-moving articulated objects and remains highly competitive in previous reconstruction settings, proving itself a practical and effective solution for realistic asset generation. The project page is available at: https://freeartgs.github.io/

关键词: Articulated Object Reconstruction, Free-moving Scenario, 3D Gaussian Splatting, Part Segmentation, Joint Estimation, End-to-end Optimization, Monocular RGB-D Video, Realistic Asset Generation

205. ❌ Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

作者: Xingyu Zhu, Beier Zhu, Shuo Wang, Junfeng Fang, Kesen Zhao, Hanwang Zhang, Xiangnan He 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22094v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）的越狱防御方法，属于大模型安全对齐领域。与’Large Language Models’相关（8分），因为VLMs是大模型的一种；与’Instruction Tuning OR Alignment OR Value Alignment’相关（8分），因为研究模型安全对齐和拒绝有害内容；与’Hallucination Mitigation OR Factuality OR Truthfulness’相关（8分），因为防御越狱攻击涉及减少有害/虚假输出；与’Mechanistic Interpretability OR Explainable AI’相关（5分），因为方法涉及理论解释性。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于零空间投影的激活防御框架NullSteer，用于增强视觉语言模型对越狱攻击的抵抗力，在显著降低有害输出率的同时保持模型在良性输入上的性能。

摘要翻译

随着视觉语言模型（VLMs）在开放世界场景中的日益广泛应用，它们极易受到视觉越狱攻击的诱导而生成有害内容，对模型的安全性与可信使用构成严重风险。近期的激活导向方法通过在推理阶段向模型激活中注入方向向量来诱导拒绝行为，已展现出一定效果。然而，导向向量在增强拒绝能力的同时也可能引发过度拒绝，从而降低模型在良性输入上的性能。此外，由于缺乏理论可解释性，这些方法在鲁棒性和有效性方面仍存在局限。为更好地平衡安全性与实用性，我们提出了NullSteer——一种零空间投影激活防御框架。该方法通过线性变换在模型激活中构建拒绝方向：在良性子空间内保持零扰动，同时沿潜在有害方向动态诱导拒绝，从而在理论上实现安全增强而不损害模型的通用能力。大量实验表明，NullSteer在各种越狱攻击下显著减少了有害输出（在MiniGPT-4上平均攻击成功率降低超过15%），同时在通用基准测试中保持了与原模型相当的性能。

摘要 (Abstract)

As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage. Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness. However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs. Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness. To better balance safety and utility, we propose NullSteer, a null-space projected activation defense framework. Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model’s general capabilities. Extensive experiments show that NullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15 percent on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.

关键词: vision-language models, jailbreak defense, activation steering, null-space projection, safety alignment, harmful content mitigation, model robustness, trustworthy AI

206. ❌ P-Flow: Prompting Visual Effects Generation

作者: Rui Zhao, Mike Zheng Shou 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22091v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文P-Flow专注于视频生成中的动态视觉特效定制，提出了一种无需训练、基于视觉语言模型进行测试时提示优化的框架。所有评分关键词均与大语言模型（LLM）技术、训练方法、推理优化、对齐技术、模型压缩、AI代理等直接相关，而本文的核心是视频生成和视觉语言模型（VLM）应用，未涉及LLM技术原理、训练方法或科学领域AI应用。虽然研究背景提到“大模型在不同领域的研究应用可以酌情给分”，但本文明确使用视觉语言模型而非语言模型，且未涉及任何评分关键词中的具体技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了视频生成中难以通过单一文本提示精确控制动态视觉特效（如物体破碎或爆炸）的问题，提出了一种无需训练的P-Flow框架，利用视觉语言模型进行迭代提示优化，实现了高保真和多样化的特效定制，并在文本到视频和图像到视频生成任务中优于其他模型。

摘要翻译

近期视频生成模型在遵循文本提示方面取得了显著进展。然而，针对动态视觉效果（定义为随时间演变且由外观驱动的视觉现象，如物体破碎或爆炸）的定制化研究仍显不足。先前关于运动定制或控制的研究主要集中于主体或摄像机的低层运动，这类运动可通过运动轨迹等显式控制信号进行引导。相比之下，动态视觉效果涉及更高层次的语义，更自然地适合通过文本提示进行控制。然而，人类难以通过单一提示精准描述这些效果，因其需要复杂的时间推理和持续的迭代优化。为应对这一挑战，我们提出了P-Flow——一种无需训练、可在不修改底层模型的情况下定制视频生成中动态视觉效果的新型框架。通过利用视觉-语言模型的语义与时间推理能力，P-Flow执行测试时提示优化，根据参考视频与生成输出之间的视觉效果差异迭代优化提示。经过多次迭代，提示词逐步演化，从而在新场景中更有效地引导出期望的动态效果。实验表明，P-Flow能够实现高保真度且多样化的视觉效果定制，并在文本到视频与图像到视频生成任务中优于其他模型。代码发布于https://github.com/showlab/P-Flow。

摘要 (Abstract)

Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearance-driven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks. Code is available at https://github.com/showlab/P-Flow.

关键词: video generation, dynamic visual effects, prompt optimization, vision-language models, training-free framework, text-to-video, image-to-video, temporal reasoning

207. ❌ Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning

作者: Xingyu Zhu, Liang Yi, Shuo Wang, Wenbo Zhu, Yonglinag Wu, Beier Zhu, Hanwang Zhang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22070v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态3D视觉语言模型在点云分析中的测试时适应问题，提出BayesMM框架。与大多数关键词无关，因为论文不涉及大语言模型、推理、对齐、压缩等核心大模型技术。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为测试时适应（TTA）是领域适应的一种形式，但论文重点在测试阶段而非预训练。

!!! tip deepseek-chat TL;DR

该论文针对多模态3D视觉语言模型在点云分析中面临域偏移时性能下降的问题，提出了BayesMM框架，通过贝叶斯分布学习实现测试时在线适应，在多个基准上取得了超过4%的平均性能提升。

摘要翻译

多模态三维视觉-语言模型在多样化的三维任务中展现出强大的泛化能力，但其性能在领域偏移下仍会显著下降。这推动了近期关于测试时适应（TTA）的研究，该技术使模型能够利用测试时数据进行在线适应。在现有的TTA方法中，基于缓存的机制被广泛采用，以利用先前观测到的样本进行在线预测优化。然而，这些方法仅存储有限的历史信息，导致随着测试流演进，信息逐渐丢失。此外，其预测逻辑值通常通过启发式方式融合，使得适应过程不稳定。为应对这些局限，我们提出了BayesMM，一种用于测试时点云分析的多模态贝叶斯分布学习框架。BayesMM将每个类别的文本先验和流式视觉特征建模为高斯分布：文本参数源自语义提示，而视觉参数则随到达样本在线更新。两种模态通过贝叶斯模型平均进行融合，该机制基于后验证据自动调整其贡献度，从而产生一个持续适应演化测试数据且无需训练的统合预测。在多个点云基准数据集上的大量实验表明，BayesMM在分布偏移下保持了鲁棒性，平均性能提升超过4%。

摘要 (Abstract)

Multimodal 3D vision-language models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable. To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual parameters are derived from semantic prompts, while visual parameters are updated online with arriving samples. The two modalities are fused via Bayesian model averaging, which automatically adjusts their contributions based on posterior evidence, yielding a unified prediction that adapts continually to evolving test-time data without training. Extensive experiments on multiple point cloud benchmarks demonstrate that BayesMM maintains robustness under distributional shifts, yielding over 4% average improvement.

关键词: Multimodal 3D vision-language models, Point cloud analysis, Test-time adaptation, Domain shift, Bayesian distribution learning, Online adaptation, BayesMM, Distributional shifts

208. ❌ SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

作者: Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, Jinwoo Shin 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22057v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLM）通过多轮思维链（CoT）推理将3D空间知识注入视觉编码器，因此与’Large Language Models’和’Chain of Thought’高度相关（10分）。‘System 2 Thinking’有一定关联（5分），因为CoT推理涉及深度思考。‘Pre-training’有中等关联（5分），因为论文涉及增强预训练视觉编码器。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出SpatialBoost框架，通过大语言模型驱动的多轮思维链推理将3D空间知识注入预训练视觉编码器，显著提升了其在需要3D感知任务上的性能，如在ADE20K上将DINOv3的mIoU从55.9%提升至59.7%。

摘要翻译

尽管大规模预训练图像表征模型（即视觉编码器）在各种视觉任务中取得了显著成功，但这些模型主要基于二维图像数据进行训练，因此往往难以捕捉现实世界中物体与背景之间的三维空间关系，这限制了许多下游应用中的模型效能。为解决这一问题，我们提出了SpatialBoost，一个可扩展的框架，通过注入以语言描述表达的三维空间知识，增强现有预训练视觉编码器的空间感知能力。其核心思想是将二维图像中的密集三维空间信息转化为语言表达，进而通过大语言模型（Large Language Model, LLM）将此类空间知识注入视觉编码器。为此，我们采用了一种多轮思维链（Chain-of-Thought, CoT）推理过程，逐步整合密集空间知识并构建层次化的空间理解。为验证有效性，我们将SpatialBoost适配于DINOv3等先进视觉编码器，并在需要三维感知和通用视觉能力的广泛基准测试中评估其性能提升。例如，在ADE20K数据集上，SpatialBoost将DINOv3的性能从55.9 mIoU提升至59.7 mIoU，相比预训练的DINOv3实现了3.8%的性能增益，达到了最先进的水平。

摘要 (Abstract)

Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.

关键词: SpatialBoost, Large Language Model, Chain-of-Thought reasoning, 3D spatial knowledge, vision encoders, DINOv3, spatial awareness, performance improvement

209. ❌ FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation

作者: Wuyang Luo, Chengkai Tan, Chang Ge, Binye Hong, Su Yang, Yongjiu Ma 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22054v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FontCrafter》专注于艺术字体生成，提出了一种基于视觉上下文生成（visual in-context generation）的元素驱动框架。其核心创新在于将元素图像作为视觉上下文，利用修复模型（inpainting model）在像素级别将元素风格转移到字形区域，并设计了上下文感知掩码适配器（CMA）和注意力重定向机制以实现精细控制。该研究与关键词列表中的绝大多数技术（如LLM、MoE、对齐、RAG、推理、代理等）完全无关，因为这些关键词主要针对大语言模型及其相关技术，而本文研究的是计算机视觉和图像生成任务。唯一相关的关键词是“In-context Learning OR Many-shot Learning”，因为论文明确提出了“in-context generation strategy”，将元素图像作为视觉上下文进行风格迁移，这与上下文学习的概念在方法论上高度相关，尽管应用于视觉领域而非文本。因此，仅对该关键词给予10分（高度相关），其余均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文针对艺术字体生成中风格多样性有限和控制粗糙的问题，提出了FontCrafter框架，通过视觉上下文生成策略和上下文感知掩码适配器，实现了高保真的元素驱动字体创建，在零样本生成中表现出优异的纹理和结构保真度。

摘要翻译

艺术字体生成旨在基于参考风格合成风格化的字形。然而，现有方法存在风格多样性有限和控制粒度粗糙的问题。在本工作中，我们探索了元素驱动的艺术字体生成的潜力。元素是字体的基本视觉单元，可作为目标风格的参考图像。从概念上，我们将元素分为具有明确结构的对象元素（如花朵或石头）和具有非结构化纹理的无定形元素（如火焰或云朵）。我们提出了FontCrafter，一个用于字体创建的元素驱动框架，并构建了一个大规模数据集ElementFont，其中包含多样化的元素类型和高质量的字形图像。然而，实现参考元素纹理与结构的高保真重建仍具挑战性。为此，我们提出了一种上下文生成策略，将元素图像视为视觉上下文，并利用修复模型在像素级别将元素风格迁移至字形区域。为进一步控制字形轮廓，我们设计了一个轻量级的上下文感知掩码适配器（Context-aware Mask Adapter, CMA）以注入形状信息。此外，一种免训练的注意力重定向机制实现了区域感知的风格控制并抑制笔画幻觉。同时，通过边缘重绘使边界更自然。大量实验表明，FontCrafter在零样本生成任务中表现出色，尤其在保持结构和纹理保真度方面，同时支持风格混合等灵活控制。

摘要 (Abstract)

Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of element-driven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures. We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which contains diverse element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level. To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. In addition, edge repainting is applied to make boundaries more natural. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, particularly in preserving structural and textural fidelity, while also supporting flexible controls such as style mixture.

关键词: artistic font generation, element-driven, visual in-context generation, inpainting model, context-aware mask adapter, attention redirection, zero-shot generation, style mixture

210. ❌ DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation

作者: Binhong Tan, Zhaoxin Wang, Handing Wang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于文本到图像（T2I）扩散模型的安全防御，提出了一种双阶段干预框架DTVI。所有评分关键词均与大语言模型（LLMs）或深度学习技术原理相关，而本文研究的是扩散模型（一种生成模型）的安全问题，并非大语言模型。尽管涉及“安全”概念，但与关键词如“Hallucination Mitigation”或“Alignment”的语义焦点（LLM的幻觉或对齐）不同。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

本文提出了一种双阶段文本和视觉干预框架DTVI，用于在推理时防御文本到图像生成模型中的不安全内容生成，在多个有害类别上实现了高防御成功率，同时保持了良性提示的生成质量。

摘要翻译

文本到图像（Text-to-Image, T2I）扩散模型已展现出强大的生成能力，但其生成不安全内容的潜力引发了严重的安全担忧。现有的推理时防御方法通常在文本嵌入空间中进行类别无关的令牌级干预，这种方法无法捕捉分布在整个令牌序列中的恶意语义，且仍易受对抗性提示的攻击。本文提出DTVI，一种用于安全T2I生成的双阶段推理时防御框架。与现有方法在特定令牌嵌入上进行干预不同，我们的方法引入了对完整提示嵌入的类别感知序列级干预，以更好地捕捉分布式的恶意语义，并进一步在视觉生成阶段衰减残留的不安全影响。在现实世界的不安全提示、对抗性提示及多种有害类别上的实验结果表明，我们的方法实现了有效且鲁棒的防御，同时在良性提示上保持了合理的生成质量，在涉及性相关类别的基准测试中平均防御成功率（Defense Success Rate, DSR）达到94.43%，在七类不安全类别中平均达到88.56%，且对良性提示的生成质量得以维持。

摘要 (Abstract)

Text-to-Image (T2I) diffusion models have demonstrated strong generation ability, but their potential to generate unsafe content raises significant safety concerns. Existing inference-time defense methods typically perform category-agnostic token-level intervention in the text embedding space, which fails to capture malicious semantics distributed across the full token sequence and remains vulnerable to adversarial prompts. In this paper, we propose DTVI, a dual-stage inference-time defense framework for safe T2I generation. Unlike existing methods that intervene on specific token embeddings, our method introduces category-aware sequence-level intervention on the full prompt embedding to better capture distributed malicious semantics, and further attenuates the remaining unsafe influences during the visual generation stage. Experimental results on real-world unsafe prompts, adversarial prompts, and multiple harmful categories show that our method achieves effective and robust defense while preserving reasonable generation quality on benign prompts, obtaining an average Defense Success Rate (DSR) of 94.43% across sexual-category benchmarks and 88.56 across seven unsafe categories, while maintaining generation quality on benign prompts.

关键词: Text-to-Image Generation, Diffusion Models, Safety Defense, Inference-time Intervention, Dual-stage Framework, Malicious Semantics, Adversarial Prompts, Defense Success Rate

211. ❌ GTSR: Subsurface Scattering Awared 3D Gaussians for Translucent Surface Reconstruction

作者: Youwen Yuan, Xi Zhao 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉和图形学领域，提出了一种基于3D高斯散射（3DGS）的半透明物体表面重建方法（GTSR），核心贡献在于结合表面和内部高斯模型、引入菲涅尔项混合以及使用Disney BSDF模型增强约束。所有关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，但论文主题是3D重建和渲染，不涉及LLM、MoE、缩放定律、训练技术、推理优化、智能体、模型压缩等。唯一可能相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为3D重建可视为计算机视觉中的科学应用，但论文未明确提及生物信息学或化学信息学，且重点在图形学而非广义科学AI，因此给予5分（有一定关联）。其他关键词完全无关，评分为0分。加权总分计算为5.0（仅一个关键词得分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于3D高斯散射（GTSR）的新方法，用于从多视角图像重建半透明物体的表面几何，通过结合表面和内部高斯模型、菲涅尔混合和Disney BSDF增强，在NeuralTO Syn数据集上优于基线方法并实现实时渲染。

摘要翻译

从多视角图像重建半透明物体是一个难题。先前的研究者采用可微分路径追踪与神经隐式场方法，但计算成本较高。近年来，许多研究基于3DGS框架以更高效率实现了对不透明物体的高质量重建。然而，此类方法难以处理半透明物体，因其未考虑半透明物体的光学特性。本文提出一种基于3DGS的新型框架（GTSR）来重建半透明物体的表面几何。GTSR结合了两组高斯模型——表面高斯与内部高斯，分别用于建模半透明物体的表面特性及光线穿透时的散射颜色。为渲染半透明物体的外观，我们引入一种利用菲涅尔项融合两组高斯模型的方法。此外，为提升非轮廓区域的重建细节，我们采用迪士尼BSDF模型结合延迟渲染技术，以加强对法线与深度的约束。实验结果表明，在NeuralTO Syn数据集上，我们的方法优于基线重建方法，同时展现出优异的实时渲染性能。我们通过扩展包含不同材质特性的新半透明物体数据集，进一步证明本方法能够适应多样化的半透明材质。

摘要 (Abstract)

Reconstructing translucent objects from multi-view images is a difficult problem. Previously, researchers have used differentiable path tracing and the neural implicit field, which require relatively large computational costs. Recently, many works have achieved good reconstruction results for opaque objects based on a 3DGS pipeline with much higher efficiency. However, such methods have difficulty dealing with translucent objects, because they do not consider the optical properties of translucent objects. In this paper, we propose a novel 3DGS-based pipeline (GTSR) to reconstruct the surface geometry of translucent objects. GTSR combines two sets of Gaussians, surface and interior Gaussians, which are used to model the surface and scattering color when lights pass translucent objects. To render the appearance of translucent objects, we introduce a method that uses the Fresnel term to blend two sets of Gaussians. Furthermore, to improve the reconstructed details of non-contour areas, we introduce the Disney BSDF model with deferred rendering to enhance constraints of the normal and depth. Experimental results demonstrate that our method outperforms baseline reconstruction methods on the NeuralTO Syn dataset while showing great real-time rendering performance. We also extend the dataset with new translucent objects of varying material properties and demonstrate our method can adapt to different translucent materials.

关键词: 3D Gaussian Splatting, translucent object reconstruction, subsurface scattering, multi-view images, surface geometry, real-time rendering, Disney BSDF, Fresnel blending

212. ❌ Tuning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models

作者: Purui Bai, Junxian Duan, Pin Wang, Jinhua Hao, Ming Sun, Chao Zhou, Huaibo Huang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22027v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的图像修复任务，使用基于流匹配的扩散模型（FLUX.1-dev），并提出了测试时缩放范式。虽然涉及大预训练模型（T2I模型）和推理优化，但所有关键词均明确针对语言模型（LLMs）或相关技术（如RLHF、指令调优、RAG等），而本文研究的是图像生成/修复模型，与语言模型无直接关联。关键词中唯一可能相关的’AI for Science’通常指自然科学领域（如生物、化学），而本文属于计算机视觉/图像处理，不属于该范畴。因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

本文提出了一种基于流匹配模型的图像修复框架ResFlow-Tuner，通过统一多模态融合和测试时缩放技术，在多个基准测试中实现了最先进的修复性能。

摘要翻译

尽管基于扩散模型的真实世界图像修复（Real-IR）已取得显著进展，但如何高效利用超大规模预训练文本到图像（T2I）模型并充分挖掘其潜力仍是重大挑战。为解决此问题，我们提出了ResFlow-Tuner——一个基于先进流匹配模型FLUX.1-dev的图像修复框架，该框架通过整合统一多模态融合（UMMF）与测试时缩放（TTS）技术，实现了前所未有的修复性能。我们的方法充分利用多模态扩散变换器（MM-DiT）架构的优势，将多模态条件编码为统一序列以指导高质量图像的合成。此外，我们针对图像修复任务设计了一种免训练的测试时缩放范式。在推理过程中，该技术通过奖励模型（RM）的反馈动态引导去噪方向，从而以可控的计算开销实现显著的性能提升。大量实验表明，我们的方法在多个标准基准测试中均达到了最先进的性能。这项工作不仅验证了流匹配模型在底层视觉任务中的强大能力，更重要的是提出了一种适用于大型预训练模型的新型高效推理时缩放范式。

摘要 (Abstract)

Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.

关键词: image restoration, flow matching, test-time scaling, multi-modal fusion, diffusion models, inference optimization, real-world image, reward model

213. ❌ 6D Robotic OCT Scanning of Curved Tissue Surfaces

作者: Suresh Guttikonda, Maximilian Neidhardt, Vidas Raudonis, Alexander Schlaefer 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22012v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人辅助光学相干断层扫描（OCT）的6D手眼校准和扫描技术，属于医学成像和机器人控制领域。所有评分关键词均涉及大模型、深度学习、AI技术原理或AI在科学领域的应用，而本文未提及任何AI、机器学习或大模型相关内容，仅涉及传统机器人校准、扫描路径规划和图像采集方法，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于机器人安装OCT探针的六维手眼校准标记方法，解决了在扫描弯曲组织表面时传统平移扫描和图像配准的局限性，并通过实验验证了该方法能实现高重复性校准和一致的大面积弯曲表面扫描。

摘要翻译

光学相干断层扫描（OCT）是一种具有高时空分辨率的非侵入式三维成像技术。为获取更大范围的组织结构图像，需移动OCT探头对目标区域进行扫描。在手持扫描场景中，获取的OCT三维图像拼接需要依赖重叠区域以实现图像配准。对于机器人扫描与拼接，典型方法是将运动限制在平移维度，这避免了复杂的手眼标定过程——该标定因大多数OCT探头视场较小而尤为困难。然而，当需要扫描弯曲组织表面时，基于配准或平移扫描的拼接方法存在局限性。本文提出一种用于机器人搭载OCT探头的全六维手眼标定标记方法。实验表明，该标定方法能获得高度可重复的变换矩阵估计值。此外，通过对两个仿体表面进行机器人扫描评估，我们验证了所提出的标定方案能够实现对大面积弯曲组织表面的连续稳定扫描。由于该方法不依赖于图像配准，避免了沿扫描路径可能产生的误差累积问题。最后，我们对比展示了该方法相较于传统三维平移机器人扫描方式的改进效果。

摘要 (Abstract)

Optical coherence tomography (OCT) is a non-invasive volumetric imaging modality with high spatial and temporal resolution. For imaging larger tissue structures, OCT probes need to be moved to scan the respective area. For handheld scanning, stitching of the acquired OCT volumes requires overlap to register the images. For robotic scanning and stitching, a typical approach is to restrict the motion to translations, as this avoids a full hand-eye calibration, which is complicated by the small field of view of most OCT probes. However, stitching by registration or by translational scanning are limited when curved tissue surfaces need to be scanned. We propose a marker for full six-dimensional hand-eye calibration of a robot mounted OCT probe. We show that the calibration results in highly repeatable estimates of the transformation. Moreover, we evaluate robotic scanning of two phantom surfaces to demonstrate that the proposed calibration allows for consistent scanning of large, curved tissue surfaces. As the proposed approach is not relying on image registration, it does not suffer from a potential accumulation of errors along a scan path. We also illustrate the improvement compared to conventional 3D-translational robotic scanning.

关键词: Optical coherence tomography, Robotic scanning, Hand-eye calibration, Curved tissue surfaces, 6D calibration, OCT probe, Image registration, Phantom surfaces

214. ❌ STENet: Superpixel Token Enhancing Network for RGB-D Salient Object Detection

作者: Jianlin Chen, Gongyang Li, Zhijiang Zhang, Liang Chang, Dan Zeng 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21999v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文STENet专注于RGB-D显著目标检测，提出了一种基于Transformer和超像素的计算机视觉方法。虽然使用了Transformer架构，但研究内容与所有评分关键词（主要关于大语言模型、训练技术、推理优化、对齐、代理系统等）完全无关。论文未涉及任何语言模型、科学AI应用或大模型技术原理创新，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为STENet的超像素令牌增强网络，通过引入超像素到跨模态交互中来解决RGB-D显著目标检测中注意力机制计算复杂度高和局部细节提取有限的问题，并在七个数据集上实现了具有竞争力的性能。

摘要翻译

基于Transformer的RGB-D显著目标检测方法因Transformer捕获长距离像素依赖的卓越能力而备受关注。然而，当前RGB-D显著目标检测方法面临诸多挑战，例如注意力机制的二次复杂度以及局部细节提取能力有限。为克服这些局限，我们提出了一种新颖的超像素令牌增强网络（STENet），它将超像素引入跨模态交互中。STENet采用双流编码器-解码器结构，其核心是两个定制的超像素驱动跨模态交互模块，分别负责全局与局部特征增强。具体而言，我们通过扩展每个超像素的邻域范围来改进超像素生成方法，从而实现像素与超像素之间的灵活转换。基于改进的超像素生成方法，我们首先提出了超像素注意力全局增强模块，该模块建模全局像素-超像素关系而非传统的全局像素-像素关系，从而能够捕获区域级信息并降低计算复杂度。我们还提出了超像素注意力局部细化模块，该模块利用超像素内的像素相似性筛选出部分像素（即局部像素），并对这些局部像素进行特征增强，从而捕获关注的局部细节。此外，我们将全局与局部增强特征以及跨尺度特征相融合，以实现全面的特征表征。在七个RGB-D显著目标检测数据集上的实验表明，我们的STENet相较于现有先进方法具有竞争力的性能。本方法的代码与结果公开于https://github.com/Mark9010/STENet。

摘要 (Abstract)

Transformer-based methods for RGB-D Salient Object Detection (SOD) have gained significant interest, owing to the transformer’s exceptional capacity to capture long-range pixel dependencies. Nevertheless, current RGB-D SOD methods face challenges, such as the quadratic complexity of the attention mechanism and the limited local detail extraction. To overcome these limitations, we propose a novel Superpixel Token Enhancing Network (STENet), which introduces superpixels into cross-modal interaction. STENet follows the two-stream encoder-decoder structure. Its cores are two tailored superpixel-driven cross-modal interaction modules, responsible for global and local feature enhancement. Specifically, we update the superpixel generation method by expanding the neighborhood range of each superpixel, allowing for flexible transformation between pixels and superpixels. With the updated superpixel generation method, we first propose the Superpixel Attention Global Enhancing Module to model the global pixel-to-superpixel relationship rather than the traditional global pixel-to-pixel relationship, which can capture region-level information and reduce computational complexity. We also propose the Superpixel Attention Local Refining Module, which leverages pixel similarity within superpixels to filter out a subset of pixels (i.e., local pixels) and then performs feature enhancement on these local pixels, thereby capturing concerned local details. Furthermore, we fuse the globally and locally enhanced features along with the cross-scale features to achieve comprehensive feature representation. Experiments on seven RGB-D SOD datasets reveal that our STENet achieves competitive performance compared to state-of-the-art methods. The code and results of our method are available at https://github.com/Mark9010/STENet.

关键词: RGB-D Salient Object Detection, Transformer, Superpixel, Cross-modal Interaction, Attention Mechanism, Computational Complexity, Feature Enhancement, Encoder-Decoder Structure

215. ❌ Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

作者: SII-GAIR, Sand. ai, :, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Yixiu Liu, Yunbo Zhang, Yunpeng Huang, Yutong Lin, Zewei Tao, Zhaoliang Liu, Zheng Zhang, Zhiyao Cen, Zhixuan Yu, Zhongshu Wang, Zhulin Hu, Zijin Zhou, Zinan Guo, Yue Cao, Pengfei Liu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21986v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出daVinci-MagiHuman，一个音频-视频生成基础模型，核心创新在于单流Transformer架构，统一处理文本、视频和音频token，属于大模型/基础模型在跨模态生成领域的应用。与关键词相关性分析：1）高度相关（10分）：属于Foundation Models；2）中等相关（5分）：涉及模型训练（pre-training/fine-tuning）和推理加速（inference acceleration）；3）无关（0分）：其他关键词如MoE、SLMs、对齐、RAG、推理方法、AI for Science等未涉及。

!!! tip deepseek-chat TL;DR

论文提出daVinci-MagiHuman，一个单流Transformer音频-视频生成基础模型，通过统一架构实现高效同步生成，在人类中心场景中达到领先的视觉质量和语音清晰度。

摘要翻译

我们推出daVinci-MagiHuman——一个面向人物中心生成的开源音视频生成基础模型。该模型通过单流Transformer架构，将文本、视频和音频统一编码为单一令牌序列，仅通过自注意力机制实现同步的音视频生成。这种单流设计避免了多流或交叉注意力架构的复杂性，同时能利用标准训练与推理基础设施轻松优化。该模型在人物中心场景表现卓越，能生成富有表现力的面部表演、自然的语音-表情协调、逼真的身体动作以及精确的音视频同步。它支持跨中文（普通话与粤语）、英语、日语、韩语、德语和法语的多语言语音生成。为实现高效推理，我们将单流主干网络与模型蒸馏、潜在空间超分辨率以及Turbo VAE解码器相结合，可在单张H100 GPU上于2秒内生成5秒时长的256p分辨率视频。在自动评估中，daVinci-MagiHuman在主流开源模型中取得了最高的视觉质量与文本对齐度，同时语音清晰度的词错误率最低（14.60%）。在2000组人工配对评估中，其对比Ovi 1.1的胜率达80.0%，对比LTX 2.3的胜率达60.9%。我们开源了完整的模型栈，包括基础模型、蒸馏模型、超分辨率模型及推理代码库。

摘要 (Abstract)

We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.

关键词: audio-video generation, foundation model, single-stream Transformer, human-centric generation, multilingual speech, efficient inference, model distillation, latent-space super-resolution

216. ❌ GeoFusion-CAD: Structure-Aware Diffusion with Geometric State Space for Parametric 3D Design

作者: Xiaolei Zhou, Chuangjie Fang, Jie Wu, Jingyi Yang, Boyi Lin, Jianwei Zheng 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21978v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于3D CAD建模的生成任务，提出了一种基于扩散模型和状态空间模型（Mamba）的架构GeoFusion-CAD。虽然属于AI应用领域，但研究内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为CAD建模可视为AI在工程/设计科学领域的应用，但论文未明确强调科学发现或生物/化学信息学，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文解决了参数化CAD建模中长命令序列生成的难题，通过提出基于扩散和状态空间模型（C-Mamba）的GeoFusion-CAD框架，在保持几何和拓扑一致性的同时，实现了优于Transformer的长期依赖建模和可扩展生成，并在新基准DeepCAD-240上取得了最先进性能。

摘要翻译

参数化计算机辅助设计（CAD）是现代三维建模的基础，然而现有方法难以生成长指令序列，尤其是在复杂的几何与拓扑依赖条件下。基于Transformer的架构因其强大的依赖建模能力主导了CAD序列生成领域，但其二次注意力计算成本与有限的上下文窗口限制了处理长程序的可扩展性。我们提出GeoFusion-CAD，一个面向可扩展与结构感知生成的端到端扩散框架。该框架将CAD程序编码为层次化树结构，在状态空间扩散过程中联合捕捉几何与拓扑信息。具体而言，轻量化的C-Mamba模块通过选择性状态转移建模长程结构依赖，从而在扩展的指令序列中实现连贯生成。为支持长序列评估，我们引入了DeepCAD-240扩展基准数据集，将序列长度从40提升至240，同时保留源自ABC数据集的草图-拉伸语义。大量实验表明，GeoFusion-CAD在短指令与长指令范围内均取得卓越性能，在基于Transformer的模型性能下降时仍保持高几何保真度与拓扑一致性。我们的方法为长序列参数化CAD生成设立了新的技术标杆，为下一代CAD建模系统奠定了可扩展的基础。代码与数据集已在GitHub平台开源。

摘要 (Abstract)

Parametric Computer-Aided Design (CAD) is fundamental to modern 3D modeling, yet existing methods struggle to generate long command sequences, especially under complex geometric and topological dependencies. Transformer-based architectures dominate CAD sequence generation due to their strong dependency modeling, but their quadratic attention cost and limited context windowing hinder scalability to long programs. We propose GeoFusion-CAD, an end-to-end diffusion framework for scalable and structure-aware generation. Our proposal encodes CAD programs as hierarchical trees, jointly capturing geometry and topology within a state-space diffusion process. Specifically, a lightweight C-Mamba block models long-range structural dependencies through selective state transitions, enabling coherent generation across extended command sequences. To support long-sequence evaluation, we introduce DeepCAD-240, an extended benchmark that increases the sequence length ranging from 40 to 240 while preserving sketch-extrusion semantics from the ABC dataset. Extensive experiments demonstrate that GeoFusion-CAD achieves superior performance on both short and long command ranges, maintaining high geometric fidelity and topological consistency where Transformer-based models degrade. Our approach sets new state-of-the-art scores for long-sequence parametric CAD generation, establishing a scalable foundation for next-generation CAD modeling systems. Code and datasets are available at GitHub.

关键词: Parametric CAD, Diffusion Models, State Space Models, Mamba, Long-sequence Generation, Geometric Fidelity, Topological Consistency, 3D Modeling

217. ❌ Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention

作者: Junhao Du, Jialong Xue, Anqi Li, Jincheng Dai, Guo Lu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21957v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于Video-LLMs的视觉令牌压缩技术，核心是提出一种统一的时空令牌压缩方法，以减少计算成本并保持性能。它直接且高度相关于’Large Language Models OR LLMs OR Foundation Models’（权重1.0），因为论文明确处理Video-LLMs，这是大语言模型的一个子类。论文未涉及其他关键词，如MoE、SLMs、训练方法、对齐、推理、代理、压缩技术（如量化）、科学AI应用等；这些主题在摘要或标题中未被提及，因此相关度为0。

!!! tip deepseek-chat TL;DR

该论文解决了Video-LLMs中高计算成本的问题，通过提出一种统一的时空令牌压缩方法，在仅保留约2%视觉令牌的情况下，保持了90.1%的基线性能，并将FLOPs减少到约2.6%。

摘要翻译

视频大语言模型（Video-LLMs）因处理大量视觉标记而面临高昂的计算成本。现有的标记压缩方法通常采用两阶段时空压缩策略，依赖于阶段特定的度量指标以及时空可分离性的隐含假设。然而，在极低的保留比例下，此类方法往往导致分配不平衡，并丢失问答所需的关键视觉证据。我们将标记压缩重新定义为全局标记保留池内的时空分配任务。我们提出了一种统一的选择机制，该机制整合注意力权重与语义相似度，以全局性地筛选出贡献度高且冗余度低的标记。未被选中的标记通过聚类进行合并并重新填充，从而保持信息的完整性。在大语言模型内部，我们进一步引入了文本感知合并机制，以基于查询相关性执行二次压缩。本方法无需重新训练，可作为即插即用模块与现有视频大语言模型兼容。实验表明，仅保留约2%的视觉标记即可在多个基准测试中保持基线性能的90.1%，同时将浮点运算量降低至约2.6%。这些优势在不同骨干网络中均具有普适性，有效降低了端到端推理延迟与内存消耗。我们提出的统一时空标记压缩策略，在超低标记保留率下的视频理解任务中确立了最先进的性能水平。

摘要 (Abstract)

Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.

关键词: Video-LLMs, token compression, spatiotemporal allocation, computational efficiency, inference acceleration, visual tokens, plug-and-play module, video understanding

218. ❌ Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

作者: Youbin Kim, Jinho Park, Hogun Park, Eunbyung Park 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21944v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出Group3D框架，利用多模态大语言模型（MLLM）进行开放词汇3D物体检测，核心创新在于将语义约束整合到实例构建过程中。与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为MLLM是核心组件；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），属于AI在科学/计算机视觉领域的应用；其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出Group3D框架，通过多模态大语言模型驱动的语义分组，解决了多视图开放词汇3D检测中几何一致性导致的过合并或碎片化问题，在ScanNet和ARKitScenes上实现了最先进的性能。

摘要翻译

开放词汇三维目标检测旨在定位并识别超出固定训练分类体系的对象。在多视角RGB场景中，现有方法通常将基于几何的实例构建与语义标注解耦，先生成类别无关的片段，再后验地分配开放词汇类别。尽管这种方式具有灵活性，但解耦操作使得实例构建过程主要受几何一致性主导，在合并阶段缺乏语义约束。当几何证据存在视角依赖性且不完整时，这种仅依赖几何的合并可能导致不可逆的关联错误，包括不同对象的过度合并或单个实例的碎片化。我们提出Group3D，一种多视角开放词汇三维检测框架，它将语义约束直接整合到实例构建过程中。Group3D维护一个源自多模态大语言模型（Multimodal Large Language Model, MLLM）的场景自适应词汇表，并将其组织成语义兼容组，这些组编码了合理的跨视角类别等价关系。这些组作为合并时的约束条件：只有当三维片段同时满足语义兼容性和几何一致性时才会被关联。这种语义门控的合并机制减轻了几何驱动的过度合并问题，同时吸纳了多视角的类别变异性。Group3D同时支持姿态已知与姿态未知的设置，仅依赖RGB观测。在ScanNet和ARKitScenes数据集上的实验表明，Group3D在多视角开放词汇三维检测中取得了最先进的性能，并在零样本场景中展现出强大的泛化能力。项目页面详见 https://ubin108.github.io/Group3D/。

摘要 (Abstract)

Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.

关键词: open-vocabulary 3D object detection, multimodal large language model, semantic grouping, multi-view RGB, geometric consistency, instance construction, zero-shot generalization, state-of-the-art performance

219. ❌ GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction

作者: Ayesh Abu Lehyeh, Xiaohan Zhang, Ahmad Arrabi, Waqas Sultani, Chen Chen, Safwan Wshah 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21943v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction》专注于计算机视觉和地理定位领域，提出了一种用于地面图像与卫星图像之间细粒度跨视图地理定位的轻量级高效框架。其核心内容包括迭代流预测、概率映射、推理算法（Iterative Refinement Sampling）以及实时性能优化。所有给定的关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用（如生物信息学、化学信息学）直接相关，而本文研究的是传统的计算机视觉和地理定位任务，未涉及大模型、深度学习创新或AI for Science的具体应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了细粒度跨视图地理定位中精度与速度难以兼顾的问题，提出了一种名为GeoFlow的轻量级框架，通过迭代流预测和推理算法实现了实时高精度的定位，在KITTI和VIGOR数据集上达到了29 FPS的实时速度并保持了竞争力的定位准确度。

摘要翻译

在GPS拒止区域中，精确快速的定位对于安全的自主导航至关重要。细粒度跨视角地理定位（Fine-Grained Cross-View Geolocalization，FG-CVG）旨在估计地面图像相对于卫星图像的精确二维自由度（2-DoF）位置。然而，现有方法面临一个艰难的权衡：高精度模型往往速度较慢，难以满足实时应用需求。本文提出GeoFlow，这是一种新颖方法，提供了一个轻量级且高效的框架，打破了这种精度与速度的权衡。我们的技术学习一种直接的概率映射，预测校正任意给定位置假设所需的位移（包括距离和方向）。这辅以我们提出的新颖推理算法——迭代优化采样（Iterative Refinement Sampling，IRS）。IRS不依赖单一预测，而是对一组假设进行优化，使其能够从随机起点迭代“流动”，最终形成一个稳健且收敛的共识结果。尽管具有迭代特性，该方法支持灵活的推理时缩放，无需重新训练即可直接在性能与计算量之间进行权衡。在KITTI和VIGOR数据集上的实验表明，GeoFlow实现了最先进的效率，能以29 FPS的实时速度运行，同时保持具有竞争力的定位精度。这项工作为开发实用的实时地理定位系统开辟了新路径。

摘要 (Abstract)

Accurate and fast localization is vital for safe autonomous navigation in GPS-denied areas. Fine-Grained Cross-View Geolocalization (FG-CVG) aims to estimate the precise 2-Degree-of-Freedom (2-DoF) location of a ground image relative to a satellite image. However, current methods force a difficult trade-off, with high-accuracy models being slow for real-time use. In this paper, we introduce GeoFlow, a new approach that offers a lightweight and highly efficient framework that breaks this accuracy-speed trade-off. Our technique learns a direct probabilistic mapping, predicting the displacement (in distance and direction) required to correct any given location hypothesis. This is complemented by our novel inference algorithm, Iterative Refinement Sampling (IRS). Instead of trusting a single prediction, IRS refines a population of hypotheses, allowing them to iteratively ‘flow’ from random starting points to a robust, converged consensus. Even its iterative nature, this approach offers flexible inference-time scaling, allowing a direct trade-off between performance and computation without any re-training. Experiments on the KITTI and VIGOR datasets show that GeoFlow achieves state-of-the-art efficiency, running at real-time speeds of 29 FPS while maintaining competitive localization accuracy. This work opens a new path for the development of practical real-time geolocalization systems.

关键词: Fine-Grained Cross-View Geolocalization, Real-Time Localization, Iterative Flow Prediction, Lightweight Framework, GPS-Denied Areas, Autonomous Navigation, Iterative Refinement Sampling, Accuracy-Speed Trade-off

220. ❌ MultiBind: A Benchmark for Attribute Misbinding in Multi-Subject Generation

作者: Wenqing Tian, Hanyi Mao, Zhaocheng Liu, Lihua Zhang, Qiang Liu, Jian Wu, Liang Wang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21937v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多主体图像生成中的属性误绑定问题，提出了MultiBind基准和评估协议。虽然属于AI应用领域，但所有关键词均针对大语言模型（LLM）技术、训练方法、推理优化、代理系统等具体方向，而本文研究的是计算机视觉中的图像生成任务，未涉及任何语言模型技术、训练方法或相关应用。因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对多主体图像生成中的跨主体属性误绑定问题，提出了MultiBind基准和维度混淆评估协议，能够有效诊断传统重建指标难以发现的绑定失败模式。

摘要翻译

主题驱动的图像生成日益需要支持对单张图像中多个实体进行细粒度控制。在多参考工作流程中，用户可能提供多张主体图像、一张背景参考图以及长篇幅的实体索引提示词，以控制同一场景中的多个人物。在此情境下，一个关键的失败模式是跨主体属性错误绑定：属性被保留、编辑或转移至错误的主体。现有基准测试和评估指标主要强调整体保真度或单主体自相似性，导致此类故障难以诊断。我们提出了MultiBind基准，该基准构建于真实多人照片之上。每个实例提供按槽位排序的主体裁剪图（附带掩码和边界框）、规范化主体参考、修复后的背景参考图，以及源自结构化标注的密集实体索引提示词。我们还提出了一种维度混淆评估方案，该方案将生成的主体与真实槽位进行匹配，并利用针对面部身份、外观、姿态和表情的专用评估器测量槽位间相似度。通过减去相应的真实相似度矩阵，我们的方法将主体自身质量退化与真实的跨主体干扰区分开来，并揭示出可解释的故障模式，如漂移、交换、主导和混合。对现代多参考生成器的实验表明，MultiBind能够揭示传统重建指标所遗漏的绑定失败问题。

摘要 (Abstract)

Subject-driven image generation is increasingly expected to support fine-grained control over multiple entities within a single image. In multi-reference workflows, users may provide several subject images, a background reference, and long, entity-indexed prompts to control multiple people within one scene. In this setting, a key failure mode is cross-subject attribute misbinding: attributes are preserved, edited, or transferred to the wrong subject. Existing benchmarks and metrics largely emphasize holistic fidelity or per-subject self-similarity, making such failures hard to diagnose. We introduce MultiBind, a benchmark built from real multi-person photographs. Each instance provides slot-ordered subject crops with masks and bounding boxes, canonicalized subject references, an inpainted background reference, and a dense entity-indexed prompt derived from structured annotations. We also propose a dimension-wise confusion evaluation protocol that matches generated subjects to ground-truth slots and measures slot-to-slot similarity using specialists for face identity, appearance, pose, and expression. By subtracting the corresponding ground-truth similarity matrices, our method separates self-degradation from true cross-subject interference and exposes interpretable failure patterns such as drift, swap, dominance, and blending. Experiments on modern multi-reference generators show that MultiBind reveals binding failures that conventional reconstruction metrics miss.

关键词: MultiBind, attribute misbinding, multi-subject generation, benchmark, evaluation protocol, cross-subject interference, image generation, subject-driven generation

221. ❌ FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection

作者: Zhilin Tu, Kemou Li, Fengpeng Li, Jianwei Fei, Jiamin Zhang, Haiwei Wu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21939v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文专注于计算机视觉领域的AI生成图像检测，使用Vision Transformer（ViT）架构、特征蒸馏和多专家集成方法。所有评分关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本文研究的是图像检测任务，未涉及任何语言模型、文本生成或LLM特定技术。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FeatDistill的AI生成图像检测框架，通过特征蒸馏和多专家集成方法，有效解决了真实场景中图像质量退化、特征表示不足和泛化能力有限的问题，在NTIRE挑战中展现了强大的鲁棒性和泛化性能。

摘要翻译

深度伪造技术的快速迭代与广泛传播对信息安全构成了严峻挑战，使得对AI生成伪造图像进行鲁棒且可泛化的检测变得日益重要。本文提出FeatDistill框架，一种融合特征蒸馏与多专家集成的AI生成图像检测方法，专为NTIRE野外鲁棒AI生成图像检测挑战赛而设计。该框架明确针对现实世界取证中的三个实际瓶颈：退化干扰、特征表示不足以及泛化能力有限。
具体而言，我们构建了一个由CLIP与SigLIP变体组成的四骨干视觉Transformer（ViT）集成系统，以捕捉互补的取证线索。为提升数据覆盖度，我们扩展了训练集并引入全面的退化建模，使检测器能够接触无约束场景中常见的多样化质量变化与合成伪影。我们进一步采用两阶段训练范式：首先通过标准二元分类目标优化模型，随后通过密集特征级自蒸馏进行表示对齐的微调。该设计有效缓解了过拟合并增强了所学特征的语义一致性。
在推理阶段，最终预测通过平均四个独立训练专家的概率输出获得，从而在面对未知生成器与复杂退化时产生稳定可靠的决策。尽管采用集成设计，该框架仍保持高效性，仅需约10GB的峰值GPU内存。在NTIRE挑战赛设定下的广泛评估表明，FeatDistill在多样化的“野外”条件下实现了强大的鲁棒性与泛化能力，为现实世界的深度伪造图像检测提供了有效且实用的解决方案。

摘要 (Abstract)

The rapid iteration and widespread dissemination of deepfake technology have posed severe challenges to information security, making robust and generalizable detection of AI-generated forged images increasingly important. In this paper, we propose FeatDistill, an AI-generated image detection framework that integrates feature distillation with a multi-expert ensemble, developed for the NTIRE Challenge on Robust AI-Generated Image Detection in the Wild. The framework explicitly targets three practical bottlenecks in real-world forensics: degradation interference, insufficient feature representation, and limited generalization. Concretely, we build a four-backbone Vision Transformer (ViT) ensemble composed of CLIP and SigLIP variants to capture complementary forensic cues. To improve data coverage, we expand the training set and introduce comprehensive degradation modeling, which exposes the detector to diverse quality variations and synthesis artifacts commonly encountered in unconstrained scenarios. We further adopt a two-stage training paradigm: the model is first optimized with a standard binary classification objective, then refined by dense feature-level self-distillation for representation alignment. This design effectively mitigates overfitting and enhances semantic consistency of learned features. At inference time, the final prediction is obtained by averaging the probabilities from four independently trained experts, yielding stable and reliable decisions across unseen generators and complex degradations. Despite the ensemble design, the framework remains efficient, requiring only about 10 GB peak GPU memory. Extensive evaluations in the NTIRE challenge setting demonstrate that FeatDistill achieves strong robustness and generalization under diverse ``in-the-wild’’ conditions, offering an effective and practical solution for real-world deepfake image detection.

关键词: AI-generated image detection, feature distillation, multi-expert ensemble, Vision Transformer, deepfake detection, robustness, generalization, NTIRE challenge

222. ❌ Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment

作者: Roy Amoyal, Oren Freifeld, Chaim Baskin 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21936v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D计算机视觉中的3D高斯泼溅模型对齐问题，提出了一种名为GSA的新方法，用于对齐两个独立的3DGS模型。其核心贡献在于几何感知的特征引导对齐框架，包括粗对齐和细对齐两个优化步骤，并在相同对象和类别级对齐任务上实现了最先进的性能。然而，论文的研究内容与所有给定的评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何形式的大语言模型、模型训练/微调技术、推理优化、对齐方法、代理系统、模型压缩或AI for Science等主题。因此，所有关键词的相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Gaussian Splatting Alignment (GSA)的新方法，用于解决两个独立3D高斯泼溅模型（即使是同一类别中的不同对象）的相似变换对齐问题，通过一个两阶段优化框架实现了最先进的性能，并首次为类别级3DGS注册提供了有效解决方案。

摘要翻译

我们提出了高斯泼溅对齐（Gaussian Splatting Alignment, GSA）方法，这是一种通过相似变换（旋转、平移和缩放）来对齐两个独立的三维高斯泼溅（3D Gaussian Splatting, 3DGS）模型的新方法，即使它们属于同一类别中的不同物体（例如不同的汽车）。相比之下，现有方法只能对齐同一物体的3DGS模型（例如同一辆汽车），且通常需要以真实尺度作为输入，而我们的方法能成功估计该尺度。GSA利用视点引导的球面映射特征来获取鲁棒的特征对应，并引入一个两步优化框架，在保持3DGS模型固定的同时完成对齐。首先，我们采用迭代的特征引导绝对定向求解器进行粗配准，该方法对较差的初始条件（例如180度误对齐或10倍的尺度差异）具有鲁棒性。接着，我们进行精细配准，该步骤受逆向辐射场公式启发，强制实施多视角特征一致性。第一步已实现最先进的性能，第二步则进一步提升了结果。在相同物体对齐任务中，即使其他方法已获知真实尺度，GSA仍显著优于现有工作。在更具挑战性的同一类别不同物体对齐任务中，GSA远超现有方法，为类别级3DGS配准提供了首个有效解决方案，并开启了新的应用可能。项目网页：https://bgu-cs-vil.github.io/GSA-project/

摘要 (Abstract)

We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation, translation, and scale), even when they are of different objects in the same category (e.g., different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g., the same car) and often must be given true scale as input, while we estimate it successfully. GSA leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns 3DGS models while keeping them fixed. First, we apply an iterative feature-guided absolute orientation solver as our coarse registration, which is robust to poor initialization (e.g., 180 degrees misalignment or a 10x scale gap). Next, we use a fine registration step that enforces multi-view feature consistency, inspired by inverse radiance-field formulations. The first step already achieves state-of-the-art performance, and the second further improves results. In the same-object case, GSA outperforms prior works, often by a large margin, even when the other methods are given the true scale. In the harder case of different objects in the same category, GSA vastly surpasses them, providing the first effective solution for category-level 3DGS registration and unlocking new applications. Project webpage: https://bgu-cs-vil.github.io/GSA-project/

关键词: 3D Gaussian Splatting, 3DGS registration, similarity transformation, geometry-aware feature-guided alignment, viewpoint-guided spherical map features, two-step optimization, category-level alignment, inverse radiance-field formulations

223. ❌ SatGeo-NeRF: Geometrically Regularized NeRF for Satellite Imagery

作者: Valentin Wagner, Sebastian Bullinger, Michael Arens, Rainer Stiefelhagen 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21931v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文SatGeo-NeRF专注于卫星图像的神经辐射场（NeRF）重建，通过几何正则化方法减少过拟合伪影，属于计算机视觉和遥感领域。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，但论文未涉及任何大模型（如LLMs）、模型训练技术（如预训练、微调、对齐）、推理优化（如量化、加速）、代理系统或特定科学AI子领域（如生物信息学、化学信息学）。仅与“AI for Science”有微弱关联，因为卫星图像处理可视为广义的科学应用，但非核心内容，故给5分；其余关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于卫星图像的几何正则化NeRF方法，通过三种正则化器减少过拟合伪影，在DFC2019基准上比现有方法降低了13.9%和11.7%的平均海拔误差。

摘要翻译

本文提出SatGeo-NeRF，一种用于卫星影像的几何正则化神经辐射场方法，通过三种模型无关的正则化器缓解当前最先进模型中因过拟合导致的几何伪影。重力对齐平面性正则化将深度推断的近似表面法线与重力轴对齐以促进局部平面性，并通过相应的表面近似耦合相邻射线以促进跨射线梯度流动。粒度正则化实施从粗到细的几何学习方案，而深度监督正则化则稳定早期训练以提升几何精度。在DFC2019卫星重建基准测试中，相较于EO-NeRF和EO-GS等最先进基线模型，SatGeo-NeRF将平均海拔误差分别降低了13.9%和11.7%。

摘要 (Abstract)

We present SatGeo-NeRF, a geometrically regularized NeRF for satellite imagery that mitigates overfitting-induced geometric artifacts observed in current state-of-the-art models using three model-agnostic regularizers. Gravity-Aligned Planarity Regularization aligns depth-inferred, approximated surface normals with the gravity axis to promote local planarity, coupling adjacent rays via a corresponding surface approximation to facilitate cross-ray gradient flow. Granularity Regularization enforces a coarse-to-fine geometry-learning scheme, and Depth-Supervised Regularization stabilizes early training for improved geometric accuracy. On the DFC2019 satellite reconstruction benchmark, SatGeo-NeRF improves the Mean Altitude Error by 13.9% and 11.7% relative to state-of-the-art baselines such as EO-NeRF and EO-GS.

关键词: Satellite Imagery, NeRF, Geometric Regularization, Overfitting Mitigation, Depth Reconstruction, Surface Normals, DFC2019 Benchmark, Mean Altitude Error

224. ❌ The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation

作者: Guannan Lai, Da-Wei Zhou, Zhenguo Li, Han-Jia Ye 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21928v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究持续测试时间适应（CTTA），属于领域适应（Domain Adaptation）范畴，与’Pre-training OR Continual Pre-training OR Domain Adaptation’关键词高度相关（8分）。论文提出使用轻量级适配器进行参数高效微调，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’有一定关联（5分）。论文未涉及大语言模型、推理、对齐、科学AI应用等其他关键词，因此其余关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对持续测试时间适应中效率与泛化的权衡问题，提出了GOLD方法，通过识别并动态维护'黄金子空间'，实现了高效且稳定的在线适应。

摘要翻译

持续测试时适应（CTTA）旨在使模型能够在无法访问源数据的情况下，在线适应分布偏移下的无标注数据流。现有CTTA方法面临效率与泛化的权衡：更新更多参数可提升适应能力，但会严重降低在线推理效率。理想的解决方案是以最小的特征更新实现可比的适应效果；我们将这一最小子空间称为黄金子空间。我们在单步适应设定中证明了其存在性，并表明该子空间与预训练分类器的行空间重合。为实现该子空间的在线维护，我们引入了样本级平均梯度外积（AGOP）作为无需重新训练即可估计分类器权重的有效代理。基于这些发现，我们提出了引导式在线低秩方向适应（GOLD），该方法通过轻量适配器将特征投影至黄金子空间，并学习紧凑的缩放向量，同时利用AGOP动态更新子空间。在分类与分割基准（包括自动驾驶场景）上的大量实验表明，GOLD在效率、稳定性及整体性能上均达到优越水平。代码发布于https://github.com/AIGNLAI/GOLD。

摘要 (Abstract)

Continual Test-Time Adaptation (CTTA) aims to enable models to adapt online to unlabeled data streams under distribution shift without accessing source data. Existing CTTA methods face an efficiency-generalization trade-off: updating more parameters improves adaptation but severely reduces online inference efficiency. An ideal solution is to achieve comparable adaptation with minimal feature updates; we call this minimal subspace the golden subspace. We prove its existence in a single-step adaptation setting and show that it coincides with the row space of the pretrained classifier. To enable online maintenance of this subspace, we introduce the sample-wise Average Gradient Outer Product (AGOP) as an efficient proxy for estimating the classifier weights without retraining. Building on these insights, we propose Guided Online Low-rank Directional adaptation (GOLD), which uses a lightweight adapter to project features onto the golden subspace and learns a compact scaling vector while the subspace is dynamically updated via AGOP. Extensive experiments on classification and segmentation benchmarks, including autonomous-driving scenarios, demonstrate that GOLD attains superior efficiency, stability, and overall performance. Our code is available at https://github.com/AIGNLAI/GOLD.

关键词: Continual Test-Time Adaptation, Domain Adaptation, Efficiency-Generalization Trade-off, Golden Subspace, Online Adaptation, Parameter-efficient Fine-tuning, Autonomous Driving

225. ❌ CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

作者: Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21901v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视频字幕去除任务，属于计算机视觉领域而非大语言模型研究。唯一相关的关键词是’PEFT OR LoRA OR Parameter-efficient Fine-tuning’，因为论文明确提到使用’LoRA-based adaptation’进行参数高效微调，这是核心方法之一，因此给10分。其他所有关键词均与论文内容无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需掩码的端到端视频字幕去除框架CLEAR，通过上下文感知自适应学习和LoRA微调，在中文基准上比依赖掩码的方法提升了6.77dB PSNR，并实现了跨六种语言的零样本泛化。

摘要翻译

视频字幕去除旨在区分文本叠加层与背景内容，同时保持时序一致性。现有的基于扩散模型的方法在训练和推理阶段均需依赖显式的掩码序列，这限制了其实际应用。本文提出CLEAR（面向端到端自适应视频字幕去除的上下文感知学习框架），这是一种无需掩码的框架，通过上下文感知自适应学习实现真正的端到端推理。我们的两阶段设计将先验提取与生成优化解耦：第一阶段通过双编码器的自监督正交约束学习解耦的字幕表征，而第二阶段采用基于LoRA的自适应机制结合生成反馈进行动态上下文调整。值得注意的是，本方法仅需基础扩散模型0.77%的参数进行训练。在中文字幕基准测试中，CLEAR相较于依赖掩码的基线方法实现了PSNR指标提升+6.77dB与VFID指标降低74.7%，同时在六种语言（英语、韩语、法语、日语、俄语、德语）上展现出卓越的零样本泛化能力——这一性能得益于我们生成驱动的反馈机制，该机制确保在推理过程中无需真实掩码即可实现鲁棒的字幕去除。

摘要 (Abstract)

Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.

关键词: Video subtitle removal, Mask-free framework, Context-aware learning, LoRA adaptation, End-to-end inference, Diffusion model, Zero-shot generalization, Parameter-efficient fine-tuning

226. ❌ HMS-VesselNet: Hierarchical Multi-Scale Attention Network with Topology-Preserving Loss for Retinal Vessel Segmentation

作者: Amarnath R 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21891v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于视网膜血管分割的计算机视觉任务，使用卷积神经网络（HMS-VesselNet）和专门的损失函数（Dice、交叉熵、中心线Dice）。所有关键词均与大语言模型（LLM）、其训练技术（如预训练、微调、对齐）、推理优化、代理系统或模型压缩相关，而本文未涉及任何LLM或深度学习基础技术原理的创新。唯一略有相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学图像分析可视为AI在科学（生物医学）领域的应用，但论文未明确使用这些术语，且核心是特定CV方法而非广义的AI for Science创新，因此给予5分（有一定关联）。其他关键词完全无关，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于视网膜血管分割的层次多尺度注意力网络（HMS-VesselNet），通过结合多分辨率处理和拓扑保持损失函数，显著提高了薄外周血管的检测召回率，在多个数据集上实现了高分割精度（平均Dice为88.72%）。

摘要翻译

基于标准重叠损失的视网膜血管分割方法往往遗漏纤细的周边血管，因为这些结构占据的像素极少且与背景对比度低。我们提出HMS-VesselNet——一种分层多尺度网络，该网络以四种不同分辨率并行处理眼底图像，并通过学习得到的融合权重整合各分支输出。训练损失结合Dice系数、二元交叉熵与中心线Dice系数，以联合优化区域重叠度与血管连续性。从第20个训练周期开始采用困难样本挖掘策略，使梯度更新集中于最具挑战性的训练图像。在DRIVE、STARE和CHASE_DB1数据集的68张图像上采用五折交叉验证进行测试，该模型取得了平均Dice系数88.72±0.67%、灵敏度90.78±1.42%和AUC（受试者工作特征曲线下面积）98.25±0.21%的性能。在留一数据集外实验中，每个未见数据集的AUC均保持在95%以上。最显著的改进体现在纤细周边血管的召回率上——这些结构是标准方法最易遗漏且对糖尿病视网膜病变早期检测至关重要的部分。

摘要 (Abstract)

Retinal vessel segmentation methods based on standard overlap losses tend to miss thin peripheral vessels because these structures occupy very few pixels and have low contrast against the background. We propose HMS-VesselNet, a hierarchical multi-scale network that processes fundus images across four parallel branches at different resolutions and combines their outputs using learned fusion weights. The training loss combines Dice, binary cross-entropy, and centerline Dice to jointly optimize area overlap and vessel continuity. Hard example mining is applied from epoch 20 onward to concentrate gradient updates on the most difficult training images. Tested on 68 images from DRIVE, STARE, and CHASE_DB1 using 5-fold cross-validation, the model achieves a mean Dice of 88.72 +/- 0.67%, Sensitivity of 90.78 +/- 1.42%, and AUC of 98.25 +/- 0.21%. In leave-one-dataset-out experiments, AUC remains above 95% on each unseen dataset. The largest improvement is in the recall of thin peripheral vessels, which are the structures most frequently missed by standard methods and most critical for early detection of diabetic retinopathy.

关键词: retinal vessel segmentation, hierarchical multi-scale network, attention mechanism, topology-preserving loss, hard example mining, diabetic retinopathy, medical image analysis, convolutional neural network

227. ❌ ADaFuSE: Adaptive Diffusion-generated Image and Text Fusion for Interactive Text-to-Image Retrieval

作者: Zhuocheng Zhang, Xingwu Zhang, Kangheng Liang, Guanxuan Li, Richard Mccreadie, Zijun Long 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21886v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ADaFuSE专注于交互式文本到图像检索（I-TIR）任务，提出了一种轻量级融合模型，其中核心创新是引入了语义感知的混合专家（Mixture of Experts）分支来捕获细粒度的跨模态细微差别。因此，仅与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关（评分为10分），因为论文明确使用了MoE架构。其他关键词主要涉及大语言模型（LLMs）的特定技术、训练方法、推理优化、代理系统或科学AI应用，而本论文研究的是基于扩散模型的图像检索，未涉及这些方面，故相关度为0分。

!!! tip deepseek-chat TL;DR

该论文针对交互式文本到图像检索中现有方法静态融合多模态视图导致性能下降的问题，提出了ADaFuSE模型，通过自适应门控和语义感知混合专家分支动态校准多模态信息，在多个基准测试中实现了最先进的性能并增强了鲁棒性。

摘要翻译

交互式文本到图像检索（I-TIR）的最新进展利用扩散模型弥合了文本信息需求与待检索图像之间的模态鸿沟，从而提升了检索效能。然而，现有框架仅通过简单的嵌入相加来融合用户反馈的多模态视图。本研究表明，这种静态且无差别的融合方式会不加区分地引入扩散模型产生的生成噪声，导致高达55.62%的样本出现性能下降。我们进一步提出ADaFuSE（基于语义感知专家的自适应扩散-文本融合），这是一种轻量级融合模型，专为扩散增强的I-TIR中的多模态视图对齐与校准而设计，无需修改骨干编码器即可嵌入现有框架。具体而言，我们引入了双分支融合机制：其中自适应门控分支动态平衡模态可靠性，同时语义感知的混合专家分支捕捉细粒度的跨模态差异。通过在四个标准I-TIR基准上的全面评估，ADaFuSE实现了最先进的性能，在Hits@10指标上以仅5.29%的参数增长超越DAR模型达3.49%，同时对噪声更多、更长的交互式查询表现出更强的鲁棒性。这些结果表明，生成式增强与原则性融合相结合，为交互式检索提供了一种简单、可泛化的替代微调方案。

摘要 (Abstract)

Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating branch to dynamically balance modality reliability, alongside a semantic-aware mixture-of-experts branch to capture fine-grained cross-modal nuances. Via thorough evaluation over four standard I-TIR benchmarks, ADaFuSE achieves state-of-the-art performance, surpassing DAR by up to 3.49% in Hits@10 with only a 5.29% parameter increase, while exhibiting stronger robustness to noisy and longer interactive queries. These results show that generative augmentation coupled with principled fusion provides a simple, generalizable alternative to fine-tuning for interactive retrieval.

关键词: interactive text-to-image retrieval, diffusion models, multi-modal fusion, mixture of experts, adaptive gating, semantic-aware, generative augmentation, robustness

228. ❌ Deep S2P: Integrating Learning Based Stereo Matching Into the Satellite Stereo Pipeline

作者: Elías Masquil, Thibaud Ehret, Pablo Musé, Gabriele Facciolo 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21882v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于卫星立体视觉和深度学习在遥感中的应用，属于计算机视觉和地球观测领域。论文的核心是集成学习型立体匹配方法（如StereoAnywhere、MonSter、Foundation Stereo）到卫星立体管道中，以改进数字表面模型生成。这与关键词列表中的大多数大模型、训练技术、推理优化、对齐、代理等主题完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及AI在地球科学（遥感）中的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均未涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该研究解决了将学习型立体匹配方法集成到卫星立体管道中的挑战，通过调整校正阶段并实验验证，在数字表面模型精度上取得了优于传统方法的改进，但植被等复杂表面类型仍存在局限。

摘要翻译

基于卫星影像生成数字表面模型是地球观测领域的核心任务，在卫星数据处理流程（如卫星立体处理流程S2P）中通常采用经典立体匹配算法进行处理。尽管近年来基于学习的立体匹配器（如StereoAnywhere、MonSter、Foundation Stereo以及卫星微调版MonSter）在标准测试集上达到了最先进的性能，但由于观测几何与视差假设的差异，将其集成到实际卫星处理流程中仍面临挑战。本研究将多种现代学习型立体匹配器集成至卫星立体处理流程，通过调整校正阶段以确保视差极性与范围的兼容性。我们公开了相应代码，以支持这些方法在大规模地球观测工作流中的可复现应用。卫星影像实验表明，相较于基于经典代价体积的方法，数字表面模型的精度获得持续提升，但常用指标（如平均绝对误差）显示出饱和效应。定性分析结果揭示了几何细节与结构锐度的显著改善，这凸显了需要更能反映感知与结构保真度的评估策略。同时，所有评估模型在植被等复杂地表类型的处理性能仍存在局限，表明基于学习的立体匹配方法在自然环境中仍面临开放挑战。

摘要 (Abstract)

Digital Surface Model generation from satellite imagery is a core task in Earth observation and is commonly addressed using classical stereoscopic matching algorithms in satellite pipelines as in the Satellite Stereo Pipeline (S2P). While recent learning-based stereo matchers achieve state-of-the-art performance on standard benchmarks, their integration into operational satellite pipelines remains challenging due to differences in viewing geometry and disparity assumptions. In this work, we integrate several modern learning-based stereo matchers, including StereoAnywhere, MonSter, Foundation Stereo, and a satellite fine-tuned variant of MonSter, into the Satellite Stereo Pipeline, adapting the rectification stage to enforce compatible disparity polarity and range. We release the corresponding code to enable reproducible use of these methods in large-scale Earth observation workflows. Experiments on satellite imagery show consistent improvements over classical cost-volume-based approaches in terms of Digital Surface Model accuracy, although commonly used metrics such as mean absolute error exhibit saturation effects. Qualitative results reveal substantially improved geometric detail and sharper structures, highlighting the need for evaluation strategies that better reflect perceptual and structural fidelity. At the same time, performance over challenging surface types such as vegetation remains limited across all evaluated models, indicating open challenges for learning-based stereo in natural environments.

关键词: satellite stereo, stereo matching, deep learning, digital surface model, Earth observation, learning-based methods, rectification, remote sensing

229. ❌ Thermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems

作者: Chengyin Hu, Yikun Guo, Yuxian Dong, Qike Zhang, Kalibinuer Tiliwalidi, Yiwei Wei, Haitao Shi, Jiujiang Guo, Jiahuan Long, Xiang Chen 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21876v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究红外视觉系统的物理对抗攻击方法（UPPA），属于计算机视觉和物理安全领域，与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何大模型、语言模型、训练方法、推理技术、对齐、压缩、代理系统或科学AI应用等内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对红外行人检测器的通用物理补丁攻击方法（UPPA），通过参数化贝塞尔块建模扰动和粒子群优化算法实现全局优化，在物理部署中生成低温补丁，实现了高攻击成功率且无需在线计算开销。

摘要翻译

尽管红外行人检测器已在视觉感知任务中得到广泛应用，但其对物理对抗攻击的脆弱性日益凸显。现有的物理攻击方法主要依赖于实例特定的在线优化和刚性图案设计，导致部署成本高昂且物理鲁棒性不足。为应对这些局限，本研究提出通用物理补丁攻击（Universal Physical Patch Attack, UPPA），这是红外领域首个通用物理攻击方法。该方法采用几何约束的参数化贝塞尔块对扰动进行建模，并利用粒子群优化算法在全局数据分布上进行统一优化，从而在动态形变下保持拓扑稳定性。在物理部署阶段，我们将优化后的数字扰动实体化为物理冷贴片，实现连续平滑的低温分布，自然契合红外成像的热辐射特性。大量实验表明，UPPA在无需任何在线计算开销的情况下实现了卓越的物理攻击成功率，同时展现出强大的跨域泛化能力和可靠的黑盒可迁移性。

摘要 (Abstract)

Although infrared pedestrian detectors have been widely deployed in visual perception tasks, their vulnerability to physical adversarial attacks is becoming increasingly apparent. Existing physical attack methods predominantly rely on instance-specific online optimization and rigid pattern design, leading to high deployment costs and insufficient physical robustness. To address these limitations, this work proposes the Universal Physical Patch Attack (UPPA), the first universal physical attack method in the infrared domain. This method employs geometrically constrained parameterized Bezier blocks to model perturbations and utilizes the Particle Swarm Optimization (PSO) algorithm to perform unified optimization across the global data distribution, thus maintaining topological stability under dynamic deformations. In the physical deployment phase, we materialize the optimized digital perturbations into physical cold patches, achieving a continuous and smooth low-temperature distribution that naturally aligns with the thermal radiation characteristics of infrared imaging. Extensive experiments demonstrate that UPPA achieves an outstanding physical attack success rate without any online computational overhead, while also exhibiting strong cross-domain generalization and reliable black-box transferability.

关键词: infrared vision systems, physical adversarial attacks, universal physical patch attack, thermal topology collapse, particle swarm optimization, infrared pedestrian detectors, cross-domain generalization, black-box transferability

230. ❌ Climate Prompting: Generating the Madden-Julian Oscillation using Video Diffusion and Low-Dimensional Conditioning

作者: Sulian Thual, Feiyang Cai, Jingjing Wang, Feng Luo 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21856v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究使用视频扩散模型生成Madden-Julian振荡（MJO）序列，属于深度学习在气候科学领域的应用。论文未涉及任何大语言模型（LLM）相关技术，如预训练、微调、对齐、推理优化、智能体等。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（气候科学）领域的应用，但并非核心匹配（论文未使用LLM，而是视频扩散模型），因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于视频扩散模型的方法，通过低维条件生成Madden-Julian振荡（MJO）序列，以桥接低维MJO理论与高分辨率大气复杂性，并帮助热带大气预测。

摘要翻译

生成式深度学习是模拟热带地区马登-朱利安振荡（MJO）的强大工具，但其与传统理论框架的关系仍不甚明晰。本文提出一种基于大气再分析数据训练的视频扩散模型，该模型可根据关键低维指标合成长时间的MJO序列。尽管存在一定偏差，生成的MJO序列仍能捕捉包括合成场、功率谱以及多尺度结构（如对流耦合波）在内的关键特征。我们进一步引导模型基于刻意理想化的低维条件（例如永续型MJO、受季节和/或厄尔尼诺-南方涛动（ENSO）独立调制等场景）生成更易处理的MJO序列，从而解构其内在过程并识别物理驱动因子。本方法为弥合低维MJO理论与高分辨率大气复杂性之间的鸿沟提供了实用框架，并将助力热带大气预测研究。

摘要 (Abstract)

Generative Deep Learning is a powerful tool for modeling of the Madden-Julian oscillation (MJO) in the tropics, yet its relationship to traditional theoretical frameworks remains poorly understood. Here we propose a video diffusion model, trained on atmospheric reanalysis, to synthetize long MJO sequences conditioned on key low-dimensional metrics. The generated MJOs capture key features including composites, power spectra and multiscale structures including convectively coupled waves, despite some bias. We then prompt the model to generate more tractable MJOs based on intentionally idealized low-dimensional conditionings, for example a perpetual MJO, an isolated modulation by seasons and/or the El Nino-Southern Oscillation, and so on. This enables deconstructing the underlying processes and identifying physical drivers. The present approach provides a practical framework for bridging the gap between low-dimensional MJO theory and high-resolution atmospheric complexity and will help tropical atmosphere prediction.

关键词: video diffusion model, Madden-Julian oscillation, MJO, generative deep learning, atmospheric reanalysis, low-dimensional conditioning, tropical atmosphere prediction, climate modeling

231. ❌ Multi-View Deformable Convolution Meets Visual Mamba for Coronary Artery Segmentation

作者: Xiaochan Yuan, Pai Zeng 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像分割，提出了一种结合多向蛇形卷积和视觉Mamba的冠状动脉分割框架。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词主要针对自然语言处理领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学（具体是心血管影像分析）领域的应用，属于’AI for Science’范畴，但并非论文的核心创新点（其创新在于模型架构），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对冠状动脉CTA图像分割中长距离依赖建模困难和计算成本高的问题，提出了一种名为MDSVM-UNet的两阶段分割框架，通过结合多向蛇形卷积和残差视觉Mamba，有效提升了分割精度并保持了线性计算复杂度。

摘要翻译

从计算机断层扫描血管造影（CTA）图像中精确分割冠状动脉对于心血管疾病的诊断和治疗规划具有至关重要的临床意义。然而，由于血管固有的多分支、细长管状形态，以及前景血管与背景组织之间严重的类别不平衡，冠状动脉分割仍然是一项挑战。传统的基于卷积神经网络（CNN）的方法难以捕捉空间上相距较远的血管结构之间的长程依赖关系，而基于视觉变换器（Vision Transformer, ViT）的方法则会产生过高的计算开销，阻碍其在资源受限的临床环境中的部署。受到状态空间模型（State Space Models, SSMs）近期在线性复杂度下高效建模长程序列依赖关系方面成功的启发，我们提出了MDSVM-UNet，一种新颖的两阶段冠状动脉分割框架，它将多向蛇形卷积（multidirectional snake convolution, MDSConv）与残差视觉曼巴（residual visual Mamba, RVM）协同整合。在编码阶段，我们引入了MDSConv，这是一个可变形卷积模块，它沿着三个正交解剖平面——矢状面、冠状面和轴向面——学习自适应偏移，从而实现全面的多视角特征融合，忠实地捕捉冠状动脉细长且迂曲的几何形态。在解码阶段，我们设计了一个基于RVM的上采样解码器块，它利用选择性状态空间机制来建模切片间的长程依赖关系，同时保持线性计算复杂度。此外，我们提出了一种渐进式的两阶段分割策略：第一阶段执行粗略的全图像分割以指导智能块提取，而第二阶段则进行细粒度的块级分割，以恢复血管细节并抑制假阳性。

摘要 (Abstract)

Accurate segmentation of coronary arteries from computed tomography angiography (CTA) images is of paramount clinical importance for the diagnosis and treatment planning of cardiovascular diseases. However, coronary artery segmentation remains challenging due to the inherent multi-branching and slender tubular morphology of the vasculature, compounded by severe class imbalance between foreground vessels and background tissue. Conventional convolutional neural network (CNN)-based approaches struggle to capture long-range dependencies among spatially distant vascular structures, while Vision Transformer (ViT)-based methods incur prohibitive computational overhead that hinders deployment in resource-constrained clinical settings. Motivated by the recent success of state space models (SSMs) in efficiently modeling long-range sequential dependencies with linear complexity, we propose MDSVM-UNet, a novel two-stage coronary artery segmentation framework that synergistically integrates multidirectional snake convolution (MDSConv) with residual visual Mamba (RVM). In the encoding stage, we introduce MDSConv, a deformable convolution module that learns adaptive offsets along three orthogonal anatomical planes – sagittal, coronal, and axial – thereby enabling comprehensive multi-view feature fusion that faithfully captures the elongated and tortuous geometry of coronary vessels. In the decoding stage, we design an RVM-based upsampling decoder block that leverages selective state space mechanisms to model inter-slice long-range dependencies while preserving linear computational complexity. Furthermore, we propose a progressive two-stage segmentation strategy: the first stage performs coarse whole-image segmentation to guide intelligent block extraction, while the second stage conducts fine-grained block-level segmentation to recover vascular details and suppress false positives..

关键词: coronary artery segmentation, computed tomography angiography, multi-view deformable convolution, visual Mamba, state space models, long-range dependencies, medical image analysis, cardiovascular disease

232. ❌ Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion

作者: Yanglin Deng, Tianyang Xu, Chunyang Cheng, Hui Li, Xiao-jun Wu, Josef Kittler 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21820v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究红外与可见光图像融合（IVIF），属于计算机视觉领域，而非大语言模型或深度学习技术原理的直接研究。论文主要关注训练范式（SPTP、UPTP、APTP）、数据对齐问题、跨模态关系以及轻量级网络设计（CNN、Transformer、GAN）。所有关键词均与大语言模型、深度学习技术原理或AI for Science的子领域（如生物信息学、化学信息学）直接相关，而本文属于通用计算机视觉应用，与AI for Science仅有微弱关联（可视为广义的AI应用），因此仅对’AI for Science OR Bioinformatics OR Cheminformatics’给予5分（有一定关联），其余关键词均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文挑战了红外与可见光图像融合中严格配对训练范式（SPTP）的必要性，提出了未配对和任意配对训练范式（UPTP和APTP），并开发了一个实用框架，能在训练数据严重有限且未对齐的情况下显著丰富跨模态关系，实验表明新范式在仅使用1%数据量时能达到与SPTP相当的性能。

摘要翻译

红外与可见光图像融合（IVIF）旨在结合互补的模态信息，同时保留自然纹理与显著的热特征。现有方法主要依赖大量严格配准的图像对进行训练，然而由于配准过程成本高昂且劳动密集，获取此类数据往往不切实际。此外，训练中保持严格的配对设置限制了跨模态关系的丰富性，从而制约了模型的泛化性能。为此，本研究通过系统探究非配对与任意配对训练范式（UPTP 和 APTP）在高性能 IVIF 中的应用，对严格配对训练范式（SPTP）的必要性提出了挑战。我们建立了 APTP 的理论目标，体现了 UPTP 与 SPTP 之间的互补特性。更重要的是，我们开发了一个实用框架，即使在训练数据严重受限且未配准的情况下，仍能显著丰富跨模态关系。为验证所提方法，我们设计了三个端到端的轻量级基线模型及一系列创新损失函数，覆盖三种经典框架（CNN、Transformer、GAN）。综合实验表明，所提出的 APTP 和 UPTP 是可行的，能够在严重受限且内容不一致的红外与可见光数据集上训练模型，并达到与 SPTP 下 100 倍规模数据集相当的性能。这一发现从根本上缓解了数据收集的成本与难度，同时从数据角度增强了模型的鲁棒性，为 IVIF 研究提供了可行的解决方案。代码发布于 \href{https://github.com/yanglinDeng/IVIF_unpair}{\textcolor{blue}{https://github.com/yanglinDeng/IVIF_unpair}}。

摘要 (Abstract)

Infrared and visible image fusion(IVIF) combines complementary modalities while preserving natural textures and salient thermal signatures. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labour-intensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100$\times$ larger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies. The code is available at \href{https://github.com/yanglinDeng/IVIF_unpair}{\textcolor{blue}{https://github.com/yanglinDeng/IVIF_unpair}}.

关键词: Infrared and visible image fusion, Unpaired training, Arbitrarily paired training, Cross-modal relationships, Lightweight baselines, CNN Transformer GAN, Data alignment, Model robustness

233. ❌ Clinical Graph-Mediated Distillation for Unpaired MRI-to-CFI Hypertension Prediction

作者: Dillan Imans, Phuoc-Nguyen Bui, Duc-Tai Le, Hyunseung Choo 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21809v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像分析（MRI和眼底图像）和知识蒸馏技术，用于高血压预测。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（具体是医学影像和高血压预测）领域的应用，属于’AI for Science’范畴，但并非其核心创新点（核心是跨模态知识蒸馏框架），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对缺乏配对数据的MRI和眼底图像模态，提出了一种基于临床相似性图的知识蒸馏框架（CGMD），成功将MRI学到的知识迁移到眼底图像模型，从而提升了仅使用眼底图像进行高血压预测的准确性。

摘要翻译

视网膜眼底成像能够实现低成本、可扩展的高血压筛查，但高血压相关的视网膜特征较为细微，导致预测结果方差较高。脑部磁共振成像能够提供更强的高血压血管及小血管疾病标志物，但其成本高昂，且很少与眼底图像同时采集，这导致了模态孤立的数据集，即磁共振与眼底影像队列互不匹配。本研究针对这种非配对的磁共振-眼底影像场景，提出了临床图介导蒸馏框架，该框架能够在无需配对多模态数据的情况下，将磁共振成像衍生的高血压知识迁移至眼底模型。CGMD利用共享的结构化生物标志物作为桥梁，通过构建一个跨越两个队列的临床相似性k近邻图来实现。我们训练一个磁共振教师模型，将其表征在图上传播，并为眼底患者推算具有脑部信息指导的表征目标。随后，通过结合高血压监督、目标蒸馏和关系蒸馏的联合目标，训练一个眼底学生模型。在我们新收集的非配对磁共振-眼底-生物标志物数据集上的实验表明，相较于标准蒸馏和非图推算基线方法，CGMD能持续提升基于眼底影像的高血压预测性能，消融实验也证实了基于临床的图连接结构的重要性。代码发布于https://github.com/DillanImans/CGMD-unpaired-distillation。

摘要 (Abstract)

Retinal fundus imaging enables low-cost and scalable hypertension (HTN) screening, but HTN-related retinal cues are subtle, yielding high-variance predictions. Brain MRI provides stronger vascular and small-vessel-disease markers of HTN, yet it is expensive and rarely acquired alongside fundus images, resulting in modality-siloed datasets with disjoint MRI and fundus cohorts. We study this unpaired MRI-fundus regime and introduce Clinical Graph-Mediated Distillation (CGMD), a framework that transfers MRI-derived HTN knowledge to a fundus model without paired multimodal data. CGMD leverages shared structured biomarkers as a bridge by constructing a clinical similarity kNN graph spanning both cohorts. We train an MRI teacher, propagate its representations over the graph, and impute brain-informed representation targets for fundus patients. A fundus student is then trained with a joint objective combining HTN supervision, target distillation, and relational distillation. Experiments on our newly collected unpaired MRI-fundus-biomarker dataset show that CGMD consistently improves fundus-based HTN prediction over standard distillation and non-graph imputation baselines, with ablations confirming the importance of clinically grounded graph connectivity. Code is available at https://github.com/DillanImans/CGMD-unpaired-distillation.

关键词: Hypertension Prediction, Unpaired MRI-Fundus, Knowledge Distillation, Clinical Graph, Multimodal Learning, Medical Imaging, Biomarkers, Retinal Fundus Imaging

234. ❌ Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment

作者: Lei Yang, Yi He, Fei Wu, Shilin Wang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21808v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于中文普通话视觉语音识别（VSR）任务，提出了一种基于多任务学习的无级联架构，通过语义引导的局部对比损失对齐特征。论文内容主要涉及计算机视觉、语音识别和多任务学习，但未涉及大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型、深度学习技术原理或AI for Science相关，而该论文的研究领域（视觉语音识别）与这些关键词无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文针对中文普通话视觉语音识别中因级联架构导致的错误累积和推理延迟问题，提出了一种基于多任务学习的无级联架构，通过语义引导的局部对比损失对齐特征，在公开数据集上实现了优越的识别性能。

摘要翻译

汉语普通话视觉语音识别任务近年来虽取得进展，但其性能仍落后于英语等非声调语言。主要挑战之一源于普通话的声调特性，这限制了传统序列到序列建模方法的有效性。为缓解此问题，现有汉语视觉语音识别系统通常在级联架构中引入中间表示（尤其是拼音）以提升识别准确率。此类设计虽有益处，但在推理过程中后续阶段依赖于前一阶段的输出，导致错误累积与推理延迟增加。为克服这些局限，我们提出一种基于多任务学习的无级联架构，该架构联合集成音素与视位（viseme）等多种中间表示，以更好地利用上下文信息。所提出的语义引导局部对比损失在时序上对齐特征，支持推理过程中按需激活，从而在推理效率与性能之间实现权衡，同时减轻因投影和重嵌入导致的错误累积。在公开数据集上的实验表明，本方法取得了优越的识别性能。

摘要 (Abstract)

Chinese mandarin visual speech recognition (VSR) is a task that has advanced in recent years, yet still lags behind the performance on non-tonal languages such as English. One primary challenge arises from the tonal nature of Mandarin, which limits the effectiveness of conventional sequence-to-sequence modeling approaches. To alleviate this issue, existing Chinese VSR systems commonly incorporate intermediate representations, most notably pinyin, within cascade architectures to enhance recognition accuracy. While beneficial, in these cascaded designs, the subsequent stage during inference depends on the output of the preceding stage, leading to error accumulation and increased inference latency. To address these limitations, we propose a cascade-free architecture based on multitask learning that jointly integrates multiple intermediate representations, including phoneme and viseme, to better exploit contextual information. The proposed semantic-guided local contrastive loss temporally aligns the features, enabling on-demand activation during inference, thereby providing a trade-off between inference efficiency and performance while mitigating error accumulation caused by projection and re-embedding. Experiments conducted on publicly available datasets demonstrate that our method achieves superior recognition performance.

关键词: Mandarin Visual Speech Recognition, Cascade-Free Architecture, Multitask Learning, Semantic-Guided Local Contrastive Loss, Phoneme, Viseme, Error Accumulation, Inference Efficiency

235. ❌ Anatomical Token Uncertainty for Transformer-Guided Active MRI Acquisition

作者: Lev Ayzenberg, Shady Abu-Hussein, Raja Giryes, Hayit Greenspan 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21806v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学影像（MRI）的主动采样与重建，属于AI在科学/生物医学领域的应用。仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为其直接应用AI技术于医学影像分析。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因提及使用预训练的医学图像分词器（pretrained medical image tokenizer）。其他关键词主要涉及大语言模型（LLM）技术、推理、对齐、优化等，与论文的计算机视觉/医学影像焦点无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于预训练医学图像分词器和潜在变换器的主动采样框架，利用标记熵指导MRI欠采样，在fastMRI数据集上实现了优于现有方法的加速重建效果。

摘要翻译

磁共振成像（MRI）的全数据采集本质上是缓慢的，这限制了临床通量并增加了患者的不适。压缩感知磁共振成像（CS-MRI）旨在通过从欠采样的k空间数据重建图像来加速采集，这需要同时优化采样轨迹和高保真度的重建模型。在本研究中，我们提出了一种新颖的主动采样框架，该框架利用预训练的医学图像分词器（tokenizer）和潜在变换器（latent transformer）的固有离散结构。通过量化视觉词元（visual tokens）的字典来表示解剖结构，该模型在潜在空间上提供了一个定义良好的概率分布。我们利用该分布，通过词元熵推导出一种基于原理的不确定性度量，以指导主动采样过程。我们引入了两种策略来利用这种潜在不确定性：（1）潜在熵选择（Latent Entropy Selection, LES），将逐块（patch-wise）的词元熵投影到k空间域，以识别信息丰富的采样线；（2）基于梯度的熵优化（Gradient-based Entropy Optimization, GEO），它通过总潜在熵损失在k空间的梯度来识别不确定性减少最大的区域。我们在fastMRI单线圈膝关节和大脑数据集上，以×8和×16的加速倍率评估了我们的框架。结果表明，我们的主动策略在感知指标和基于特征的距离上均优于最先进的基线方法。我们的代码可在 https://github.com/levayz/TRUST-MRI 获取。

摘要 (Abstract)

Full data acquisition in MRI is inherently slow, which limits clinical throughput and increases patient discomfort. Compressed Sensing MRI (CS-MRI) seeks to accelerate acquisition by reconstructing images from under-sampled k-space data, requiring both an optimal sampling trajectory and a high-fidelity reconstruction model. In this work, we propose a novel active sampling framework that leverages the inherent discrete structure of a pretrained medical image tokenizer and a latent transformer. By representing anatomy through a dictionary of quantized visual tokens, the model provides a well-defined probability distribution over the latent space. We utilize this distribution to derive a principled uncertainty measure via token entropy, which guides the active sampling process. We introduce two strategies to exploit this latent uncertainty: (1) Latent Entropy Selection (LES), projecting patch-wise token entropy into the $k$-space domain to identify informative sampling lines, and (2) Gradient-based Entropy Optimization (GEO), which identifies regions of maximum uncertainty reduction via the $k$-space gradient of a total latent entropy loss. We evaluate our framework on the fastMRI singlecoil Knee and Brain datasets at $\times 8$ and $\times 16$ acceleration. Our results demonstrate that our active policies outperform state-of-the-art baselines in perceptual metrics, and feature-based distances. Our code is available at https://github.com/levayz/TRUST-MRI.

关键词: Active MRI acquisition, Compressed Sensing MRI, Transformer, Medical image tokenizer, Latent uncertainty, Token entropy, k-space sampling, Image reconstruction

236. ❌ Timing In stand-up Comedy: Text, Audio, Laughter, Kinesics (TIC-TALK): Pipeline and Database for the Multimodal Study of Comedic Timing

作者: Yaelle Zribi, Florian Cafiero, Vincent Lépinay, Chahan Vidal-Gorène 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21803v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究单口喜剧表演的多模态分析，使用BERTopic进行主题分割、Whisper-AT进行笑声检测、YOLOv8进行姿态估计等现有模型构建数据集和分析管道，但未涉及大模型或深度学习技术原理的创新，也未在生物医药等科学领域应用AI，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文构建了TIC-TALK多模态数据集和分析管道，用于研究单口喜剧表演中语言、姿态和观众反应的时序关系，发现动能与笑声率负相关、个人主题引发更多笑声等规律。

摘要翻译

单口喜剧及广义的幽默研究常聚焦于其语言内容。然而，现场表演同样依赖于表演者的身体呈现与观众反馈。本文介绍TIC-TALK——一个多模态资源库，包含5,400多个时间对齐的主题片段，涵盖90部专业拍摄的单口喜剧专场（2015-2024年）中的语言、姿态与观众反应。该处理流程结合了以下技术：采用BERTopic模型与密集句子嵌入对60秒片段进行主题分割；利用Whisper-AT进行0.8秒间隔的笑声检测；通过微调的YOLOv8-cls镜头分类器识别镜头类型；并以每秒1帧的频率运用YOLOv8s-pose提取原始人体关键点。我们保留了未经预聚类的原始17关节骨骼坐标，从而能够计算连续的运动学信号——手臂伸展幅度、动能和躯干倾斜度——这些信号可作为表演动态的代理指标。所有数据流通过分层时间包含关系进行对齐（无需重采样），每个主题片段均存储其句子-BERT嵌入向量，以支持下游相似度计算与聚类任务。作为具体应用案例，我们研究了24个主题类别中的笑声动态：动能与观众笑声率呈负相关（r = -0.75, N = 24），这与“抖包袱前的静止”模式相符；个人与身体主题比地缘政治主题引发更多笑声；特写镜头比例与笑声呈正相关（r = +0.28），符合反应性蒙太奇规律。

摘要 (Abstract)

Stand-up comedy, and humor in general, are often studied through their verbal content. Yet live performance relies just as much on embodied presence and audience feedback. We introduce TIC-TALK, a multimodal resource with 5,400+ temporally aligned topic segments capturing language, gesture, and audience response across 90 professionally filmed stand-up comedy specials (2015-2024). The pipeline combines BERTopic for 60 s thematic segmentation with dense sentence embeddings, Whisper-AT for 0.8 s laughter detection, a fine-tuned YOLOv8-cls shot classifier, and YOLOv8s-pose for raw keypoint extraction at 1 fps. Raw 17-joint skeletal coordinates are retained without prior clustering, enabling the computation of continuous kinematic signals-arm spread, kinetic energy, and trunk lean-that serve as proxies for performance dynamics. All streams are aligned by hierarchical temporal containment without resampling, and each topic segment stores its sentence-BERT embedding for downstream similarity and clustering tasks. As a concrete use case, we study laughter dynamics across 24 thematic topics: kinetic energy negatively predicts audience laughter rate (r = -0.75, N = 24), consistent with a stillness-before-punchline pattern; personal and bodily content elicits more laughter than geopolitical themes; and shot close-up proportion correlates positively with laughter (r = +0.28), consistent with reactive montage.

关键词: stand-up comedy, multimodal analysis, laughter detection, kinematic signals, BERTopic, YOLOv8, temporal alignment, performance dynamics

237. ❌ Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTEvent

作者: Lokeshwaran Manohar, Moritz Roidl 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21787v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究事件相机在工业环境中的多类别物体检测，属于计算机视觉领域，而非大语言模型或深度学习技术原理的创新。仅与两个关键词有微弱关联：1) “Pre-training OR Continual Pre-training OR Domain Adaptation”（5分），因为论文提到了GEN1和PEDRo预训练初始化对性能的影响；2) “AI for Science OR Bioinformatics OR Cheminformatics”（5分），因为工业机器人应用可视为AI在科学/工程领域的应用。其他关键词均与大语言模型、推理、对齐、压缩等技术无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于事件相机的循环物体检测模型在工业多类别识别任务上的性能，通过基准测试发现循环结构能带来9.6%的性能提升，且合适的预训练初始化能进一步提高检测精度。

摘要翻译

事件相机因其高时间分辨率、高动态范围及低运动模糊特性，在工业机器人领域具有显著吸引力。然而，当前多数基于事件的目标检测研究集中于户外驾驶场景或有限类别设置。本研究以MTEvent数据集为基准，评估循环式ReYOLOv8s在工业多类别识别任务中的性能，并采用非循环YOLOv8s变体作为基线以分析时序记忆机制的影响。在MTEvent验证集上，最优的从头训练循环模型（C21）达到0.285 mAP50，相较于非循环YOLOv8s基线（0.260）实现了9.6%的相对提升。事件域预训练表现出更强效果：基于GEN1初始化的微调在片段长度21时取得最佳综合结果（0.329 mAP50），且与从头训练不同，GEN1预训练模型性能随片段长度增加持续改善。而PEDRo初始化结果下降至0.251，表明源域不匹配的预训练可能劣于从头训练。主要的持续错误模式源于类别不平衡及人-物交互场景。总体而言，本研究定位为针对工业环境中基于循环事件检测的专项基准测试与分析探索。

摘要 (Abstract)

Event cameras are attractive for industrial robotics because they provide high temporal resolution, high dynamic range, and reduced motion blur. However, most event-based object detection studies focus on outdoor driving scenarios or limited class settings. In this work, we benchmark recurrent ReYOLOv8s on MTEvent for industrial multi-class recognition and use a non-recurrent YOLOv8s variant as a baseline to analyze the effect of temporal memory. On the MTEvent validation split, the best scratch recurrent model (C21) reaches 0.285 mAP50, corresponding to a 9.6% relative improvement over the nonrecurrent YOLOv8s baseline (0.260). Event-domain pretraining has a stronger effect: GEN1-initialized fine-tuning yields the best overall result of 0.329 mAP50 at clip length 21, and unlike scratch training, GEN1-pretrained models improve consistently with clip length. PEDRo initialization drops to 0.251, indicating that mismatched source-domain pretraining can be less effective than training from scratch. Persistent failure modes are dominated by class imbalance and human-object interaction. Overall, we position this work as a focused benchmarking and analysis study of recurrent event-based detection in industrial environments.

关键词: event cameras, object detection, industrial robotics, recurrent models, YOLOv8, pretraining, multi-class recognition, benchmarking

238. ❌ The Universal Normal Embedding

作者: Chen Tasker, Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21786v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究计算机视觉领域的生成模型（扩散模型）和编码器（如CLIP、DINO）的潜在空间几何性质，提出Universal Normal Embedding（UNE）假设，并验证其语义对齐和可控编辑能力。所有评分关键词均与大语言模型（LLM）相关，而本文专注于视觉模型（扩散模型、视觉编码器），未涉及任何LLM技术、训练方法、推理优化、对齐、代理系统或科学AI应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文研究了生成模型和视觉编码器共享的潜在空间高斯性质，提出了Universal Normal Embedding（UNE）假设，并通过实验证明扩散噪声和编码器表示在语义上对齐，能够实现可控的图像编辑。

摘要翻译

生成模型与视觉编码器长期以来基本沿着各自独立的路径发展，分别针对不同目标进行优化，并基于不同的数学原理。然而，它们共享一个基本特性：潜在空间的高斯性。生成模型将高斯噪声映射为图像，而编码器则将图像映射为语义嵌入，其坐标在经验上表现出高斯特性。我们假设二者皆为一个共享潜在源——即“通用正态嵌入”（Universal Normal Embedding, UNE）的不同视图：这是一个近似高斯的潜在空间，编码器嵌入和DDIM反转噪声均可视为其带噪声的线性投影。为验证此假设，我们引入了NoiseZoo数据集，其中包含每张图像对应的潜在表示，涵盖DDIM反转的扩散噪声以及匹配的编码器表示（如CLIP、DINO）。在CelebA数据集上，两个空间中的线性探针均能实现强大且一致的属性预测，表明生成噪声沿着线性方向编码了有意义的语义信息。这些方向进一步支持无需修改模型架构即可实现忠实、可控的图像编辑（例如笑容、性别、年龄），其中简单的正交化处理可缓解虚假纠缠问题。综上所述，我们的结果为UNE假设提供了实证支持，并揭示了一种共享的类高斯潜在几何结构，从而具体地连接了编码与生成过程。代码与数据详见 https://rbetser.github.io/UNE/。

摘要 (Abstract)

Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/

关键词: Universal Normal Embedding, generative models, vision encoders, latent space Gaussianity, DDIM-inverted noise, CLIP, DINO, controllable edits

239. ❌ Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends

作者: Simone Nascivera, Leonard Bauersfeld, Jeff Delaune, Davide Scaramuzza 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21785v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉和机器人领域的视觉里程计（VO）前端参数自适应调优，使用强化学习框架和轻量级CNN编码器。所有关键词均涉及大语言模型（LLM）及其相关技术（如MoE、SFT、RAG、量化等）、AI for Science（生物信息学、化学信息学）或特定推理方法（如CoT、MCTS），与论文的视觉里程计、强化学习、机器人导航核心内容无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图像条件的强化学习框架，用于在线自适应调整视觉里程计前端的参数，通过轻量级纹理感知CNN编码器映射视觉输入到参数，实验表明在模拟训练下实现了3倍更长的特征跟踪和3倍更低的计算成本。

摘要翻译

资源受限的自主机器人依赖于稀疏直接与半直接视觉（惯性）里程计（VO）流程，因其在精度、鲁棒性和计算成本之间提供了良好的平衡。然而，大多数系统的性能关键取决于手动调整的超参数，这些参数控制着特征检测、跟踪和异常值剔除。这些参数通常在部署期间固定不变，尽管其最优值会随场景特性（如纹理密度、光照、运动模糊和传感器噪声）而变化，导致其在真实环境中性能脆弱。我们提出了首个基于图像条件的强化学习框架，用于在线调整VO前端参数，从而将专家知识有效嵌入系统。我们的核心思想是将前端配置建模为一个序列决策问题，并学习一个直接将视觉输入映射到特征检测与跟踪参数的策略。该策略在训练中使用轻量级的纹理感知CNN编码器和特权评论家。与先前仅依赖内部VO统计数据的基于强化学习的方法不同，我们的方法观察图像内容，并在跟踪质量下降前主动调整参数。在TartanAirV2和TUM RGB-D数据集上的实验表明，尽管完全在仿真环境中训练，该方法仍实现了特征跟踪长度延长3倍，计算成本降低3倍。

摘要 (Abstract)

Resource-constrained autonomous robots rely on sparse direct and semi-direct visual-(inertial)-odometry (VO) pipelines, as they provide a favorable tradeoff between accuracy, robustness, and computational cost. However, the performance of most systems depends critically on hand-tuned hyperparameters governing feature detection, tracking, and outlier rejection. These parameters are typically fixed during deployment, even though their optimal values vary with scene characteristics such as texture density, illumination, motion blur, and sensor noise, leading to brittle performance in real-world environments. We propose the first image-conditioned reinforcement learning framework for online tuning of VO frontend parameters, effectively embedding the expert into the system. Our key idea is to formulate the frontend configuration as a sequential decision-making problem and learn a policy that directly maps visual input to feature detection and tracking parameters. The policy uses a lightweight texture-aware CNN encoder and a privileged critic during training. Unlike prior RL-based approaches that rely solely on internal VO statistics, our method observes the image content and proactively adapts parameters before tracking degrades. Experiments on TartanAirV2 and TUM RGB-D show 3x longer feature tracks and 3x lower computational cost, despite training entirely in simulation.

关键词: visual odometry, reinforcement learning, parameter tuning, feature detection, CNN encoder, autonomous robots, online adaptation, computational efficiency

240. ❌ Dynamic Exposure Burst Image Restoration

作者: Woohyeok Kim, Jaesung Rim, Daeyeon Kim, Sunghyun Cho 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21784v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的动态曝光连拍图像恢复，提出了一种新颖的DEBIR流水线和BAENet网络，用于优化曝光时间设置以提高图像恢复质量。论文内容完全围绕图像处理、相机系统和深度学习在视觉任务中的应用展开，未涉及任何大语言模型、深度学习技术原理创新或AI for Science相关主题。所有评分关键词均与大模型、深度学习技术原理或科学AI应用相关，而本论文属于传统计算机视觉领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种动态曝光连拍图像恢复方法DEBIR，通过BAENet网络预测最优曝光时间，显著提升了连拍图像恢复的质量，并在真实相机系统中验证了其有效性。

摘要翻译

突发图像复原旨在从突发图像序列中重建高质量图像，这些图像通常采用手动设计的曝光设置进行采集。尽管曝光设置显著影响最终复原性能，但寻找最优曝光设置的问题长期被忽视。本文提出动态曝光突发图像复原（DEBIR），这是一种新型突发图像复原流程，通过动态预测适应拍摄环境的曝光时间来提升复原质量。在我们的流程中，突发自动曝光网络（BAENet）基于预览图像、运动幅度与增益参数，为每帧突发图像估算最优曝光时间。随后，突发图像复原网络利用这些最优曝光时间采集的突发图像序列重建高质量图像。为训练该模型，我们引入了可微分突发模拟器与三阶段训练策略。实验表明，我们的流程实现了最先进的复原质量。此外，我们在真实相机系统上验证了该方法的有效性，证明了其实际应用价值。

摘要 (Abstract)

Burst image restoration aims to reconstruct a high-quality image from burst images, which are typically captured using manually designed exposure settings. Although these exposure settings significantly influence the final restoration performance, the problem of finding optimal exposure settings has been overlooked. In this paper, we present Dynamic Exposure Burst Image Restoration (DEBIR), a novel burst image restoration pipeline that enhances restoration quality by dynamically predicting exposure times tailored to the shooting environment. In our pipeline, Burst Auto-Exposure Network (BAENet) estimates the optimal exposure time for each burst image based on a preview image, as well as motion magnitude and gain. Subsequently, a burst image restoration network reconstructs a high-quality image from burst images captured using these optimal exposure times. For training, we introduce a differentiable burst simulator and a three-stage training strategy. Our experiments demonstrate that our pipeline achieves state-of-the-art restoration quality. Furthermore, we validate the effectiveness of our approach on a real-world camera system, demonstrating its practicality.

关键词: burst image restoration, dynamic exposure, exposure time optimization, BAENet, burst auto-exposure, differentiable burst simulator, image quality enhancement, camera system

241. ❌ SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis

作者: Bingxuan Zhao, Qing Zhou, Chuang Yang, Qi Wang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21783v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于遥感图像合成，使用Diffusion Transformers（DiTs）和FLUX模型，属于计算机视觉和生成式AI领域。与大多数大语言模型（LLM）关键词无关，但涉及领域适应（Domain Adaptation）和微调（Fine-tuning），因此给这两个关键词5分。同时，遥感属于科学应用，与’AI for Science’相关，给8分。其他关键词如LLMs、MoE、Scaling Laws、RLHF、RAG、Agents等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了遥感图像合成中因缺乏领域专用先验和高分辨率训练成本导致的质量问题，通过微调FLUX模型构建遥感先验（RS-FLUX）并提出SHARP方法，实现了训练自由的分辨率提升和高质量多尺度生成。

摘要翻译

基于扩散变换器（Diffusion Transformers, DiTs）的文生图技术已取得显著进展，但遥感（Remote Sensing, RS）图像合成领域仍相对滞后，主要受限于两大障碍：缺乏领域专用的DiT先验模型，以及在遥感应用所需的高分辨率下训练成本过高。通过旋转位置编码（Rotary Position Embedding, RoPE）重缩放实现免训练的分辨率提升提供了一种实用解决方案，但现有方法均在去噪全过程采用静态的位置缩放规则。这种均匀压缩对遥感影像尤为不利，因为其显著更密集的中高频能量编码了对航拍场景真实性至关重要的精细结构，例如车辆、建筑轮廓和道路标记。要应对这两大挑战，需要结合领域专用的生成先验与一种感知去噪过程的位置自适应策略。为此，我们在超过10万张精选遥感图像上对FLUX模型进行微调，构建了强大的领域先验模型（RS-FLUX），并提出了一种免训练方法——频谱感知高动态分辨率提升适配器（Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion, SHARP）。该方法将一个有理分数时间调度函数k_rs(t)引入RoPE中：在早期布局形成阶段施加强位置提升，并在细节恢复阶段逐步放松，从而使外推强度与扩散去噪过程频率渐进特性保持一致。其分辨率无关的公式化设计进一步实现了使用单一组超参数进行鲁棒的多尺度生成。在六种正方形与矩形分辨率上的大量实验表明，SHARP在CLIP分数、美学分数和HPSv2指标上均持续优于所有免训练基线方法，且在外推因子更激进时优势差距扩大，同时计算开销可忽略不计。代码与模型权重已发布于https://github.com/bxuanz/SHARP。

摘要 (Abstract)

Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.

关键词: Remote Sensing Synthesis, Diffusion Transformers, Resolution Promotion, Rotary Position Embedding, Domain Adaptation, Fine-tuning, Training-free Method, Multi-scale Generation

242. ❌ Getting to the Point: Why Pointing Improves LVLMs

作者: Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21746v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型视觉语言模型（LVLMs）中的pointing机制，属于大模型应用范畴。核心相关关键词：1）‘Post-training OR Supervised Fine-tuning OR SFT’（10分）- 论文明确使用fine-tuning方法；2）‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分）- pointing机制将grounding和reasoning分解为显式序列步骤，是多步推理的体现；3）‘Mechanistic Interpretability OR Explainable AI’（10分）- 研究pointing机制如何提升模型性能，属于可解释性研究；4）‘Large Language Models OR LLMs OR Foundation Models’（8分）- LVLMs是大模型的一种；5）‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（8分）- pointing促进更深入的推理过程。其他关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了在大型视觉语言模型中引入pointing机制（先预测物体坐标再计数）如何通过显式的多步推理过程提升零样本计数任务的准确性和泛化能力，并揭示了空间信息编码是性能提升的关键机制。

摘要翻译

指向机制通过将视觉定位与推理建模为显式的序列化步骤，提升了大规模视觉语言模型（LVLMs）的准确性与可解释性。该机制首先通过预测自然语言查询中提及物体的坐标来实现视觉定位，随后基于这些坐标点生成答案。尽管已有研究表明指向机制能提高LVLMs的准确性，但其背后的作用机制及其在认知任务中的关联性尚不明确。此外，中间坐标点的可靠性仍缺乏深入研究，这限制了其作为视觉解释工具的应用。本研究以一项认知任务——视觉场景中的零样本计数——为切入点，探究指向机制的作用。我们采用两种方法对前沿LVLMs进行微调：直接计数法（模型仅预测物体总数）与先指向后计数法（LVLMs先生成目标物体的坐标，再进行计数）。实验结果表明，先指向后计数法具有更强的分布外泛化能力，说明坐标信息有助于LVLMs学习通用技能而非局限于特定任务的过拟合。尽管在超过89%的情况下（以F1分数衡量）预测坐标能准确对应图像中的位置，但不同图像区域的性能存在差异，揭示了模型的空间偏差。最终的机制分析表明，计数性能的提升源于坐标所编码的空间信息。

摘要 (Abstract)

Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs’ accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects’ coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.

关键词: Large Vision-Language Models, pointing mechanism, zero-shot counting, fine-tuning, multi-step reasoning, spatial information, model interpretability, visual grounding

243. ❌ PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and Paraganglioma

作者: Zelin Liu, Xiangfu Yu, Jie Huang, Ge Wang, Yizhe Yuan, Zhenyu Yi, Jing Xie, Haotian Jiang, Lichi Zhang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21700v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文PPGL-Swarm主要研究基于多智能体系统的罕见神经内分泌肿瘤（PPGL）诊断系统，属于AI在生物医学（Bioinformatics）领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。系统采用多智能体架构进行任务分解和协调，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’有一定关联（各5分），但未明确使用大语言模型（LLM）或深度学习技术原理创新。其他关键词主要涉及大模型技术细节（如训练方法、推理优化、对齐等），论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对嗜铬细胞瘤和副神经节瘤（PPGL）诊断中GAPP评分工作量大、主观性强且忽略基因风险因素的问题，提出了一个基于多智能体系统的PPGL-Swarm诊断系统，能自动生成包含GAPP评分、基因风险警报和多模态证据的综合报告，并通过强化学习优化工具选择和任务分配。

摘要翻译

嗜铬细胞瘤和副神经节瘤（PPGLs）是罕见的神经内分泌肿瘤，其中15-25%会发展为转移性疾病，据报道其5年生存率低至34%。PPGL可能提示存在遗传综合征，需要更严格、针对特定综合征的治疗和监测，但临床医生在常规诊疗中常未能识别这些关联。临床实践中使用GAPP评分对PPGL进行分级，但PPGL诊断仍存在若干局限：（1）GAPP评分要求临床医生投入大量工作量，因其需人工评估六个独立指标；（2）细胞密度和Ki-67等关键指标常依赖主观标准评估；（3）GAPP未能涵盖若干临床相关的转移风险因素，例如SDHB基因突变（其报告的转移率高达35-75%）。智能体驱动的诊断系统前景广阔，但多数系统缺乏可追溯的决策推理过程，且未整合PPGL基因型信息等专业领域知识。为应对这些局限，我们提出PPGL-Swarm——一个智能体化的PPGL诊断系统，可生成包含自动化GAPP评分（含量化细胞密度与Ki-67）、基因型风险预警及多模态证据整合报告在内的综合诊断报告。该系统通过将诊断分解为微任务并分配给专业智能体，形成可审计的推理路径。其中基因智能体与表格智能体采用知识增强技术以优化基因型和实验室结果的解读，并在训练阶段运用强化学习优化工具选择与任务分配机制。

摘要 (Abstract)

Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors, of which 15-25% develop metastatic disease with 5-year survival rates reported as low as 34%. PPGL may indicate hereditary syndromes requiring stricter, syndrome-specific treatment and surveillance, but clinicians often fail to recognize these associations in routine care. Clinical practice uses GAPP score for PPGL grading, but several limitations remain for PPGL diagnosis: (1) GAPP scoring demands a high workload for clinician because it requires the manual evaluation of six independent components; (2) key components such as cellularity and Ki-67 are often evaluated with subjective criteria; (3) several clinically relevant metastatic risk factors are not captured by GAPP, such as SDHB mutations, which have been associated with reported metastatic rates of 35-75%. Agent-driven diagnostic systems appear promising, but most lack traceable reasoning for decision-making and do not incorporate domain-specific knowledge such as PPGL genotype information. To address these limitations, we present PPGL-Swarm, an agentic PPGL diagnostic system that generates a comprehensive report, including automated GAPP scoring (with quantified cellularity and Ki-67), genotype risk alerts, and multimodal report with integrated evidence. The system provides an auditable reasoning trail by decomposing diagnosis into micro-tasks, each assigned to a specialized agent. The gene and table agents use knowledge enhancement to better interpret genotype and laboratory findings, and during training we use reinforcement learning to refine tool selection and task assignment.

关键词: PPGL, diagnostic system, agentic, GAPP scoring, genotype risk, multimodal report, reinforcement learning, multi-agent

244. ❌ RefracGS: Novel View Synthesis Through Refractive Water Surfaces with 3D Gaussian Ray Tracing

作者: Yiming Shao, Qiyu Dai, Chong Gao, Guanbin Li, Yeqiang Wang, He Sun, Qiong Zeng, Baoquan Chen, Wenzheng Chen 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21695v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的新视角合成（NVS）技术，特别是通过折射水面进行渲染。它使用3D高斯场和神经高度场等表示方法，并开发了折射感知的高斯光线追踪方法。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是计算机图形学中的特定渲染问题，与这些关键词无直接关联。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了通过折射水面进行新视角合成时因光线非线性传播导致的严重伪影问题，提出了一种联合优化折射表面和底层场景表示的框架RefracGS，实现了高保真渲染和实时性能。

摘要翻译

通过非平面折射表面进行新视角合成（NVS）面临着根本性挑战，这主要源于严重且空间变化的光学畸变。尽管近期如神经辐射场（NeRF）和3D高斯泼溅（3DGS）等表征方法在新视角合成方面表现出色，但它们基于光线直线传播的假设在此类条件下失效，导致显著的伪影。为克服这一局限，我们提出了RefracGS框架，该框架能够联合重建折射水面及界面下的场景。我们的核心见解是将折射边界与目标物体显式解耦：折射表面通过一个捕捉波浪几何的神经高度场进行建模，而水下场景则表征为一个3D高斯场。我们提出了一种折射感知的高斯光线追踪方法，该方法利用斯涅尔定律精确计算非线性光线轨迹，高效渲染底层高斯场，同时将损失梯度反向传播至参数化的折射表面。通过对两种表征进行端到端的联合优化，我们的方法确保了高保真的新视角合成和视角一致的表面重建。在具有复杂波浪的合成场景和真实场景上的实验表明，RefracGS在视觉质量上优于先前的折射方法，同时实现了15倍的训练加速和200 FPS的实时渲染。RefracGS的项目页面位于 https://yimgshao.github.io/refracgs/。

摘要 (Abstract)

Novel view synthesis (NVS) through non-planar refractive surfaces presents fundamental challenges due to severe, spatially varying optical distortions. While recent representations like NeRF and 3D Gaussian Splatting (3DGS) excel at NVS, their assumption of straight-line ray propagation fails under these conditions, leading to significant artifacts. To overcome this limitation, we introduce RefracGS, a framework that jointly reconstructs the refractive water surface and the scene beneath the interface. Our key insight is to explicitly decouple the refractive boundary from the target objects: the refractive surface is modeled via a neural height field, capturing wave geometry, while the underlying scene is represented as a 3D Gaussian field. We formulate a refraction-aware Gaussian ray tracing approach that accurately computes non-linear ray trajectories using Snell’s law and efficiently renders the underlying Gaussian field while backpropagating the loss gradients to the parameterized refractive surface. Through end-to-end joint optimization of both representations, our method ensures high-fidelity NVS and view-consistent surface recovery. Experiments on both synthetic and real-world scenes with complex waves demonstrate that RefracGS outperforms prior refractive methods in visual quality, while achieving 15x faster training and real-time rendering at 200 FPS. The project page for RefracGS is available at https://yimgshao.github.io/refracgs/.

关键词: Novel View Synthesis, Refractive Surfaces, 3D Gaussian Splatting, Ray Tracing, Neural Height Field, Snell’s Law, Real-time Rendering, Joint Optimization

245. ❌ PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

作者: Yuheng Ji, Yuyang Liu, Huajie Tan, Xuchuan Huang, Fanding Huang, Yijie Xu, Cheng Chi, Yuting Zhao, Huaihai Lyu, Peterson Co, Mingyu Cao, Qiongyu Zhang, Zhe Li, Enshen Zhou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, Xiaolong Zheng 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21669v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人策略评估方法，提出了一种基于过程奖励模型（PRM）的密集评估范式，用于从轨迹视频中审计策略执行。论文的核心是机器人评估的度量系统（OPD）和基准测试（RoboPulse），不涉及大语言模型、深度学习技术原理、模型训练方法（如预训练、微调、对齐）、推理优化、代理系统或科学AI应用等关键词。所有关键词均与论文内容无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对当前机器人评估主要依赖二元成功率而忽略执行过程质量的问题，提出了PRM-as-a-Judge密集评估范式，通过过程奖励模型从轨迹视频估计任务进度，并建立了OPD度量系统，实证表明该方法能揭示传统结果指标无法检测的行为特征和失败模式。

摘要翻译

当前机器人评估仍主要被二元成功率所主导，这种方法将丰富的执行过程压缩为单一结果，掩盖了进展、效率与稳定性等关键特性。为应对这一局限，我们提出“PRM即裁判”这一密集评估范式，其利用过程奖励模型通过观测序列估计任务进度，从而直接从轨迹视频中审核策略执行。该范式的核心是OPD（结果-过程-诊断）度量体系，其通过任务对齐的进度势能显式形式化执行质量。我们通过两个公理性特征来刻画密集机器人评估：宏观一致性（要求具备可加性与路径一致的聚合能力）与微观分辨率（要求对细粒度物理演化具有敏感性）。在此框架下，基于势能的PRM裁判为密集评估提供了自然的实例化方案，其诱导的标量势能直接满足宏观一致性要求。我们使用RoboPulse（一个专为探测微观尺度进度判别能力而设计的诊断基准）对微观分辨率特性进行了实证验证，其中多个基于轨迹训练的PRM裁判在性能上超越了基于判别相似性的方法与通用基础模型裁判。最后，依托“PRM即裁判”范式与OPD度量体系，我们对主流策略范式在长视野任务中进行了结构化审核，揭示了仅依赖结果度量的方法所无法观测的行为特征与失效模式。

摘要 (Abstract)

Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.

关键词: robotic evaluation, dense evaluation, Process Reward Models, trajectory videos, task progress, OPD metric system, RoboPulse benchmark, policy auditing

246. ❌ HumanOmni-Speaker: Identifying Who said What and When

作者: Detao Bai, Shimin Yao, Weixuan Chen, Xihan Wei, Zhiheng Ma 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21664v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态大语言模型（Omni-modal LLMs）在复杂多人对话场景中的视觉-语言对齐问题，提出了VR-SDR范式和HumanOmni-Speaker模型。核心相关关键词是’Large Language Models OR LLMs OR Foundation Models’（评分8.0），因为论文明确针对Omni-modal LLMs的局限性进行改进。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统、AI for Science等均未在摘要中提及或涉及，因此评分为0。论文属于大模型在不同领域（多模态对话理解）的应用研究，具有一定创新性，但未深入探讨其他具体技术原理。

!!! tip deepseek-chat TL;DR

该论文针对现有多模态大语言模型在复杂多人对话中难以准确识别'谁在何时说了什么'的问题，提出了Visual-Registered Speaker Diarization and Recognition范式和HumanOmni-Speaker模型，通过视觉残差编码捕获细粒度唇动和说话人轨迹，实现了端到端的唇读和空间定位，并在说话人中心任务上取得优越性能。

摘要翻译

尽管全模态大语言模型在联合感官处理方面取得了进展，但其在理解人类互动的基石——解析复杂的多人对话动态以准确回答“谁在何时说了什么”——方面仍存在根本性困难。现有模型普遍存在一种“能力幻觉”：它们利用传统基准测试中的视觉偏见来规避真正的跨模态对齐，同时依赖稀疏的低帧率视觉采样，这破坏了如唇部运动等关键的高频动态信息。为打破这种幻觉，我们引入了视觉注册说话人日志与识别（VR-SDR）范式及HumanOmni-Speaker基准。通过严格消除视觉捷径，这一严谨的范式要求仅使用自然语言查询实现真正的端到端时空身份绑定。为克服底层架构的感知差距，我们提出了由视觉增量编码器驱动的HumanOmni-Speaker模型。该模型以25帧/秒采样原始视频，并将帧间运动残差显式压缩至每帧仅6个令牌，从而在不引发灾难性令牌爆炸的前提下，捕获细粒度的视位单元和说话人轨迹。最终，HumanOmni-Speaker展现出强大的多模态协同能力，原生支持端到端唇语阅读和高精度空间定位而无需侵入式裁剪，并在广泛的以说话人为中心的任务中实现了卓越性能。

摘要 (Abstract)

While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer Who said what and when.'' Current models suffer from an illusion of competence’’ – they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.

关键词: Omni-modal Large Language Models, Visual-Registered Speaker Diarization and Recognition, HumanOmni-Speaker, multimodal synergy, lip-reading, speaker trajectories, end-to-end spatio-temporal identity binding, visemes

247. ❌ OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging

作者: Meilin Liu, Jiaying Wang, Jing Shan 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21660v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于联邦学习在异构医学影像分析中的应用，提出了OmniFM框架来解决模态和任务无关的协作学习问题。所有关键词均与大模型（LLMs）或深度学习技术原理直接相关，但论文内容完全不涉及大模型、语言模型、模型训练/对齐技术、推理优化、代理系统或模型压缩等主题。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于医学影像分析的AI应用（Bioinformatics相关领域），但并非核心创新点（核心是联邦学习框架），因此给予8分（有一定关联）。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对异构医学影像分析中联邦学习框架与任务和模态紧耦合的局限性，提出了OmniFM框架，通过频域方法实现了模态和任务无关的协作学习，并在实验中超越了现有基线方法。

摘要翻译

联邦学习（FL）已成为协作医学图像分析的一种前景广阔的模式，然而现有框架仍与特定任务的主干网络紧密耦合，且在异构成像模态下表现脆弱。这些限制阻碍了其在现实世界中的部署，因为不同机构的模态分布差异巨大，且必须支持多样化的下游任务。为解决这一局限，我们提出了OmniFM，一种模态与任务无关的联邦学习框架，它统一了分类、分割、超分辨率、视觉问答及多模态融合的训练过程，无需重新设计优化流程。OmniFM基于一个关键的频域洞见：低频频谱成分展现出强大的跨模态一致性，并编码了模态不变的解剖结构。因此，OmniFM整合了（i）全局频谱知识检索以注入全局频率先验，（ii）嵌入级交叉注意力融合以对齐表征，以及（iii）前缀-后缀频谱提示来共同调节全局与个性化线索，并通过频谱-近端对齐目标进行联合正则化以稳定聚合。在真实数据集上的实验表明，OmniFM在模态内与跨模态异质性场景下均持续超越最先进的联邦学习基线，在微调与从头训练设置下均取得了优异结果。

摘要 (Abstract)

Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to task-specific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM, a modality- and task-agnostic FL framework that unifies training across classification, segmentation, super-resolution, visual question answering, and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates (i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix-Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.

关键词: Federated Learning, Medical Imaging, Modality-Robust, Task-Agnostic, Heterogeneous Data, Spectral Analysis, Cross-Modality Consistency, Multimodal Fusion

248. ❌ FedCVU: Federated Learning for Cross-View Video Understanding

作者: Shenghan Zhang, Run Ling, Ke Cao, Ao Ma, Zhanjie Zhang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21647v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《FedCVU: Federated Learning for Cross-View Video Understanding》专注于联邦学习在跨视角视频理解中的应用，提出了一种解决非独立同分布数据、表示对齐和通信开销的框架。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是联邦学习在视频分析中的具体应用，未涉及大模型、LLMs、MoE、SLMs、缩放定律、预训练、后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、代理系统、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI等主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习在跨视角视频理解中面临的非独立同分布数据、表示对齐和通信开销三大挑战，提出了FedCVU框架，通过VS-Norm、CV-Align和SLA三个组件有效提升了未见视角的准确性并保持了强性能。

摘要翻译

联邦学习（Federated Learning, FL）已成为一种具有前景的隐私保护多摄像头视频理解范式。然而，将FL应用于跨视角场景面临三大挑战：（i）异构的视角与背景导致高度非独立同分布的客户端数据分布，并容易过拟合于特定视角模式；（ii）局部分布偏差导致表征失准，阻碍跨视角语义的一致性；（iii）大规模视频架构带来极高的通信开销。为解决这些问题，我们提出FedCVU框架，其包含三个核心组件：VS-Norm（视角特定归一化），通过保留归一化参数以处理视角相关的统计特征；CV-Align（跨视角对齐），一个轻量级对比正则化模块，用于提升跨视角表征对齐；以及SLA（选择性层聚合策略），在不牺牲精度的情况下降低通信成本。在跨视角协议下对行为理解与行人重识别任务进行的大量实验表明，FedCVU能持续提升未知视角的准确率，同时保持强大的已知视角性能，其表现优于现有联邦学习基线方法，并展现出对领域异构性与通信限制的鲁棒性。

摘要 (Abstract)

Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.

关键词: Federated Learning, Cross-View Video Understanding, Non-IID Data, Representation Alignment, Communication Efficiency, Action Understanding, Person Re-identification, Domain Heterogeneity

249. ❌ No Dense Tensors Needed: Fully Sparse Object Detection on Event-Camera Voxel Grids

作者: Mohamad Yazan Sadoun, Sarah Sharif, Yaser Mike Banad 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21638v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于事件相机目标检测，提出了一种完全稀疏的处理方法（SparseVoxelDet），与大多数关键词无关。唯一相关的关键词是’Mixture of Experts OR MoE OR Sparse Models’，因为论文的核心创新是使用3D稀疏卷积进行完全稀疏处理，这与稀疏模型技术相关，但并非MoE或传统稀疏模型，因此给予5分（有一定关联）。其他关键词均未涉及大语言模型、训练方法、推理技术、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种完全稀疏的目标检测方法SparseVoxelDet，用于事件相机数据，通过3D稀疏卷积仅处理活跃体素，实现了高效的内存压缩和存储减少，同时保持了较高的检测精度。

摘要翻译

事件相机产生的异步高动态范围数据流非常适合检测小型快速移动的无人机，然而大多数基于事件的检测器将稀疏事件流转换为密集张量，丢弃了神经形态传感的表征效率。我们提出SparseVoxelDet——据我们所知这是首个面向事件相机的完全稀疏目标检测器，其主干特征提取、特征金字塔融合和检测头均通过三维稀疏卷积专门在占用的体素位置上进行计算；整个流程的任何阶段均未实例化密集特征张量。在FRED基准数据集（629,832帧标注数据）上，SparseVoxelDet在IoU阈值为0.50时达到83.38% mAP，每帧仅处理14,900个活跃体素（占T×H×W网格的0.23%），而作为密集基线的YOLOv11需处理409,600像素（87.68% mAP@0.50）。将IoU阈值从0.50放宽至0.40可使mAP恢复至89.26%，表明剩余的精度差距主要源于边界框回归精度而非检测能力。相较于等效的密集三维体素张量，稀疏表征实现了858倍GPU内存压缩和3,670倍存储缩减，其数据结构规模随场景动态变化而非传感器分辨率。对119,459帧测试数据的误差分析证实，71%的失败案例属于定位偏差而非漏检目标。这些结果表明：原生稀疏处理是事件相机目标检测的可行范式，它能够利用神经形态传感器数据的结构稀疏性而无需神经形态计算硬件，并构建了一个表征成本由场景活动度而非像素数量决定的框架——这一特性将随着事件相机向更高分辨率发展而日益重要。

摘要 (Abstract)

Event cameras produce asynchronous, high-dynamic-range streams well suited for detecting small, fast-moving drones, yet most event-based detectors convert the sparse event stream into dense tensors, discarding the representational efficiency of neuromorphic sensing. We propose SparseVoxelDet, to our knowledge the first fully sparse object detector for event cameras, in which backbone feature extraction, feature pyramid fusion, and the detection head all operate exclusively on occupied voxel positions through 3D sparse convolutions; no dense feature tensor is instantiated at any stage of the pipeline. On the FRED benchmark (629,832 annotated frames), SparseVoxelDet achieves 83.38% mAP at 50 while processing only 14,900 active voxels per frame (0.23% of the T.H.W grid), compared to 409,600 pixels for the dense YOLOv11 baseline (87.68% mAP at 50). Relaxing the IoU threshold from 0.50 to 0.40 recovers mAP to 89.26%, indicating that the remaining accuracy gap is dominated by box regression precision rather than detection capability. The sparse representation yields 858 times GPU memory compression and 3,670 times storage reduction relative to the equivalent dense 3D voxel tensor, with data-structure size that scales with scene dynamics rather than sensor resolution. Error forensics across 119,459 test frames confirms that 71 percent of failures are localization near-misses rather than missed targets. These results demonstrate that native sparse processing is a viable paradigm for event-camera object detection, exploiting the structural sparsity of neuromorphic sensor data without requiring neuromorphic computing hardware, and providing a framework whose representation cost is governed by scene activity rather than pixel count, a property that becomes increasingly valuable as event cameras scale to higher resolutions.

关键词: event cameras, sparse object detection, 3D sparse convolutions, voxel grids, neuromorphic sensing, GPU memory compression, FRED benchmark, SparseVoxelDet

250. ❌ Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition

作者: Wen Guo, Pengfei Zhao, Zongmeng Wang, Yufan Hu, Junyu Gao 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21629v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多目标跟踪（MOT）中的测试时适应（TTA）问题，提出了一种基于经验和直觉的双层校准框架。论文属于计算机视觉领域，专注于目标跟踪的分布偏移问题，未涉及大语言模型、深度学习技术原理、AI for Science等关键词。所有关键词均与大语言模型、深度学习技术、科学AI应用相关，而本文是纯计算机视觉研究，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对多目标跟踪中训练-测试数据分布偏移导致的性能下降问题，提出了一个基于经验和直觉的测试时校准框架，在多个基准数据集上显著提升了模型在分布偏移下的适应能力。

摘要翻译

多目标跟踪（Multiple Object Tracking，MOT）长期以来是计算机视觉领域的一项基础任务，在众多现实场景中具有广泛应用。然而，由于训练数据与测试数据在外观、运动模式和类别分布上存在偏移，模型在MOT的在线推理过程中性能显著下降。测试时适应（Test-Time Adaptation，TTA）作为一种缓解此类分布偏移的有效范式应运而生。然而，现有TTA方法在MOT中往往难以取得令人满意的效果，因为它们主要侧重于帧级适应，而忽视了跨帧与跨视频的时间一致性和身份关联性。受人类决策过程的启发，本文提出一种基于经验与直觉的测试时校准（Test-time Calibration from Experience and Intuition，TCEI）框架。在该框架中，直觉系统利用瞬时记忆回顾近期观测到的目标以进行快速预测，而经验系统则借助先前测试视频中积累的经验对这些直觉预测进行重新评估与校准。此外，在线测试过程中置信度高的目标与不确定性目标分别被用作历史先验和反思案例，从而使模型能够适应测试环境并缓解性能下降。大量实验表明，所提出的TCEI框架在多个基准数据集上均能取得优越性能，并显著提升了模型在分布偏移下的适应能力。代码将在https://github.com/1941Zpf/TCEI 发布。

摘要 (Abstract)

Multiple Object Tracking (MOT) has long been a fundamental task in computer vision, with broad applications in various real-world scenarios. However, due to distribution shifts in appearance, motion pattern, and catagory between the training and testing data, model performance degrades considerably during online inference in MOT. Test-Time Adaptation (TTA) has emerged as a promising paradigm to alleviate such distribution shifts. However, existing TTA methods often fail to deliver satisfactory results in MOT, as they primarily focus solely on frame-level adaptation while neglecting temporal consistency and identity association across frames and videos. Inspired by human decision-making process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework. In this framework, the Intuitive system utilizes transient memory to recall recently observed objects for rapid predictions, while the Experiential system leverages the accumulated experience from prior test videos to reassess and calibrate these intuitive predictions. Furthermore, both confident and uncertain objects during online testing are exploited as historical priors and reflective cases, respectively, enabling the model to adapt to the testing environment and alleviate performance degradation. Extensive experiments demonstrate that the proposed TCEI framework consistently achieves superior performance across multiple benchmark datasets and significantly enhances the model’s adaptability under distribution shifts. The code will be released at https://github.com/1941Zpf/TCEI.

关键词: Multiple Object Tracking, Test-Time Adaptation, Distribution Shifts, Temporal Consistency, Identity Association, Experience and Intuition, Online Inference, Model Calibration

251. ❌ PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation

作者: Jiacheng Lu, Hui Ding, Shiyu Zhang, Guoping Huo 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21626v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像分割（脑肿瘤MRI分割），提出了一种名为PGR-Net的深度学习网络，通过引入数据驱动的空间先验集、分层Top-K ROI决策机制和WinGS-ROI模块来提升分割精度和效率。论文的核心内容与大多数关键词（如LLMs、MoE、RLHF、RAG等）完全无关，因为这些关键词主要涉及大语言模型、训练技术、推理优化、智能体等特定领域。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（具体为医学影像分析）领域的应用，属于AI for Science的范畴，但并非核心创新点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PGR-Net的脑肿瘤MRI分割网络，通过引入先验引导的ROI推理机制，在多个数据集上实现了优于现有方法的性能，同时保持较低的参数量。

摘要翻译

脑肿瘤MRI分割对于临床诊断与治疗规划至关重要，能够实现精准的病灶检测与放疗靶区勾画。然而，肿瘤病灶仅占据三维空间中的极小部分，导致严重的空间稀疏性问题，而现有分割网络往往忽略临床观察到的肿瘤发生空间先验，从而在广泛的背景区域上进行冗余的特征计算。为解决这一问题，我们提出PGR-Net（先验引导的感兴趣区域推理网络）——一种显式的ROI感知框架，它整合了数据驱动的空间先验集合以捕捉肿瘤病灶的分布与尺度特征，为更稳定的分割提供全局引导。利用这些先验，PGR-Net引入了分层的Top-K ROI决策机制，该机制在编码器层中逐步选择置信度最高的病灶候选区域，以提升定位精度。我们进一步开发了WinGS-ROI（窗口化高斯空间衰减ROI）模块，该模块采用具有空间衰减函数的多窗口高斯模板来生成中心强化的引导图，从而在整个网络中引导特征学习。基于这些ROI特征，我们采用了窗口化的RetNet主干网络以增强定位可靠性。在BraTS-2019/2023和MSD Task01数据集上的实验表明，PGR-Net在仅使用8.64M参数的情况下，性能持续优于现有方法，在全肿瘤区域上分别取得了89.02%、91.82%和89.67%的Dice分数。代码公开于https://github.com/CNU-MedAI-Lab/PGR-Net。

摘要 (Abstract)

Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided ROI Reasoning Network) - an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions across encoder layers to improve localization precision. We further develop the WinGS-ROI (Windowed Gaussian-Spatial Decay ROI) module, which uses multi-window Gaussian templates with a spatial decay function to produce center-enhanced guidance maps, thus directing feature learning throughout the network. With these ROI features, a windowed RetNet backbone is adopted to enhance localization reliability. Experiments on BraTS-2019/2023 and MSD Task01 show that PGR-Net consistently outperforms existing approaches while using only 8.64M Params, achieving Dice scores of 89.02%, 91.82%, and 89.67% on the Whole Tumor region. Code is available at https://github.com/CNU-MedAI-Lab/PGR-Net.

关键词: Brain tumor segmentation, MRI segmentation, ROI reasoning, Spatial prior, Deep learning, Medical image analysis, PGR-Net, RetNet backbone

252. ❌ 4DGS360: 360° Gaussian Reconstruction of Dynamic Objects from a Single Video

作者: Jae Won Jang, Yeonjin Chang, Wonsik Shin, Juhwan Cho, Nojun Kwak 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21618v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的动态物体三维重建，使用高斯溅射和单目视频输入，核心贡献是3D原生初始化和轨迹优化方法。所有评分关键词均涉及大语言模型、深度学习技术原理或AI在科学领域的应用，但本文完全不涉及这些主题，属于纯粹的计算机视觉/三维重建研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为4DGS360的无扩散框架，用于从单目视频中重建动态物体的360度三维几何，通过3D原生初始化和AnchorTAP3D轨迹优化解决了现有方法在遮挡区域几何一致性的问题，并在新基准iPhone360上实现了最先进的性能。

摘要翻译

我们提出了4DGS360，一种无需扩散模型的框架，用于从单目日常视频中重建360°动态物体。现有方法通常难以重建一致的360°几何结构，因为它们过度依赖二维原生先验，导致初始点过度拟合每个训练视角中的可见表面。4DGS360通过一种先进的三维原生初始化方法应对这一挑战，该方法缓解了被遮挡区域的几何模糊性。我们提出的三维跟踪器AnchorTAP3D通过将置信度高的二维跟踪点作为锚点，生成强化的三维点轨迹，从而抑制漂移并提供可靠的初始化，以保持被遮挡区域的几何结构。这种初始化与优化相结合，产生了连贯的360°四维重建结果。我们进一步推出了iPhone360这一新基准数据集，其测试相机与训练视角的间隔可达135°，实现了现有数据集无法提供的360°全方位评估。实验表明，4DGS360在iPhone360、iPhone和DAVIS数据集上均取得了定性与定量评估的最先进性能。

摘要 (Abstract)

We introduce 4DGS360, a diffusion-free framework for 360$^{\circ}$ dynamic object reconstruction from casual monocular video. Existing methods often fail to reconstruct consistent 360$^{\circ}$ geometry, as their heavy reliance on 2D-native priors causes initial points to overfit to visible surface in each training view. 4DGS360 addresses this challenge through a advanced 3D-native initialization that mitigates the geometric ambiguity of occluded regions. Our proposed 3D tracker, AnchorTAP3D, produces reinforced 3D point trajectories by leveraging confident 2D track points as anchors, suppressing drift and providing reliable initialization that preserves geometry in occluded regions. This initialization, combined with optimization, yields coherent 360$^{\circ}$ 4D reconstructions. We further present iPhone360, a new benchmark where test cameras are placed up to 135$^{\circ}$ apart from training views, enabling 360$^{\circ}$ evaluation that existing datasets cannot provide. Experiments show that 4DGS360 achieves state-of-the-art performance on the iPhone360, iPhone, and DAVIS datasets, both qualitatively and quantitatively.

关键词: 360° dynamic object reconstruction, monocular video, 3D-native initialization, Gaussian splatting, AnchorTAP3D, occluded regions, iPhone360 benchmark, 4D reconstruction

253. ❌ AdaEdit: Adaptive Temporal and Channel Modulation for Flow-Based Image Editing

作者: Guandong Li, Zhaobin Chu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21615v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于基于流匹配模型的图像编辑方法（AdaEdit），研究内容为训练自由、文本引导的图像操作，涉及渐进注入调度和通道选择性潜在扰动等技术。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是计算机视觉领域的图像编辑，与LLM、MoE、缩放定律、对齐、推理、代理、量化等关键词无直接关联，也未涉及生物信息学或化学信息学等科学AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了基于流匹配模型的图像编辑中的注入困境问题，通过提出渐进注入调度和通道选择性潜在扰动两种自适应方法，显著提升了编辑质量并减少了伪影。

摘要翻译

基于流匹配模型的反演图像编辑已成为免训练、文本引导图像操控的强大范式。该范式的核心挑战在于注入困境：在去噪过程中注入源图像特征虽能保留原始图像的背景，但会同时抑制模型合成编辑内容的能力。现有方法通过固定的注入策略应对此问题——包括二值化的开关时序调度、均匀的空间混合比例以及通道无关的潜在扰动——这些策略忽视了时序与通道维度上注入需求固有的异质性。本文提出AdaEdit，一种免训练的自适应编辑框架，通过两项互补的创新解决此困境。首先，我们提出渐进式注入调度，使用连续衰减函数（Sigmoid、余弦或线性）替代硬性的二值截断，实现从源特征保留到目标特征生成的平滑过渡，消除特征不连续伪影。其次，我们引入通道选择性潜在扰动，该方法基于反演潜在向量与随机潜在向量的分布差异估计各通道重要性，并相应施加差异化的扰动强度——对编辑相关通道进行强扰动，同时保留结构编码通道。在PIE-Bench基准测试（700张图像，10种编辑类型）上的大量实验表明，AdaEdit相较于强基线方法，在LPIPS指标上降低了8.7%，在SSIM和PSNR指标上分别提升了2.6%和2.3%，同时保持了具有竞争力的CLIP相似度。AdaEdit完全即插即用，兼容多种常微分方程求解器（包括Euler、RF-Solver和FireFlow）。代码发布于https://github.com/leeguandong/AdaEdit。

摘要 (Abstract)

Inversion-based image editing in flow matching models has emerged as a powerful paradigm for training-free, text-guided image manipulation. A central challenge in this paradigm is the injection dilemma: injecting source features during denoising preserves the background of the original image but simultaneously suppresses the model’s ability to synthesize edited content. Existing methods address this with fixed injection strategies – binary on/off temporal schedules, uniform spatial mixing ratios, and channel-agnostic latent perturbation – that ignore the inherently heterogeneous nature of injection demand across both the temporal and channel dimensions. In this paper, we present AdaEdit, a training-free adaptive editing framework that resolves this dilemma through two complementary innovations. First, we propose a Progressive Injection Schedule that replaces hard binary cutoffs with continuous decay functions (sigmoid, cosine, or linear), enabling a smooth transition from source-feature preservation to target-feature generation and eliminating feature discontinuity artifacts. Second, we introduce Channel-Selective Latent Perturbation, which estimates per-channel importance based on the distributional gap between the inverted and random latents and applies differentiated perturbation strengths accordingly – strongly perturbing edit-relevant channels while preserving structure-encoding channels. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing types) demonstrate that AdaEdit achieves an 8.7% reduction in LPIPS, a 2.6% improvement in SSIM, and a 2.3% improvement in PSNR over strong baselines, while maintaining competitive CLIP similarity. AdaEdit is fully plug-and-play and compatible with multiple ODE solvers including Euler, RF-Solver, and FireFlow. Code is available at https://github.com/leeguandong/AdaEdit

关键词: flow matching models, image editing, injection dilemma, progressive injection schedule, channel-selective latent perturbation, training-free editing, text-guided manipulation, adaptive editing framework

254. ❌ SARe: Structure-Aware Large-Scale 3D Fragment Reassembly

作者: Hanze Jia, Chunshi Wang, Yuxiao Yang, Zhonghua Jiang, Yawei Luo, Shuainan Ye, Tan Tang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21611v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文SARe专注于3D碎片重组，这是一个计算机视觉/几何处理问题，使用深度学习技术（如几何编码器、生成框架）解决大规模碎片组装挑战。所有关键词均与大型语言模型（LLM）、模型训练/对齐技术、推理优化、代理系统等直接相关，而本文不涉及任何语言模型或文本处理，因此除’AI for Science’外所有关键词评分为0。‘AI for Science’评5分，因为3D重组可视为AI在科学/工程领域的应用（如文化遗产修复、材料科学），但并非核心生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SARe的结构感知生成框架，通过联合预测断裂表面标记概率和碎片间接触图来解决大规模3D碎片重组中因接触推理不可靠导致的级联失败问题，并在合成、模拟和真实断裂扫描中实现了最先进的性能。

摘要翻译

三维碎片重组旨在将无序的碎片点云或网格的刚性姿态恢复至同一物体坐标系中，以重建完整形状。随着碎片数量的增加，该问题变得尤为困难，因为目标形状未知且碎片仅能提供微弱的语义线索。现有的端到端方法由于依赖不可靠的接触关系推理（尤其是碎片邻接关系判断不准确）而容易产生级联式失败。为此，我们提出结构感知重组框架（SARe），该生成式框架包含用于欧几里得空间组装生成的SARe-Gen模块，以及具备显式接触建模的推理时优化模块SARe-Refine。SARe-Gen通过联合预测断裂表面标记概率与碎片间接触图，以定位接触区域并推断候选邻接关系。该模块采用基于查询点的条件调节机制，从冻结的几何编码器中提取查询点位置处对齐的局部几何标记，从而无需额外结构预训练即可生成可查询的结构化表征。我们进一步引入推理时优化阶段SARe-Refine：通过几何一致性检验验证候选接触边，该模块能够筛选可靠子结构，同时对未确定的区域进行重采样，同时保持已验证部分固定，从而在多碎片场景中实现更稳定、更一致的组装结果。我们在三种实验场景（包括合成断裂、真实扫描物体的模拟断裂以及真实物理断裂扫描）中对SARe进行评估。结果表明该方法取得了最先进的性能，在具有挑战性的大规模重组任务中，随着碎片数量增加，其性能下降更为平缓且成功率更高。

摘要 (Abstract)

3D fragment reassembly aims to recover the rigid poses of unordered fragment point clouds or meshes in a common object coordinate system to reconstruct the complete shape. The problem becomes particularly challenging as the number of fragments grows, since the target shape is unknown and fragments provide weak semantic cues. Existing end-to-end approaches are prone to cascading failures due to unreliable contact reasoning, most notably inaccurate fragment adjacencies. To address this, we propose Structure-Aware Reassembly (SARe), a generative framework with SARe-Gen for Euclidean-space assembly generation and SARe-Refine for inference-time refinement, with explicit contact modeling. SARe-Gen jointly predicts fracture-surface token probabilities and an inter-fragment contact graph to localize contact regions and infer candidate adjacencies. It adopts a query-point-based conditioning scheme and extracts aligned local geometric tokens at query locations from a frozen geometry encoder, yielding queryable structural representations without additional structural pretraining. We further introduce an inference-time refinement stage, SARe-Refine. By verifying candidate contact edges with geometric-consistency checks, it selects reliable substructures and resamples the remaining uncertain regions while keeping verified parts fixed, leading to more stable and consistent assemblies in the many-fragment regime. We evaluate SARe across three settings, including synthetic fractures, simulated fractures from scanned real objects, and real physically fractured scans. The results demonstrate state-of-the-art performance, with more graceful degradation and higher success rates as the fragment count increases in challenging large-scale reassembly.

关键词: 3D fragment reassembly, structure-aware, contact modeling, generative framework, geometry encoder, inference-time refinement, large-scale reassembly, point clouds

255. ❌ Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

作者: Alexandra Zelenin, Alexandra Zhuravlyova 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22276v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究DoRA（基于LoRA的扩展）的高效实现，与’PEFT/LoRA/Parameter-efficient Fine-tuning’高度相关（10分），因为DoRA是LoRA的改进方法。论文涉及推理加速和内存优化，与’Speculative Decoding OR Inference Acceleration’有一定关联（5分）。论文在8-32B视觉语言模型上测试，与大模型相关但非核心，给’Large Language Models'5分。其他关键词如MoE、SFT、RAG等与论文内容无关，均为0分。

!!! tip deepseek-chat TL;DR

论文解决了高秩DoRA（基于LoRA的权重分解低秩适应）在单GPU设置中因内存需求大而成本高的问题，通过提出分解范数和融合内核的方法，实现了1.5-2.7倍的加速和高达7GB的VRAM降低，同时保持数值稳定性和训练一致性。

摘要翻译

权重分解低秩适应（DoRA）通过将权重幅度与方向解耦来扩展LoRA，但其前向传播需要计算W + sBA的行范数，而我们所调研的所有主流框架均通过实例化稠密的[d_out, d_in]乘积BA来实现该计算。当d_in = 8192且秩r = 384时，单个模块的范数计算在bf16精度下约需512 MB的瞬时工作内存，这使得高秩DoRA成本高昂，且一旦涉及数百个适配模块和检查点机制，在常见的单GPU配置中往往难以实现。
我们提出两项系统层面的贡献。首先，分解范数法将平方范数分解为可分别通过O(d_out r + r^2)中间量计算的基础项、交叉项和格拉姆项，从而避免了稠密乘积的生成。其次，融合Triton内核将原本四步内核的DoRA组合压缩为单次计算，减少约4倍的内存流量，并采用数值稳定的形式，避免了实际应用中幅度缩放集中于接近1的重新缩放区域时可能出现的灾难性抵消问题。
在bf16精度、秩r=384的条件下，于三款NVIDIA GPU（RTX 6000 PRO、H200、B200）上对六个8-32B规模的视觉语言模型（VLMs）进行测试，融合实现在推理速度上比Hugging Face PEFT的DoRA实现快1.5-2.0倍，在梯度计算（不含优化器步骤）上快1.5-1.9倍，且峰值显存降低最高达7 GB。在横跨四代架构的六款GPU（L40S、A100、RTX 6000 PRO、H200、B200、B300）上的微基准测试证实，组合内核速度提升达1.5-2.7倍。所有模型/GPU配对下的最终逻辑输出余弦相似度均超过0.9999，且多随机种子训练曲线在2000步内的平均每步损失差异保持在7.1×10^-4以内。

摘要 (Abstract)

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module’s norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT’s DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.

关键词: DoRA, LoRA, Parameter-efficient Fine-tuning, Memory Optimization, Inference Acceleration, Fused Kernels, Vision-Language Models, High-Rank Adaptation

256. ❌ Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

作者: Zakaria Mhammedi, James Cohan 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22273v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习（RL）中的探索问题，提出了一种将探索与利用解耦的新范式，使用树搜索策略和认知不确定性度量来驱动探索，并在Atari、MuJoCo等基准测试中验证了其效率。论文的核心内容是强化学习算法和探索策略，不涉及大语言模型（LLMs）、深度学习技术原理、AI for Science应用或任何评分关键词中列出的具体技术（如MoE、SFT、RAG、量化等）。所有关键词均与大语言模型及其相关技术（训练、对齐、推理、应用等）相关，而本论文研究的是传统RL领域的探索问题，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种将探索与利用解耦的新强化学习范式，通过基于不确定性的树搜索策略高效探索环境，无需在探索阶段进行策略优化，并在多个硬探索基准任务上实现了最先进的性能。

摘要翻译

发现过程需要主动探索——即收集新颖且信息丰富的数据行为。然而，高效的自主探索仍然是一个尚未解决的重要问题。主流范式通过使用强化学习（RL）训练具有内在动机的智能体来应对这一挑战，以最大化外在奖励与内在奖励的复合目标。我们认为这种方法会产生不必要的开销：尽管策略优化对于精确任务执行是必要的，但仅为了扩展状态覆盖而采用此类机制可能效率低下。本文提出一种新范式，明确将探索与利用分离，并在探索阶段绕过强化学习。我们的方法采用受“胜者优先”（Go-With-The-Winner）算法启发的树搜索策略，结合认知不确定性度量来系统性地驱动探索。通过消除策略优化的开销，我们的方法在困难Atari基准测试中的探索效率比标准内在动机基线高出一个数量级。此外，我们证明所发现的轨迹可通过现有监督式反向学习算法提炼为可部署策略，在《蒙特祖玛的复仇》《陷阱！》和《冒险》游戏中以显著优势取得最先进分数，且无需依赖领域特定知识。最后，我们通过在高维连续动作空间中解决稀疏奖励设置下的MuJoCo Adroit灵巧操作任务和AntMaze任务，验证了框架的普适性——该方法直接基于图像观测实现，且无需专家示范或离线数据集。据我们所知，这在此前尚未实现。

摘要 (Abstract)

The process of discovery requires active exploration – the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma’s Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before.

关键词: Reinforcement Learning, Exploration, Tree Search, Uncertainty, Decoupling, Intrinsic Motivation, Hard Exploration, Policy Optimization

257. ❌ Characterizing High-Capacity Janus Aminobenzene-Graphene Anode for Sodium-Ion Batteries with Machine Learning

作者: Claudia Islas-Vargas, L. Ricardo Montoya, Carlos A. Vital-José, Oliver T. Unke, Klaus-Robert Müller, Huziel E. Sauceda 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22254v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用机器学习力场（SpookyNet MLFF）和密度泛函理论计算来表征钠离子电池阳极材料，属于材料科学和计算化学领域。论文中明确使用了机器学习方法（MLFF），因此与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，评分为5分。然而，论文内容不涉及大语言模型（LLMs）、深度学习技术原理、模型训练、推理优化、对齐、代理系统等关键词，这些关键词均评分为0分。

!!! tip deepseek-chat TL;DR

该研究使用机器学习力场和密度泛函理论模拟，揭示了氨基苯功能化Janus石墨烯作为钠离子电池阳极的三阶段储钠机制，表现出高容量、低电压平台和快速钠离子扩散等优异性能。

摘要翻译

钠离子电池需要兼具高容量、低工作电压、快速钠离子传输和机械稳定性的负极材料，而传统负极难以满足这些要求。本研究利用SpookyNet机器学习力场（MLFF）结合全电子密度泛函理论计算，表征了室温下氨基苯功能化Janus石墨烯（Na$_x$AB）中的钠存储行为。对不同荷电状态的模拟揭示了一种三阶段存储机制：首先在氨基苯基团处发生位点特异性吸附并形成Na$_n$@AB$_m$结构，随后进行层间通道填充——这与硬碳中多阶段的孔隙主导、石墨层间主导及缺陷主导的行为形成鲜明对比。该机制产生了扩展的0.15 V（相对于Na/Na$^{+}$）低电压平台开路电压（OCV）曲线，估算质量容量约为400 mAh g$^{-1}$，体积变化可忽略不计，且钠离子扩散系数达$\sim10^{-6}$ cm$^{2}$ s$^{-1}$，比硬碳高两到三个数量级。我们的研究证实Janus氨基苯-石墨烯是一种结构明确、具有前景的高容量钠离子负极材料，并展示了基于MLFF的模拟在电极材料表征中的强大能力。

摘要 (Abstract)

Sodium-ion batteries require anodes that combine high capacity, low operating voltage, fast Na-ion transport, and mechanical stability, which conventional anodes struggle to deliver. Here, we use the SpookyNet machine-learning force field (MLFF) together with all-electron density-functional theory calculations to characterize Na storage in aminobenzene-functionalized Janus graphene (Na$_x$AB) at room-temperature. Simulations across state of charge reveal a three-stage storage mechanism-site-specific adsorption at aminobenzene groups and Na$_n$@AB$_m$ structure formation, followed by interlayer gallery filling-contrasting the multi-stage pore-, graphite-interlayer-, and defect-controlled behavior in hard carbon. This leads to an OCV profile with an extended low-voltage plateau of 0.15 V vs. Na/Na$^{+}$, an estimated gravimetric capacity of $\sim$400 mAh g$^{-1}$, negligible volume change, and Na diffusivities of $\sim10^{-6}$ cm$^{2}$ s$^{-1}$, two to three orders of magnitude higher than in hard carbon. Our results establish Janus aminobenzene-graphene as a promising, structurally defined high-capacity Na-ion anode and illustrate the power of MLFF-based simulations for characterizing electrode materials.

关键词: Sodium-ion batteries, Janus graphene anode, Machine learning force field, Density-functional theory, Na storage mechanism, High capacity, Low operating voltage, Na diffusivity

258. ❌ ShapDBM: Exploring Decision Boundary Maps in Shapley Space

作者: Luke Watkin, Daniel Archambault, Alex Telea 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22235v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《ShapDBM: Exploring Decision Boundary Maps in Shapley Space》专注于机器学习分类边界的可视化技术，提出了一种基于Shapley空间的新方法来计算决策边界图（DBMs），以提高可视化的质量和可探索性。该研究属于传统的机器学习可视化和解释性领域，与所有评分关键词（主要围绕大模型、深度学习技术原理、训练方法、推理优化、对齐、代理系统、科学应用等）均无直接关联。论文未涉及任何大模型、深度学习技术原理创新或其在科学领域的应用，也未提及任何关键词中的具体技术或概念。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Shapley空间的新方法来计算决策边界图（DBMs），以提高机器学习分类边界的可视化质量和可探索性。

摘要翻译

决策边界图（Decision Boundary Maps，DBMs）是一种用于可视化机器学习分类边界的有效工具。然而，DBM的质量在很大程度上取决于所使用的降维（dimensionality reduction，DR）技术以及数据点所处的高维空间。对于复杂的机器学习数据集，降维过程可能产生许多混合类别，进而导致生成的决策边界图难以解读。我们提出一种新技术，通过将数据空间转换为沙普利空间（Shapley space）并在其上执行降维来计算决策边界图。与直接基于原始数据计算的标准决策边界图相比，我们的图谱在质量指标上具有相当或更高的数值，并且呈现出明显更紧凑、更易于探索的决策区域。

摘要 (Abstract)

Decision Boundary Maps (DBMs) are an effective tool for visualising machine learning classification boundaries. Yet, DBM quality strongly depends on the dimensionality reduction (DR) technique and high dimensional space used for the data points. For complex ML datasets, DR can create many mixed classes which, in turn, yield DBMs that are hard to use. We propose a new technique to compute DBMs by transforming data space into Shapley space and computing DR on it. Compared to standard DBMs computed directly from data, our maps have similar or higher quality metric values and visibly more compact, easier to explore, decision zones.

关键词: Decision Boundary Maps, Shapley space, dimensionality reduction, machine learning classification, visualization, data transformation, compact decision zones, explorability

259. ❌ Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting

作者: Qilin Wang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22219v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于时间序列预测的基准测试方法创新（Noise Titration）和概率生成模型（Fern扩展），核心是时间序列分析、混沌系统、概率建模和评估方法，未涉及大语言模型、深度学习技术原理或AI在科学领域的应用，与所有给定关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于噪声注入的精确统计基准测试范式（Noise Titration），用于评估概率时间序列预测模型，并扩展了Fern架构以输出校准的联合协方差结构，结果表明在非平稳和噪声条件下，传统大模型表现不佳，而Fern模型能保持结构保真度和统计校准。

摘要翻译

现代时间序列预测几乎完全通过对单一历史轨迹的被动观察进行评估，这使得关于模型对非平稳性鲁棒性的主张从根本上无法证伪。我们提出向干预主义、精确统计基准测试的范式转变。通过将校准的高斯观测噪声系统性地注入已知的混沌和随机动力系统，我们将预测从黑箱序列匹配游戏转变为精确的分布推断任务。由于底层数据生成过程和噪声方差在数学上是显式定义的，评估可以依赖精确的负对数似然和校准的分布检验，而非启发式近似。为充分利用此框架，我们将Fern架构扩展为概率生成模型，该模型原生参数化对称正定（SPD）锥，输出校准的联合协方差结构，无需通用雅可比建模的计算瓶颈。在此严格评估下，我们发现最先进的零样本基础模型表现出与上下文复述机制一致的行为，在非平稳状态转移和噪声增强时系统性失效。相比之下，Fern显式捕捉了底层动力学的不变测度和多元几何结构，在大型序列匹配模型崩溃之处，仍能保持结构保真度和统计上的精确校准。

摘要 (Abstract)

Modern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model’s robustness to non-stationarity fundamentally unfalsifiable. We propose a paradigm shift toward interventionist, exact-statistical benchmarking. By systematically titrating calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, we transform forecasting from a black-box sequence matching game into an exact distributional inference task. Because the underlying data-generating process and noise variance are mathematically explicit, evaluation can rely on exact negative log-likelihoods and calibrated distributional tests rather than heuristic approximations. To fully leverage this framework, we extend the Fern architecture into a probabilistic generative model that natively parameterizes the Symmetric Positive Definite (SPD) cone, outputting calibrated joint covariance structures without the computational bottleneck of generic Jacobian modeling. Under this rigorous evaluation, we find that state-of-the-art zero-shot foundation models behave consistently with the context-parroting mechanism, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of the underlying dynamics, maintaining structural fidelity and statistically sharp calibration precisely where massive sequence-matching models collapse.

关键词: time series forecasting, probabilistic forecasting, noise titration, exact statistical benchmarking, chaotic dynamical systems, covariance structures, non-stationarity, calibration

260. ❌ Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs

作者: Kangqi Ni, Wenyue Hua, Xiaoxiang Shi, Jiang Guo, Shiyu Chang, Tianlong Chen 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22206v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体工作流在异构LLM集群上的调度系统，与’Large Language Models’高度相关（论文研究LLM serving系统），与’LLM Agents’和’Multi-agent Systems’高度相关（论文明确研究multi-agent applications和workflows）。其他关键词涉及模型架构、训练方法、推理优化、特定应用领域等，论文未涉及这些具体技术，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对异构LLM集群上多智能体工作流服务的调度挑战，提出了Chimera系统，通过语义路由、输出长度预测和负载均衡优化，在代码生成和数学推理任务中实现了1.2-2.4倍的端到端延迟降低和8.0-9.5个百分点的任务性能提升。

摘要翻译

多智能体应用常以多阶段工作流的形式执行复杂任务，其中每个阶段对应一次大语言模型调用，其输出将作为后续步骤的上下文组成部分。现有的大语言模型服务系统大多假设集群为同构环境，即所有模型副本完全一致。这种设计忽略了异构部署的潜力——通过部署不同规模与能力的模型，可以在延迟与性能之间实现更精细的权衡。然而，异构性也带来了新的挑战，即如何在具有不同吞吐量与性能的模型间进行调度。本文提出Chimera，一种面向异构大语言模型集群的多智能体工作流服务预测调度系统，它能协同优化端到端延迟与任务性能。Chimera通过语义路由为每个请求估算各模型的置信度分数，预测工作流的剩余总输出长度，并利用运行中的预测令牌量来估计各模型拥塞程度以实现负载均衡。我们在代码生成与数学推理两类代表性智能体工作流上，使用多种异构大语言模型配置对Chimera进行评估。在可比设置下，Chimera能够逼近最优的延迟-性能边界，相较于包括vLLM在内的竞争性基线方法，其端到端延迟降低至1.2–2.4倍，任务性能平均提升8.0–9.5个百分点。

摘要 (Abstract)

Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling across models with diverse throughput and performance. We present Chimera, a predictive scheduling system for multi-agent workflow serving on heterogeneous LLM clusters that jointly improves end-to-end latency and task performance. Chimera applies semantic routing to estimate per-model confidence scores for each request, predicts the total remaining output length of the workflow, and estimates per-model congestion using in-flight predicted token volumes for load balancing. We evaluate Chimera on representative agentic workflows for code generation and math reasoning using multiple heterogeneous LLM configurations. Across comparable settings, Chimera traces the best latency-performance frontier, reducing end-to-end latency by 1.2–2.4$\times$ and improving task performance by 8.0-9.5 percentage points on average over competitive baselines including vLLM.

关键词: multi-agent serving, heterogeneous LLMs, workflow scheduling, latency-performance trade-off, semantic routing, load balancing, code generation, math reasoning

261. ❌ Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?

作者: Oscar Novo, Oscar Bastidas-Jossa, Alberto Calvo, Antonio Peris, Carlos Kuchkovsky 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22184v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究LLM在量子计算代码生成中的应用，核心比较了参数微调与推理时增强方法（RAG和基于代理的执行反馈）。高度相关的关键词包括：LLMs（论文核心）、RAG（明确评估）、LLM Agents（执行反馈代理）、AI for Science（量子软件属于科学应用）。有一定相关的关键词：SFT（提到参数微调基线）、Tool Use（代理执行涉及工具使用）。其他关键词未涉及。

!!! tip deepseek-chat TL;DR

该研究探讨了在量子软件代码生成中，通过检索增强生成和基于代理的执行反馈等推理时增强方法，相比参数微调能更有效地提升LLM性能，实现更灵活可维护的辅助开发。

摘要翻译

近年来，大规模语言模型（LLM）的进展使得越来越多的编程任务能够实现自动化，包括科学与工程领域的代码生成。在快速演进的软件生态系统中，例如量子软件开发，其框架往往涉及复杂的抽象层次，一个核心问题是如何在保持库的可维护性的同时，将领域知识有效地整合到基于LLM的辅助工具中。
本研究基于Qiskit-HumanEval基准，探讨了Qiskit代码生成的专门化策略。我们比较了先前工作中提出的参数专门化微调基线模型与一系列最新的通用LLM，后者结合了检索增强生成（RAG）和基于智能体的执行反馈推理。
我们的结果表明，现代通用LLM在性能上持续优于参数专门化的基线模型。微调模型在Qiskit-HumanEval上实现了约47%的pass@1准确率，而最新的通用模型在零样本和检索增强设置下达到了60-65%，当与迭代执行反馈智能体结合时，表现最强的评估模型甚至达到了85%——这相较于零样本通用模型性能提升了超过20%，相较于参数专门化基线提升了超过35%。
基于智能体的执行反馈带来了最稳定的性能提升，尽管其运行时间成本有所增加；而RAG则提供了适度的、依赖于模型的增益。这些发现表明，无需进行领域特定的微调，仅通过推理时增强即可实现性能提升，从而为LLM辅助的量子软件开发提供了一种更灵活、更可维护的路径。

摘要 (Abstract)

Recent advances in large language models (LLMs) have enabled the automation of an increasing number of programming tasks, including code generation for scientific and engineering domains. In rapidly evolving software ecosystems such as quantum software development, where frameworks expose complex abstractions, a central question is how best to incorporate domain knowledge into LLM-based assistants while preserving maintainability as libraries evolve. In this work, we study specialization strategies for Qiskit code generation using the Qiskit-HumanEval benchmark. We compare a parameter-specialized fine-tuned baseline introduced in prior work against a range of recent general-purpose LLMs enhanced with retrieval-augmented generation (RAG) and agent-based inference with execution feedback. Our results show that modern general-purpose LLMs consistently outperform the parameter-specialized baseline. While the fine-tuned model achieves approximately 47% pass@1 on Qiskit-HumanEval, recent general-purpose models reach 60-65% under zero-shot and retrieval-augmented settings, and up to 85% for the strongest evaluated model when combined with iterative execution-feedback agents -representing an improvement of more than 20% over zero-shot general-purpose performance and more than 35% over the parameter-specialized baseline. Agentic execution feedback yields the most consistent improvements, albeit at increased runtime cost, while RAG provides modest and model-dependent gains. These findings indicate that performance gains can be achieved without domain-specific fine-tuning, instead relying on inference-time augmentation, thereby enabling a more flexible and maintainable approach to LLM-assisted quantum software development.

关键词: Large Language Models, Quantum Code Generation, Retrieval-Augmented Generation, Agent-based Inference, Execution Feedback, Qiskit, Fine-tuning, Software Development

262. ❌ Causal Evidence that Language Models use Confidence to Drive Behavior

作者: Dharshan Kumaran, Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22161v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM如何利用内部置信度信号来调节行为（如弃权），这与LLM作为自主代理的能力直接相关。高度相关的关键词包括：LLM（核心研究对象）、LLM Agents（论文明确讨论LLM向自主代理的转变）、Self-Correction/Self-Reflection（置信度评估是自我反思的一种形式）、Mechanistic Interpretability（通过激活操控提供因果证据，属于可解释性研究）。RAG被提及为比较因素，但非核心，给5分。其他关键词如MoE、SFT、量化等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该研究通过四阶段弃权范式，提供了因果证据表明大型语言模型（LLM）会主动利用内部置信度信号来调节行为（如决定是否回答问题），这模拟了生物系统中的元认知控制机制，对LLM作为自主代理的发展至关重要。

摘要翻译

元认知——即评估自身认知表现的能力——已在多个物种中得到证实，其中内部置信度估计是适应性行为的关键信号。虽然可以从大型语言模型（LLM）的输出中提取置信度，但模型是否主动利用这些信号来调节行为仍是一个根本性问题。我们通过一个四阶段的弃权范式对此进行研究。第一阶段在无弃权选项的情况下建立了内部置信度估计。第二阶段显示，LLM在决定作答或弃权时会对这些估计值应用隐式阈值。置信度成为行为的主导预测因子，其效应量比知识检索可及性（RAG分数）或表层语义特征高出一个数量级。第三阶段通过激活导向提供了因果证据：操纵内部置信度信号会相应改变弃权率。最后，第四阶段证明模型能够根据指令阈值系统性地调整弃权策略。我们的研究结果表明，弃权行为源于内部置信度表征与基于阈值的策略的协同运作，这反映了生物系统中存在的两阶段元认知控制。随着LLM向自主智能体转型——它们必须识别自身的不确定性以决定何时行动或寻求帮助——这种能力至关重要。

摘要 (Abstract)

Metacognition – the ability to assess one’s own cognitive performance – is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstention paradigm.Phase 1 established internal confidence estimates in the absence of an abstention option. Phase 2 revealed that LLMs apply implicit thresholds to these estimates when deciding to answer or abstain. Confidence emerged as the dominant predictor of behavior, with effect sizes an order of magnitude larger than knowledge retrieval accessibility (RAG scores) or surface-level semantic features. Phase 3 provided causal evidence through activation steering: manipulating internal confidence signals correspondingly shifted abstention rates. Finally, Phase 4 demonstrated that models can systematically vary abstention policies based on instructed thresholds.Our findings indicate that abstention arises from the joint operation of internal confidence representations and threshold-based policies, mirroring the two-stage metacognitive control found in biological systems. This capacity is essential as LLMs transition into autonomous agents that must recognize their own uncertainty to decide when to act or seek help.

关键词: Large Language Models, confidence, abstention, metacognition, autonomous agents, activation steering, behavior regulation, uncertainty

263. ❌ Data Curation for Machine Learning Interatomic Potentials by Determinantal Point Processes

作者: Joanna Zou, Youssef Marzouk 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22160v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于机器学习原子间势能的数据管理，使用行列式点过程（DPPs）选择信息丰富的原子构型子集进行标记。论文内容与大多数关键词（涉及大模型、训练技术、推理优化、对齐、代理等）完全无关，因为这些关键词主要针对大语言模型及其相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及机器学习在分子系统（科学领域）中的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。加权总分计算为5.0分（仅一个关键词得5分，权重1.0）。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用行列式点过程（DPPs）来高效选择原子构型子集进行标记的方法，以解决机器学习原子间势能开发中的数据生成和标记瓶颈，实验表明该方法能构建紧凑多样的训练集，提高分子系统表示的准确性和鲁棒性。

摘要翻译

机器学习原子间势的开发面临一个关键的计算瓶颈：即如何生成并标注有效的训练数据集。本文提出了一种行列式点过程（DPPs）的新颖应用，用于从原子构型中选择信息丰富的子集，以便利用昂贵的量子力学方法为其标注参考能量和力。通过对氧化铪数据的实验，我们证明，通过利用分子描述符的核函数，DPPs在构建紧凑且多样化的训练集方面与现有方法相比具有竞争力，从而提高了分子系统机器学习表征的准确性和鲁棒性。我们的工作为DPPs的应用指明了有前景的方向，包括在异构或多模态数据中进行无监督训练数据筛选，或在分子动力学模拟期间用于迭代数据增强的在线主动学习方案中加以应用。

摘要 (Abstract)

The development of machine learning interatomic potentials faces a critical computational bottleneck with the generation and labeling of useful training datasets. We present a novel application of determinantal point processes (DPPs) to the task of selecting informative subsets of atomic configurations to label with reference energies and forces from costly quantum mechanical methods. Through experiments with hafnium oxide data, we show that DPPs are competitive with existing approaches to constructing compact but diverse training sets by utilizing kernels of molecular descriptors, leading to improved accuracy and robustness in machine learning representations of molecular systems. Our work identifies promising directions to employ DPPs for unsupervised training data curation with heterogeneous or multimodal data, or in online active learning schemes for iterative data augmentation during molecular dynamics simulation.

关键词: machine learning interatomic potentials, data curation, determinantal point processes, training datasets, atomic configurations, molecular descriptors, hafnium oxide, active learning

264. ❌ RAMPAGE: RAndomized Mid-Point for debiAsed Gradient Extrapolation

作者: Abolfazl Hashemi 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22155v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是变分不等式（VIs）的数值优化算法（RAMPAGE和RAMPAGE+），属于数学优化和计算数学领域。所有关键词均与大语言模型、深度学习技术原理、AI应用或科学AI应用相关，而本文完全不涉及这些主题。论文内容聚焦于梯度外推、随机中点方法、方差减少等传统优化技术，与给定的大模型技术关键词无任何关联。

!!! tip deepseek-chat TL;DR

本文针对变分不等式求解中Extragradient方法存在的离散化偏差问题，提出了RAMPAGE和RAMPAGE+两种无偏随机中点梯度外推算法，并证明了它们在多种问题设置下的收敛性保证。

摘要翻译

变分不等式（VIs）的一种经典方法是外梯度法（Extragradient, EG），该方法可视为一种标准的离散时间积分格式。基于这一视角，本文指出当EG应用于非线性向量场（无论其是否保守）时，可能受到离散化偏差的影响。为克服这一离散化缺陷，我们提出了随机中点去偏梯度外推法（RAndomized Mid-Point for debiAsed Gradient Extrapolation, RAMPAGE）及其方差缩减版本RAMPAGE+，后者利用了反采样技术。与EG不同，这两种方法均是无偏的。此外，通过利用负相关性，RAMPAGE+作为一种无偏的几何路径积分器，能够完全消除方差中的一阶内部项，理论上优于RAMPAGE。我们进一步证明，对于包括在共强制、共亚单调及广义Lipschitz性条件下的求根问题在内的一系列问题，两种方法均具有可证明的$\mathcal{O}(1/k)$收敛保证。同时，我们引入了对称缩放变体，将结果推广到约束变分不等式。最后，我们为随机和确定性光滑凸凹博弈提供了两种方法的收敛性保证。值得注意的是，尽管RAMPAGE+是一种随机方法，它在多种研究设定下获得了纯确定性的收敛界。

摘要 (Abstract)

A celebrated method for Variational Inequalities (VIs) is Extragradient (EG), which can be viewed as a standard discrete-time integration scheme. With this view in mind, in this paper we show that EG may suffer from discretization bias when applied to non-linear vector fields, conservative or otherwise. To resolve this discretization shortcoming, we introduce RAndomized Mid-Point for debiAsed Gradient Extrapolation (RAMPAGE) and its variance-reduced counterpart, RAMPAGE+ which leverages antithetic sampling. In contrast with EG, both methods are unbiased. Furthermore, leveraging negative correlation, RAMPAGE+ acts as an unbiased, geometric path-integrator that completely removes internal first-order terms from the variance, provably improving upon RAMPAGE. We further demonstrate that both methods enjoy provable $\mathcal{O}(1/k)$ convergence guarantees for a range of problems including root finding under co-coercive, co-hypomonotone, and generalized Lipschitzness regimes. Furthermore, we introduce symmetrically scaled variants to extend our results to constrained VIs. Finally, we provide convergence guarantees of both methods for stochastic and deterministic smooth convex-concave games. Somewhat interestingly, despite being a randomized method, RAMPAGE+ attains purely deterministic bounds for a number of the studied settings.

关键词: Variational Inequalities, Extragradient, RAMPAGE, Randomized Mid-Point, Gradient Extrapolation, Convergence Guarantees, Convex-Concave Games, Variance Reduction

265. ❌ Computationally lightweight classifiers with frequentist bounds on predictions

作者: Shreeram Murali, Cristian R. Rojas, Dominik Baumann 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种基于Nadaraya-Watson估计器的计算高效分类算法，并推导了频率不确定性区间，应用于心电图信号分类。论文与大多数关键词（涉及大模型技术、训练方法、推理优化、对齐等）完全无关，因为这些关键词主要针对大语言模型和深度学习技术，而本文研究的是传统机器学习分类器。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将方法应用于生物医学信号（心电图）分类，属于AI在科学/生物信息学领域的应用，但并非核心创新点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对现有分类器缺乏预测不确定性边界的问题，提出了一种基于Nadaraya-Watson估计器的计算高效分类算法，在保持高准确率的同时提供可操作的不确定性边界，适用于资源受限的实时诊断监测场景。

摘要翻译

尽管经典分类器与神经网络分类器均可实现高准确率，但其预测结果缺乏不确定性边界，因此不适用于安全关键型应用。现有可提供此类边界的基于核函数的分类器在时间上具有$\mathcal O (n^{\sim3})$的复杂度，对于大规模数据集计算上不可行。为解决此问题，我们提出一种基于Nadaraya-Watson估计器的新型高效计算分类算法，并为其估计值推导了频率学派不确定性区间。我们在合成生成数据及MIT-BIH心律失常数据库的心电（ECG）心跳信号上评估了该分类器。结果表明，该方法在$\mathcal O(n)$与$\mathcal O(\log n)$计算量下实现了超过\SI{96}{\percent}的竞争性准确率，同时提供了可操作的不确定性边界。此类边界可用于标记低置信度预测，使其适用于资源受限的实时场景，如诊断监测或植入式设备。

摘要 (Abstract)

While both classical and neural network classifiers can achieve high accuracy, they fall short on offering uncertainty bounds on their predictions, making them unfit for safety-critical applications. Existing kernel-based classifiers that provide such bounds scale with $\mathcal O (n^{\sim3})$ in time, making them computationally intractable for large datasets. To address this, we propose a novel, computationally efficient classification algorithm based on the Nadaraya-Watson estimator, for whose estimates we derive frequentist uncertainty intervals. We evaluate our classifier on synthetically generated data and on electrocardiographic heartbeat signals from the MIT-BIH Arrhythmia database. We show that the method achieves competitive accuracy $>$\SI{96}{\percent} at $\mathcal O(n)$ and $\mathcal O(\log n)$ operations, while providing actionable uncertainty bounds. These bounds can, e.g., aid in flagging low-confidence predictions, making them suitable for real-time settings with resource constraints, such as diagnostic monitoring or implantable devices.

关键词: classification algorithm, uncertainty bounds, computationally efficient, Nadaraya-Watson estimator, frequentist intervals, electrocardiographic signals, real-time applications, diagnostic monitoring

266. ❌ MIHT: A Hoeffding Tree for Time Series Classification using Multiple Instance Learning

作者: Aurora Esteban, Amelia Zafra, Sebastián Ventura 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22074v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于时间序列分类的传统机器学习方法（Hoeffding决策树和多实例学习），与所有大模型/深度学习关键词完全无关。仅与’Explainable AI’有弱关联（因提到可解释性），与’AI for Science’有弱关联（因时间序列分类可用于科学领域），但均非核心内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多实例学习和增量决策树的MIHT算法，用于处理多变量、变长度时间序列的分类问题，并在28个公开数据集上超越了11种先进模型，同时提供了可解释的结果。

摘要翻译

由于时间序列数据在众多现实问题中的普遍性及其固有的依赖性，时间序列分类在各领域具有至关重要的意义。然而，现有模型在处理变长或高维序列时往往面临困难。本文提出MIHT（多实例霍夫丁树）算法，这是一种高效模型，利用多实例学习对多变量变长时间序列进行分类，同时提供可解释的结果。该算法采用一种新颖的时间序列表示方法，即将序列表示为“子序列包”，并结合基于增量决策树的优化过程，以区分序列中的相关部分与噪声。该方法能够提取具有多变量和变长特性的序列的潜在概念。生成的决策树是序列概念的紧凑、白盒表示，为序列中最相关的变量和片段提供了可解释性洞察。实验结果表明，MIHT在28个公共数据集（包括高维数据集）上优于11种先进的时间序列分类模型，展现出卓越性能。MIHT在准确性和可解释性方面均有提升，为处理复杂动态时间序列数据提供了一种前景广阔的解决方案。

摘要 (Abstract)

Due to the prevalence of temporal data and its inherent dependencies in many real-world problems, time series classification is of paramount importance in various domains. However, existing models often struggle with series of variable length or high dimensionality. This paper introduces the MIHT (Multi-instance Hoeffding Tree) algorithm, an efficient model that uses multi-instance learning to classify multivariate and variable-length time series while providing interpretable results. The algorithm uses a novel representation of time series as “bags of subseries,” together with an optimization process based on incremental decision trees that distinguish relevant parts of the series from noise. This methodology extracts the underlying concept of series with multiple variables and variable lengths. The generated decision tree is a compact, white-box representation of the series’ concept, providing interpretability insights into the most relevant variables and segments of the series. Experimental results demonstrate MIHT’s superiority, as it outperforms 11 state-of-the-art time series classification models on 28 public datasets, including high-dimensional ones. MIHT offers enhanced accuracy and interpretability, making it a promising solution for handling complex, dynamic time series data.

关键词: time series classification, multi-instance learning, Hoeffding tree, variable-length series, interpretability, multivariate time series, incremental decision tree, bags of subseries

267. ❌ MAGPI: Multifidelity-Augmented Gaussian Process Inputs for Surrogate Modeling from Scarce Data

作者: Atticus Rex, Elizabeth Qian, David Peterson 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22050v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于多保真度高斯过程回归方法，用于从稀缺数据中构建代理模型，属于科学计算和工程优化领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词主要针对自然语言处理和大型语言模型技术。唯一有一定关联的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及科学和工程中的机器学习应用，但并非核心内容，只是应用场景之一，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的多保真度训练方法，通过使用低保真数据定义额外特征来增强高斯过程回归的输入空间，从而在稀缺数据下提高代理模型的预测精度并降低计算成本。

摘要翻译

监督机器学习描述了将参数化模型拟合到带标签的输入-输出数据的实践。监督机器学习方法已展现出在学习高效代理模型方面的潜力，这些模型能够（部分）替代昂贵的高保真模型，从而使优化、不确定性量化和推断等大量查询分析变得可行。然而，当训练数据必须通过评估昂贵的模型或实验获得时，可获取的训练数据量往往有限，这可能导致学习到的代理模型不可靠。但在许多工程与科学场景中，可能存在更廉价的低保真模型，例如源自简化物理建模或粗网格的模型。这些模型可用于生成额外的低保真训练数据。多保真机器学习的目标是利用高保真和低保真训练数据，学习一个比高保真模型评估成本更低、但比任何可用低保真模型更精确的代理模型。本研究提出了一种新的高斯过程回归多保真训练方法，该方法利用低保真数据定义额外特征，以扩展学习模型的输入空间。此方法融合了现有两类多保真高斯过程回归方法——协同克里金法和自回归估计器——的理想特性。在多个测试问题上的数值实验表明，相较于现有先进方法，该方法既提升了预测精度，又降低了计算成本。

摘要 (Abstract)

Supervised machine learning describes the practice of fitting a parameterized model to labeled input-output data. Supervised machine learning methods have demonstrated promise in learning efficient surrogate models that can (partially) replace expensive high-fidelity models, making many-query analyses, such as optimization, uncertainty quantification, and inference, tractable. However, when training data must be obtained through the evaluation of an expensive model or experiment, the amount of training data that can be obtained is often limited, which can make learned surrogate models unreliable. However, in many engineering and scientific settings, cheaper \emph{low-fidelity} models may be available, for example arising from simplified physics modeling or coarse grids. These models may be used to generate additional low-fidelity training data. The goal of \emph{multifidelity} machine learning is to use both high- and low-fidelity training data to learn a surrogate model which is cheaper to evaluate than the high-fidelity model, but more accurate than any available low-fidelity model. This work proposes a new multifidelity training approach for Gaussian process regression which uses low-fidelity data to define additional features that augment the input space of the learned model. The approach unites desirable properties from two separate classes of existing multifidelity GPR approaches, cokriging and autoregressive estimators. Numerical experiments on several test problems demonstrate both increased predictive accuracy and reduced computational cost relative to the state of the art.

关键词: multifidelity machine learning, Gaussian process regression, surrogate modeling, scarce data, low-fidelity models, high-fidelity models, predictive accuracy, computational cost

268. ❌ RAFL: Generalizable Sim-to-Real of Soft Robots with Residual Acceleration Field Learning

作者: Dong Heon Cho, Boyuan Chen 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22039v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究软体机器人的仿真到现实（sim-to-real）问题，提出了一种残差加速度场学习（RAFL）框架来减少仿真与现实之间的差距。虽然论文涉及AI在机器人领域的应用，但所有关键词都明确针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等），而本文专注于物理仿真、机器人控制和几何建模，未涉及任何语言模型、深度学习在科学领域的应用或大模型技术原理的创新。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对软体机器人仿真中因几何形状变化导致的仿真与现实差距问题，提出了一种残差加速度场学习框架，能够在未见过的形态上实现零样本改进，并支持在形态优化过程中持续提升仿真精度。

摘要翻译

可微分仿真器支持基于梯度对软体机器人的材料参数、控制与形态进行优化，但由于仿真与现实间的差异，精确建模真实系统仍具挑战性。当几何结构本身成为设计变量时，这一问题尤为突出。系统辨识通过将全局材料参数拟合至数据来减小差异；然而，当本构模型设定错误或观测数据稀疏时，辨识出的参数往往吸收了几何相关的效应，而非反映材料的内在特性。更具表达力的本构模型可提升精度，但会显著增加计算成本，限制其实用性。
我们提出一种残差加速度场学习框架，通过一个可迁移的单元级修正动力学场增强基础仿真器。该模型基于共享的局部特征运行，独立于全局网格拓扑与离散化方式。通过可微分仿真器利用稀疏标记观测数据进行端到端训练，所学习的残差场能够泛化至不同形状。在仿真到仿真及仿真到现实的实验中，我们的方法在未见过的形态上均实现了稳定的零样本性能提升，而系统辨识方法则常出现负迁移现象。该框架还支持持续优化，使得仿真精度能够在形态优化过程中持续累积。

摘要 (Abstract)

Differentiable simulators enable gradient-based optimization of soft robots over material parameters, control, and morphology, but accurately modeling real systems remains challenging due to the sim-to-real gap. This issue becomes more pronounced when geometry is itself a design variable. System identification reduces discrepancies by fitting global material parameters to data; however, when constitutive models are misspecified or observations are sparse, identified parameters often absorb geometry-dependent effects rather than reflect intrinsic material behavior. More expressive constitutive models can improve accuracy but substantially increase computational cost, limiting practicality. We propose a residual acceleration field learning (RAFL) framework that augments a base simulator with a transferable, element-level corrective dynamics field. Operating on shared local features, the model is agnostic to global mesh topology and discretization. Trained end-to-end through a differentiable simulator using sparse marker observations, the learned residual generalizes across shapes. In both sim-to-sim and sim-to-real experiments, our method achieves consistent zero-shot improvements on unseen morphologies, while system identification frequently exhibits negative transfer. The framework also supports continual refinement, enabling simulation accuracy to accumulate during morphology optimization.

关键词: soft robots, sim-to-real gap, differentiable simulator, residual acceleration field learning, morphology optimization, zero-shot generalization, system identification, geometry-dependent effects

269. ❌ On the Interplay of Priors and Overparametrization in Bayesian Neural Network Posteriors

作者: Julius Kobialka, Emanuel Sommer, Chris Kolb, Juntae Kwon, Daniel Dold, David Rügamer 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22030v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究贝叶斯神经网络后验分布的理论性质，特别是过参数化和先验的相互作用，属于深度学习理论研究的范畴。虽然涉及神经网络，但所有关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、应用等），而本文完全不涉及语言模型、大模型技术或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了过参数化和先验如何共同重塑贝叶斯神经网络后验分布的几何结构，揭示了冗余引入的三个关键现象（平衡性、权重重分配和先验一致性），并通过大规模实验验证了过参数化会导致结构化、先验对齐的权重后验分布。

摘要翻译

贝叶斯神经网络（BNN）的后验分布常被认为难以用于实际推断，因为对称性会使其碎片化，不可识别性会增加其维度，且权重空间先验常被视为缺乏意义。本研究探讨了过参数化与先验如何共同重塑BNN后验，并推导出相关推论，以帮助我们更好地理解二者的相互作用。我们证明，冗余性会引入三种关键现象，从根本上改变后验几何结构：平衡性、等概率流形上的权重重分配以及先验一致性。我们通过远超早期工作规模的后验采样实验验证了这些发现，并展示了过参数化如何诱导出具有结构化、与先验对齐的权重后验分布。

摘要 (Abstract)

Bayesian neural network (BNN) posteriors are often considered impractical for inference, as symmetries fragment them, non-identifiabilities inflate dimensionality, and weight-space priors are seen as meaningless. In this work, we study how overparametrization and priors together reshape BNN posteriors and derive implications allowing us to better understand their interplay. We show that redundancy introduces three key phenomena that fundamentally reshape the posterior geometry: balancedness, weight reallocation on equal-probability manifolds, and prior conformity. We validate our findings through extensive experiments with posterior sampling budgets that far exceed those of earlier works, and demonstrate how overparametrization induces structured, prior-aligned weight posterior distributions.

关键词: Bayesian neural networks, posterior distributions, overparametrization, priors, posterior geometry, weight reallocation, prior conformity, posterior sampling

270. ❌ Do Papers Match Code? A Benchmark and Framework for Paper-Code Consistency Detection in Bioinformatics Software

作者: Tianxiang Xu, Xiaoyan Zhu, Xin Lai, Sizhe Dang, Xin Lian, Hangyu Cheng, Jiayin Wang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22018v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于生物信息学领域，提出了一个检测论文与代码一致性的新任务，并构建了基准数据集BioCon和检测框架。论文的核心是解决生物信息学软件中论文描述与代码实现不一致的问题，以提高科学可重复性。因此，它仅与关键词列表中的’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（评分为10分），因为该关键词直接匹配论文的应用领域（生物信息学）。其他所有关键词均涉及大模型、深度学习技术原理、训练方法、推理优化、代理系统等具体技术，而本论文并未研究或应用这些技术；它主要关注跨模态（自然语言与代码）语义对齐和一致性检测，属于一个更具体的软件工程/科学计算问题，与列表中的其他大模型技术关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对生物信息学软件中论文描述与代码实现不一致的问题，提出了论文-代码一致性检测的新任务，构建了基准数据集BioCon和一个跨模态一致性检测框架，有效实现了两者间的一致性识别。

摘要翻译

确保研究论文与其对应的软件实现之间的一致性，对于软件可靠性和科学可复现性至关重要。然而，这一问题仍未得到充分探索，特别是在生物信息学领域，论文中的方法描述与实际代码实现之间的差异普遍存在。为填补这一空白，本文提出了一项新任务，即论文-代码一致性检测，并收集整理了48个生物信息学软件项目及其相关出版物。我们系统地将论文中句子级别的算法描述与函数级别的代码片段进行对齐。结合专家标注和混合负采样策略，我们构建了该任务在生物信息学领域的首个基准数据集，命名为BioCon。基于此基准，我们进一步提出了一个跨模态一致性检测框架，旨在建模自然语言描述与代码实现之间的语义关系。该框架采用统一的输入表示，并利用预训练模型来捕获论文与代码之间的深层语义对齐。为缓解类别不平衡和困难样本的影响，我们引入了加权焦点损失以增强模型的鲁棒性。实验结果表明，我们的框架能有效识别生物信息学中论文与代码之间的一致性，达到了0.9056的准确率和0.8011的F1分数。总体而言，本研究为论文-代码一致性分析开辟了新的研究方向，并为科学软件的自动化可复现性评估和跨模态理解奠定了基础。

摘要 (Abstract)

Ensuring consistency between research papers and their corresponding software implementations is fundamental to software reliability and scientific reproducibility. However, this problem remains underexplored, particularly in the domain of bioinformatics, where discrepancies between methodological descriptions in papers and their actual code implementations are prevalent. To address this gap, this paper introduces a new task, namely paper-code consistency detection, and curates a collection of 48 bioinformatics software projects along with their associated publications. We systematically align sentence-level algorithmic descriptions from papers with function-level code snippets. Combined with expert annotations and a hybrid negative sampling strategy, we construct the first benchmark dataset in the bioinformatics domain tailored to this task, termed BioCon. Based on this benchmark, we further propose a cross-modal consistency detection framework designed to model the semantic relationships between natural language descriptions and code implementations. The framework adopts a unified input representation and leverages pre-trained models to capture deep semantic alignment between papers and code. To mitigate the effects of class imbalance and hard samples, we incorporate a weighted focal loss to enhance model robustness. Experimental results demonstrate that our framework effectively identifies consistency between papers and code in bioinformatics, achieving an accuracy of 0.9056 and an F1 score of 0.8011. Overall, this study opens a new research direction for paper-code consistency analysis and lays the foundation for automated reproducibility assessment and cross-modal understanding in scientific software.

关键词: paper-code consistency, bioinformatics, benchmark dataset, cross-modal detection, scientific reproducibility, software reliability, semantic alignment, BioCon

作者: Peter Pak, Amir Barati Farimani 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究基于Gemma 3大语言模型，通过领域自适应预训练和指令微调，开发了面向增材制造的多模态大语言模型AdditiveLLM2。因此与’Large Language Models’、‘Pre-training/Domain Adaptation’、‘Instruction Tuning’高度相关（10分）。研究属于AI在科学/工程领域的应用，与’AI for Science’高度相关（10分）。其他关键词如MoE、SLMs、SFT、RAG、推理方法、代理、压缩加速等均未在摘要中提及或涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究通过领域自适应预训练和指令微调，基于Gemma 3开发了面向增材制造的多模态大语言模型AdditiveLLM2，在增材制造领域任务中准确率超过90%。

摘要翻译

本研究提出了AdditiveLLM2，这是一个基于指令调优版Gemma 3模型构建的多模态、领域自适应大语言模型，其训练使用了相对较小的约5000万词元的数据集。该数据集（AdditiveLLM2-OA）由开放获取的增材制造期刊论文构成，其数据被提取用于领域自适应预训练和视觉指令调优过程。开发模型的各个阶段均通过“增材制造基准”进行评估，该基准由已发布资源汇编的增材制造领域特定任务组成。AdditiveLLM2在基于语言和视觉的任务中均表现出色，在通用增材制造知识方面准确率超过90%。这种领域自适应预训练与指令调优策略，为大型语言模型在增材制造等领域的专业化提供了一种可行的定制方法。

摘要 (Abstract)

This work presents AdditiveLLM2 a multi-modal, domain adapted large language model built upon the instruction tuned variant of the Gemma 3 model using a relatively small dataset of around 50 million tokens. The dataset (AdditiveLLM2-OA) consists of open-access additive manufacturing journal articles with data extracted for the domain adaptive pretraining and visual instruction tuning processes. Various stages of the developed model are evaluated with the Additive-Manufacturing-Benchmark which consists of additive manufacturing domain specific tasks compiled published resources. AdditiveLLM2 exhibits proficiency in both language and vision based tasks, achieving accuracies upwards of 90% in general additive manufacturing knowledge. This domain adaptive pretraining and instruction tuning strategy outline an accessible specialization method for large language models to a domain such as additive manufacturing.

关键词: AdditiveLLM2, multi-modal large language model, additive manufacturing, domain adaptation, instruction tuning, Gemma 3, Additive-Manufacturing-Benchmark, visual instruction tuning

272. ❌ A plug-and-play approach with fast uncertainty quantification for weak lensing mass mapping

作者: Hubert Leterme, Andreas Tersenov, Jalal Fadili, Jean-Luc Starck 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22006v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究弱引力透镜质量映射的深度学习方法（PnPMass），属于AI在科学领域的应用（具体是天体物理/宇宙学），因此仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但未涉及大模型技术原理或生物/化学信息学。其他关键词均与大模型、训练方法、推理优化、代理系统等无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对弱引力透镜质量映射问题，提出了一种名为PnPMass的即插即用深度学习方法，实现了高精度重建、快速推理和可靠的校准不确定性量化，适用于未来的大规模宇宙学调查。

摘要翻译

即将开展的第四阶段巡天项目（如欧几里得和鲁宾望远镜）将提供海量高精度数据，为以前所未有的精度约束宇宙学模型开辟新机遇。该过程中的关键步骤是从含噪的弱引力透镜剪切测量中重建暗物质分布。当前基于深度学习的质量映射方法虽能实现高重建精度，但要么需要为每个新观测天区重新训练模型（限制了实用性），要么依赖耗时的马尔可夫链蒙特卡洛采样。因此，有效利用未来巡天数据需要一种兼具高精度、灵活性和快速推理能力的新方法。此外，具备覆盖保证的不确定性量化对于可靠的宇宙学参数估计至关重要。我们提出PnPMass——一种即插即用式弱引力透镜质量映射方法。该算法通过交替执行两个步骤生成点估计：采用精心设计的数据保真项进行梯度下降，以及使用在受高斯白噪声污染的模拟数据上训练的单一深度学习模型执行去噪步骤。我们还提出一种基于矩网络的快速免采样不确定性量化方案，其校准误差条通过保形预测获得，以确保覆盖保证。最后，我们将PnPMass与模型驱动及数据驱动的质量映射技术进行基准测试。 PnPMass在实现与最先进深度学习方法相近性能的同时，具备快速推理能力（仅需数次迭代即可收敛），且仅需单次训练阶段即可独立于观测噪声协方差运行。该方法因此融合了灵活性、高效性与重建精度，并能提供比现有方法更严格的误差条，使其特别适用于未来的弱引力透镜巡天研究。

摘要 (Abstract)

Upcoming stage-IV surveys such as Euclid and Rubin will deliver vast amounts of high-precision data, opening new opportunities to constrain cosmological models with unprecedented accuracy. A key step in this process is the reconstruction of the dark matter distribution from noisy weak lensing shear measurements. Current deep learning-based mass mapping methods achieve high reconstruction accuracy, but either require retraining a model for each new observed sky region (limiting practicality) or rely on slow MCMC sampling. Efficient exploitation of future survey data therefore calls for a new method that is accurate, flexible, and fast at inference. In addition, uncertainty quantification with coverage guarantees is essential for reliable cosmological parameter estimation. We introduce PnPMass, a plug-and-play approach for weak lensing mass mapping. The algorithm produces point estimates by alternating between a gradient descent step with a carefully chosen data fidelity term, and a denoising step implemented with a single deep learning model trained on simulated data corrupted by Gaussian white noise. We also propose a fast, sampling-free uncertainty quantification scheme based on moment networks, with calibrated error bars obtained through conformal prediction to ensure coverage guarantees. Finally, we benchmark PnPMass against both model-driven and data-driven mass mapping techniques. PnPMass achieves performance close to that of state-of-the-art deep-learning methods while offering fast inference (converging in just a few iterations) and requiring only a single training phase, independently of the noise covariance of the observations. It therefore combines flexibility, efficiency, and reconstruction accuracy, while delivering tighter error bars than existing approaches, making it well suited for upcoming weak lensing surveys.

关键词: weak lensing mass mapping, deep learning, plug-and-play, uncertainty quantification, cosmological surveys, denoising, conformal prediction, fast inference

273. ❌ CRPS-Optimal Binning for Conformal Regression

作者: Paolo Toccaceli 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22000v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是非参数条件分布估计方法，专注于通过优化CRPS（连续排名概率分数）进行分箱，并应用于共形回归预测区间。所有关键词均涉及大模型、深度学习技术原理或特定AI应用领域（如生物信息学），而本文属于传统统计机器学习中的回归预测方法，未涉及任何大模型、深度学习或AI for Science相关内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种基于CRPS优化的分箱方法用于非参数条件分布估计，通过动态规划选择最优分箱并生成共形预测集，在保持接近名义覆盖水平的同时显著缩小了预测区间宽度。

摘要翻译

我们提出一种非参数条件分布估计方法：首先将按协变量排序的观测值划分为连续区间，并以区间内经验累积分布函数作为预测分布。区间边界的选择旨在最小化总留一连续分级概率评分，该优化问题存在闭式成本函数，其预计算复杂度为$O(n^2 \log n)$，存储需求为$O(n^2)$；通过动态规划算法可在$O(n^2 K)$时间内恢复全局最优的$K$分区。研究发现，最小化样本内留一连续分级概率评分并不适用于选择$K$值，因其会导致样本内乐观偏差。因此我们改为通过交替留出分割的测试连续分级概率评分来选择$K$，该准则呈U形曲线且存在明确最小值。选定$K^*$并拟合全数据分区后，我们构建两个互补的预测对象：基于文恩预测的置信带，以及以连续分级概率评分作为非契合度分数的保形预测集，后者可在任意设定水平$\varepsilon$下提供有限样本边际覆盖保证。在与分割保形竞争方法（高斯分割保形、条件分位数回归及其分位数随机森林变体）的真实基准测试中，本方法在保持接近名义覆盖水平的同时，能产生显著更窄的预测区间。

摘要 (Abstract)

We propose a method for non-parametric conditional distribution estimation based on partitioning covariate-sorted observations into contiguous bins and using the within-bin empirical CDF as the predictive distribution. Bin boundaries are chosen to minimise the total leave-one-out Continuous Ranked Probability Score (LOO-CRPS), which admits a closed-form cost function with $O(n^2 \log n)$ precomputation and $O(n^2)$ storage; the globally optimal $K$-partition is recovered by a dynamic programme in $O(n^2 K)$ time. Minimisation of Within-sample LOO-CRPS turns out to be inappropriate for selecting $K$ as it results in in-sample optimism. So we instead select $K$ by evaluating test CRPS on an alternating held-out split, which yields a U-shaped criterion with a well-defined minimum. Having selected $K^*$ and fitted the full-data partition, we form two complementary predictive objects: the Venn prediction band and a conformal prediction set based on CRPS as the nonconformity score, which carries a finite-sample marginal coverage guarantee at any prescribed level $\varepsilon$. On real benchmarks against split-conformal competitors (Gaussian split conformal, CQR, and CQR-QRF), the method produces substantially narrower prediction intervals while maintaining near-nominal coverage.

关键词: conformal regression, CRPS optimization, non-parametric estimation, prediction intervals, dynamic programming, Venn prediction, coverage guarantee, distribution estimation

274. ❌ BOOST-RPF: Boosted Sequential Trees for Radial Power Flow

作者: Ehimare Okoyomon, Christoph Goebel 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21977v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文BOOST-RPF专注于电力系统潮流分析，使用梯度提升决策树（XGBoost）解决电压预测问题。研究内容涉及机器学习在工程领域的应用，但未涉及大语言模型、深度学习技术原理创新或任何评分关键词中的具体技术（如MoE、RLHF、RAG等）。所有关键词均与大模型、深度学习技术或AI for Science（生物信息学、化学信息学）直接相关，而本文属于传统机器学习在电力工程的应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于梯度提升决策树的序列树方法BOOST-RPF，用于径向电网的潮流分析，通过将全局图回归任务转化为基于路径的学习问题，实现了优于传统分析和神经基线的准确性和泛化能力。

摘要翻译

精确的潮流分析对现代配电系统至关重要，然而经典求解器面临可扩展性问题，当前机器学习模型则常受泛化能力不足的制约。本文提出BOOST-RPF这一新方法，它将电压预测任务从全局图回归重构为基于路径的序列学习问题。通过将辐射状网络分解为从根节点到叶节点的路径，我们利用梯度提升决策树（XGBoost）对局部压降规律进行建模。我们评估了三种架构变体：绝对电压模型、父节点残差模型以及物理信息残差模型。该方法使模型架构与潮流的递归物理特性相契合，确保其具备与系统规模无关的适用性及卓越的分布外鲁棒性。在Kerber Dorfnetz电网和ENGAGE测试集上的基准实验表明，BOOST-RPF的父节点残差变体取得了最先进的性能，在标准精度和泛化任务中均持续优于解析方法与神经网络基线。尽管全局多层感知机（MLP）和图神经网络（GNN）在拓扑结构变化时常出现性能下降，BOOST-RPF在未见过的馈线上仍能保持高精度。此外，该框架展现出线性的$O(N)$计算复杂度，并通过基于边的监督显著提升了样本效率，为配电系统运营商（DSO）的实时应用提供了一种可扩展且泛化能力强的解决方案。

摘要 (Abstract)

Accurate power flow analysis is critical for modern distribution systems, yet classical solvers face scalability issues, and current machine learning models often struggle with generalization. We introduce BOOST-RPF, a novel method that reformulates voltage prediction from a global graph regression task into a sequential path-based learning problem. By decomposing radial networks into root-to-leaf paths, we leverage gradient-boosted decision trees (XGBoost) to model local voltage-drop regularities. We evaluate three architectural variants: Absolute Voltage, Parent Residual, and Physics-Informed Residual. This approach aligns the model architecture with the recursive physics of power flow, ensuring size-agnostic application and superior out-of-distribution robustness. Benchmarked against the Kerber Dorfnetz grid and the ENGAGE suite, BOOST-RPF achieves state-of-the-art results with its Parent Residual variant which consistently outperforms both analytical and neural baselines in standard accuracy and generalization tasks. While global Multi-Layer Perceptrons (MLPs) and Graph Neural Networks (GNNs) often suffer from performance degradation under topological shifts, BOOST-RPF maintains high precision across unseen feeders. Furthermore, the framework displays linear $O(N)$ computational scaling and significantly increased sample efficiency through per-edge supervision, offering a scalable and generalizable alternative for real-time distribution system operator (DSO) applications.

关键词: power flow analysis, gradient-boosted decision trees, XGBoost, radial networks, voltage prediction, distribution systems, generalization, sequential learning

275. ❌ Structural Concentration in Weighted Networks: A Class of Topology-Aware Indices

作者: L. Riso, M. G. Zoia 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21918v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Structural Concentration in Weighted Networks: A Class of Topology-Aware Indices》专注于网络科学和复杂系统理论，提出了一种新的拓扑感知浓度指数框架，用于衡量加权网络中的结构集中度。其核心内容涉及网络拓扑、权重分布、浓度测量和复杂系统分析，属于数学、经济学和网络科学交叉领域。所有评分关键词均与大模型、深度学习、AI技术原理或AI在科学领域的应用直接相关，而本文完全不涉及这些主题，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种新的拓扑感知浓度指数框架，用于同时考虑权重分布和网络结构来测量加权系统中的结构集中度，并通过理论和实证分析证明了网络拓扑对集中度测量的重要性。

摘要翻译

本文提出了一种用于衡量嵌入交互网络中的加权系统集中度的统一框架。传统指标（如赫芬达尔-赫希曼指数）虽能捕捉权重分布的离散程度，却忽略了权重接收单元间关系的拓扑结构。为弥补这一缺陷，我们引入了一系列拓扑感知的集中度指数，可同时考量权重分布与网络结构。该框架的核心是基础性的网络集中度指数（Network Concentration Index, NCI），其定义为一种标准化二次型，用于衡量沿观测网络链路实现的潜在加权互联比例。在此基础上，我们构建了一类灵活的扩展指标，通过调整交互结构或标准化基准，衍生出加权型、密度调整型、零模型型、度约束型、数据转换型及多层变体等多种形式。该指数族保留了标准化性、不变性与可解释性等关键性质，同时支持从不同依赖维度（包括强度、高阶交互与极端事件）评估集中度。理论研究刻画了这些指数的特性，并建立了其与经典集中度指标及网络度量之间的关系。实证与模拟研究表明，具有相同权重分布的系统可能因网络拓扑差异而呈现显著不同的结构集中度，这凸显了本框架所捕捉的额外信息。该方法广泛适用于经济、金融及复杂系统中通过网络交互的加权要素分析。

摘要 (Abstract)

This paper develops a unified framework for measuring concentration in weighted systems embedded in networks of interactions. While traditional indices such as the Herfindahl-Hirschman Index capture dispersion in weights, they neglect the topology of relationships among the elements receiving those weights. To address this limitation, we introduce a family of topology-aware concentration indices that jointly account for weight distributions and network structure. At the core of the framework lies a baseline Network Concentration Index (NCI), defined as a normalized quadratic form that measures the fraction of potential weighted interconnection realized along observed network links. Building on this foundation, we construct a flexible class of extensions that modify either the interaction structure or the normalization benchmark, including weighted, density-adjusted, null-model, degree-constrained, transformed-data, and multi-layer variants. This family of indices preserves key properties such as normalization, invariance, and interpretability, while allowing concentration to be evaluated across different dimensions of dependence, including intensity, higher-order interactions, and extreme events. Theoretical results characterize the indices and establish their relationship with classical concentration and network measures. Empirical and simulation evidence demonstrate that systems with identical weight distributions may exhibit markedly different levels of structural concentration depending on network topology, highlighting the additional information captured by the proposed framework. The approach is broadly applicable to economic, financial, and complex systems in which weighted elements interact through networks.

关键词: weighted networks, concentration indices, network topology, structural concentration, complex systems, network concentration index, topology-aware, interaction structure

276. ❌ A Novel Method for Enforcing Exactly Dirichlet, Neumann and Robin Conditions on Curved Domain Boundaries for Physics Informed Machine Learning

作者: Suchuan Dong, Yuchuan Zhang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21909v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于物理信息机器学习（PIML）中边界条件的精确实施方法，属于科学计算和偏微分方程数值解的范畴。论文使用了极端学习机（ELM）技术，属于机器学习在科学领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分）。然而，论文的核心内容（边界条件处理、TFC、ELM）与所有其他关键词（均围绕大模型、深度学习技术原理、训练方法、推理优化、对齐、代理等）完全无关，这些关键词均未在标题或摘要中出现，也未涉及相关概念，因此其他所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种在物理信息机器学习中精确实施狄利克雷、诺伊曼和罗宾边界条件的新方法，通过结合TFC约束表达式和超限插值，在复杂几何边界上实现了机器精度的数值解。

摘要翻译

本文提出了一种系统方法，用于在具有任意曲线边界的广义四边形区域上精确施加狄利克雷、诺伊曼和罗宾类型边界条件。该方法建立在广义四边形区域与标准区域之间的精确映射基础上，并结合了TFC（函数连接理论）约束表达式与超限插值技术。当存在诺伊曼或罗宾边界时，特别是当两条诺伊曼（或罗宾）边界在顶点相交时，精确满足交点处衍生的相容性约束对于精确实现连接边界上的给定条件至关重要。我们详细分析并提出了针对两类情形的边界条件处理及相容性约束构造方案：（i）当诺伊曼（或罗宾）边界仅与狄利克雷边界相交时；（ii）当两条诺伊曼（或罗宾）边界相互相交时。我们描述了一个四步流程，用于系统构建在广义四边形区域上精确满足给定狄利克雷、诺伊曼或罗宾边界条件的函数通用形式。本文开发的方法已与我们近期为科学机器学习所开发的极限学习机（ELM）技术结合实现。通过在多类具有复杂边界几何的二维区域上开展大量线性/非线性稳态/动态问题的数值实验，仿真结果表明所提方法能够精确实现曲线区域边界上的狄利克雷、诺伊曼和罗宾条件，其数值边界条件误差达到机器精度水平。

摘要 (Abstract)

We present a systematic method for exactly enforcing Dirichlet, Neumann, and Robin type conditions on general quadrilateral domains with arbitrary curved boundaries. Our method is built upon exact mappings between general quadrilateral domains and the standard domain, and employs a combination of TFC (theory of functional connections) constrained expressions and transfinite interpolations. When Neumann or Robin boundaries are present, especially when two Neumann (or Robin) boundaries meet at a vertex, it is critical to enforce exactly the induced compatibility constraints at the intersection, in order to enforce exactly the imposed conditions on the joining boundaries. We analyze in detail and present constructions for handling the imposed boundary conditions and the induced compatibility constraints for two types of situations: (i) when Neumann (or Robin) boundary only intersects with Dirichlet boundaries, and (ii) when two Neumann (or Robin) boundaries intersect with each other. We describe a four-step procedure to systematically formulate the general form of functions that exactly satisfy the imposed Dirichlet, Neumann, or Robin conditions on general quadrilateral domains. The method developed herein has been implemented together with the extreme learning machine (ELM) technique we have developed recently for scientific machine learning. Ample numerical experiments are presented with several linear/nonlinear stationary/dynamic problems on a variety of two-dimensional domains with complex boundary geometries. Simulation results demonstrate that the proposed method has enforced the Dirichlet, Neumann, and Robin conditions on curved domain boundaries exactly, with the numerical boundary-condition errors at the machine accuracy.

关键词: Physics Informed Machine Learning, Boundary Conditions, Dirichlet, Neumann, Robin, Extreme Learning Machine (ELM), Numerical Simulation, Curved Boundaries

277. ❌ SparseDVFS: Sparse-Aware DVFS for Energy-Efficient Edge Inference

作者: Ziyang Zhang, Zheshun Wu, Jie Liu, Luca Mottola 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21908v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《SparseDVFS: Sparse-Aware DVFS for Energy-Efficient Edge Inference》专注于边缘设备上深度神经网络（DNN）推理的能效优化，提出了一种细粒度、稀疏感知的动态电压频率缩放（DVFS）框架。虽然论文涉及深度学习模型在边缘部署的优化，但其核心内容是关于硬件层面的能效管理（如DVFS、频率调制、硬件切换延迟），而非大模型或深度学习技术原理的创新。所有评分关键词均与大模型技术、训练方法、推理优化（如注意力机制、对齐、代理系统等）或科学AI应用直接相关，而本文未涉及这些主题。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SparseDVFS的细粒度、稀疏感知DVFS框架，用于优化边缘设备上深度神经网络推理的能效，通过区分计算密集型和内存稀疏型算子并应用专用频率三元组，在评估中实现了平均78.17%的能效提升。

摘要翻译

在功耗敏感的边缘设备上部署深度神经网络（DNN）是一项艰巨挑战。动态电压频率调节（DVFS）虽被广泛用于能耗优化，但传统的模型级调节粒度往往过于粗糙，无法捕捉推理过程中的内部变化；而细粒度的算子级调节则会因显著的硬件切换延迟导致性能严重下降。本文提出SparseDVFS，一种面向高效边缘推理的细粒度稀疏感知DVFS框架。我们的核心洞见是：算子稀疏度是硬件频率调节的关键指标。通过区分计算密集型稠密算子与内存密集型稀疏算子，系统可应用定制化的频率三元组（CPU/GPU/EMC）以最大化能效。为克服切换开销与组件间干扰，SparseDVFS融合三项关键创新：（1）离线建模器通过白盒时间线分析，建立算子稀疏度与最优频率三元组之间的确定性映射；（2）运行时图分区器采用贪心合并启发式算法，将算子聚合为超块，通过延迟摊销约束平衡调节粒度与DVFS切换延迟；（3）统一协同控制器集成频率统一调节引擎（FUSE）与前瞻指令队列，以消除独立控制器间的对抗效应并隐藏硬件切换延迟。大量实验表明，SparseDVFS在保持14%优越成本收益比的同时，相比现有最优方案平均提升78.17%的能效。

摘要 (Abstract)

Deploying deep neural networks (DNNs) on power-sensitive edge devices presents a formidable challenge. While Dynamic Voltage and Frequency Scaling (DVFS) is widely employed for energy optimization, traditional model-level scaling is often too coarse to capture intra-inference variations, whereas fine-grained operator-level scaling suffers from prohibitive performance degradation due to significant hardware switching latency. This paper presents SparseDVFS, a fine-grained, sparse-aware DVFS framework designed for energy-efficient edge inference. Our key insight is that operator sparsity is a primary metric for hardware frequency modulation. By distinguishing between compute-bound dense operators and memory-bound sparse operators, the system can apply specialized frequency triplets to maximize energy efficiency. To overcome switching overheads and component interference, SparseDVFS incorporates three key innovations: (1) an offline modeler that established a deterministic mapping between operator sparsity and optimal frequency triplets (CPU/GPU/EMC) via white-box timeline analysis; (2) a runtime graph partitioner that utilizes a greedy merging heuristic to aggregate operators into super-blocks, balancing scaling granularity and DVFS switching latency through a latency amortization constraint; and (3) a unified co-governor that employs a frequency unified scaling engine (FUSE) and a look-ahead instruction queue to eliminate antagonistic effects between independent controllers and hide hardware transition latencies. Extensive evaluations show that SparseDVFS achieves an average 78.17% energy efficiency gain over state-of-the-art solutions while maintaining a superior 14% cost-gain ratio.

关键词: SparseDVFS, energy-efficient edge inference, dynamic voltage and frequency scaling, operator sparsity, hardware frequency modulation, graph partitioner, frequency unified scaling engine, latency amortization

278. ❌ Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

作者: Weilin Wan, Jingtao Han, Weizhong Zhang, Cheng Jin 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21862v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	15.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Mixture-of-Experts架构的扩展定律优化，与’Mixture of Experts’高度相关（15分），直接涉及’Large Language Models’和’Scaling Laws’（各10分）。其他关键词如SLMs、训练方法、推理技术、应用领域等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了将大语言模型扩展定律转化为精确的Mixture-of-Experts架构配置的开放问题，提出了一个可重复使用的整体优化框架，通过建立计算公平性约束和降维搜索空间，能够为任何计算预算生成完整的最优MoE架构。

摘要翻译

大语言模型的缩放定律主导着宏观资源分配，但由于组合爆炸的设计空间，将其转化为精确的混合专家模型架构配置仍是一个开放性问题。现有的MoE缩放研究受限于实验成本，要么在缩放公式中引入额外的MoE变量（可能导致拟合不可靠），要么固定所有非MoE因素（忽略了全局交互）。我们提出了一个可复用的整体MoE架构优化框架来弥合这一差距。我们首先证明，仅凭单令牌浮点运算次数不足以公平评估MoE模型，因为不同层类型的计算密度差异可能导致参数量膨胀而计算成本未成比例增加，并由此建立了单令牌浮点运算次数、激活参数量与总参数量三者联合的约束体系。随后，我们通过代数约束和隐藏维度的秩保持特性，将16维架构搜索空间缩减为两个顺序的低维搜索阶段。该框架在跨越六个数量级计算量的数百个MoE模型上得到验证，能够生成稳健的缩放定律，将任意计算预算映射为完整且最优的MoE架构。一个关键发现是，接近最优的配置带宽随模型规模扩大而增加，这为实践者提供了量化灵活性，使其能在缩放定律建议与基础设施限制之间进行权衡。

摘要 (Abstract)

Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE architectural optimization that bridges this gap. We first show that FLOPs per token alone is an inadequate fairness metric for MoE models because differing computational densities across layer types can inflate parameters without proportional compute cost, and establish a joint constraint triad of FLOPs per token, active parameters, and total parameters. We then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints and a rank-preserving property of the hidden dimension. Validated across hundreds of MoE models spanning six orders of magnitude in compute, our framework yields robust scaling laws that map any compute budget to a complete, optimal MoE architecture. A key finding is that the near-optimal configuration band widens with scale, giving practitioners quantitative flexibility to balance scaling law recommendations against infrastructure constraints.

关键词: Mixture-of-Experts, Scaling Laws, Large Language Models, Architectural Optimization, FLOPs per token, Active Parameters, Holistic Framework, Compute Budget

279. ❌ All elementary functions from a single binary operator

作者: Andrzej Odrzywołek 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21852v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是数学基础理论，提出一个单一二元运算符（eml(x,y)=exp(x)-ln(y)）可以生成所有初等函数，并展示了其在符号回归中的应用。论文内容完全属于数学基础理论和符号计算领域，与所有评分关键词（均围绕大模型、深度学习及其技术原理、应用和优化）没有任何直接关联。论文未涉及任何人工智能模型、机器学习方法或相关技术概念。

!!! tip deepseek-chat TL;DR

该论文发现一个名为eml(x,y)=exp(x)-ln(y)的单一二元运算符，结合常数1，可以生成所有初等函数，并展示了如何利用该运算符的树结构进行梯度优化以从数值数据中精确恢复闭式函数。

摘要翻译

在数字硬件中，单个双输入门即可实现所有布尔逻辑。然而，连续数学领域一直缺乏类似的基元：计算诸如 sin、cos、sqrt 和 log 等初等函数始终需要多种不同的运算。本文证明，单个二元运算符 eml(x,y)=exp(x)-ln(y) 与常数 1 结合，即可生成科学计算器的标准功能集合。这包括 $e$、$π$、$i$ 等常数；算术运算 $+$、$-$、$\times$、$/$ 及乘方运算，以及常见的超越函数和代数函数。例如，$e^x=\operatorname{eml}(x,1)$，$\ln x=\operatorname{eml}(1,\operatorname{eml}(\operatorname{eml}(1,x),1))$，其他所有运算均可类似构造。此类运算符的存在性此前未被预见；我通过系统性的穷举搜索发现了它，并以构造性方式证明其足以实现具体科学计算器的功能基础。在 EML（指数减对数）形式下，每个此类表达式都转化为由相同节点构成的二叉树，产生如 $S \to 1 \mid \operatorname{eml}(S,S)$ 这般简洁的语法。这种统一结构还支持基于梯度的符号回归：通过将 EML 树作为可训练电路并采用标准优化器（如 Adam），我证明了在树深度不超过 4 的浅层结构中，从数值数据精确恢复闭式初等函数的可行性。该架构同样能拟合任意数据，但当生成规律为初等函数时，它可能恢复出精确公式。

摘要 (Abstract)

A single two-input gate suffices for all of Boolean logic in digital hardware. No comparable primitive has been known for continuous mathematics: computing elementary functions such as sin, cos, sqrt, and log has always required multiple distinct operations. Here I show that a single binary operator, eml(x,y)=exp(x)-ln(y), together with the constant 1, generates the standard repertoire of a scientific calculator. This includes constants such as $e$, $π$, and $i$; arithmetic operations including $+$, $-$, $\times$, $/$, and exponentiation as well as the usual transcendental and algebraic functions. For example, $e^x=\operatorname{eml}(x,1)$, $\ln x=\operatorname{eml}(1,\operatorname{eml}(\operatorname{eml}(1,x),1))$, and likewise for all other operations. That such an operator exists was not anticipated; I found it by systematic exhaustive search and established constructively that it suffices for the concrete scientific-calculator basis. In EML (Exp-Minus-Log) form, every such expression becomes a binary tree of identical nodes, yielding a grammar as simple as $S \to 1 \mid \operatorname{eml}(S,S)$. This uniform structure also enables gradient-based symbolic regression: using EML trees as trainable circuits with standard optimizers (Adam), I demonstrate the feasibility of exact recovery of closed-form elementary functions from numerical data at shallow tree depths up to 4. The same architecture can fit arbitrary data, but when the generating law is elementary, it may recover the exact formula.

关键词: elementary functions, binary operator, symbolic regression, exp-minus-log, scientific calculator, gradient-based optimization, closed-form recovery, mathematical foundation

280. ❌ Deriving Health Metrics from the Photoplethysmogram: Benchmarks and Insights from MIMIC-III-Ext-PPG

作者: Mohammad Moulaeifard, Philip J. Aston, Peter H. Charlton, Nils Strodthoff 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21832v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习进行光电容积描记（PPG）信号的临床预测任务，包括心律失常检测和生理参数估计。论文内容与大多数关键词（涉及大模型技术、训练方法、推理优化、代理系统等）完全无关，因为这些关键词主要针对大型语言模型和通用AI系统。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学（具体是临床预测）领域的应用，属于AI for Science的范畴，但并非核心大模型技术，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究建立了一个基于光电容积描记（PPG）信号的多任务临床预测综合基准，使用深度学习模型实现了高精度的心房颤动检测和生理参数估计，并展示了良好的跨数据集泛化能力。

摘要翻译

光电容积脉搏波描记法（Photoplethysmography, PPG）是临床预测任务中采集最广泛的生物信号之一，然而基于PPG的算法通常在小规模且质量不确定的数据集上进行训练，这阻碍了有意义的算法比较。我们利用\dbname~数据集提出了一个基于PPG的临床预测综合基准，为全系列临床相关应用建立了基线：包括多类心律分类，以及呼吸频率（Respiratory Rate, RR）、心率（Heart Rate, HR）和血压（Blood Pressure, BP）等生理参数的回归。最值得注意的是，我们首次对PPG在房颤（Atrial Fibrillation, AF）和房扑（Atrial Flutter, AFLT）之外的广义心律失常检测进行了全面评估，并按血压、心率及人口统计学亚组对性能进行了分层分析。采用成熟的深度学习架构，我们在房颤检测（AUROC = 0.96）和生理参数估计方面取得了优异性能（RR平均绝对误差：2.97次/分钟；HR平均绝对误差：1.13次/分钟；收缩压/舒张压平均绝对误差：16.13/8.70 mmHg）。跨数据集验证显示房颤检测具有出色的泛化能力（AUROC = 0.97），而临床亚组分析揭示了不同血压、心率及人口统计学层次亚组间显著的性能差异。这些差异似乎反映了人群特异性的波形差异，而非模型行为的系统性偏差。该框架首次建立了基于PPG的多任务临床预测一体化基准，证明PPG信号能有效支持多种同步监测任务，并为未来算法开发提供了必要的基线。

摘要 (Abstract)

Photoplethysmography (PPG) is one of the most widely captured biosignals for clinical prediction tasks, yet PPG-based algorithms are typically trained on small-scale datasets of uncertain quality, which hinders meaningful algorithm comparisons. We present a comprehensive benchmark for PPG-based clinical prediction using the \dbname~dataset, establishing baselines across the full spectrum of clinically relevant applications: multi-class heart rhythm classification, and regression of physiological parameters including respiratory rate (RR), heart rate (HR), and blood pressure (BP). Most notably, we provide the first comprehensive assessment of PPG for general arrhythmia detection beyond atrial fibrillation (AF) and atrial flutter (AFLT), with performance stratified by BP, HR, and demographic subgroups. Using established deep learning architectures, we achieved strong performance for AF detection (AUROC = 0.96) and accurate physiological parameter estimation (RR MAE: 2.97 bpm; HR MAE: 1.13 bpm; SBP/DBP MAE: 16.13/8.70 mmHg). Cross-dataset validation demonstrates excellent generalizability for AF detection (AUROC = 0.97), while clinical subgroup analysis reveals marked performance differences across subgroups by BP, HR, and demographic strata. These variations appear to reflect population-specific waveform differences rather than systematic bias in model behavior. This framework establishes the first integrated benchmark for multi-task PPG-based clinical prediction, demonstrating that PPG signals can effectively support multiple simultaneous monitoring tasks and providing essential baselines for future algorithm development.

关键词: Photoplethysmography, PPG, clinical prediction, deep learning, arrhythmia detection, physiological parameter estimation, benchmark, MIMIC-III

281. ❌ Show Me What You Don’t Know: Efficient Sampling from Invariant Sets for Model Validation

作者: Armand Rousselot, Joran Wendebourg, Ullrich Köthe 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21782v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种分析特征提取器不变性的方法，通过采样特征纤维来可视化模型行为。与大多数关键词无关，因为论文不涉及大模型技术原理、训练方法、推理优化或代理系统。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有中等关联（5分），因为使用了预训练扩散模型作为先验，但这不是论文核心。与’Mechanistic Interpretability OR Explainable AI’高度相关（8分），因为论文核心是模型解释和可视化。与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（8分），因为实验包括生物医学数据集（CheXpert）和模型（BiomedClip），并展示了Qwen-2B在医学图像分析中的应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练的方法，通过引导预训练生成模型采样特征纤维来可视化机器学习模型的不变性，从而揭示模型从理想行为到潜在问题的特征学习模式，并在ImageNet和CheXpert等数据集上验证了其有效性。

摘要翻译

机器学习模型的性能取决于其学习特征的质量。理想情况下，这些特征应对不相关的数据变化保持不变，同时对任务相关的细节保持敏感。为了可视化这一特性是否实现，我们提出一种方法，通过从特征提取器的纤维（fiber）——即由其不变性定义的等价类——中进行采样来分析特征提取器，采样过程基于任意给定的代表性样本。与现有工作中需要为每个特征检测器训练专用生成模型不同，我们的算法无需训练，并利用预训练的扩散模型或流匹配模型作为先验。纤维损失（fiber loss）——惩罚特征不匹配——通过非线性扩散轨迹匹配，引导去噪过程朝向目标等价类。这将以相当的保真度，将原本需要数天训练的不变性学习过程，替换为一次引导生成过程。在流行数据集（ImageNet、CheXpert）和模型类型（ResNet、DINO、BiomedClip）上的实验表明，我们的框架能够揭示从非常理想到值得关注的各种不变性行为。例如，我们展示了Qwen-2B模型如何将患有内脏反位（心脏位于右侧）的患者与典型解剖结构的样本置于同一纤维中。

摘要 (Abstract)

The performance of machine learning models is determined by the quality of their learned features. They should be invariant under irrelevant data variation but sensitive to task-relevant details. To visualize whether this is the case, we propose a method to analyze feature extractors by sampling from their fibers – equivalence classes defined by their invariances – given an arbitrary representative. Unlike existing work where a dedicated generative model is trained for each feature detector, our algorithm is training-free and exploits a pretrained diffusion or flow-matching model as a prior. The fiber loss – which penalizes mismatch in features – guides the denoising process toward the desired equivalence class, via non-linear diffusion trajectory matching. This replaces days of training for invariance learning with a single guided generation procedure at comparable fidelity. Experiments on popular datasets (ImageNet, CheXpert) and model types (ResNet, DINO, BiomedClip) demonstrate that our framework can reveal invariances ranging from very desirable to concerning behaviour. For instance, we show how Qwen-2B places patients with situs inversus (heart on the right side) in the same fiber as typical anatomy.

关键词: feature invariance, model validation, fiber sampling, diffusion models, interpretability, medical imaging, CheXpert, BiomedClip

282. ❌ Cluster-Specific Predictive Modeling: A Scalable Solution for Resource-Constrained Wi-Fi Controllers

作者: Gianluca Fontanesi, Luca Barbieri, Lorenzo Galati Giordano, Alfonso Fernandez Duran, Thorsten Wild 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21778v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是Wi-Fi网络中的预测建模优化，使用聚类算法和模型评估技术来提高预测准确性并优化资源利用。论文内容完全聚焦于网络工程和机器学习在特定领域的应用，没有涉及任何大模型、深度学习技术原理、AI for Science或相关关键词中的技术。所有关键词均与大模型、深度学习、AI科学应用等相关，而本文是传统的机器学习在网络管理中的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了在资源受限的Wi-Fi控制器中，通过特征聚类和集群特定预测模型来提高预测准确性并优化资源利用，发现集群特定模型在高活动集群中比全局模型具有更低的平均绝对误差。

摘要翻译

本文通过整合聚类算法与模型评估技术，对托管式Wi-Fi网络中的预测模型优化进行了全面分析。研究针对在内存和计算资源受限的中央控制器管理的大规模环境中部署预测算法所面临的挑战，提出了解决方案。研究采用基于特征的聚类方法，辅以主成分分析（PCA）和先进的特征工程技术，将时间序列数据依据其共享特征进行分组，从而能够为每个聚类开发特定的预测模型。通过对比全局模型与聚类特定模型的评估结果表明，在高活跃度聚类中，聚类特定模型在平均绝对误差值方面始终展现出更优的准确性。研究分析了模型复杂度（及准确性）与资源利用之间的权衡，凸显了定制化建模方法的可扩展性。研究结果主张采用自适应网络管理策略，通过选择性模型部署来优化资源分配，提升预测准确性，并确保在大型、集中管理的Wi-Fi环境中实现可扩展的运营。

摘要 (Abstract)

This manuscript presents a comprehensive analysis of predictive modeling optimization in managed Wi-Fi networks through the integration of clustering algorithms and model evaluation techniques. The study addresses the challenges of deploying forecasting algorithms in large-scale environments managed by a central controller constrained by memory and computational resources. Feature-based clustering, supported by Principal Component Analysis (PCA) and advanced feature engineering, is employed to group time series data based on shared characteristics, enabling the development of cluster-specific predictive models. Comparative evaluations between global models (GMs) and cluster-specific models demonstrate that cluster-specific models consistently achieve superior accuracy in terms of Mean Absolute Error (MAE) values in high-activity clusters. The trade-offs between model complexity (and accuracy) and resource utilization are analyzed, highlighting the scalability of tailored modeling approaches. The findings advocate for adaptive network management strategies that optimize resource allocation through selective model deployment, enhance predictive accuracy, and ensure scalable operations in large-scale, centrally managed Wi-Fi environments.

关键词: predictive modeling, Wi-Fi networks, clustering algorithms, resource-constrained, scalability, feature engineering, time series data, model accuracy

283. ❌ Identifiability and amortized inference limitations in Kuramoto models

作者: Emma Hannula, Jana de Wiljes, Matthew T. Moores, Heikki Haario, Lassi Roininen 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21752v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究Kuramoto模型（非线性振荡器网络）的贝叶斯推断问题，提出了一种摊销推断方法，使用神经网络近似后验分布。所有关键词均与大语言模型、深度学习技术原理或特定AI应用领域直接相关，而本文专注于传统贝叶斯推断和计算物理/工程问题，仅与’AI for Science’有微弱关联（因涉及科学计算中的AI方法），但并非核心内容。

!!! tip deepseek-chat TL;DR

该论文针对Kuramoto振荡器网络中的贝叶斯推断计算难题，提出了一种基于神经网络的摊销推断方法，实现了快速、可扩展的后验近似和不确定性量化。

摘要翻译

贝叶斯推断是动力学系统中参数估计与不确定性量化的有力工具。然而，对于诸如Kuramoto模型这类广泛应用于物理、生物和工程领域同步现象研究的非线性振子网络，由于高维状态空间和难以处理的似然函数，推断过程往往在计算上难以实现。本文提出一种摊销式贝叶斯推断方法，该方法通过模拟相位动力学学习后验分布的神经近似，从而无需重复采样或优化即可实现快速、可扩展的推断。将本方法应用于合成Kuramoto网络，其在近似后验分布和捕捉不确定性方面显示出良好效果，与传统贝叶斯技术相比显著节约了计算成本。这些结果表明，摊销式推断为振子网络的不确定性感知分析提供了一个实用且灵活的框架。

摘要 (Abstract)

Bayesian inference is a powerful tool for parameter estimation and uncertainty quantification in dynamical systems. However, for nonlinear oscillator networks such as Kuramoto models, widely used to study synchronization phenomena in physics, biology, and engineering, inference is often computationally prohibitive due to high-dimensional state spaces and intractable likelihood functions. We present an amortized Bayesian inference approach that learns a neural approximation of the posterior from simulated phase dynamics, enabling fast, scalable inference without repeated sampling or optimization. Applied to synthetic Kuramoto networks, the method shows promising results in approximating posterior distributions and capturing uncertainty, with computational savings compared to traditional Bayesian techniques. These findings suggest that amortized inference is a practical and flexible framework for uncertainty-aware analysis of oscillator networks.

关键词: Bayesian inference, Kuramoto models, amortized inference, neural approximation, oscillator networks, parameter estimation, uncertainty quantification, computational savings

284. ❌ Model selection in hybrid quantum neural networks with applications to quantum transformer architectures

作者: Harsh Wadhwa, Rahul Bhowmick, Naipunnya Raj, Rajiv Sangle, Ruchira V. Bhat, Krishnakumar Sabapathy 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21749v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子机器学习中的模型选择问题，特别是量子transformer架构，与大多数关键词（涉及大语言模型技术、训练方法、推理优化、对齐、代理等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及量子机器学习在科学计算中的应用，但并非核心生物信息学或化学信息学，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对量子机器学习模型缺乏设计准则的问题，开发了Quantum Bias-Expressivity Toolbox（QBET）框架，通过引入Simplicity Bias和Expressivity指标来高效预筛选量子transformer架构变体，并在某些任务中展示了量子自注意力变体优于经典对应物。

摘要翻译

量子机器学习模型通常缺乏系统化的设计准则，往往需要对大量编码方案、量子电路设计和初始化策略进行资源密集型的完整训练，才能找到有效配置。为应对这一挑战，我们开发了量子偏置-表达能力工具箱（$\texttt{QBET}$），这是一个用于评估量子、经典及混合变压器架构的框架。在该工具箱中，我们引入了简洁性偏置（$\texttt{SB}$）和表达能力（$\texttt{EXP}$）的轻量化度量指标，以比较不同模型，并将$\texttt{SB}$的分析拓展至生成式和多分类任务。我们证明，$\texttt{QBET}$能够高效预筛选有潜力的模型变体，从而避免执行完整的训练流程。在基于变压器的分类和生成任务评估中，我们共使用$18$个量子比特进行嵌入（查询、键和值各占$6$个量子比特）。通过依据$\texttt{SB}$度量对相应模型进行排序并比较其相对性能，我们识别出了量子自注意力变体超越其经典对应模型的若干场景。

摘要 (Abstract)

Quantum machine learning models generally lack principled design guidelines, often requiring full resource-intensive training across numerous choices of encodings, quantum circuit designs and initialization strategies to find effective configuration. To address this challenge, we develope the Quantum Bias-Expressivity Toolbox ($\texttt{QBET}$), a framework for evaluating quantum, classical, and hybrid transformer architectures. In this toolbox, we introduce lean metrics for Simplicity Bias ($\texttt{SB}$) and Expressivity ($\texttt{EXP}$), for comparing across various models, and extend the analysis of $\texttt{SB}$ to generative and multiclass-classification tasks. We show that $\texttt{QBET}$ enables efficient pre-screening of promising model variants obviating the need to execute complete training pipelines. In evaluations on transformer-based classification and generative tasks we employ a total of $18$ qubits for embeddings ($6$ qubits each for query, key, and value). We identify scenarios in which quantum self-attention variants surpass their classical counterparts by ranking the respective models according to the $\texttt{SB}$ metric and comparing their relative performance.

关键词: quantum machine learning, transformer architectures, model selection, quantum bias-expressivity toolbox, simplicity bias, expressivity, quantum self-attention, hybrid quantum neural networks

285. ❌ CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

作者: Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B. Fox, Serena Yeung-Levy 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21743v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文《CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning》专注于使用强化学习（RL）优化虚拟细胞生成模型，以增强其生物合理性。论文的核心是生物信息学/计算生物学领域的AI应用，具体涉及细胞图像生成和药物发现加速。因此，仅与关键词“AI for Science OR Bioinformatics OR Cheminformatics”高度相关（评分为10），因为该研究直接属于生物信息学应用范畴，且摘要中明确提及“accelerating drug discovery”和“biologically meaningful evaluators”。其他关键词均涉及大语言模型（LLM）相关技术（如MoE、SFT、RAG、CoT等）、模型优化方法（如量化、注意力机制）或特定AI概念（如智能体、世界模型），而本文未涉及任何LLM或这些特定技术，仅使用RL作为优化工具，且RL本身不是评分关键词之一，故其他关键词均评为0分。

!!! tip deepseek-chat TL;DR

该研究解决了虚拟细胞生成模型可能违反生物约束的问题，通过引入基于强化学习的后训练框架，利用生物评估器作为奖励函数，显著提升了生成细胞的生物合理性和功能准确性。

摘要翻译

利用生成模型构建虚拟细胞以在计算机中模拟细胞行为，正成为加速药物发现的一种新兴且有前景的研究范式。然而，现有的基于图像的生成方法可能产生违反基本物理和生物学约束的、不合理的细胞图像。为解决这一问题，我们提出使用强化学习对虚拟细胞模型进行后训练，将具有生物学意义的评估器作为奖励函数。我们设计了涵盖三个类别——生物功能、结构有效性和形态正确性——的七种奖励，并优化了最先进的CellFlux模型，从而得到CellFluxRL。在所有奖励指标上，CellFluxRL均持续优于CellFlux，且通过测试时缩放可进一步提升性能。总体而言，我们的研究成果提出了一个通过强化学习强制执行基于物理约束的虚拟细胞建模框架，推动细胞生成从“视觉上真实”迈向“生物学上有意义”的新阶段。

摘要 (Abstract)

Building virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond “visually realistic” generations towards “biologically meaningful” ones.

关键词: virtual cell modeling, reinforcement learning, biological constraints, drug discovery, generative models, CellFlux, post-training, biologically meaningful

286. ❌ Uncertainty Quantification for Distribution-to-Distribution Flow Matching in Scientific Imaging

作者: Dongxia Wu, Yuhui Zhang, Serena Yeung-Levy, Emma Lundberg, Emily B. Fox 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21717v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于科学成像中的分布到分布生成模型的不确定性量化，虽然属于AI for Science领域（关键词26得8分），但论文内容主要涉及生成模型、不确定性量化、贝叶斯方法、蒙特卡洛等技术，与提供的关键词列表中的其他26个关键词（主要关于大语言模型技术、训练方法、推理优化、智能体等）完全无关，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为贝叶斯随机流匹配的统一不确定性量化框架，用于科学成像中的分布到分布生成模型，通过随机流匹配提高可靠性，并通过MCD-Antithetic方法增强异常检测的问责性。

摘要翻译

分布到分布生成模型支持从细胞扰动响应建模到跨条件医学图像转换等一系列科学成像任务。可信生成既需要可靠性（跨实验室、设备和实验条件的泛化能力），也需要可问责性（检测预测可能不可靠的分布外情况）。基于不确定性量化（UQ）的方法为这些任务提供了有前景的解决方案，但针对分布到分布生成模型的UQ研究仍显不足。本文提出一个统一的不确定性量化框架——贝叶斯随机流匹配（Bayesian Stochastic Flow Matching，BSFM），能够解耦任意不确定性和认知不确定性。其随机流匹配（Stochastic Flow Matching，SFM）组件通过引入扩散项增强确定性流，以提升模型对未见场景的泛化能力。针对不确定性量化，我们开发了一种可扩展的贝叶斯方法——MCD-Antithetic，该方法将蒙特卡洛丢弃（Monte Carlo Dropout）与样本高效的对立采样相结合，为分布外检测生成有效的异常分数。在细胞成像（BBBC021、JUMP）和脑功能磁共振成像（心智理论任务）的多种场景实验表明，SFM提升了模型可靠性，而MCD-Antithetic增强了可问责性。

摘要 (Abstract)

Distribution-to-distribution generative models support scientific imaging tasks ranging from modeling cellular perturbation responses to translating medical images across conditions. Trustworthy generation requires both reliability (generalization across labs, devices, and experimental conditions) and accountability (detecting out-of-distribution cases where predictions may be unreliable). Uncertainty quantification (UQ) based approaches serve as promising candidates for these tasks, yet UQ for distribution-to-distribution generative models remains underexplored. We present a unified UQ framework, Bayesian Stochastic Flow Matching (BSFM), that disentangles aleatoric and epistemic uncertainty. The Stochastic Flow Matching (SFM) component augments deterministic flows with a diffusion term to improve model generalization to unseen scenarios. For UQ, we develop a scalable Bayesian approach – MCD-Antithetic – that combines Monte Carlo Dropout with sample-efficient antithetic sampling to produce effective anomaly scores for out-of-distribution detection. Experiments on cellular imaging (BBBC021, JUMP) and brain fMRI (Theory of Mind) across diverse scenarios show that SFM improves reliability while MCD-Antithetic enhances accountability.

关键词: Uncertainty Quantification, Distribution-to-distribution generative models, Scientific imaging, Bayesian Stochastic Flow Matching, Monte Carlo Dropout, Anomaly detection, Cellular imaging, Brain fMRI

287. ❌ LipsAM: Lipschitz-Continuous Amplitude Modifier for Audio Signal Processing and its Application to Plug-and-Play Dereverberation

作者: Kazuki Matsumoto, Ren Uchida, Kohei Yatabe 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21684v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于音频信号处理中的深度神经网络（DNN）鲁棒性，特别是Lipschitz连续性的应用，与所有评分关键词（均围绕大模型、深度学习技术原理创新及其在科学领域的应用）无直接关联。论文未涉及大模型、语言模型、训练方法、推理技术、代理系统、模型压缩、科学AI应用等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了LipsAM，一种用于音频信号处理的Lipschitz连续振幅调制器架构，并应用于语音去混响的即插即用算法，通过数值实验证明了其改进的稳定性。

摘要翻译

深度神经网络（DNNs）的鲁棒性可通过其Lipschitz连续性得到验证，这使得构建Lipschitz连续DNN成为一个活跃的研究领域。然而，由于与现有成果兼容性较差，面向音频处理的DNN尚未成为主要焦点。本文针对处理音频信号的常用架构——幅度调制器（AM，Amplitude Modifier），提出了其Lipschitz连续变体，称为LipsAM。我们证明了AM具备Lipschitz连续性的充分条件，并以两种架构作为LipsAM的实例进行说明。所提出的架构被应用于语音去混响的即插即用算法中，数值实验证明了其稳定性的提升。

摘要 (Abstract)

The robustness of deep neural networks (DNNs) can be certified through their Lipschitz continuity, which has made the construction of Lipschitz-continuous DNNs an active research field. However, DNNs for audio processing have not been a major focus due to their poor compatibility with existing results. In this paper, we consider the amplitude modifier (AM), a popular architecture for handling audio signals, and propose its Lipschitz-continuous variants, which we refer to as LipsAM. We prove a sufficient condition for an AM to be Lipschitz continuous and propose two architectures as examples of LipsAM. The proposed architectures were applied to a Plug-and-Play algorithm for speech dereverberation, and their improved stability is demonstrated through numerical experiments.

关键词: Lipschitz continuity, deep neural networks, audio signal processing, amplitude modifier, speech dereverberation, Plug-and-Play algorithm, robustness, stability

288. ❌ Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs

作者: Tian Xia 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21705v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	15.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究模型合并（Model Merging）方法，特别是针对长链思维推理LLMs的层自适应合并，因此与’Model Merging OR Model Soups OR Weight Averaging’高度相关（15分）。论文明确涉及LLMs和链式思维推理，因此与’Large Language Models OR LLMs OR Foundation Models’和’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（各10分）。其他关键词如MoE、SLMs、预训练、对齐、RAG、量化等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对长链思维推理大语言模型（LLMs）的模型合并问题，提出了一种基于Fisher信息矩阵的无数据层自适应合并方法FIM-Merging，在多个基准测试中实现了最先进的性能，同时显著减少了输出长度。

摘要翻译

模型融合已成为一种无需额外训练即可整合专用大语言模型（LLM）能力的实用方法。在长链推理到短链输出（L2S）场景中，将基础模型与长思维链推理模型融合旨在保持推理准确性的同时减少输出长度。现有方法依赖于任务算术及其变体，这些方法隐含假设模型输出随融合系数线性变化——我们证明该假设在L2S场景中系统性失效。我们首次为分层自适应融合提供了理论依据：证明融合误差的边界项与逐层海森矩阵范数成正比（命题1），并通过局部最优处的费雪-海森等价性，确立费雪信息矩阵（FIM）可作为该边界的可计算理论代理指标。基于此理论，我们提出FIM-Merging方法，该方法仅使用随机令牌输入（无需领域特定校准数据）计算对角FIM，并据此分配逐层融合系数。在7B参数的L2S基准测试中，FIM-TIES在六项评估基准中的五项达到最优性能，其中在MATH500上较ACM-TIES提升**+6.2分（90.2 vs. 84.0），且无需校准数据。在1.5B参数基准测试中，FIM-TIES平均准确率达到47.3**，超越此前最优方法ACM-TIES（43.3）+3.9分，同时将平均响应长度相对长思维链模型降低91.9%。我们的框架还为现有分层自适应方法（如ACM）在经验上优于均匀融合提供了统一的理论解释。

摘要 (Abstract)

Model merging has emerged as a practical approach to combine capabilities of specialized large language models (LLMs) without additional training. In the Long-to-Short (L2S) scenario, merging a base model with a long-chain-of-thought reasoning model aims to preserve reasoning accuracy while reducing output length. Existing methods rely on Task Arithmetic and its variants, which implicitly assume that model outputs vary linearly with the merging coefficient – an assumption we show is systematically violated in L2S settings. We provide the first theoretical justification for layer-adaptive merging: we prove that merging error is bounded by a term proportional to the per-layer Hessian norm (Proposition~1), and establish that the Fisher Information Matrix (FIM) is a principled, computable proxy for this bound via the Fisher-Hessian equivalence at local optima. Building on this theory, we propose \textbf{FIM-Merging}, which computes diagonal FIM using only random token inputs (no domain-specific calibration data required) and uses it to assign per-layer merging coefficients. On the 7B L2S benchmark, FIM-TIES achieves state-of-the-art performance on five out of six evaluation benchmarks, including a \textbf{+6.2} point gain on MATH500 over ACM-TIES (90.2 vs.\ 84.0), while requiring no calibration data. On the 1.5B benchmark, FIM-TIES achieves an average accuracy of \textbf{47.3}, surpassing the previous best ACM-TIES (43.3) by \textbf{+3.9} points, while reducing average response length by \textbf{91.9%} relative to the long-CoT model. Our framework also provides a unified theoretical explanation for why existing layer-adaptive methods such as ACM empirically outperform uniform merging.

关键词: Model Merging, Large Language Models, Chain-of-Thought Reasoning, Fisher Information Matrix, Layer-adaptive Merging, Long-to-Short Reasoning, Data-Free Merging, FIM-Merging

289. ❌ CoNBONet: Conformalized Neuroscience-inspired Bayesian Operator Network for Reliability Analysis

作者: Shailesh Garg, Souvik Chakraborty 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21678v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种用于非线性动力系统可靠性分析的深度学习模型CoNBONet，属于AI在科学工程领域的应用。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于AI在工程科学（可靠性分析）中的应用，但并非生物信息学或化学信息学领域，因此给予5分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CoNBONet的神经科学启发的贝叶斯算子网络，用于解决非线性动力系统在随机激励下的时变可靠性分析问题，实现了快速、节能且具有不确定性量化保证的可靠性评估。

摘要翻译

非线性动力系统在随机激励下的时变可靠性分析是一项关键但计算量巨大的任务。传统方法（如蒙特卡洛模拟）需要对计算成本高昂的数值求解器进行反复评估，导致显著的计算瓶颈。为应对这一挑战，我们提出了一种受神经科学启发的代理模型——\textit{CoNBONet}，它能够实现快速、节能且具备不确定性感知的可靠性分析，为蒙特卡洛模拟等技术提供了一种可扩展的替代方案。CoNBONet（全称 \textbf{Co}nformalized \textbf{N}euroscience-inspired \textbf{B}ayesian \textbf{O}perator \textbf{Net}work，即共形化神经科学启发的贝叶斯算子网络）充分利用了深度算子网络的表达能力，同时集成了受神经科学启发的神经元模型，以实现快速、低功耗的推理。与高斯过程、多项式混沌展开或支持向量回归等传统代理模型不同（这些模型在面对高维、时变可靠性问题时可能面临可扩展性挑战），CoNBONet具备以下优势：通过受神经科学启发的网络架构实现\textit{快速且节能的推理}；通过分形共形预测提供\textit{具有理论保证的校准后不确定性量化}；以及通过一种将输入函数映射到系统响应轨迹的算子学习范式，获得\textit{强大的泛化能力}。在各种非线性动力系统上对所提出的CoNBONet进行的验证表明，CoNBONet保持了预测保真度，并实现了对失效概率的可靠覆盖，这使其成为工程设计中稳健且可扩展的可靠性分析的有力工具。

摘要 (Abstract)

Time-dependent reliability analysis of nonlinear dynamical systems under stochastic excitations is a critical yet computationally demanding task. Conventional approaches, such as Monte Carlo simulation, necessitate repeated evaluations of computationally expensive numerical solvers, leading to significant computational bottlenecks. To address this challenge, we propose \textit{CoNBONet}, a neuroscience-inspired surrogate model that enables fast, energy-efficient, and uncertainty-aware reliability analysis, providing a scalable alternative to techniques such as Monte Carlo simulations. CoNBONet, short for \textbf{Co}nformalized \textbf{N}euroscience-inspired \textbf{B}ayesian \textbf{O}perator \textbf{Net}work, leverages the expressive power of deep operator networks while integrating neuroscience-inspired neuron models to achieve fast, low-power inference. Unlike traditional surrogates such as Gaussian processes, polynomial chaos expansions, or support vector regression, that may face scalability challenges for high-dimensional, time-dependent reliability problems, CoNBONet offers \textit{fast and energy-efficient inference} enabled by a neuroscience-inspired network architecture, \textit{calibrated uncertainty quantification with theoretical guarantees} via split conformal prediction, and \textit{strong generalization capability} through an operator-learning paradigm that maps input functions to system response trajectories. Validation of the proposed CoNBONet for various nonlinear dynamical systems demonstrates that CoNBONet preserves predictive fidelity, and achieves reliable coverage of failure probabilities, making it a powerful tool for robust and scalable reliability analysis in engineering design.

关键词: reliability analysis, nonlinear dynamical systems, surrogate model, neuroscience-inspired, Bayesian operator network, conformal prediction, uncertainty quantification, Monte Carlo simulation

290. ❌ SPINONet: Scalable Spiking Physics-informed Neural Operator for Computational Mechanics Applications

作者: Shailesh Garg, Luis Mandl, Somdatta Goswami, Souvik Chakraborty 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21674v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文SPINONet专注于计算力学中的物理信息算子学习，提出了一种基于神经科学启发的脉冲神经网络框架以提高能效。所有关键词均与大型语言模型（LLM）、其训练/对齐技术、推理优化、代理系统或通用AI模型相关，而本文研究的是特定领域的科学计算模型（物理信息神经网络），并非大模型。唯一略有相关的是“AI for Science”，因为论文属于科学计算中的AI应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对计算力学中物理信息算子学习模型在边缘设备上能效低的问题，提出了一种神经科学启发的脉冲神经网络框架SPINONet，在保持预测性能可比的同时实现了稀疏计算以降低能耗。

摘要翻译

在计算力学与科学计算领域，部署物理信息算子学习模型时，能源效率仍是一个关键挑战，尤其在边缘设备和嵌入式设备等功耗受限的场景中，密集网络内重复的算子评估会带来巨大的计算与能耗成本。为应对这一挑战，我们提出了可分离物理信息神经科学启发算子网络（Separable Physics-informed Neuroscience-inspired Operator Network, SPINONet），这是一种受神经科学启发的框架，能够在保持与物理信息训练兼容的同时，减少重复评估中的冗余计算。SPINONet通过架构感知设计，引入了适用于回归任务的神经科学启发脉冲神经元，实现了稀疏、事件驱动的计算方式，在提升能源效率的同时，保留了计算时空导数所需的连续、坐标可微路径。我们在具有时空及参数依赖性的瞬态与稳态场景下，针对一系列代表计算力学问题的偏微分方程对SPINONet进行了评估，结果表明尽管引入了稀疏通信，其预测性能仍与传统物理信息算子学习方法相当。此外，在混合训练设置中引入有限的数据监督，可提升在纯物理信息训练可能收敛到伪解时的困难场景下的性能表现。最后，我们通过理论分析探讨了SPINONet的架构组件与设计选择如何降低计算负荷与能耗。

摘要 (Abstract)

Energy efficiency remains a critical challenge in deploying physics-informed operator learning models for computational mechanics and scientific computing, particularly in power-constrained settings such as edge and embedded devices, where repeated operator evaluations in dense networks incur substantial computational and energy costs. To address this challenge, we introduce the Separable Physics-informed Neuroscience-inspired Operator Network (SPINONet), a neuroscience-inspired framework that reduces redundant computation across repeated evaluations while remaining compatible with physics-informed training. SPINONet incorporates regression-friendly neuroscience-inspired spiking neurons through an architecture-aware design that enables sparse, event-driven computation, improving energy efficiency while preserving the continuous, coordinate-differentiable pathways required for computing spatio-temporal derivatives. We evaluate SPINONet on a range of partial differential equations representative of computational mechanics problems, with spatial, temporal, and parametric dependencies in both time-dependent and steady-state settings, and demonstrate predictive performance comparable to conventional physics-informed operator learning approaches despite the induced sparse communication. In addition, limited data supervision in a hybrid setup is shown to improve performance in challenging regimes where purely physics-informed training may converge to spurious solutions. Finally, we provide an analytical discussion linking architectural components and design choices of SPINONet to reductions in computational load and energy consumption.

关键词: Spiking Neural Networks, Physics-informed Neural Operator, Computational Mechanics, Energy Efficiency, Sparse Computation, Partial Differential Equations, Edge Computing, Neuroscience-inspired

291. ❌ TrustFed: Enabling Trustworthy Medical AI under Data Privacy Constraints

作者: Vagish Kumar, Syed Bahauddin Alam, Souvik Chakraborty 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21656v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医疗AI领域的联邦学习框架TrustFed，解决医疗数据隐私约束下的不确定性量化问题。所有关键词均与大模型技术原理、训练方法、推理优化、对齐技术等直接相关，而本文不涉及任何大模型技术，仅涉及传统机器学习在医疗领域的应用。唯一相关的是"AI for Science OR Bioinformatics OR Cheminformatics"，因为论文属于医疗AI应用，但并非大模型在科学领域的应用，因此给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了TrustFed联邦学习框架，解决了医疗数据隐私约束下多机构数据异构和类别不平衡导致的预测可靠性问题，通过表示感知的客户端分配和软最近邻阈值聚合策略，在六种临床影像模态上实现了具有统计保证的不确定性量化。

摘要翻译

保护患者隐私仍是医疗健康机构间规模化应用机器学习的基本障碍，由于伦理、法律和监管限制，集中敏感数据往往不可行。联邦学习通过在不共享原始患者数据的前提下实现隐私保护的多机构协同训练，提供了一种前景广阔的替代方案；然而，实际部署面临数据异质性、机构特异性偏差和类别不平衡等严峻挑战，这些问题会降低预测可靠性，并使现有的不确定性量化方法失效。本文提出TrustFed——一种联邦不确定性量化框架，该框架在无需集中访问数据的前提下，针对异构且不平衡的医疗数据提供无分布、有限样本的覆盖保证。TrustFed引入了表征感知的客户端分配机制，利用模型内部表征实现跨机构有效校准，同时采用软最近邻阈值聚合策略，在生成紧凑可靠预测集的同时缓解分配不确定性。基于六种临床差异显著的影像模态（imaging modalities）中超过43万张医学图像，我们开展了医学影像领域最全面的不确定性感知联邦学习评估之一，证明了该方法在具有不同类别基数与不平衡状态的数据集上均能提供稳健的覆盖保证。通过在此规模和广度上验证TrustFed，本研究将不确定性感知联邦学习从概念验证推进至具有临床意义、模态无关的部署阶段，并将统计保证的不确定性确立为下一代医疗人工智能系统的核心要求。

摘要 (Abstract)

Protecting patient privacy remains a fundamental barrier to scaling machine learning across healthcare institutions, where centralizing sensitive data is often infeasible due to ethical, legal, and regulatory constraints. Federated learning offers a promising alternative by enabling privacy-preserving, multi-institutional training without sharing raw patient data; however, real-world deployments face severe challenges from data heterogeneity, site-specific biases, and class imbalance, which degrade predictive reliability and render existing uncertainty quantification methods ineffective. Here, we present TrustFed, a federated uncertainty quantification framework that provides distribution-free, finite-sample coverage guarantees under heterogeneous and imbalanced healthcare data, without requiring centralized access. TrustFed introduces a representation-aware client assignment mechanism that leverages internal model representations to enable effective calibration across institutions, along with a soft-nearest threshold aggregation strategy that mitigates assignment uncertainty while producing compact and reliable prediction sets. Using over 430,000 medical images across six clinically distinct imaging modalities, we conduct one of the most comprehensive evaluations of uncertainty-aware federated learning in medical imaging, demonstrating robust coverage guarantees across datasets with diverse class cardinalities and imbalance regimes. By validating TrustFed at this scale and breadth, our study advances uncertainty-aware federated learning from proof-of-concept toward clinically meaningful, modality-agnostic deployment, positioning statistically guaranteed uncertainty as a core requirement for next-generation healthcare AI systems.

关键词: Federated Learning, Medical AI, Uncertainty Quantification, Data Privacy, Healthcare, Medical Imaging, Heterogeneous Data, Class Imbalance

292. ❌ MISApp: Multi-Hop Intent-Aware Session Graph Learning for Next App Prediction

作者: Yunchi Yang, Longlong Li, Jianliang Wu, Cunquan Qu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21653v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MISApp专注于移动应用预测任务，使用图神经网络（GNN）和会话图学习技术，核心是建模用户应用使用序列中的高阶结构依赖和意图演化。所有评分关键词均涉及大模型（LLM）及相关技术（如MoE、RLHF、RAG、量化等）、大模型推理技术（如CoT、MCTS）、大模型对齐与优化、或特定科学AI应用（如生物信息学）。该论文未使用或提及任何大模型技术，也未涉及生物/化学信息学等科学AI应用，而是传统的机器学习/图学习在行为预测领域的应用，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多跳会话图学习的无配置文件框架MISApp，用于预测用户下一个将启动的移动应用，实验表明其在标准和冷启动设置下均优于现有基线方法。

摘要翻译

预测用户即将启动的移动应用对于实现主动式移动服务至关重要。然而，在实际场景中，准确预测仍面临挑战：用户意图可能在短会话内快速切换，且用户特定的历史画像通常稀疏或不可用，尤其在冷启动条件下更为突出。现有方法主要将应用使用行为建模为序列行为或局部会话转移，限制了其捕捉高阶结构依赖与会话意图动态演变的能力。为解决这一问题，我们提出MISApp，一种基于多跳会话图学习的免画像下一应用预测框架。MISApp通过构建多跳会话图以捕捉不同结构范围内的转移依赖，通过轻量级图传播学习会话表征，融合时空上下文以刻画会话情境，并从近期交互中捕捉意图演变。在两个真实世界应用使用数据集上的实验表明，MISApp在标准与冷启动设置下均持续优于现有基线方法，同时在预测准确性与实际效率之间保持了良好平衡。进一步分析表明，学习得到的跳级注意力权重与结构相关性高度吻合，为所提出的多跳建模策略的有效性提供了可解释的证据。

摘要 (Abstract)

Predicting the next mobile app a user will launch is essential for proactive mobile services. Yet accurate prediction remains challenging in real-world settings, where user intent can shift rapidly within short sessions and user-specific historical profiles are often sparse or unavailable, especially under cold-start conditions. Existing approaches mainly model app usage as sequential behavior or local session transitions, limiting their ability to capture higher-order structural dependencies and evolving session intent. To address this issue, we propose MISApp, a profile-free framework for next app prediction based on multi-hop session graph learning. MISApp constructs multi-hop session graphs to capture transition dependencies at different structural ranges, learns session representations through lightweight graph propagation, incorporates temporal and spatial context to characterize session conditions, and captures intent evolution from recent interactions. Experiments on two real-world app usage datasets show that MISApp consistently outperforms competitive baselines under both standard and cold-start settings, while maintaining a favorable balance between predictive accuracy and practical efficiency. Further analyses show that the learned hop-level attention weights align well with structural relevance, offering interpretable evidence for the effectiveness of the proposed multi-hop modeling strategy.

关键词: next app prediction, session graph learning, multi-hop modeling, intent evolution, cold-start setting, graph neural networks, user behavior prediction, mobile services

293. ❌ Engineering Distributed Governance for Regional Prosperity: A Socio-Technical Framework for Mitigating Under-Vibrancy via Human Data Engines

作者: Amil Khanzada, Takuji Takemoto 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21639v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	2.0/10	0.0

评分理由: 论文研究区域经济治理，使用AI驱动的决策支持系统分析数据并提出治理框架，属于AI在社会科学领域的应用。所有关键词均与大模型、深度学习技术原理或特定AI子领域（如MoE、RLHF、RAG等）直接相关，而本文仅提及通用的’AI-driven decision support system’，未涉及任何具体的大模型技术、架构、训练方法或前沿研究方向。仅与’AI for Science’有微弱关联，因为其将AI应用于区域经济分析，可视为广义的’科学’应用，但非核心的生物信息学或化学信息学，故给2分。其余关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文针对人口减少地区面临的'低活力'问题，提出了一个分布式人类数据引擎框架，利用AI决策支持系统分析日本福井县的消费和情感数据，量化了每年约86万未实现访问的机会缺口，并设计了双助推治理架构以优化跨区域经济流动。

摘要翻译

当前城市信息学与旅游研究多聚焦于缓解全球高密度城市的过度旅游问题。然而，对于面临人口衰退与结构停滞的地区，其主要风险在于“活力不足”——即游客密度过低抑制经济活动并降低满意度的状态。本文引入分布式人类数据引擎（Distributed Human Data Engine, DHDE），这一先前在生物危机管理中已验证的社会技术框架，并将其调整应用于区域经济流动优化。通过使用日本游客最少县（福井县）的高粒度数据，我们利用人工智能驱动的决策支持系统（Decision Support System, DSS）分析两个数据集：原始福井消费数据库（90,350条记录）与区域标准化情感数据库（97,719条回应）。该系统实现了81%的样本内解释力（R^2 = 0.810）与68%的样本外预测性能（R^2 = 0.683）。我们量化出每年865,917次未实现的访问机会缺口，相当于约119.6亿日元（7,620万美元）的收入损失。我们提出一种双助推治理架构，利用DHDE重新分配跨县流动并减少经济漏损。

摘要 (Abstract)

Most research in urban informatics and tourism focuses on mitigating overtourism in dense global cities. However, for regions experiencing demographic decline and structural stagnation, the primary risk is “under-vibrancy”, a condition where low visitor density suppresses economic activity and diminishes satisfaction. This paper introduces the Distributed Human Data Engine (DHDE), a socio-technical framework previously validated in biological crisis management, and adapts it for regional economic flow optimization. Using high-granularity data from Japan’s least-visited prefecture (Fukui), we utilize an AI-driven decision support system (DSS) to analyze two datasets: a raw Fukui spending database (90,350 records) and a regional standardized sentiment database (97,719 responses). The system achieves in-sample explanatory power of 81% (R^2 = 0.810) and out-of-sample predictive performance of 68% (R^2 = 0.683). We quantify an annual opportunity gap of 865,917 unrealized visits, equivalent to approximately 11.96 billion yen (USD 76.2 million) in lost revenue. We propose a dual-nudge governance architecture leveraging the DHDE to redistribute cross-prefectural flows and reduce economic leakage.

关键词: under-vibrancy, distributed human data engine, AI-driven decision support system, regional economic flow optimization, dual-nudge governance, economic leakage, Fukui prefecture, opportunity gap

294. ❌ Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective

作者: Yuehu Gong, Zeyuan Wang, Yulin Chen, Yanwei Fu 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21621v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习中的生成策略优化，提出了一种基于广义薛定谔桥的路径空间PPO方法。虽然论文涉及生成模型（扩散和流模型），但其核心是强化学习算法改进，而非大模型技术或AI在科学领域的应用。所有关键词均与大语言模型、模型训练技术、推理方法、AI应用等直接相关，而本文研究的是强化学习中的策略优化问题，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于广义薛定谔桥的路径空间PPO方法（GSB-PPO），用于优化生成策略的强化学习训练，实验表明惩罚式目标比裁剪式目标具有更好的稳定性和性能。

摘要翻译

基于生成策略的同策略强化学习前景广阔但尚未得到充分探索。一个核心挑战在于，近端策略优化（PPO）传统上以动作空间概率比的形式表述，而基于扩散和流的策略更自然地表现为轨迹层面的生成过程。本文提出GSB-PPO——一种受广义薛定谔桥（Generalized Schrödinger Bridge，GSB）启发的生成式PPO路径空间表述框架。该框架将PPO风格的近端更新从终端动作提升至完整生成轨迹，为生成策略的同策略优化提供了统一视角。在此框架内，我们开发了两个具体目标函数：基于截断的目标GSB-PPO-Clip和基于惩罚的目标GSB-PPO-Penalty。实验结果表明，虽然两种目标函数均适用于同策略训练，但惩罚式表述始终比截断式具有更好的稳定性和性能。总体而言，我们的研究结果揭示了路径空间近端正则化作为使用PPO训练生成策略的有效原则。

摘要 (Abstract)

On-policy reinforcement learning with generative policies is promising but remains underexplored. A central challenge is that proximal policy optimization (PPO) is traditionally formulated in terms of action-space probability ratios, whereas diffusion- and flow-based policies are more naturally represented as trajectory-level generative processes. In this work, we propose GSB-PPO, a path-space formulation of generative PPO inspired by the Generalized Schrödinger Bridge (GSB). Our framework lifts PPO-style proximal updates from terminal actions to full generation trajectories, yielding a unified view of on-policy optimization for generative policies. Within this framework, we develop two concrete objectives: a clipping-based objective, GSB-PPO-Clip, and a penalty-based objective, GSB-PPO-Penalty. Experimental results show that while both objectives are compatible with on-policy training, the penalty formulation consistently delivers better stability and performance than the clipping counterpart. Overall, our results highlight path-space proximal regularization as an effective principle for training generative policies with PPO.

关键词: Proximal Policy Optimization, PPO, Generative Policies, Schrödinger Bridge, Path-space Optimization, Reinforcement Learning, Trajectory Generation, On-policy Training

295. ❌ Rateless DeepJSCC for Broadcast Channels: a Rate-Distortion-Complexity Tradeoff

作者: Zijun Qin, Jingxuan Huang, Zesong Fei, Haichuan Ding, Yulin Shao, Xianhao Chen 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21616v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究无线边缘广播信道中的深度学习联合信源信道编码（DeepJSCC），提出了一种基于无速率码的可变长度JSCC框架（NTRSCC），专注于图像传输中的失真、速率和解码复杂度之间的权衡优化。所有评分关键词均涉及大语言模型（LLMs）及其相关技术（如训练方法、推理优化、应用等），而本文研究的是通信领域的深度学习应用（DeepJSCC），属于无线通信与深度学习的交叉领域，与大语言模型技术无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种用于广播信道的非线性变换无速率信源信道编码（NTRSCC）框架，通过集成学习到的信源变换与物理层LT码，实现了异构接收端在失真、传输速率和解码复杂度之间的可控权衡，提升了图像广播质量。

摘要翻译

近年来，无线边缘涌现出大量数据密集型广播应用，亟需在失真度、传输速率与处理复杂度之间实现灵活权衡。尽管基于深度学习的联合信源信道编码（DeepJSCC）已被视为数据密集型通信的潜在解决方案，但现有方案大多局限于最坏情况设计，缺乏自适应复杂度调节能力，且在广播场景下效率低下。为突破这些限制，本文提出非线性变换无速率信源信道编码（NTRSCC），一种基于无速率码的广播信道可变长度JSCC框架。具体而言，我们将学习型信源变换与物理层LT码相结合，开发了利用解码器边信息的不等保护方案，并设计了近似方法以实现无速率参数的端到端优化。该框架使得异构接收机能够在置信传播中自适应调整接收的无速率符号数量与解码迭代次数，从而实现对失真度、速率和解码复杂度的可控权衡。仿真结果表明，在异构边缘设备面临严格通信与处理资源约束的条件下，所提方法显著提升了图像广播质量。

摘要 (Abstract)

In recent years, numerous data-intensive broadcasting applications have emerged at the wireless edge, calling for a flexible tradeoff between distortion, transmission rate, and processing complexity. While deep learning-based joint source-channel coding (DeepJSCC) has been identified as a potential solution to data-intensive communications, most of these schemes are confined to worst-case solutions, lack adaptive complexity, and are inefficient in broadcast settings. To overcome these limitations, this paper introduces nonlinear transform rateless source-channel coding (NTRSCC), a variable-length JSCC framework for broadcast channels based on rateless codes. In particular, we integrate learned source transformations with physical-layer LT codes, develop unequal protection schemes that exploit decoder side information, and devise approximations to enable end-to-end optimization of rateless parameters. Our framework enables heterogeneous receivers to adaptively adjust their received number of rateless symbols and decoding iterations in belief propagation, thereby achieving a controllable tradeoff between distortion, rate, and decoding complexity. Simulation results demonstrate that the proposed method enhances image broadcast quality under stringent communication and processing budgets over heterogeneous edge devices.

关键词: DeepJSCC, rateless coding, broadcast channels, rate-distortion-complexity tradeoff, LT codes, unequal protection, end-to-end optimization, heterogeneous receivers

296. ❌ Towards Multimodal Time Series Anomaly Detection with Semantic Alignment and Condensed Interaction

作者: Shiyan Hu, Jianxin Jin, Yang Shu, Peng Chen, Bin Yang, Chenjuan Guo 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21612v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于时间序列异常检测，提出了一种多模态方法（MindTS），整合时间序列数据和文本信息，通过细粒度时间-文本语义对齐和内容压缩重建来解决跨模态对齐和冗余信息过滤问题。虽然涉及多模态学习和对齐技术，但论文未提及任何大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用，所有关键词均与论文内容无关，因此所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MindTS的多模态时间序列异常检测模型，通过细粒度时间-文本语义对齐和内容压缩重建来解决跨模态对齐和冗余信息过滤问题，在六个真实世界数据集上取得了优于现有方法的结果。

摘要翻译

时间序列异常检测在许多动态系统中发挥着关键作用。尽管其重要性显著，先前的方法主要依赖于单模态数值数据，忽视了来自其他模态的互补信息的重要性。本文提出了一种新颖的多模态时间序列异常检测模型（MindTS），该模型聚焦于解决两个关键挑战：（1）如何实现异构多模态数据间的语义一致性对齐，以及（2）如何过滤冗余模态信息以有效增强跨模态交互。针对第一个挑战，我们提出了细粒度时间-文本语义对齐方法。该方法通过跨视图文本融合和多模态对齐机制，整合外生与内生的文本信息，实现了时间与文本模态间的语义一致性对齐。对于第二个挑战，我们引入了内容浓缩器重构模块，该模块过滤对齐后文本模态内的冗余信息，并执行跨模态重构以实现交互。在六个真实世界多模态数据集上的大量实验表明，与现有方法相比，所提出的MindTS模型取得了具有竞争力或更优的结果。代码发布于：https://github.com/decisionintelligence/MindTS。

摘要 (Abstract)

Time series anomaly detection plays a critical role in many dynamic systems. Despite its importance, previous approaches have primarily relied on unimodal numerical data, overlooking the importance of complementary information from other modalities. In this paper, we propose a novel multimodal time series anomaly detection model (MindTS) that focuses on addressing two key challenges: (1) how to achieve semantically consistent alignment across heterogeneous multimodal data, and (2) how to filter out redundant modality information to enhance cross-modal interaction effectively. To address the first challenge, we propose Fine-grained Time-text Semantic Alignment. It integrates exogenous and endogenous text information through cross-view text fusion and a multimodal alignment mechanism, achieving semantically consistent alignment between time and text modalities. For the second challenge, we introduce Content Condenser Reconstruction, which filters redundant information within the aligned text modality and performs cross-modal reconstruction to enable interaction. Extensive experiments on six real-world multimodal datasets demonstrate that the proposed MindTS achieves competitive or superior results compared to existing methods. The code is available at: https://github.com/decisionintelligence/MindTS.

关键词: multimodal time series, anomaly detection, semantic alignment, cross-modal interaction, text fusion, content condensation, reconstruction, heterogeneous data

297. ❌ In-network Attack Detection with Federated Deep Learning in IoT Networks: Real Implementation and Analysis

作者: Devashish Chaudhary, Sutharshan Rajasegarar, Shiva Raj Pokhrel, Lei Pan, Ruby D 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21596v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于物联网网络安全中的联邦学习和轻量级异常检测，未涉及大语言模型、深度学习技术原理创新或科学领域应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于联邦学习和轻量级自动编码器的物联网网络攻击检测框架，在真实测试平台上验证了其有效性，显著减少了通信开销并保持了与集中式方法相当的性能。

摘要翻译

物联网（IoT）的快速扩展及其与骨干网络的融合加剧了安全漏洞的风险。传统的集中式异常检测方法需要将大量数据传输至中央服务器，存在隐私、可扩展性和延迟方面的局限性。本文提出了一种基于轻量级自编码器的异常检测框架，专为部署在资源受限的边缘设备而设计，能够在实现实时检测的同时最小化数据传输并保护隐私。该框架采用联邦学习在分布式设备间协同训练模型，其中本地训练在边缘节点进行，仅将模型权重在中央服务器聚合。研究开发了一个使用树莓派（Raspberry Pi）传感器节点的真实物联网测试平台，用于收集正常流量与攻击流量数据。在该测试平台上实现并评估的所提联邦异常检测系统，证明了其能够有效准确地识别网络攻击。在达到与集中式方法相当性能的同时，通信开销显著降低。

摘要 (Abstract)

The rapid expansion of the Internet of Things (IoT) and its integration with backbone networks have heightened the risk of security breaches. Traditional centralized approaches to anomaly detection, which require transferring large volumes of data to central servers, suffer from privacy, scalability, and latency limitations. This paper proposes a lightweight autoencoder-based anomaly detection framework designed for deployment on resource-constrained edge devices, enabling real-time detection while minimizing data transfer and preserving privacy. Federated learning is employed to train models collaboratively across distributed devices, where local training occurs on edge nodes and only model weights are aggregated at a central server. A real-world IoT testbed using Raspberry Pi sensor nodes was developed to collect normal and attack traffic data. The proposed federated anomaly detection system, implemented and evaluated on the testbed, demonstrates its effectiveness in accurately identifying network attacks. The communication overhead was reduced significantly while achieving comparable performance to the centralized method.

关键词: IoT networks, anomaly detection, federated learning, autoencoder, edge devices, network attacks, real-time detection, privacy preservation

298. ❌ Feature Incremental Clustering with Generalization Bounds

作者: Jing Zhang, Chenping Hou 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21590v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究特征增量聚类算法及其泛化界分析，属于传统机器学习聚类领域，未涉及大模型、深度学习、AI for Science等关键词相关的技术或应用。所有关键词均与大模型技术原理、训练方法、推理优化、对齐、应用等直接相关，而本文专注于传统k-means聚类算法的增量特征场景扩展和理论分析，无任何关联。

!!! tip deepseek-chat TL;DR

该论文针对特征增量场景下的聚类问题，提出了四种特征增量聚类算法并分析了它们的泛化误差界，在活动识别聚类任务中验证了有效性。

摘要翻译

在许多学习系统中，例如活动识别系统，随着新的数据采集方法在各种动态环境应用中不断涌现，实例的属性会逐渐累积增加，数据被存储在持续扩展的特征空间中。如何设计具有理论保证的算法来有效聚类这种特殊类型的数据流（通常称为活动识别），目前仍未得到充分探索。与传统场景相比，在这种特征增量场景中，我们至少面临两个基本问题：（i）如何设计初步且有效的算法来解决特征增量聚类问题？（ii）如何分析所提出算法的泛化界，以及在何种条件下这些算法能提供强有力的泛化保证？为解决这些问题，我们以最常用的聚类算法——$k$-means为例，针对数据访问的不同情况，提出了四种类型的特征增量聚类（Feature Incremental Clustering, FIC）算法，分别对应：特征裁剪（Feature Tailoring, FT）、数据重构（Data Reconstruction, DR）、数据适应（Data Adaptation, DA）和模型复用（Model Reuse, MR），简称为FIC-FT、FIC-DR、FIC-DA和FIC-MR。随后，我们对这四种算法的泛化误差界进行了详细分析，并强调了影响这些界的关键因素，例如训练数据量、假设空间的复杂性、预训练模型的质量以及重构特征分布的差异。数值实验验证了所提出算法的有效性，特别是在活动识别聚类任务中的应用。

摘要 (Abstract)

In many learning systems, such as activity recognition systems, as new data collection methods continue to emerge in various dynamic environmental applications, the attributes of instances accumulate incrementally, with data being stored in gradually expanding feature spaces. How to design theoretically guaranteed algorithms to effectively cluster this special type of data stream, commonly referred to as activity recognition, remains unexplored. Compared to traditional scenarios, we will face at least two fundamental questions in this feature incremental scenario. (i) How to design preliminary and effective algorithms to address the feature incremental clustering problem? (ii) How to analyze the generalization bounds for the proposed algorithms and under what conditions do these algorithms provide a strong generalization guarantee? To address these problems, by tailoring the most common clustering algorithm, i.e., $k$-means, as an example, we propose four types of Feature Incremental Clustering (FIC) algorithms corresponding to different situations of data access: Feature Tailoring (FT), Data Reconstruction (DR), Data Adaptation (DA), and Model Reuse (MR), abbreviated as FIC-FT, FIC-DR, FIC-DA, and FIC-MR. Subsequently, we offer a detailed analysis of the generalization error bounds for these four algorithms and highlight the critical factors influencing these bounds, such as the amounts of training data, the complexity of the hypothesis space, the quality of pre-trained models, and the discrepancy of the reconstruction feature distribution. The numerical experiments show the effectiveness of the proposed algorithms, particularly in their application to activity recognition clustering tasks.

关键词: Feature Incremental Clustering, Generalization Bounds, k-means, Activity Recognition, Data Stream, Feature Space Expansion, Clustering Algorithms, Theoretical Guarantees

299. ❌ SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models

作者: Md Kaykobad Reza, Ameya Patil, Edward Ayrapetian, M. Salman Asif 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21584v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SSAM专注于多模态大语言模型（MLLMs）的合并，核心贡献是提出一种无需训练的模型合并框架。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为MLLMs是LLMs的扩展；与’Model Merging OR Model Soups OR Weight Averaging’高度相关（10分），因为论文的核心是模型合并技术。其他关键词如MoE、SLMs、Scaling Laws、训练方法（Pre-training、SFT、RLHF等）、推理优化（RAG、Context Window、KV Cache）、推理能力（CoT、System 2）、代理（Agents、Tool Use）、效率技术（Quantization、Speculative Decoding）、可靠性（Hallucination Mitigation）、可解释性（Interpretability）、世界模型或科学AI应用，在论文标题和摘要中均未提及或讨论，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SSAM的训练免费模型合并框架，用于将独立训练的多模态大语言模型合并成一个能处理任意输入模态组合的单一模型，并在多个数据集上实现了最先进的性能。

摘要翻译

多模态大语言模型（MLLMs）通过联合处理来自视觉、音频和语言等多种模态的输入，实现了强大的性能。然而，构建此类模型或将其扩展到新模态通常需要大量配对数据集和可观的计算资源。鉴于许多预训练的多模态大语言模型（例如视觉-语言或音频-语言模型）已公开可用，我们提出：能否将它们合并为一个能够处理多种模态的单一多模态大语言模型？合并具有不同输入模态的多模态大语言模型仍然具有挑战性，部分原因在于所学表征的差异以及其参数空间之间的相互干扰。为应对这些挑战，我们提出奇异子空间对齐与合并（Singular Subspace Alignment and Merging, SSAM），这是一种无需训练的模型合并框架，可将独立训练的专家多模态大语言模型统一为能够处理任意输入模态组合的单一模型。SSAM 分别维护各模态特定的参数更新，并识别出一个用于语言相关参数更新的共享低秩子空间，在该子空间内对齐这些更新，然后将其合并，以在最小化参数干扰的同时保留互补知识。在不使用任何多模态训练数据的情况下，SSAM 在四个数据集上实现了最先进的性能，超越了先前的免训练合并方法，甚至优于联合训练的多模态模型。这些结果表明，在参数空间中对齐模型为传统的联合多模态训练提供了一种可扩展且资源高效的替代方案。

摘要 (Abstract)

Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination of input modalities. SSAM maintains modality-specific parameter updates separately and identifies a shared low-rank subspace for language-related parameter updates, aligns them within this subspace, and merges them to preserve complementary knowledge while minimizing parameter interference. Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models. These results demonstrate that aligning models in parameter space provides a scalable and resource-efficient alternative to conventional joint multimodal training.

关键词: Multimodal Large Language Models, Model Merging, Singular Subspace Alignment, Training-free Framework, Parameter Interference, Modality-specific Parameters, Low-rank Subspace, State-of-the-art Performance

300. ❌ Stability and Bifurcation Analysis of Nonlinear PDEs via Random Projection-based PINNs: A Krylov-Arnoldi Approach

作者: Gianluca Fabiani, Michail E. Kavousanakis, Constantinos Siettos, Ioannis G. Kevrekidis 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21568v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用物理信息随机投影神经网络（PI-RPNNs）和Krylov-Arnoldi方法进行非线性偏微分方程（PDEs）的稳定性和分岔分析，属于计算数学和科学计算领域。所有关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用相关，但论文内容不涉及LLMs、深度学习模型训练、对齐、推理、代理系统或模型优化等主题。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文使用神经网络解决科学计算问题（PDEs），属于AI在科学领域的应用，但并非核心内容，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于物理信息随机投影神经网络（PI-RPNNs）和Krylov-Arnoldi方法的数值框架，用于分析非线性偏微分方程的稳定性和分岔，通过避免数值秩缺陷问题可靠地计算特征值谱。

摘要翻译

本文提出了一种用于非线性偏微分方程（PDE）稳定性与分岔分析的数值框架，其中解在由物理信息随机投影神经网络（Physics-Informed Random Projection Neural Networks, PI-RPNNs）所张成的函数空间中求解，并通过配置法进行离散化。这些网络为单隐藏层结构，其隐藏层权重随机采样并预先固定；仅优化线性输出层权重，从而将训练简化为一次最小二乘求解。这种线性输出结构使得我们能够直接且显式地构建控制稳态解线性稳定性的特征值问题。该问题呈现广义特征值形式，自然地将物理域内部动力学与边界条件施加的代数约束分离开来，无需额外训练成本，也不要求额外的偏微分方程求解。然而，随机投影配置矩阵本质上是数值秩亏的，导致直接的特征值计算不可靠，并使得真实特征值谱被虚假的近零模态污染。为克服这一局限，我们引入了一种在权重空间中直接操作的、无矩阵的移位逆Krylov-Arnoldi方法，避免了对数值秩亏配置矩阵的显式求逆，从而能够可靠地计算物理雅可比矩阵的若干主导特征对——该雅可比矩阵是偏微分方程算子关于解场的离散化弗雷歇导数，其特征值谱决定了线性稳定性。我们进一步证明，基于PI-RPNN的广义特征值问题几乎必然是正则的，这保证了标准特征求解器的可解性，并且对于解析激活函数，随机投影配置矩阵的奇异值呈指数衰减。

摘要 (Abstract)

We address a numerical framework for the stability and bifurcation analysis of nonlinear partial differential equations (PDEs) in which the solution is sought in the function space spanned by physics-informed random projection neural networks (PI-RPNNs), and discretized via a collocation approach. These are single-hidden-layer networks with randomly sampled and fixed a priori hidden-layer weights; only the linear output layer weights are optimized, reducing training to a single least-squares solve. This linear output structure enables the direct and explicit formulation of the eigenvalue problem governing the linear stability of stationary solutions. This takes a generalized eigenvalue form, which naturally separates the physical domain interior dynamics from the algebraic constraints imposed by boundary conditions, at no additional training cost and without requiring additional PDE solves. However, the random projection collocation matrix is inherently numerically rank-deficient, rendering naive eigenvalue computation unreliable and contaminating the true eigenvalue spectrum with spurious near-zero modes. To overcome this limitation, we introduce a matrix-free shift-invert Krylov-Arnoldi method that operates directly in weight space, avoiding explicit inversion of the numerically rank-deficient collocation matrix and enabling the reliable computation of several leading eigenpairs of the physical Jacobian - the discretized Frechet derivative of the PDE operator with respect to the solution field, whose eigenvalue spectrum determines linear stability. We further prove that the PI-RPNN-based generalized eigenvalue problem is almost surely regular, guaranteeing solvability with standard eigensolvers, and that the singular values of the random projection collocation matrix decay exponentially for analytic activation functions.

关键词: nonlinear PDEs, stability analysis, bifurcation analysis, physics-informed neural networks, random projection, Krylov-Arnoldi method, eigenvalue problem, numerical framework

301. ❌ Kolmogorov Complexity Bounds for LLM Steganography and a Perplexity-Based Detection Proxy

作者: Andrii Shportko 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21567v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM隐写术（LLM steganography），即LLMs如何改写文本以嵌入隐藏信息，同时保持表面意义，这直接涉及LLMs的潜在滥用和对齐监控挑战。因此，与’Large Language Models’高度相关（10分）。研究还涉及’Alignment’（8分），因为隐写术可能绕过对齐监控，构成对齐挑战。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等均未在摘要中提及或相关，故给0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型隐写术的信息理论成本，证明了任何在保持语义的同时嵌入有效载荷的隐写方案都会增加文本的Kolmogorov复杂度，并提出了基于困惑度的检测代理方法。

摘要翻译

大语言模型能够在保持表层语义的同时改写文本以嵌入隐藏载荷，这一能力在协作的AI系统间开辟了隐蔽信道，并对对齐监控提出了挑战。我们研究了此类嵌入的信息论代价。我们的主要结论是：任何在保持载体文本~$M_1$ 语义负载的同时将载荷~$P$ 编码为隐写文本~$M_2$ 的隐写方案，都必须满足 $K(M_2) \geq K(M_1) + K(P) - O(\log n)$，其中 $K$ 表示柯尔莫哥洛夫复杂度（Kolmogorov complexity），$n$ 为组合消息长度。一个推论是，无论编码器如何巧妙地分布信号，任何非平凡载荷都会迫使隐写文本的复杂度严格增加。
由于柯尔莫哥洛夫复杂度不可计算，我们探讨了是否可以通过实用代理指标来检测这一理论预测的复杂度增长。借鉴无损压缩与柯尔莫哥洛夫复杂度之间的经典对应关系，我们认为语言模型的困惑度（perplexity）在概率体系中扮演着类似角色，并提出将 Binoculars 困惑度比值分数作为此类代理指标之一。基于颜色编码的LLM隐写方案的初步实验支持了理论预测：对300个样本进行的配对$t$检验得到 $t = 5.11$，$p < 10^{-6}$。

摘要 (Abstract)

Large language models can rewrite text to embed hidden payloads while preserving surface-level meaning, a capability that opens covert channels between cooperating AI systems and poses challenges for alignment monitoring. We study the information-theoretic cost of such embedding. Our main result is that any steganographic scheme that preserves the semantic load of a covertext~$M_1$ while encoding a payload~$P$ into a stegotext~$M_2$ must satisfy $K(M_2) \geq K(M_1) + K(P) - O(\log n)$, where $K$ denotes Kolmogorov complexity and $n$ is the combined message length. A corollary is that any non-trivial payload forces a strict complexity increase in the stegotext, regardless of how cleverly the encoder distributes the signal. Because Kolmogorov complexity is uncomputable, we ask whether practical proxies can detect this predicted increase. Drawing on the classical correspondence between lossless compression and Kolmogorov complexity, we argue that language-model perplexity occupies an analogous role in the probabilistic regime and propose the Binoculars perplexity-ratio score as one such proxy. Preliminary experiments with a color-based LLM steganographic scheme support the theoretical prediction: a paired $t$-test over 300 samples yields $t = 5.11$, $p < 10^{-6}$.

关键词: Large Language Models, Steganography, Kolmogorov Complexity, Perplexity, Detection Proxy, Alignment Monitoring, Information Theory, LLM Security

302. ❌ What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

作者: Xinyu Zhang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究世界模型（World Models）在强化学习中的内部表示，使用可解释性技术（如线性探测、因果干预、注意力分析）分析IRIS和DIAMOND两种世界模型架构。论文与"World Models AND General World Models"高度相关（10分），因为这是研究的核心主题；与"Mechanistic Interpretability OR Explainable AI"高度相关（10分），因为论文应用了多种可解释性技术来理解模型内部表示。其他关键词（如LLMs、MoE、RLHF、RAG等）均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了强化学习中世界模型的内部表示，通过可解释性技术发现两种不同架构的世界模型都发展出了近似线性的、结构化的环境状态内部表示。

摘要翻译

世界模型通过经验学习环境动态模拟，从而实现样本高效的强化学习。但这些模型在内部究竟表征了什么？我们将可解释性技术——包括线性和非线性探针、因果干预及注意力分析——应用于两种架构迥异的世界模型：在Atari Breakout和Pong游戏上训练的IRIS（离散令牌变换器）和DIAMOND（连续扩散UNet）。通过线性探针分析，我们发现两种模型均形成了游戏状态变量（物体位置、得分）的线性可解码表征，多层感知机探针仅产生略微更高的R^2值，证实这些表征近似线性。因果干预——沿探针推导方向偏移隐藏状态——在模型预测中产生相关变化，证明这些表征具有功能性用途而非仅存在相关性。对IRIS注意力头的分析揭示了空间特化现象：特定注意力头优先关注与游戏物体重叠的令牌。多基线令牌消融实验一致表明，包含物体的令牌具有超比例的重要性。我们的研究为可解释性提供了证据：经过学习的世界模型在两种游戏和两种架构中形成了结构化、近似线性的环境状态内部表征。

摘要 (Abstract)

World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques–including linear and nonlinear probing, causal interventions, and attention analysis–to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions–shifting hidden states along probe-derived directions–produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi-baseline token ablation experiments consistently identify object-containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop structured, approximately linear internal representations of environment state across two games and two architectures.

关键词: World Models, Interpretability, Linear Probing, Causal Interventions, Attention Analysis, Reinforcement Learning, Internal Representations, Environment Simulators

303. ❌ Sharper Generalization Bounds for Transformer

作者: Yawen Li, Tao Hu, Zhouhui Lian, Wan Tian, Yijie Peng, Huiming Zhang, Zhongyi Li 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21541v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究Transformer模型的泛化误差界，属于深度学习理论分析范畴，但所有关键词均针对大模型技术、应用、优化等具体方向，而本文专注于理论泛化分析，未涉及任何关键词中的具体技术、应用或优化方法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文基于偏移Rademacher复杂度，为不同架构的Transformer模型推导了更精确的泛化误差界，并扩展了理论结果到无界特征和重尾分布场景。

摘要翻译

本文研究了Transformer模型的泛化误差界。基于偏移Rademacher复杂度，我们为不同Transformer架构推导了更尖锐的泛化界，包括单层单头、单层多头以及多层Transformer。我们首先用偏移Rademacher复杂度表示Transformer的过剩风险。通过利用其与相应假设空间的经验覆盖数之间的联系，我们获得了在常数因子范围内达到最优收敛速率的过剩风险界。随后，我们通过使用矩阵秩和矩阵范数对Transformer假设空间的覆盖数进行上界估计，推导出更精细的过剩风险界，从而得到精确的、与架构相关的泛化界。最后，我们放宽了对特征映射的有界性假设，将理论结果扩展到无界（亚高斯）特征和重尾分布的场景。

摘要 (Abstract)

This paper studies generalization error bounds for Transformer models. Based on the offset Rademacher complexity, we derive sharper generalization bounds for different Transformer architectures, including single-layer single-head, single-layer multi-head, and multi-layer Transformers. We first express the excess risk of Transformers in terms of the offset Rademacher complexity. By exploiting its connection with the empirical covering numbers of the corresponding hypothesis spaces, we obtain excess risk bounds that achieve optimal convergence rates up to constant factors. We then derive refined excess risk bounds by upper bounding the covering numbers of Transformer hypothesis spaces using matrix ranks and matrix norms, leading to precise, architecture-dependent generalization bounds. Finally, we relax the boundedness assumption on feature mappings and extend our theoretical results to settings with unbounded (sub-Gaussian) features and heavy-tailed distributions.

关键词: Transformer, generalization error bounds, offset Rademacher complexity, excess risk, covering numbers, matrix ranks, sub-Gaussian features, heavy-tailed distributions

304. ❌ Generalization Limits of In-Context Operator Networks for Higher-Order Partial Differential Equations

作者: Jamie Mahowald, Tan Bui-Thanh 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21534v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文研究In-Context Operator Networks (ICONs)在偏微分方程中的应用，与’In-context Learning’高度相关（10分），因为论文明确基于in-context learning原理构建新网络。论文属于科学计算AI应用，与’AI for Science’相关（8分）。论文提到’foundation model’，与’Large Language Models’有一定关联（8分），但论文主要关注算子网络而非语言模型。其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于上下文学习原理的算子网络（ICONs）在高阶偏微分方程中的泛化能力，发现虽然点精度会下降，但模型在捕捉解动态和整体行为方面仍保持定性准确性。

摘要翻译

本研究探讨了基于上下文学习原理构建的新型算子网络——上下文算子网络（In-Context Operator Networks, ICONs）——在求解高阶偏微分方程时的泛化能力。我们通过扩展基础模型所能处理的微分方程类型与范围，对先前研究进行了延伸。研究表明，尽管处理复杂输入需要引入新的计算方法，但其底层机器学习技术与较简单案例基本保持一致。实验结果表明，虽然对于热方程等高阶问题，逐点精度有所下降，但模型在捕捉解的动态特性与整体行为方面仍保持定性层面的准确性。这证明了该模型能够将基本解的特征外推至训练范围之外的问题中。

摘要 (Abstract)

We investigate the generalization capabilities of In-Context Operator Networks (ICONs), a new class of operator networks that build on the principles of in-context learning, for higher-order partial differential equations. We extend previous work by expanding the type and scope of differential equations handled by the foundation model. We demonstrate that while processing complex inputs requires some new computational methods, the underlying machine learning techniques are largely consistent with simpler cases. Our implementation shows that although point-wise accuracy degrades for higher-order problems like the heat equation, the model retains qualitative accuracy in capturing solution dynamics and overall behavior. This demonstrates the model’s ability to extrapolate fundamental solution characteristics to problems outside its training regime.

关键词: In-Context Operator Networks, ICONs, higher-order partial differential equations, in-context learning, foundation model, generalization capabilities, operator networks, heat equation

305. ❌ BOxCrete: A Bayesian Optimization Open-Source AI Model for Concrete Strength Forecasting and Mix Optimization

作者: Bayezid Baten, M. Ayyan Iqbal, Sebastian Ament, Julius Kusuma, Nishant Garg 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21525v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文BOxCrete专注于使用高斯过程回归进行混凝土强度预测和配合比优化，属于AI在材料科学领域的应用。所有关键词均与大模型、深度学习技术原理或特定AI方法（如MoE、RLHF、RAG等）直接相关，而本文使用的是传统的机器学习方法（高斯过程回归），未涉及任何大模型或深度学习技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（材料工程）领域的应用，但并非核心内容，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个名为BOxCrete的开源概率建模和优化框架，用于预测混凝土强度并优化配合比，在公开数据集上实现了高精度预测（R²=0.94）和多目标优化。

摘要翻译

现代混凝土必须同时满足力学性能、工作性、耐久性和可持续性方面不断演进的需求，这使得配合比设计日益复杂。近期利用人工智能（AI）与机器学习（ML）模型的研究在预测抗压强度与指导配合比优化方面展现出潜力，但现有工作大多基于专有工业数据集和闭源实现。本文介绍BOxCrete——一个开源的概率建模与优化框架，其训练基于一个全新的开放数据集，该数据集包含来自123种混合料（69种砂浆与54种混凝土混合料）在五个养护龄期（1、3、5、14和28天）测试所得的500余组强度测量值（1-15 ksi）。BOxCrete利用高斯过程（Gaussian Process, GP）回归来预测强度发展，平均R²达到0.94，均方根误差（RMSE）为0.69 ksi，并能够量化不确定性，同时执行对抗压强度和隐含碳的多目标优化。该数据集与模型为基于人工智能的优化配合比设计的数据驱动开发建立了可复现的开源基础。

摘要 (Abstract)

Modern concrete must simultaneously satisfy evolving demands for mechanical performance, workability, durability, and sustainability, making mix designs increasingly complex. Recent studies leveraging Artificial Intelligence (AI) and Machine Learning (ML) models show promise for predicting compressive strength and guiding mix optimization, but most existing efforts are based on proprietary industrial datasets and closed-source implementations. Here we introduce BOxCrete, an open-source probabilistic modeling and optimization framework trained on a new open-access dataset of over 500 strength measurements (1-15 ksi) from 123 mixtures - 69 mortar and 54 concrete mixes tested at five curing ages (1, 3, 5, 14, and 28 days). BOxCrete leverages Gaussian Process (GP) regression to predict strength development, achieving average R$^2$ = 0.94 and RMSE = 0.69 ksi, quantify uncertainty, and carry out multi-objective optimization of compressive strength and embodied carbon. The dataset and model establish a reproducible open-source foundation for data-driven development of AI-based optimized mix designs.

关键词: concrete strength forecasting, mix optimization, Gaussian Process regression, open-source AI model, probabilistic modeling, multi-objective optimization, embodied carbon, data-driven development

306. ❌ Optimizing Feature Extraction for On-device Model Inference with User Behavior Sequences

作者: Chen Gong, Zhenzhe Zheng, Yiliu Chen, Sheng Wang, Fan Wu, Guihai Chen 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21508v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于优化移动设备上的机器学习模型特征提取过程，以提高端到端推理延迟。仅与关键词’Small Language Models OR SLMs OR On-device AI’相关，因为论文明确研究on-device模型推理优化，但未涉及SLMs或大模型技术。其他关键词均与论文内容无关，论文未讨论大模型、训练方法、对齐、推理、代理、压缩、科学AI等主题。

!!! tip deepseek-chat TL;DR

该论文针对移动设备上模型推理中特征提取过程的瓶颈，提出了AutoFeature系统，通过图抽象、图优化和高效缓存技术，在工业移动服务中将端到端执行延迟降低了1.33x-4.53倍。

摘要翻译

机器学习模型被广泛集成于现代移动应用中，用于分析用户行为并提供个性化服务。确保设备端模型执行的低延迟对于维持高质量用户体验至关重要。尽管先前研究主要集中于在给定输入特征的情况下加速模型推理，我们发现现实世界设备端模型执行流程中存在一个被忽视的瓶颈：从原始应用日志中提取输入特征。在本工作中，我们通过分析并消除不同模型特征之间以及连续模型推理过程中冗余的特征提取操作，探索了特征提取优化的新方向。随后，我们提出了AutoFeature，这是一种自动化特征提取引擎，旨在不损害模型推理精度的前提下加速设备端特征提取过程。AutoFeature包含三项核心设计：（1）图抽象，将不同输入特征的提取工作流构建为一个有向无环图；（2）图优化，识别并融合图中不同特征间的冗余操作节点；（3）高效缓存，以最小化连续模型推理之间重叠原始数据的操作。我们实现了AutoFeature的系统原型，并将其集成到涵盖搜索、视频和电子商务领域的五个工业级移动服务中。在线评估表明，AutoFeature将设备端端到端模型执行延迟在日间降低了1.33倍至3.93倍，在夜间降低了1.43倍至4.53倍。

摘要 (Abstract)

Machine learning models are widely integrated into modern mobile apps to analyze user behaviors and deliver personalized services. Ensuring low-latency on-device model execution is critical for maintaining high-quality user experiences. While prior research has primarily focused on accelerating model inference with given input features, we identify an overlooked bottleneck in real-world on-device model execution pipelines: extracting input features from raw application logs. In this work, we explore a new direction of feature extraction optimization by analyzing and eliminating redundant extraction operations across different model features and consecutive model inferences. We then introduce AutoFeature, an automated feature extraction engine designed to accelerate on-device feature extraction process without compromising model inference accuracy. AutoFeature comprises three core designs: (1) graph abstraction to formulate the extraction workflows of different input features as one directed acyclic graph, (2) graph optimization to identify and fuse redundant operation nodes across different features within the graph; (3) efficient caching to minimize operations on overlapping raw data between consecutive model inferences. We implement a system prototype of AutoFeature and integrate it into five industrial mobile services spanning search, video and e-commerce domains. Online evaluations show that AutoFeature reduces end-to-end on-device model execution latency by 1.33x-3.93x during daytime and 1.43x-4.53x at night.

关键词: on-device model inference, feature extraction optimization, mobile apps, user behavior sequences, low-latency execution, AutoFeature, graph optimization, efficient caching

307. ❌ Quotient Geometry, Effective Curvature, and Implicit Bias in Simple Shallow Neural Networks

作者: Hang-Cheng Dong, Pengcheng Cheng 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21502v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究浅层神经网络的几何结构、对称性和隐式偏差，属于深度学习理论分析范畴，但完全不涉及大模型（LLMs）或任何评分关键词中的具体技术（如MoE、RLHF、RAG等）。论文聚焦于网络参数空间的数学性质（如商几何、曲率、梯度流），而非大模型应用、训练方法、推理优化或科学AI应用。所有关键词均与大模型技术或应用直接相关，而该论文是纯理论分析，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文通过商几何框架分析浅层神经网络的参数对称性，提出了消除对称性轨道退化的有效曲率概念，并证明在欠定情况下隐式偏差在商坐标下描述最自然。

摘要翻译

过参数化的浅层神经网络存在显著的参数冗余：由于隐藏单元的置换、重缩放及相关对称性，不同的参数向量可能表示相同的预测函数。因此，在环境欧几里得参数空间中直接计算的几何量可能反映的是表示方式的伪影，而非预测函数的内在特性。本文发展了一种微分几何框架，通过在正则集合上模除参数对称性得到的商空间来分析简单浅层网络。我们首先刻画了正则浅层网络参数的对称性与商结构，并证明有限样本实现映射在商流形上诱导了一个自然度量。这引出了一个有效的曲率概念，它消除了沿对称轨道的简并性，并产生了一个捕捉内在局部几何的对称约化Hessian矩阵。随后，我们研究了商空间上的梯度流，表明只有参数运动的水平分量对一阶预测函数演化有贡献，而垂直分量纯粹对应于规范变化。最后，我们在商层次上构建了一个隐式偏置的视角，主张有意义的复杂性应归属于预测函数类别，而非单个参数表示。实验证实，环境平坦性是表示依赖的，局部动力学通过商层次曲率摘要能更好地组织，且在欠定区域中，隐式偏置最自然地通过商坐标描述。

摘要 (Abstract)

Overparameterized shallow neural networks admit substantial parameter redundancy: distinct parameter vectors may represent the same predictor due to hidden-unit permutations, rescalings, and related symmetries. As a result, geometric quantities computed directly in the ambient Euclidean parameter space can reflect artifacts of representation rather than intrinsic properties of the predictor. In this paper, we develop a differential-geometric framework for analyzing simple shallow networks through the quotient space obtained by modding out parameter symmetries on a regular set. We first characterize the symmetry and quotient structure of regular shallow-network parameters and show that the finite-sample realization map induces a natural metric on the quotient manifold. This leads to an effective notion of curvature that removes degeneracy along symmetry orbits and yields a symmetry-reduced Hessian capturing intrinsic local geometry. We then study gradient flows on the quotient and show that only the horizontal component of parameter motion contributes to first-order predictor evolution, while the vertical component corresponds purely to gauge variation. Finally, we formulate an implicit-bias viewpoint at the quotient level, arguing that meaningful complexity should be assigned to predictor classes rather than to individual parameter representatives. Our experiments confirm that ambient flatness is representation-dependent, that local dynamics are better organized by quotient-level curvature summaries, and that in underdetermined regimes, implicit bias is most naturally described in quotient coordinates.

关键词: shallow neural networks, quotient geometry, parameter symmetries, effective curvature, implicit bias, gradient flow, Hessian, underdetermined regimes

308. ❌ Learning Can Converge Stably to the Wrong Belief under Latent Reliability

作者: Zhipeng Zhang, Zhenjie Yao, Kai Li, Lei Yang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21491v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究学习系统在反馈可靠性不可观测时可能稳定收敛到错误信念的问题，并提出Monitor-Trust-Regulator框架来缓解此问题。该研究属于机器学习理论和方法论范畴，主要关注学习动态、反馈可靠性和优化稳定性，而非大模型、深度学习技术原理或特定领域应用。所有关键词均与大模型技术、训练方法、推理优化、应用领域等具体主题相关，而本文讨论的是更基础的机器学习理论问题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文研究了当反馈可靠性不可观测时，学习系统可能稳定收敛到错误信念的问题，并提出Monitor-Trust-Regulator框架通过推断可靠性来减少偏差积累并改善恢复能力。

摘要翻译

学习系统通常通过最小化损失或最大化奖励进行优化，其前提是这些信号的改进反映了向真实目标的进展。然而，当反馈可靠性不可观测时，这一假设可能失效，学习算法可能稳定地收敛至错误解。
这种失败源于单步反馈无法揭示经验是信息性的还是持续存在偏差的。然而，当信息在学习轨迹上聚合时，可靠与不可靠机制之间的系统性差异可能显现出来。
我们提出了一种监控-信任-调节器（Monitor-Trust-Regulator, MTR）框架，该框架从学习动态中推断可靠性，并通过一个慢时间尺度的信任变量来调节更新。在强化学习和监督学习场景中，标准算法在潜在不可靠性下学习错误解时表现出稳定的优化行为，而信任调节系统则减少了偏差累积并提升了恢复能力。
这些结果表明，学习动态不仅是优化轨迹，也是反馈可靠性的信息来源。

摘要 (Abstract)

Learning systems are typically optimized by minimizing loss or maximizing reward, assuming that improvements in these signals reflect progress toward the true objective. However, when feedback reliability is unobservable, this assumption can fail, and learning algorithms may converge stably to incorrect solutions. This failure arises because single-step feedback does not reveal whether an experience is informative or persistently biased. When information is aggregated over learning trajectories, however, systematic differences between reliable and unreliable regimes can emerge. We propose a Monitor-Trust-Regulator (MTR) framework that infers reliability from learning dynamics and modulates updates through a slow-timescale trust variable. Across reinforcement learning and supervised learning settings, standard algorithms exhibit stable optimization behavior while learning incorrect solutions under latent unreliability, whereas trust-modulated systems reduce bias accumulation and improve recovery. These results suggest that learning dynamics are not only optimization traces but also a source of information about feedback reliability.

关键词: learning dynamics, feedback reliability, stable convergence, bias accumulation, trust modulation, optimization behavior, latent unreliability, Monitor-Trust-Regulator

309. ❌ Multinoulli Extension: A Lossless Continuous Relaxation for Partition-Constrained Subset Selection

作者: Qixin Zhang, Wei Huang, Yan Sun, Yao Shu, Yi Yu, Dacheng Tao 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21492v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是带分区约束的子集选择问题的优化算法，提出了一种名为Multinoulli Extension的连续松弛框架和相应的算法。论文内容完全聚焦于组合优化、近似算法和理论计算机科学领域，涉及子模函数、弱子模函数、分区约束、连续松弛等概念。所有评分关键词都直接与大模型、深度学习、AI应用或相关技术原理相关，而这篇论文完全不涉及这些主题。论文没有提到任何语言模型、神经网络、训练方法、推理技术、AI代理或科学AI应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Multinoulli Extension的连续松弛框架和Multinoulli-SCG算法，用于解决带分区约束的子集选择问题，能够在无需先验参数知识的情况下，以更少的函数评估次数达到与现有方法相同的近似保证。

摘要翻译

在满足预设分区约束的前提下，识别出最能代表接近次模性目标的最优子集，是机器学习中具有广泛应用的基础任务。然而，现有的扭曲局部搜索方法常因其极高的查询复杂度以及对难以获取的结构参数先验知识的刚性要求而受到限制。为克服这些局限，我们提出了一种名为 Multinoulli-SCG 的新算法，该算法不仅无需参数，还能以显著更少的函数评估次数达到与扭曲局部搜索方法相同的近似保证。具体而言，当目标函数为单调 $α$-弱 DR-次模或 $(γ,β)$-弱次模时，我们的 Multinoulli-SCG 算法仅需 $O(1/ε^{2})$ 次函数评估即可获得 $(1-e^{-α})\text{OPT}-ε$ 或 $(\frac{γ^{2}(1-e^{-(β(1-γ)+γ^2)})}{β(1-γ)+γ^2})\text{OPT}-ε$ 的目标值，其中 OPT 表示最优值。我们 Multinoulli-SCG 算法的核心是一个创新的连续松弛框架，称为 Multinoulli 扩展（Multinoulli Extension, ME），它能有效地将受分区约束的离散子集选择问题转化为可解的连续最大化问题，其重点在于学习相关分区上的最优 Multinoulli 先验分布。与次模子集选择中成熟的多线性扩展相比，我们提出的 ME 的一个显著优势是其内在能力可为任意集函数提供无损舍入方案。此外，基于我们提出的 ME，我们还针对尚未探索的、受分区约束的在线子集选择问题，提出了两种新颖的在线算法，即 Multinoulli-OSCG 和 Multinoulli-OSGA。

摘要 (Abstract)

Identifying the most representative subset for a close-to-submodular objective while satisfying the predefined partition constraint is a fundamental task with numerous applications in machine learning. However, the existing distorted local-search methods are often hindered by their prohibitive query complexities and the rigid requirement for prior knowledge of difficult-to-obtain structural parameters. To overcome these limitations, we introduce a novel algorithm titled Multinoulli-SCG, which not only is parameter-free, but also can achieve the same approximation guarantees as the distorted local-search methods with significantly fewer function evaluations. More specifically, when the objective function is monotone $α$-weakly DR-submodular or $(γ,β)$-weakly submodular, our Multinoulli-SCG algorithm can attain a value of $(1-e^{-α})\text{OPT}-ε$ or $(\frac{γ^{2}(1-e^{-(β(1-γ)+γ^2)})}{β(1-γ)+γ^2})\text{OPT}-ε$ with only $O(1/ε^{2})$ function evaluations, where OPT denotes the optimal value. The cornerstone of our Multinoulli-SCG algorithm is an innovative continuous-relaxation framework named Multinoulli Extension(ME), which can effectively convert the discrete subset selection problem subject to partition constraints into a solvable continuous maximization focused on learning the optimal multinoulli priors across the concerned partition. In sharp contrast with the well-established multi-linear extension for submodular subset selection, a notable advantage of our proposed ME is its intrinsic capacity to provide a lossless rounding scheme for any set function. Furthermore, based on our proposed ME, we also present two novel online algorithms, namely, Multinoulli-OSCG and Multinoulli-OSGA, for the unexplored online subset selection problems over partition constraints.

关键词: subset selection, partition constraints, continuous relaxation, Multinoulli Extension, weakly submodular, approximation algorithms, function evaluations, online algorithms

310. ❌ Characterizing Long-Range Dependencies in Knee Joint Contact Mechanics: A Comparison of Topology Diffusion, Global Routing, and Hybrid Graph Neural Networks

作者: Zhengye Pan, Jianwei Zuo, Jiajia Luo 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21020v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用图神经网络（GNN）作为膝关节接触力学的替代模型，比较了拓扑扩散、全局路由和混合架构。研究内容属于生物力学和计算建模领域，与深度学习在科学领域的应用相关，但未涉及大语言模型（LLM）、大模型技术原理或任何指定的关键词（如MoE、SFT、RLHF、RAG等）。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究应用AI（GNN）于生物力学（科学领域），但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究比较了拓扑扩散、全局路由和混合图神经网络在膝关节接触力学替代建模中的性能，发现混合模型在减少全场误差和峰值应力误差方面表现最佳。

摘要翻译

膝关节接触力学的有限元分析计算成本高昂，这推动了图神经网络代理模型的发展。然而，有效表征关节力学响应中的长程依赖关系仍具挑战。本研究系统比较了拓扑扩散、全局路由及其混合策略在膝关节接触力学代理建模中的应用。利用九名足球运动员执行变向动作的运动学与力学数据，通过有限元仿真生成图结构样本，并在分组三重交叉受试者评估框架下进行训练与评估。研究比较了五种架构：标准MeshGraphNet、分层MeshGraphNet、纯路由Transformer、拓扑偏置路由Transformer以及混合模型。混合模型取得了最佳综合性能，实现了最低的全场误差与峰值应力误差，同时对高风险区域的空间一致性最高。在非混合模型中，标准拓扑扩散模型整体表现最优，而纯路由策略效果较差。这些结果表明，在当前基准下，拓扑扩散为膝关节接触力学的代理建模提供了稳健基础，而全局路由的加入能进一步改善临床相关高应力模式的重建精度。

摘要 (Abstract)

Finite element analysis of knee joint contact mechanics is computationally expensive, which has motivated the development of graph neural network surrogate models. However, effectively representing long-range dependencies in joint mechanical responses remains challenging. This study systematically compared topology diffusion, global routing, and their hybridization for surrogate modeling of knee joint contact mechanics. Using kinematic and force data from nine soccer players performing change-of-direction maneuvers, finite element simulations were used to generate graph-structured samples for training and evaluation under a grouped three-fold cross-subject evaluation framework. Five architectures were compared: standard MeshGraphNet, hierarchical MeshGraphNet, a routing-only transformer, a topology-biased routing transformer, and a hybrid model. The hybrid model achieved the best overall performance, yielding the lowest full-field error and peak stress error, together with the highest spatial agreement for high-risk regions. Among the non-hybrid models, the standard topology-diffusion model performed best overall, whereas routing-only strategies were less effective. These findings indicate that topology diffusion provides a robust basis for surrogate modeling of knee joint contact mechanics within the present benchmark, while the addition of global routing can further improve reconstruction of clinically relevant high-stress patterns.

关键词: knee joint contact mechanics, graph neural networks, topology diffusion, global routing, hybrid model, finite element analysis, surrogate modeling, biomechanics

311. ❌ An Accurate Tensorial Model for Prediction of Full Zeolite NMR Spectra

作者: Carlos Bornes, Chiheb Ben Mahmoud, Volker L. Deringer, Christopher J. Heard, Lukáš Grajciar 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22268v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于使用机器学习方法预测沸石材料的核磁共振谱，属于科学计算和材料科学领域。论文内容与大多数关键词（主要涉及大语言模型技术、训练方法、推理优化、对齐技术等）完全无关，因为这些关键词特指自然语言处理和大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学领域的应用，具体是化学信息学和材料科学，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该研究开发了一种新颖的张量机器学习模型，用于高精度预测沸石材料的完整核磁共振谱，实现了对多种核素（如27Al、29Si）的高通量模拟。

摘要翻译

固态核磁共振（ss-NMR）是解析几何结构复杂的晶体材料（如沸石）结构最灵敏且常用的技术之一。计算建模的协同支持对于解释实验谱图以及将ss-NMR与原子尺度模型关联至关重要。然而，基于第一性原理计算磁屏蔽（MS）与电场梯度（EFG）张量的高昂成本阻碍了计算预测的发展。本研究采用一种新颖的张量机器学习方法，训练了一个用于预测完整核磁共振张量的通用模型。我们在多样化的沸石材料及核磁共振活性核（$^{27}$Al、$^{29}$Si、$^{17}$O、$^{23}$Na和$^{1}$H）数据集上验证了该方法的实用性，实现了对所有核磁共振观测量的高精度预测。这些观测量进一步转化为对典型沸石RTH完整$^{27}$Al和$^{29}$Si固态核磁共振谱的预测。因此，本工作为化学结构复杂的沸石大规模真实模型提供了精确、高通量的核磁共振模拟新途径。

摘要 (Abstract)

Solid state nuclear magnetic resonance (ss-NMR) is one of the most sensitive and popular techniques for structure elucidation in geometrically complex crystalline materials, such as zeolites. Synergistic support from computational modelling is vital to interpret experimental spectra, and relate ss-NMR to atomistic models. Nevertheless, computational predictions are hindered by the high expense of calculating magnetic shielding (MS) and electric field gradient (EFG) tensors from first principles. In this work, we leverage a novel tensorial machine learning approach to train a general model for predicting complete NMR tensors. We demonstrate the utility of the approach for a diverse dataset of zeolitic materials and NMR-active nuclei ($^{27}$Al, $^{29}$Si, $^{17}$O, $^{23}$Na and $^{1}$H), predicting all NMR observables to a high degree of precision. These observables are then translated into predictions of the full $^{27}$Al and $^{29}$Si ss-nMR spectra for the exemplary zeolite RTH. Thus, this work opens a pathway to accurate, high-throughput NMR simulation for large-scale and realistic models of chemically complex zeolites.

关键词: zeolite, NMR spectra, tensorial machine learning, magnetic shielding, electric field gradient, high-throughput simulation, solid state NMR, computational modeling

312. ❌ Microscopic view of materials properties of liquids: An atomic scale perspective

作者: Jaeyun Moon 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22266v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Microscopic view of materials properties of liquids: An atomic scale perspective》是一篇关于液体物理性质的综述，聚焦于原子尺度下的理论、计算和实验方法（如瞬时正态模式、速度自相关函数、X射线和中子散射技术），完全不涉及大模型、深度学习、AI技术或任何评分关键词中的概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该综述探讨了液体原子尺度动力学的研究挑战，总结了通过理论、计算和实验方法在理解液体热力学和动态行为方面取得的进展。

摘要翻译

对液体性质的微观理解对于推动从核反应堆、电池等能源应用，到药物递送、微流控等生物医学应用的广泛领域至关重要。然而，液体固有的动态无序性和结构周期性的缺失，为其热力学与动力学行为的严格微观理论发展带来了根本性挑战。近期计算能力和实验计量学的突破，在揭示液体复杂的原子尺度动力学方面推动了显著进展。在本综述中，我们简要回顾了液态物理学的历史背景，并通过理论、计算和实验方法探讨了最新进展。在理论与计算方法方面，我们讨论了瞬时简正模和速度自相关函数的计算。在实验方面，我们重点介绍了能在原子水平探测液体动力学的X射线和中子散射技术。最后，我们展望了液体原子动力学研究的新兴机遇与未来方向。

摘要 (Abstract)

Microscopic understanding of liquid properties is essential for advancing a wide range of applications from energy applications such as nuclear reactors and batteries to biomedical applications including drug delivery and microfluidics. However, intrinsic dynamic disorder and lack of structural periodicity in liquids have presented fundamental challenges in developing rigorous microscopic theories of their thermodynamic and dynamic behavior. Recent breakthroughs in computational power and experimental metrologies have driven significant progress in unraveling the complex atomic scale dynamics of liquids. In this Review, we provide a brief historical context of liquid state physics and explore recent advances through theoretical, computational, and experimental approaches. For theoretical and computational approaches, instantaneous normal mode and velocity autocorrelation function calculations are discussed. For experiments, we focus on X-ray and neutron scattering techniques that probe liquid dynamics at the atomic level. Finally, we highlight emerging opportunities and future directions in the study of liquid atomic dynamics.

关键词: liquid properties, atomic scale dynamics, instantaneous normal mode, velocity autocorrelation function, X-ray scattering, neutron scattering, computational approaches, experimental metrologies

313. ❌ Decoupling Precipitation and Surface Complexation during Mn(II) Removal by Biochar via Experiments and Atomistic Simulations

作者: Audrey Ngambia, Anastasiia Gavrilova, Haitao Huang, Zhuodong Lyu, Ondřej Mašek, Margaret Graham, Valentina Erastova 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22144v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究生物炭去除水中锰(II)的机制，属于环境科学与材料科学领域，通过实验和分子动力学模拟分析化学过程。所有评分关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、对齐、应用等），而论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过实验和原子尺度模拟揭示了生物炭去除水中锰(II)的两种主要机制——碱性沉淀和表面络合，为设计高效水修复材料提供了化学标准。

摘要翻译

采矿活动所释放的二价锰离子构成了持续性的水质挑战，然而低成本吸附剂（如生物炭）固定二价锰的机制仍不明确。本研究通过结合固定床柱实验、批量实验与原子尺度分子动力学模拟，明确了二价锰固定的具体化学驱动机制。以350°C、550°C和700°C热解制备的油菜秸秆生物炭，在酸性进水条件下（pH 4，浓度5 ppm）去除了20-50%的溶解态锰。高温生物炭实现了最高的去除率（约50%），并迅速将出水pH提升至9，引发了碱性沉淀。相反，低温生物炭在维持近中性pH（7-7.5）的同时去除了20-30%的锰。这些体系中钾离子释放量的增加表明存在显著的阳离子交换和非沉淀途径。分子模拟证实，虽然中性表面与二价锰的结合较弱，去质子化位点可通过内层络合（约50%去除）和外层结合（约10%去除）驱动强吸附。这些结果建立了一个区分沉淀主导与表面络合主导去除的机制框架。通过为靶向锰固定提供具体的化学准则，本研究为合理设计用于可持续水体修复的功能化生物炭奠定了基础。

摘要 (Abstract)

Manganese(II) mobilised by mining activity poses a persistent water-quality challenge, yet the mechanisms by which low-cost sorbents, such as biochar, sequester Mn(II) remain poorly resolved. This study identifies the specific chemical drivers of Mn(II) sequestration by combining fixed-bed column and batch experiments with atomistic molecular dynamics simulations. Oilseed rape straw biochars, produced at 350\textdegree C, 550\textdegree C, and 700\textdegree C, removed 20-50% of dissolved Mn from acidic influent (pH 4, 5 ppm). High-temperature biochar achieved the greatest removal ($\sim$50%) and rapidly increased effluent pH to 9, triggering alkaline precipitation. Conversely, lower-temperature biochars removed 20-30% of Mn while maintaining a near-neutral pH (7-7.5). Enhanced \ce{K+} release in these systems indicates significant cation exchange and non-precipitative pathways. Molecular simulations confirmed that while neutral surfaces show weak Mn(II) association, deprotonated sites drive strong adsorption through inner-sphere complexation ($\sim$50% removal) and outer-sphere association ($\sim$10%). These results establish a mechanistic framework to distinguish between precipitation-led and surface-complexation-led removal. By providing specific chemical criteria for Mn-targeted sequestration, this work enables the rational design of engineered biochars for sustainable water remediation.

关键词: Manganese removal, Biochar, Water remediation, Molecular dynamics simulations, Surface complexation, Precipitation, Cation exchange, Atomistic simulations

314. ❌ Stable, Fast, and Accurate Kohn-Sham Inversion in Gaussian Basis for Open Shell Molecular and Condensed Phase Systems via Density Matrix Penalization

作者: Ziwei Chai, Sandra Luber 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22140v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学领域的逆Kohn-Sham密度泛函理论方法开发，属于科学计算应用。所有关键词均与大模型、深度学习技术原理或AI应用直接相关，但论文内容完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学领域，可视为科学计算应用，但论文本身未提及AI方法，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于密度矩阵的高斯基逆Kohn-Sham方法，用于优化Kohn-Sham势矩阵以重现目标电子密度，相比传统ZMP方法在分子和凝聚相系统中实现了更小的密度偏差和更高的计算效率。

摘要翻译

本文提出一种完全基于高斯基组表示的密度矩阵科恩-沙姆（Kohn-Sham, KS）反演方法，通过优化KS势矩阵来复现目标电子密度。反演科恩-沙姆密度泛函理论（inverse Kohn-Sham density functional theory, inverse KS-DFT）旨在确定能够复现目标电子密度的有效局域KS势，这对于电子结构分析和基于轨道的校正方法发展均具有重要意义。然而在有限高斯基组实现中，传统反演KS-DFT方法（如赵-莫里森-帕尔（Zhao-Morrison-Parr, ZMP）方法）常因实空间惩罚势被投影到有限数量的高斯基组矩阵元上而导致约束性弱化、效率低下，这种投影会严重粗粒化势场的空间变化。本方法在勒夫丁（Löwdin）正交化基组中定义密度矩阵偏差，使得惩罚能量在该基组的幺正变换下保持恒定。我们还在原始非正交高斯基组中解析推导了KS哈密顿量中对应的惩罚势贡献项。在广泛的惩罚强度范围内，自洽场（self consistent field, SCF）优化对各种开壳层体系均保持鲁棒性和高效性，而逐步增强的惩罚力驱使电子密度与目标值精确吻合。分子与凝聚相体系的测试表明，相较于传统ZMP方法，本方法可获得显著更小的可达到密度偏差。该方法为有限高斯基组中的KS反演提供了快速精确的途径，并有望应用于未来基于轨道的校正方案。

摘要 (Abstract)

Here we present a density matrix based KS inversion method formulated entirely within a Gaussian basis representation to optimize a KS potential matrix that reproduces a target electron density. Inverse Kohn-Sham (KS) density functional theory (DFT) aims to determine the effective local KS potential that reproduces a target electron density, and is important both for electronic structure analysis and for the development of orbital based correction methods. In finite Gaussian basis implementations, however, conventional inverse KS-DFT approaches such as the Zhao-Morrison-Parr (ZMP) method often become poorly constrained and inefficient, because the real space penalty potential is projected onto a limited number of Gaussian basis matrix elements, which can strongly coarse-grain its spatial variation. In the present method, the density matrix mismatch is defined in a Lowdin orthogonalized basis, which yields a penalty energy invariant under unitary rotations in that basis. The corresponding penalty potential contribution to the KS Hamiltonian is derived analytically in the original nonorthogonal Gaussian basis. Across a wide range of penalty strengths, the self consistent field (SCF) optimization remains robust and efficient for various open shell systems, while progressively tightening the penalty drives the electron density into accurate agreement with the target. Benchmarks on molecules and condensed phase systems show that the method achieves substantially smaller attainable density deviations than the conventional ZMP method. The method provides a fast and accurate route to KS inversion in finite Gaussian basis sets and may also be useful for future orbital based correction schemes.

关键词: Inverse Kohn-Sham DFT, Density matrix, Gaussian basis, KS inversion, ZMP method, Open shell systems, Self-consistent field, Electronic structure

315. ❌ Adsorption energies and decomposition barrier heights for ethylene carbonate on the surface of lithium from cluster-based quantum chemistry

作者: Ethan A. Vo, Hung T. Vuong, Zachary K. Goldsmith, Hong-Zhou Ye, Yujing Wei, Sohang Kundu, Ardavan Farahvash, Garvit Agarwal, Richard A. Friesner, Timothy C. Berkelbach 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22139v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学领域，研究锂金属表面乙烯碳酸酯的吸附能和分解反应能垒，使用量子化学方法（如耦合簇理论、量子蒙特卡洛）进行高精度计算，并评估不同密度泛函的准确性。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词都属于人工智能/机器学习领域，而本文是纯粹的物理化学计算研究。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学（可视为化学信息学或科学计算的一部分），但论文本身并未使用AI/机器学习方法，而是传统量子化学计算，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文通过高精度量子化学方法计算了乙烯碳酸酯在锂金属（100）表面的吸附能和分解反应能垒，验证了基于有限原子团簇校正至热力学极限的方案，并发现ωB97X-V泛函在锂金属阳极电解质溶剂界面化学中具有较高准确性。

摘要翻译

针对碳酸乙烯酯在锂金属(100)表面的吸附行为，我们计算了两种结合构型的吸附能以及开环分解反应的能垒高度。我们验证了一种在热力学极限下获取结果的校正方案：通过对仅含40-100个原子的有限锂团簇计算结果进行修正，使得采用杂化密度泛函、随机相位近似以及耦合簇理论和辅助场量子蒙特卡罗等关联波函数理论成为可能。研究发现，高阶理论的计算结果差异在2-5 kcal/mol范围内，因此可为更经济的计算方法提供基准参考。利用我们的基准数据，我们证明广义梯度近似泛函（如PBE）对反应能垒高度的计算精度不足，并发现$ω$B97X-V泛函在锂金属负极电解质溶剂界面化学研究中表现出特别的应用潜力。

摘要 (Abstract)

For ethylene carbonate on the (100) surface of lithium, we calculate the adsorption energy in two binding motifs as well as the barrier height for a ring-opening decomposition reaction. We validate a scheme for producing results in the thermodynamic limit by correcting results obtained on finite lithium clusters containing only 40-100 atoms, which enables the use of hybrid density functionals, the random-phase approximation, and correlated wavefunction theories such as coupled-cluster theory and auxiliary-field quantum Monte Carlo. We find that the high-level theories agree to within 2-5 kcal/mol and can therefore serve as benchmarks for more affordable methods. Using our reference data, we demonstrate that generalized gradient approximation functionals, such as PBE, are not sufficiently accurate for reaction barrier heights, and we identify $ω$B97X-V as an especially promising functional for the interfacial chemistry of electrolyte solvents at lithium metal anodes.

关键词: ethylene carbonate, lithium metal surface, adsorption energy, decomposition barrier, quantum chemistry, density functional theory, coupled-cluster theory, quantum Monte Carlo

316. ❌ Overcoming sampling limitations using machine-learned interatomic potentials: the case of water-in-salt electrolytes

作者: Luca Brugnoli, Mathieu Salanne, A. Marco Saitta, Alessandra Serva, Arthur France-Lanord 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.22099v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文研究机器学习原子间势能（MACE potentials）在浓电解质模拟中的应用，属于AI for Science领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文未涉及大语言模型（LLMs）、深度学习技术原理、模型训练优化、推理加速、智能体等主题，因此其他26个关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究评估了机器学习原子间势能在模拟高浓度水盐电解质中的性能，发现其能克服传统方法的采样限制，并与实验数据高度吻合，同时证明了微调基础模型比从头训练更具数据效率和配置信息优势。

摘要翻译

机器学习原子间势有望实现对高浓度液体在具有实际意义的时间尺度上的建模，这远超当前从头算电子结构方法的能力范围。本文评估了多种MACE势在模拟基于双（三氟甲磺酰基）亚胺锂的$21 m$高浓水系电解质中的性能。我们测试了开箱即用的基础模型，以及微调与从零开始训练两种策略。模拟结果表明，代理模型能够克服从头算分子动力学的采样限制，在与结构因子等实验观测数据上达到优异的一致性。我们还证明了微调基础模型相较于从零训练的优势：这不仅体现在数据效率方面，更重要的是，它能提供关于难以采样的构型（例如短Li$^+$–Li$^+$距离）的信息。最后，我们发现根据所选参考交换关联泛函的不同，经验色散校正方案可能产生不利影响。总而言之，我们的研究表明机器学习原子间势非常适合用于长时间尺度下高浓度电解质的建模。

摘要 (Abstract)

Machine-learned interatomic potentials hold the promise to enable the modeling of highly concentrated liquids over meaningful timescales, far from reach for current ab initio electronic structure methods. Here we evaluate the performances of various MACE potentials in modeling a $21 m$ water-in-salt electrolyte based on lithium bis(trifluoromethanesulfonyl)imide. We test out-of-the-box foundation models, as well as both fine tuning and from scratch training strategies. Our simulations demonstrate that surrogate models allow to overcome sampling limitations of ab initio molecular dynamics, reaching an excellent agreement with experimental observables such as the structure factor. We also demonstrate the benefit of fine tuning a foundation model over training from scratch: in terms of data efficiency, but most importantly as a means to provide information regarding configurations hard to sample, such as short Li$^+$–Li$^+$ distances. Finally, we show that depending on the reference exchange-correlation functional, empirical dispersion correction schemes can be detrimental. All in all, our work shows that machine-learned interatomic potentials are a good fit for the modeling of highly concentrated electrolytes over long timescales.

关键词: machine-learned interatomic potentials, water-in-salt electrolytes, MACE potentials, fine tuning, foundation models, ab initio molecular dynamics, sampling limitations, lithium bis(trifluoromethanesulfonyl)imide

317. ❌ Molecular dynamics simulation of high slip flow of water confined between graphene nanochannels at experimentally accessible strain rates

作者: Carmelo Civello, Luca Maffioli, Edward Smith, James Ewen, Peter Daivis, Daniele Dini, Billy Todd 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21907v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究分子动力学模拟中水在石墨烯纳米通道中的高滑移流动，属于计算物理/材料科学领域，与所有大模型/深度学习技术关键词完全无关（评分为0）。唯一相关的是"AI for Science OR Bioinformatics OR Cheminformatics"关键词，因为论文涉及科学计算和分子模拟，属于AI在科学领域的潜在应用范畴，但论文本身未使用AI方法，仅使用传统分子动力学方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究使用瞬态时间相关函数方法（TTCF）首次在实验可及的剪切速率下模拟了水在石墨烯纳米通道中的高滑移流动，验证了TTCF方法在低应变率系统中的有效性，并与平衡分子动力学模拟和实验结果一致。

摘要翻译

瞬态时间相关函数方法（TTCF）已成为在低剪切速率下精确探测系统的有力工具。本研究采用TTCF方法，针对一个由石墨烯壁间受限水构成的高滑移系统，在实验可及的剪切速率范围内评估了滑移长度对剪切速率的依赖性——该速率条件下经典的非平衡分子动力学（NEMD）模拟难以实施。我们计算了跨越六个数量级的所有剪切速率对应的纳维摩擦系数，并与平衡极限进行了比较。本研究首次报道了利用TTCF方法在实验可及剪切速率下获得的NEMD结果，该系统在过去数十年间已引起广泛关注。通过TTCF计算得到的滑移长度与先前的平衡分子动力学模拟及实验结果高度吻合。本文旨在彰显TTCF方法的卓越能力，尤其对于高滑移（低应变率）体系，并验证平衡方法在实验可及应变率下与非平衡分子动力学测量结果直接匹配。

摘要 (Abstract)

The transient time correlation function method (TTCF) has emerged as a powerful methodology for accurately probing systems at low shear rates. In the present study, TTCF was used to evaluate the shear rate dependence of the slip length in a high-slip system consisting of water confined between graphene walls at experimentally accessible shear rates, for which classical nonequilibrium molecular dynamics (NEMD) is unfeasible. The corresponding Navier friction coefficient was computed for all shear rates spanning six orders of magnitude and compared with the equilibrium limit. We report for the first time NEMD results obtained at experimentally accessible shear rates using the TTCF approach for a system that has attracted significant interest over the past decades. The slip length calculated with TTCF is in good agreement with previous equilibrium molecular dynamics simulations and experiments. Our aim here is to highlight the extraordinary power of TTCF, particularly for high-slip (low strain-rate) systems, and to verify that equilibrium methods directly match NEMD measurements at experimentally accessible strain rates.

关键词: molecular dynamics simulation, slip flow, graphene nanochannels, transient time correlation function, shear rate dependence, Navier friction coefficient, nonequilibrium molecular dynamics, experimentally accessible strain rates

318. ❌ olLOSC: Unified and efficient density functional approximation to correct delocalization error in molecules and periodic materials

作者: Yichen Fan, Jacob Z. Williams, Weitao Yang 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21906v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学中的密度泛函理论（DFT）方法改进，具体开发了olLOSC校正方法来处理离域误差。论文内容属于计算化学/材料科学领域，与所有大模型（LLM）、深度学习技术原理、训练方法、推理优化、智能体等关键词完全无关。唯一可能的相关点是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于计算科学（理论化学）范畴，是AI for Science的一个潜在应用领域，但论文本身并未明确使用AI或机器学习方法，而是基于第一性原理的物理模型改进，因此给予5分（有一定关联）。其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对密度泛函理论（DFT）中系统性的离域误差问题，提出了一种名为olLOSC的统一高效校正方法，能够在分子和周期性材料中准确修正带隙低估和总能量误差。

摘要翻译

密度泛函理论（DFT）是中大规模计算分子与材料量子性质最具前景的方法。然而，常用的密度泛函近似（DFAs）存在系统性的离域误差，具体表现为带隙低估、电荷过度离域以及界面能级失准，这限制了其定量预测能力。为应对离域误差，学界已投入大量努力，例如发展多体微扰理论中的$GW$近似、针对特定体系调整DFA参数以及设计校正泛函等。然而，目前仍缺乏一种精确、高效且统一的方法，能够同时描述有限体系与材料的总能量、电荷密度和能带结构。基于线性响应局域轨道标度校正（lrLOSC），我们提出了olLOSC：一种通过无轨道电子线性响应计算曲率的局域轨道标度校正方法。olLOSC在精度上与lrLOSC相当，但计算效率显著提高。该方案在相同的无轨道近似框架内，校正了分子以及小带隙与中等带隙材料中的离域误差——尤其是带隙低估问题，同时也修正了总能量。重要的是，凭借这一统一近似，olLOSC为在分子、材料及界面体系中实现稳健且高效的DFT应用开辟了道路。

摘要 (Abstract)

Density functional theory (DFT) is the most promising method for calculating quantum properties of molecules and materials at moderate and large scales. However, commonly used density functional approximations (DFAs) have systematic delocalization error, as demonstrated by underestimated band gaps, over-delocalized charges, and energy level misalignment at interfaces, which limits its quantitative prediction. Extensive efforts, such as the $GW$ approximation to many-body perturbation theory, system-specific tuning of DFA parameters, and correction functionals have been developed to address delocalization error. However, an accurate, efficient, and unified solution to describe total energy, charge density and band structure for both finite systems and materials is still not available. Building on the linear-response localized orbital scaling correction (lrLOSC), we introduce olLOSC: a localized orbital scaling correction with curvature calculated by orbital-free electronic linear response. olLOSC has comparable accuracy to lrLOSC, but is much more computationally efficient. olLOSC corrects delocalization error - especially underestimated gaps, but also the total energy - both in molecules and in materials with small and moderate band gaps, within the same orbital-free approximation. Critically, with a a unified approximation, olLOSC opens the path for robust and efficient DFT applications across molecules, materials, and interfaces.

关键词: Density functional theory, Delocalization error, Band gap correction, Localized orbital scaling correction, Computational efficiency, Unified approximation, Materials science, Quantum chemistry

319. ❌ Emergent single-species non-reciprocity from bistable chemical dynamics

作者: Jakob Metson, Ramin Golestanian 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究化学活性胶体系统中的非互易相互作用和涌现动力学，属于软物质物理和化学物理领域。论文内容与绝大多数关键词（涉及大模型、深度学习、AI技术原理）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于科学计算或理论科学范畴，但并未使用AI或机器学习方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了一种在半透性囊泡包裹酶的化学活性胶体单物种悬浮液中，通过双稳态化学动力学产生非互易相互作用的机制，使胶体能够动态切换吸引、排斥、追逐等行为，并实现可控的涌现动力学。

摘要翻译

在由可形成复合单元的组分构成的复杂系统中，涌现对称性的出现为我们设计和调控奇异相行为提供了机遇，例如可通过利用与之相关的动态对称破缺来实现。本文提出了一种新颖的非互易相互作用涌现机制，该机制存在于由半透性囊泡构成的化学活性胶体单组分悬浮液中，这些囊泡封装了可催化非线性化学反应的酶。双稳态化学动力学使得胶体反应腔能够根据其内部及周围化学浓度的选定值，充当化学物质的净生产者或消费者。由于胶体的内部化学状态取决于动态化学浓度而非材料参数，两个完全相同的胶体在通过扩散电泳响应相应浓度梯度时，可在同一系统内表现出不同的有效化学相互作用。此外，胶体能够在有效消费者与生产者之间自发且可逆地切换。因此，胶体能够以非互易的方式动态切换彼此间的忽视、吸引、排斥和追逐行为。通过调控参数以诱导化学动力学中的分岔，可利用这种灵活性实现对相互作用模式的稳健控制，并产生丰富的涌现动力学，例如自发的多体极性集群运动。

摘要 (Abstract)

The appearance of emergent symmetries in complex systems with components that can form composite units provides us with opportunities for design and control of exotic phase behaviour, for example by exploiting the dynamical symmetry breaking associated with them. We present a novel mechanism for the emergence of non-reciprocal interactions in a single-species suspension of chemically active colloids made out of semi-permeable vesicles, which encapsulate enzymes that catalyze a non-linear chemical reaction. Bistable chemical dynamics enables the colloidal reaction chamber to act as a net producer or consumer of a chemical, depending on the selected values of the chemical concentrations inside and around it. Since the internal chemical state of the colloid depends on the dynamic chemical concentrations rather than the material parameters, two identically produced colloids can present different effective chemical interactions within the same system upon responding to the corresponding gradients via diffusiophoresis. Furthermore, the colloids can spontaneously and reversibly switch between being effective consumers or producers. As a consequence, the colloids can dynamically switch between ignoring, attracting, repelling, and chasing each other, in a non-reciprocal manner. This flexibility can be exploited by manipulation of tuning parameters to induce bifurcations in the chemical dynamics, resulting in a robust control over the interaction motifs, and rich emergent dynamics such as spontaneous many-body polar swarming.

关键词: non-reciprocal interactions, chemically active colloids, bistable chemical dynamics, diffusiophoresis, emergent dynamics, spontaneous switching, many-body polar swarming, semi-permeable vesicles

320. ❌ Deformed states in paraelectric and ferroelectric nematic liquid crystals

作者: Oleg D. Lavrentovich 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21338v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于向列型液晶（特别是顺电和铁电向列型）中变形状态的基础物理研究，完全属于凝聚态物理和材料科学领域。论文内容涉及分子形状、手性、极性、空间限制以及弹性与静电能量等物理概念，与所有评分关键词（均围绕大模型、深度学习、AI技术及其应用）无任何关联。论文未提及任何人工智能、机器学习或计算模型相关内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该综述研究了顺电和铁电向列型液晶中分子形状、手性和空间限制如何诱导具有宇称破缺、展曲、弯曲和扭曲-弯曲变形的平衡态和多畴态，并揭示了展曲抵消效应。

摘要翻译

具有取向有序性的材料（从固体铁磁体、铁电体到液晶）的基态通常包含由内在因素（如结构单元的形状）或受限几何结构引起的空间变化矢量型序参量。本综述通过实例阐述分子形状、手性和极性以及空间限制如何诱导顺电性与铁电性向列相液晶中产生具有宇称破缺、展曲、弯曲及扭曲-弯曲形变的平衡态和多畴态。宇称破缺的产生源于组成分子的手性（作为顺电向列相中能量代价高昂的展曲与弯曲的替代机制），或源于铁电向列相中对退极化场的响应。顺电与铁电向列相均表现出展曲抵消效应，即沿某一方向的展曲弹性能与静电能可通过沿正交方向的附加展曲得以降低。

摘要 (Abstract)

Ground states of materials with orientational order ranging from solid ferromagnets and ferroelectrics to liquid crystals often contain spatially varying vector-like order parameter caused by inner factors such as the shape of building units or by the geometry of confinement. This review presents examples of how the shapes, chirality, and polarity of molecules and spatial confinement induce deformed equilibrium and polydomain states with parity breaking, splay, bend, and twist-bend deformations of the order parameter in paraelectric and ferroelectric nematic liquid crystals. Parity breaking results either from chirality of the constituent molecules, as a replacement of energetically costly splay and bend in paraelectric nematics, or in response to depolarization field in the ferroelectric nematic. Both paraelectric and ferroelectric nematics exhibit a splay cancellation effect, in which the elastic and electrostatic energies of splay along one direction are reduced by an additional splay along orthogonal directions.

关键词: nematic liquid crystals, paraelectric, ferroelectric, deformed states, parity breaking, splay bend twist, spatial confinement, splay cancellation

321. ❌ TERS-ABNet: A Deep Learning Approach for Automated Single-Molecule Structure Reconstruction with Atomic Precision from TERS Mapping

作者: Jie Cui, Yao Zhang, Yang Zhang, Yi Luo, Zhen-Chao Dong 期刊/来源: arxiv 发布日期: 2026-03-23 arXiv链接: http://arxiv.org/abs/2603.21579v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于开发一个名为TERS-ABNet的深度学习框架，用于从尖端增强拉曼光谱（TERS）映射中自动重建单分子原子结构。它本质上是一个应用于科学（具体是化学/材料表征）的计算机视觉/图神经网络任务。所有关键词（除了最后一个）都明确涉及大语言模型（LLMs）及其相关技术、训练方法、推理优化、对齐、代理系统等。该论文没有提及或使用任何LLM、基础模型或相关的大模型技术。它使用的是为特定科学成像任务定制的深度学习架构（“two-track” CNN用于图像到图的推理）。因此，除了“AI for Science OR Bioinformatics OR Cheminformatics”这一关键词外，其他所有关键词都得0分，因为该论文是AI在科学（化学信息学/纳米表征）中的一个应用实例，符合该关键词的描述。

!!! tip deepseek-chat TL;DR

该论文解决了从高维尖端增强拉曼光谱（TERS）图像中自动、精确地确定单分子化学结构的挑战，并提出了TERS-ABNet深度学习框架，该框架能够以约94%的原子类型分类准确率和约0.23 Å的平均坐标误差，从TERS图谱直接重建完整的原子-键合图分子结构。

摘要翻译

从光谱数据中确定表面单分子的化学结构是一个具有挑战性的高维反演问题。针尖增强拉曼光谱（TERS）能够以亚纳米空间分辨率实现单分子的化学特异性成像，但由于振动信号存在模糊性且依赖专家解读，从TERS图谱中完整重建分子结构仍然困难。本文提出TERS-ABNet深度学习框架，将光谱图像中的单分子结构测定构建为图像到图结构的推理任务。该模型采用“双通道”架构，同步预测概率化的原子与化学键分布图，从而无需依赖预定义的化学规则即可直接构建显式的原子-化学键图。通过在模拟数据集上进行训练，TERS-ABNet实现了约94%的原子类型分类准确率（平均坐标误差约0.23 Å），能够可靠恢复分子连接性并从TERS图谱中完整重建单分子结构。该框架通过迁移学习可适应不同空间分辨率与结构复杂度，并成功从实验TERS数据中重建出单个卟啉分子的原子结构。本研究建立了一种从高维光谱成像数据推断显式原子-化学键图表示的通用深度学习策略，为纳米尺度表征中的自动化分子结构测定提供了新途径。

摘要 (Abstract)

Determining the chemical structure for a single molecule on surface from spectroscopic data represents a challenging high-dimensional inverse problem. Tip-enhanced Raman spectroscopy (TERS) enables chemically specific imaging of single molecules with sub-nanometer spatial resolution, yet reconstructing complete molecular structures from TERS maps remains difficult owing to the ambiguous vibrational signatures and reliance on expert interpretation. Here, we introduce TERS-ABNet, a deep-learning framework that formulates single-molecule structure determination from spectroscopic images as an image-to-graph inference task. Using a “two-track” architecture, the model jointly predicts probabilistic atom and bond maps, enabling direct construction of explicit atom-bond graphs without relying on predefined chemical rules. Trained on simulated datasets, TERS-ABNet achieves about 94% atom-type classification accuracy (with a mean coordinate error of about 0.23 Å), enabling to reliably recovering molecular connectivity and fully reconstruct single-molecule structure from its TERS maps. The framework generalizes across varying spatial resolutions and structural complexity through transfer learning, and successfully reconstructs the atomic structure of a single porphyrin molecule from experimental TERS data. This work establishes a general deep-learning strategy for inferring explicit atom-bond graph representations from high-dimensional spectroscopic imaging data, providing a new pathway towards automated molecular structure determination in nanoscale characterization.

关键词: TERS-ABNet, deep learning, single-molecule structure determination, tip-enhanced Raman spectroscopy, image-to-graph inference, atom-bond graph, automated molecular reconstruction, nanoscale characterization

322. ❌ Machine-Learned Leftmost Hessian Eigenvectors for Robust Transition State Finding

作者: Guanchen Wu, Chung-Yueh Yuan, Kareem Hegazy, Samuel M. Blau, Teresa Head-Gordon 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21323v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究机器学习在化学计算中的应用，具体开发了一种机器学习驱动的过渡态优化器，用于预测Hessian矩阵的最左特征向量以提高反应坐标计算的效率和鲁棒性。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词特指自然语言处理领域的大语言模型及相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学计算（具体是计算化学）领域的应用，但并非核心的生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种机器学习驱动的优化器，通过直接预测Hessian矩阵的最左特征向量来高效、鲁棒地寻找化学反应过渡态，实现了二阶稳定性但仅需一阶计算成本。

摘要翻译

过渡态（TS）的可靠确定得益于二阶信息所提供的稳健收敛与验证能力，但海森矩阵（Hessian）的计算成本阻碍了其在过渡态优化中的常规应用。本文提出一种机器学习驱动的过渡态优化器，可直接预测最左海森本征向量（LMHE）——该关键模态在局部近似包含过渡态的反应坐标。我们证明，该LMHE优化器能以与全海森优化器相同的成功率找到过渡态解，且对于质量较差的初始猜测几何结构具有更强的鲁棒性，从而消除了全海森方法典型的长耗时瓶颈，并相比标准拟牛顿法减少了总梯度计算次数。我们进一步通过不确定性量化来识别偶发的LMHE预测失败案例，从而提升准确性与鲁棒性；此时系统会在该优化步骤中回退至基于机器学习势能的全海森更新，避免了昂贵的主动学习过程。总体而言，我们的方法与半自动化工作流程以一阶计算成本实现了二阶稳定性，为高通量反应发现提供了一个高效引擎。

摘要 (Abstract)

The reliable determination of transition states (TSs) benefits from second-order information for robust convergence and validation, but the computational expense of Hessians prohibits their routine use in TS optimization. Here, we present a machine-learning-driven TS optimizer that directly predicts the leftmost Hessian eigenvector (LMHE), the critical mode that locally approximates the reaction coordinate encompassing the TS. We demonstrate that our LMHE optimizer recovers TS solutions at the same rate as full Hessian optimizers, and more robustly from degraded initial guess geometries, thereby eliminating the excessively long wall times characteristic of full-Hessian approaches and reducing total gradient evaluations compared to standard quasi-Newton methods. We further improve accuracy and robustness using uncertainty quantification for identifying occasional LMHE prediction failures, that then falls back to a full Hessian update from the machine learned potential at that optimization step, avoiding expensive active learning. Overall our methodology and semi-automated workflow delivers second-order stability at first-order computational expense to provide a highly efficient engine for high-throughput reaction discovery.

关键词: transition state finding, Hessian eigenvector, machine learning optimizer, reaction coordinate, uncertainty quantification, computational chemistry, high-throughput reaction discovery

323. ❌ Measurement Reduction in Orbital-Optimized Variational Quantum Eigensolver via Orbital Compression

作者: Yanxian Tao, Lingyun Wan, Jie Liu 期刊/来源: arxiv 发布日期: 2026-03-22 arXiv链接: http://arxiv.org/abs/2603.21109v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子计算领域的变分量子本征求解器（VQE）算法优化，通过轨道压缩技术减少测量成本并提高电子结构计算的准确性。论文内容与绝大多数关键词（涉及大模型、深度学习、训练技术、推理优化、对齐、智能体等）完全无关，因为这些关键词均属于人工智能/深度学习领域，而论文属于量子计算与量子化学交叉领域。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及量子计算在化学模拟（甲醛分解）中的应用，属于科学计算/AI for Science的广义范畴，但论文本身未使用传统AI/深度学习方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于轨道压缩的轨道优化变分量子本征求解器（FNO/SVO-OO-VQE）框架，用于在保持较小活性空间的同时提高电子结构计算的准确性并显著降低测量成本，并在分子解离和甲醛分解的模拟中验证了其有效性。

摘要翻译

变分量子本征求解器（VQE）已成为在近期噪声中等规模量子设备上求解电子结构问题的主要量子算法之一。然而，由于量子比特相干时间有限、量子门保真度不完美以及所需测量次数庞大，其在实际量子化学中的应用仍面临挑战，这些因素共同限制了当前电子结构模拟只能在相对较小的活性空间中进行。本研究提出了一种基于轨道压缩的轨道优化VQE框架，旨在保持较小活性空间的同时提高电子结构计算的精度。我们首先采用冻结自然轨道（Frozen Natural Orbitals, FNO）与分裂虚轨道（Split Virtual Orbitals, SVO）为VQE模拟构建紧凑的活性空间，形成了FNO/SVO-VQE方法。随后引入轨道优化以进一步恢复电子关联效应，从而发展出FNO/SVO-OO-VQE方法。我们将所提方法应用于分子解离势能面与甲醛分解反应活化能的模拟。数值结果表明，FNO-OO-VQE与SVO-OO-VQE在显著降低测量成本的同时，均提升了变分计算的精度。

摘要 (Abstract)

The variational quantum eigensolver (VQE) has emerged as one of the leading quantum algorithms for solving electronic structure problems on near-term noisy intermediate-scale quantum devices. However, its practical application to quantum chemistry remains challenging due to the limited coherence time, imperfect quantum gate fidelity, and the large number of measurements required, which together confine current electronic structure simulations to relatively small active spaces. In this work, we present an orbital-optimized VQE framework based on orbital compression, designed to improve the accuracy of electronic structure calculations while maintaining relatively small active spaces. Frozen natural orbitals (FNO) and split virtual orbitals (SVO) are first employed to construct compact active spaces for VQE simulations, leading to the FNO/SVO-VQE approach. Orbital optimization is then incorporated to further recover electron correlation effects, resulting in the FNO/SVO-OO-VQE methods. We apply the proposed method to simulate potential energy surfaces for molecular dissociation and the activation energy of formaldehyde decomposition. Numerical results demonstrate that both FNO-OO-VQE and SVO-OO-VQE improve the variational accuracy while substantially reducing measurement cost.

关键词: variational quantum eigensolver, orbital compression, frozen natural orbitals, split virtual orbitals, measurement reduction, electronic structure, quantum chemistry, active space

作者: Stephen Wiggins 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20803v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究量子信息中的信息扰乱（scrambling）现象，提出基于Bohmian轨迹的几何诊断方法，属于量子物理和量子信息理论领域。所有评分关键词均涉及大模型、深度学习及相关技术，而论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Bohmian轨迹的几何诊断方法来研究量子信息扰乱现象，通过分析倒置谐振子中的波包动力学，为理解信息扰动的敏感性提供了新的几何视角。

摘要翻译

乱序时序关联子（Out-of-Time-Order Correlator，OTOC）是量子信息扰动的标准代数诊断工具，但其提供的直接几何直观较为有限。本文提出一种基于玻姆轨迹的框架，利用拉格朗日描述子（Lagrangian Descriptors，LDs）构建与扰动相关的敏感性的几何诊断方法。为规避不确定性原理对单一波函数内独立赋予初始位置与动量的限制，我们在由初始中心与动量冲量标记的局域高斯波包所构成的二维制备空间上评估玻姆动力学。对于倒谐振子，该构建可解析处理：波包中心动力学及其对制备参数的依赖关系可被显式表达。特别地，在远离平衡原点的区域，相关制备空间稳定性矩阵的指数增长导致波包中心拉格朗日描述子的敏感性存在$\mathcal{O}(e^{ωT})$量级的界限，这促使我们将其与乱序时序关联子增长相关的敏感性结构进行半经典比较。在此意义上，拉格朗日描述子提供了与扰动相关的敏感性的几何指标。最后，我们讨论了这一制备空间图像如何为未来研究指明方向，特别是针对先前在倒谐振子中报道的不同微正则体系。

摘要 (Abstract)

The Out-of-Time-Order Correlator (OTOC) is a standard algebraic diagnostic of quantum information scrambling, but it offers limited direct geometric intuition. In this note, we propose a Bohmian, trajectory-based framework for constructing a geometric diagnostic of scrambling-related sensitivity using Lagrangian Descriptors (LDs). To avoid the uncertainty-principle obstruction to assigning independent initial position and momentum within a single wave function, we evaluate Bohmian dynamics over a two-dimensional preparation space of localized Gaussian wavepackets labeled by their initial center and momentum kick. For the inverted harmonic oscillator, this construction is analytically tractable: the wavepacket-center dynamics and their dependence on preparation parameters can be written explicitly. In particular, away from the equilibrium origin, the exponential growth of the associated preparation-space stability matrix yields an $\mathcal{O}(e^{ωT})$ bound on the sensitivity of the wavepacket-center LDs, motivating a semiclassical comparison with sensitivity structures associated with OTOC growth. In this sense, the LD provides a geometric indicator of scrambling-related sensitivity. We conclude by discussing how this preparation-space picture suggests a program for future work regarding the distinct microcanonical regimes previously reported for the inverted harmonic oscillator.

关键词: Out-of-Time-Order Correlator, quantum information scrambling, Bohmian dynamics, Lagrangian Descriptors, inverted harmonic oscillator, wavepacket-center dynamics, preparation-space stability matrix, semiclassical comparison

325. ❌ Efficient Coupled-Cluster Python Frameworks for Next-Generation GPUs: A Comparative Study of CuPy and PyTorch on the Hopper and Grace Hopper Architecture

作者: Antonina Dobrowolska, Julian Świerczyński, Paweł Tecmer, Emil Sujkowski, Somayeh Ahmadkhani, Grzegorz Mazur, Klemens Noga, Jeff Hammond, Katharina Boguslawski 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20912v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子化学计算中的耦合簇方法（CCSD）在GPU上的高性能计算实现，使用CuPy和PyTorch库进行优化和基准测试。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词均针对自然语言处理或通用人工智能领域的大模型研究。唯一略有相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算化学领域，可视为AI在科学计算中的应用，但论文核心是高性能计算和算法优化，而非AI模型本身，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究开发了用于耦合簇单双激发（CCSD）计算的新批处理算法和通用张量收缩协议，在NVIDIA Hopper和Grace Hopper GPU架构上使用CuPy和PyTorch库实现了高达10倍的性能加速。

摘要翻译

本研究提出了新的批处理算法，以有效处理在图形处理器（GPU）的视频随机存取存储器（VRAM）上使用Python实现耦合簇单双激发（CCSD）方法时遇到的大规模张量收缩问题，从而提升计算性能。具体而言，我们在单个英伟达Hopper（H100）及Grace Hopper（GH200）架构上，对CuPy与PyTorch库的性能进行了基准测试。我们首先通过一种非对称动态分割方案，优化了CCSD中粒子-粒子阶梯项这一瓶颈收缩过程，进而发展出一种通用的张量收缩方案，使得张量收缩计算几乎完全在GPU上执行。我们使用CuPy和PyTorch库，针对不同分子体系和基组大小，对我们全新的、完全通用的GPU加速耦合簇实现进行了性能基准测试。在H100上，PyTorch的表现比CuPy快约20%，而在GH200架构上两者性能相近。与我们最初的GPU实现[J. Chem. Theory Comput. 2024, 20, 3, 1130–1142]相比，我们实现了10倍的加速。在分子CCSD计算中，与我们原始的GPU-CPU混合实现相比，使用Cholesky分解的电子排斥积分进行单次CCSD迭代，我们获得了3至16倍的额外加速。

摘要 (Abstract)

In this work, we introduce new batching algorithms to effectively handle large contractions encountered in coupled-cluster singles and doubles (CCSD) implementations in Python on the Video Random Access Memory (VRAM) of graphical processing units (GPUs), thereby improving performance. Specifically, we benchmark the performance of the CuPy and PyTorch libraries on a single NVIDIA Hopper (H100) and the Grace Hopper (GH200) architectures. We begin by optimizing the particle-particle ladder bottleneck contraction in CCSD using an asymmetric and dynamic splitting recipe, and then move toward a generic tensor contraction protocol that enables tensor contractions to be performed almost exclusively on GPUs. We benchmark our new, fully generic GPU-accelerated coupled-cluster implementations for various molecular systems and basis-set sizes, using both the CuPy and PyTorch libraries. While PyTorch outperforms CuPy on H100 by approximately 20%, both perform similarly on the GH200 architecture. Compared to our initial GPU implementation [J. Chem. Theory Comput. 2024, 20, 3, 1130–1142], we achieve a 10-fold speedup. In molecular CCSD calculations, we report additional speedups between 3 and 16 for a single CCSD iteration using Cholesky-decomposed electron repulsion integrals compared to our original GPU-CPU hybrid implementation.

关键词: coupled-cluster singles and doubles (CCSD), GPU acceleration, CuPy, PyTorch, tensor contractions, NVIDIA Hopper, Grace Hopper, performance benchmarking

326. ❌ Resolving Discrepancies in Disjoining Pressure Predictions for Liquid Nanofilms from Molecular Simulations

作者: Yafan Yang, Zufeng Zuo, Jingyu Wan, Denvid Lau 期刊/来源: arxiv 发布日期: 2026-03-21 arXiv链接: http://arxiv.org/abs/2603.20720v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究分子模拟中液体纳米薄膜分离压力的预测差异，属于计算物理/材料科学领域，完全不涉及大模型、深度学习、AI技术或相关方法。所有关键词均与大模型技术原理、应用或AI科学应用相关，与该论文的分子模拟物理研究无任何关联。

!!! tip deepseek-chat TL;DR

该论文通过分析长程色散相互作用和薄膜厚度定义的不一致性，解决了分子模拟中液体纳米薄膜分离压力预测的差异问题，并提出了改进的Peng方法以获得更准确的Hamaker常数。

摘要翻译

不同分子模拟方法获得的液体纳米膜分离压文献值存在显著差异。研究表明，这些差异源于原始彭方法中忽略了长程色散相互作用以及膜厚定义的不一致。关键发现在于：长程色散作用以厚度依赖的方式影响表面张力——在大厚度下增强表面张力，而在小厚度下由于分离压引起的法向压缩和横向膨胀效应，会抑制这种增强。这导致水纳米膜表面张力呈现交叉转变行为。由于分离压是通过表面张力对厚度的微分推导得出，这种非平庸的厚度依赖性会严重影响其计算精度。通过对色散相互作用的正确处理和统一的厚度定义，修正后的彭方法与巴特方法结果一致，并能获得更精确的哈梅克常数。

摘要 (Abstract)

Literature values of disjoining pressure in liquid nanofilms from different molecular simulation methods show significant discrepancies. We demonstrate that these arise from neglecting long-range dispersion interactions and inconsistent definitions of film thickness in the original Peng method. A key insight is that long-range dispersion affects surface tension in a thickness-dependent manner, increasing it at large thickness but suppressing its enhancement at small thickness due to disjoining-pressure-induced normal compression and lateral expansion. This leads to crossover behavior in the surface tension of water nanofilms. Since disjoining pressure is obtained from the derivative of surface tension with respect to thickness, this nontrivial dependence strongly impacts its accuracy. With proper treatment of dispersion interactions and a consistent thickness definition, the revised Peng method agrees with the Bhatt method and yields more accurate Hamaker constants.

关键词: disjoining pressure, liquid nanofilms, molecular simulations, long-range dispersion interactions, film thickness, surface tension, Peng method, Hamaker constants

Token 消耗统计

总计: 972,269 tokens（输入 619,497 / 输出 352,772）

模型	输入	输出	合计
deepseek-chat	585,131	324,060	909,191
glm-4.7	34,366	28,712	63,078

📊 ArXiv 研究报告 (2026-03-25)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment

基于符号递归自对齐与验证推理的迭代自训练稳定化方法

2. Mind over Space: Can Multimodal Large Language Models Mentally Navigate?

心智超越空间：多模态大语言模型能否进行心理导航？

3. Probing How Scalable Table Data Enhances General Long-Context Reasoning

探究可扩展表格数据如何增强通用长上下文推理

4. Improving Coherence and Persistence in Agentic AI for System Optimization

改进智能体AI在系统优化中的连贯性和持久性

5. EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises

EnterpriseLab：用于在企业中开发和部署智能体的全栈平台

6. Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERT

葡萄牙语问答的高效微调方法：BERTimbau 上的 PEFT 比较研究与生成式 LLM 的探索性评估

7. User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Intera

基于检索增强交互的弱奖励反馈：对话式大模型智能体的用户偏好建模

8. SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Im

SpatialReward：用于文本到图像生成中细粒度空间一致性的可验证空间奖励建模

9. Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with C

基于文本梯度下降优化多智能体天气描述生成：一种具有共识感知梯度融合的免训练方法

10. Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Artic

全文科学文章中假设和统计证据提取的上下文选择

📋 所有论文列表

1. ✅ Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment

2. ✅ Mind over Space: Can Multimodal Large Language Models Mentally Navigate?

3. ✅ Probing How Scalable Table Data Enhances General Long-Context Reasoning

4. ✅ Improving Coherence and Persistence in Agentic AI for System Optimization

5. ✅ EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises

6. ✅ Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs

7. ✅ User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction

8. ✅ SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

9. ✅ Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with Consensus-Aware Gradient Fusion

10. ✅ Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles

11. ❌ Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

12. ❌ SynSym: A Synthetic Data Generation Framework for Psychiatric Symptom Identification

13. ❌ Graph Fusion Across Languages using Large Language Models

14. ❌ AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference

15. ❌ mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT

16. ❌ Conspiracy Frame: a Semiotically-Driven Approach for Conspiracy Theories Detection

17. ❌ TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

18. ❌ Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models

19. ❌ Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

20. ❌ A Latent Representation Learning Framework for Hyperspectral Image Emulation in Remote Sensing

21. ❌ UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

22. ❌ End-to-End Training for Unified Tokenization and Latent Denoising

23. ❌ WorldCache: Content-Aware Caching for Accelerated Video World Models

24. ❌ 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

25. ❌ ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

26. ❌ TiCo: Time-Controllable Training for Spoken Dialogue Models

27. ❌ One Model, Two Markets: Bid-Aware Generative Recommendation

28. ❌ Confidence-Based Decoding is Provably Efficient for Diffusion Language Models

29. ❌ Dyadic: A Scalable Platform for Human-Human and Human-AI Conversation Research

30. ❌ Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

31. ❌ SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

32. ❌ CayleyPy-4: AI-Holography. Towards analogs of holographic string dualities for AI tasks

33. ❌ Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

34. ❌ Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

35. ❌ Calibeating Made Simple

36. ❌ MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management

37. ❌ Multimodal Survival Analysis with Locally Deployable Large Language Models

38. ❌ Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation

39. ❌ More Isn’t Always Better: Balancing Decision Accuracy and Conformity Pressures in Multi-AI Advice

40. ❌ Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

41. ❌ On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

42. ❌ GSEM: Graph-based Self-Evolving Memory for Experience Augmented Clinical Reasoning

43. ❌ SpecTM: Spectral Targeted Masking for Trustworthy Foundation Models

44. ❌ On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration

45. ❌ A Context Engineering Framework for Improving Enterprise AI Agents based on Digital-Twin MDP

46. ❌ Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models

47. ❌ Future-Interactions-Aware Trajectory Prediction via Braid Theory

48. ❌ ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention

49. ❌ SegMaFormer: A Hybrid State-Space and Transformer Model for Efficient Segmentation

50. ❌ TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning

51. ❌ λ-GELU: Learning Gating Hardness for Controlled ReLU-ization in Deep Networks

52. ❌ LRC-WeatherNet: LiDAR, RADAR, and Camera Fusion Network for Real-time Weather-type Classification in Autonomous Driving

53. ❌ SecureBreak – A dataset towards safe and secure models