📊 ArXiv 研究报告 (2026-03-31)

生成时间: 2026-03-31 09:09:52 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 237 篇
及格论文: 8 篇 (3.4%)

⭐ 及格论文详细分析

1. Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification

作者: Shuai Lv, Chang Liu, Feng Tang, Yujie Yuan, Aojun Zhou, Kui Zhang, Xi Yang, Yangqiu Song 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26348v1

评分: 73.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	15.0/10	15.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	15.0/10	15.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出Visual Re-Examination (VRE)框架，针对多模态大语言模型(MLLMs)在长文本生成中逐渐偏离图像证据、产生幻觉的问题。核心创新在于通过自我反思(self-reflection)和视觉再检查实现自我改进(self-improvement)，属于大模型推理和幻觉缓解的前沿研究。高度相关的关键词包括：Self-Correction/Self-Improvement/Self-Reflection(15分，核心机制)、Hallucination Mitigation(15分，主要目标)、Chain of Thought/System 2 Thinking(各10分，涉及多步推理和深度思考)、Large Language Models(10分，基于MLLMs)、Post-training/SFT(8分，涉及训练框架)。其他关键词如Mechanistic Interpretability(5分，涉及注意力分析)有一定关联，其余关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在长文本生成中逐渐偏离图像证据、产生幻觉的问题，提出了Visual Re-Examination (VRE)自我进化训练框架，通过视觉内省和信息增益驱动的验证，显著提高了推理准确性、感知可靠性并大幅减少了幻觉。

摘要翻译

多模态大语言模型（MLLMs）在多模态推理任务中展现出强大性能，然而我们发现其在长文本生成中存在一种反复出现的失效模式：随着输出内容增长，模型逐渐偏离图像证据并依赖文本先验，导致推理缺乏依据并产生幻觉。有趣的是，通过注意力机制分析，我们发现MLLMs具备一种潜在但未被持续激活的后期视觉验证能力。基于这一观察，我们提出视觉再审视（Visual Re-Examination, VRE）框架——一种自演化的训练范式，使MLLMs能够在无需额外视觉输入的情况下，在推理过程中自主执行视觉内省。与从更强教师模型蒸馏视觉能力的方法不同，VRE通过模型自身生成反思轨迹来促进迭代式自我改进，借助信息增益使视觉信息转化为可执行知识。在多样化多模态基准测试上的大量实验表明，VRE能持续提升推理准确性与感知可靠性，同时显著减少幻觉现象，尤其在长链推理场景中效果尤为突出。代码已发布于https://github.com/Xiaobu-USTC/VRE。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long-form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late-stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self-improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long-chain settings. Code is available at https://github.com/Xiaobu-USTC/VRE.

关键词: Multimodal Large Language Models, Visual Re-Examination, Self-improvement, Hallucination mitigation, Multimodal reasoning, Long-form generation, Visual verification, Information gain

2. Learning to Commit: Generating Organic Pull Requests via Online Repository Memory

作者: Mo Li, L. H. Xu, Qitai Tan, Ting Cao, Yunxin Liu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26664v1

评分: 51.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based coding agents在生成代码提交（pull requests）时缺乏有机性（organicity）的问题，提出Learning to Commit框架和Online Repository Memory方法。与LLM Agents高度相关（10分），因为论文研究LLM代理在代码生成任务中的应用；与Large Language Models高度相关（10分），明确使用LLM作为基础模型；与Self-Correction/Self-Improvement相关（8分），通过对比历史提交进行反思学习；与Tool Use/Function Calling相关（8分），涉及代码生成和API使用；与Pre-training/Domain Adaptation和Post-training/SFT有一定关联（5分），涉及领域适应和基于历史数据的监督学习；与In-context Learning有一定关联（5分），涉及基于历史上下文的技能学习。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Quantization等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM-based coding agents生成的代码提交缺乏有机性（忽略项目特定惯例、重复内部API功能、违反架构约束）的问题，提出了Learning to Commit框架和Online Repository Memory方法，通过从历史提交中学习项目特定的技能模式，显著提高了生成代码提交的有机性评分。

摘要翻译

基于大语言模型（LLM）的编码代理在受控基准测试中取得了令人瞩目的成果，但在实际应用中，其生成的拉取请求却常常遭到真实维护者的拒绝。根本原因并非功能错误，而是缺乏有机性：生成的代码忽视了项目特定的惯例、重复了内部API已提供的功能，并且违反了多年开发过程中积累的隐性架构约束。仅仅让代理接触最新的代码库快照是不够的：快照仅揭示了代码库的最终状态，却无法展现达成该状态所依赖的、特定于该仓库的变更模式。
为此，我们提出了“学习提交”框架，该框架通过在线仓库记忆来弥合这一差距。给定一个按严格时间顺序分割的代码仓库，代理会对早期的提交进行监督式对比反思：它首先盲目地尝试解决每个历史问题，随后将其预测结果与真实的代码差异进行比对，并将其中的差距提炼为一套持续增长的技能——这些可复用的模式捕捉了编码风格、内部API使用方式以及架构不变性。当新的拉取请求描述到来时，代理会基于这些积累的技能来生成代码，从而产生植根于项目自身演化历程而非通用预训练先验的变更。
评估在技能构建阶段完全未接触过的、真实未来的已合并拉取请求上进行，并涵盖多个维度，包括功能正确性、代码风格一致性、内部API复用率以及修改区域的合理性。在一个由专家维护、具有丰富提交历史的仓库上进行的实验表明，在线仓库记忆能有效提升在预留未来任务上的有机性得分。

摘要 (Abstract)

Large language model (LLM)-based coding agents achieve impressive results on controlled benchmarks yet routinely produce pull requests that real maintainers reject. The root cause is not functional incorrectness but a lack of organicity: generated code ignores project-specific conventions, duplicates functionality already provided by internal APIs, and violates implicit architectural constraints accumulated over years of development. Simply exposing an agent to the latest repository snapshot is not enough: the snapshot reveals the final state of the codebase, but not the repository-specific change patterns by which that state was reached. We introduce Learning to Commit, a framework that closes this gap through Online Repository Memory. Given a repository with a strict chronological split, the agent performs supervised contrastive reflection on earlier commits: it blindly attempts to resolve each historical issue, compares its prediction against the oracle diff, and distils the gap into a continuously growing set of skills-reusable patterns capturing coding style, internal API usage, and architectural invariants. When a new PR description arrives, the agent conditions its generation on these accumulated skills, producing changes grounded in the project’s own evolution rather than generic pretraining priors. Evaluation is conducted on genuinely future, merged pull requests that could not have been seen during the skill-building phase, and spans multiple dimensions including functional correctness, code-style consistency, internal API reuse rate, and modified-region plausibility. Experiments on an expert-maintained repository with rich commit history show that Online Repository Memory effectively improves organicity scores on held-out future tasks.

关键词: Large Language Models, LLM-based coding agents, Online Repository Memory, organicity, pull requests, supervised contrastive reflection, skill distillation, repository-specific patterns

3. Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

作者: Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng, Yudong Zhang, Mulei Zhang, Yang Chen, Shuyue Hu, Zhenfei Yin, Chen Zhang, Lei Bai 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26535v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出PAPO方法，专注于大语言模型（LLMs）的训练优化，属于大模型技术原理的创新。核心涉及强化学习对齐（RLHF/DPO）和微调（SFT），通过改进GRPO算法来整合过程级评估，直接解决推理质量（Chain of Thought/System 2 Thinking）的奖励设计问题。因此，与LLMs、Post-training/SFT、RLHF/DPO、Chain of Thought、System 2 Thinking高度相关（10分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型训练中奖励设计的局限性，提出了一种名为PAPO的方法，通过解耦优势归一化将过程级评估整合到GRPO中，以在保持答案正确性的同时区分推理质量，实验表明其在多个基准测试中优于传统结果奖励模型。

摘要翻译

我们提出过程感知策略优化（PAPO），该方法通过解耦优势归一化，将过程级评估整合至群体相对策略优化（GRPO）中，以解决现有奖励设计的两项局限。结果奖励模型（ORM）仅评估最终答案的正确性，对所有正确回答一视同仁而忽略推理质量，且随着群体表现趋于一致正确，其优势信号逐渐消失。过程奖励模型（PRM）能提供更丰富的监督信息，但直接使用PRM分数会导致奖励破解问题，即模型通过冗长回答人为提高分数，而实际准确率却大幅下降。PAPO通过组合两个优势分量来解决这些问题：结果分量Aout源自ORM并在所有响应中归一化，过程分量Aproc则基于规则化PRM且仅在正确响应内部归一化。这种解耦设计确保Aout将训练锚定于正确性，而Aproc能在不扭曲结果信号的前提下区分推理质量。在多种模型规模和六个基准测试上的实验表明，PAPO始终优于ORM，在OlympiadBench上达到51.3%对比46.3%的准确率，且在ORM进入平台期并下降时仍能持续提升性能。

摘要 (Abstract)

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.

关键词: Process-Aware Policy Optimization, PAPO, Group Relative Policy Optimization, GRPO, decoupled advantage normalization, process reward models, reasoning quality, outcome reward models

4. SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatol

作者: Zhangtianyi Chen, Yuhao Shen, Florensia Widjaja, Yan Xu, Liyuan Sun, Zijian Wang, Hongyi Chen, Wufei Dai, Juexiao Zhou 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26122v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文SkinGPT-X是一个用于皮肤病学诊断的多模态协作多智能体系统，与LLM和多智能体系统高度相关（10分），因为它基于LLM构建并采用多智能体架构。与AI for Science相关（10分），因为它将AI应用于皮肤病学这一科学领域。与Explainable AI有一定关联（5分），因为它强调透明和可解释的诊断。其他关键词如MoE、SFT、RAG等未在摘要中提及或不是核心内容，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了SkinGPT-X，一个结合自进化皮肤病学记忆机制的多模态协作多智能体系统，以解决现有LLM在皮肤病诊断中缺乏可解释性和对罕见疾病处理能力不足的问题，并在多个数据集上实现了显著的性能提升。

摘要翻译

尽管大语言模型的最新进展显著推动了皮肤病学诊断的发展，但单体大语言模型常因训练数据稀疏性而在细粒度、大规模多类别诊断任务及罕见皮肤病诊断中面临困难，同时缺乏临床推理所必需的可解释性与可追溯性。虽然多智能体系统能够提供更透明、可解释的诊断，但现有框架主要集中于视觉问答与会话任务，且其对静态知识库的严重依赖限制了其在复杂现实临床场景中的适应性。本文提出SkinGPT-X，这是一个集成自进化皮肤病记忆机制的多模态协作多智能体皮肤病诊断系统。通过模拟皮肤病学家的诊断工作流程并实现持续记忆进化，SkinGPT-X为复杂及罕见皮肤病病例的管理提供透明且可信的诊断。为验证SkinGPT-X的鲁棒性，我们设计了三层对比实验。首先，我们在四个公共数据集上将SkinGPT-X与四种先进大语言模型进行基准测试，结果显示其取得了最先进的性能：在DDI31数据集上准确率较最佳模型提升9.6%，在Dermnet数据集上加权F1分数提升13%。其次，我们构建了一个涵盖498种不同皮肤病类别的大规模多类别数据集，以评估其细粒度分类能力。最后，我们整理了罕见皮肤病数据集——这是首个针对临床罕见皮肤病稀缺性问题设计的基准数据集，包含564个临床样本，涵盖八种罕见皮肤病。在该数据集上，SkinGPT-X实现了准确率提升9.8%、加权F1分数提升7.1%、科恩卡帕系数提升10%的显著改进。

摘要 (Abstract)

While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen’s Kappa improvement.

关键词: Large Language Models, Multi-agent Systems, Dermatological Diagnosis, Self-evolving Memory, Transparent Diagnostics, Rare Skin Diseases, AI for Science, Collaborative Agents

5. ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-r

作者: Raia Abu Ahmad, Max Upravitelev, Aida Usmanova, Veronika Solopova, Georg Rehm 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26449v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	8.0/10	8.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文主要研究利用大模型（LLMs）进行气候相关声明的科学事实核查和虚假信息分类，属于大模型在科学领域的应用。摘要明确提到系统结合了密集检索管道和大语言模型进行结构化分层推理，因此与"Large Language Models"（8分）、“Retrieval-Augmented Generation”（8分）和"Chain of Thought"（8分）高度相关。核心任务涉及事实核查和虚假信息检测，与"Hallucination Mitigation"（10分）直接相关。研究主题是气候科学，属于"AI for Science"（10分）范畴。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统等未在论文中提及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了ClimateCheck 2026共享任务，利用大语言模型结合检索增强生成和结构化推理，对气候相关声明进行科学事实核查和虚假信息叙事分类，并揭示了传统评估指标的系统性偏差以及不同虚假信息可验证性的差异。

摘要翻译

自动依据科学文献验证气候相关主张是一项具有挑战性的任务，其复杂性既源于学术证据的专业性，也源于气候虚假信息背后多样化的修辞策略。ClimateCheck 2026 是应对这一挑战的共享任务的第二次迭代，它在 2025 年版的基础上进行了扩展，训练数据增加了两倍，并新增了一项虚假信息叙事分类任务。该竞赛于 2026 年 1 月至 2 月在 CodaBench 平台上进行，吸引了 20 名注册参与者和 8 个排行榜提交系统，这些系统结合了密集检索流程、交叉编码器集成、大型语言模型以及结构化的层次推理方法。除了标准评估指标（Recall@K 和 Binary Preference），我们还采用了一个自动化框架来评估不完全标注下的检索质量，揭示了传统指标在系统排名中存在的系统性偏差。一项跨任务分析进一步表明，并非所有气候虚假信息都具有同等的可验证性，这可能对未来事实核查系统的设计方式具有启示意义。

摘要 (Abstract)

Automatically verifying climate-related claims against scientific literature is a challenging task, complicated by the specialised nature of scholarly evidence and the diversity of rhetorical strategies underlying climate disinformation. ClimateCheck 2026 is the second iteration of a shared task addressing this challenge, expanding on the 2025 edition with tripled training data and a new disinformation narrative classification task. Running from January to February 2026 on the CodaBench platform, the competition attracted 20 registered participants and 8 leaderboard submissions, with systems combining dense retrieval pipelines, cross-encoder ensembles, and large language models with structured hierarchical reasoning. In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems. A cross-task analysis further reveals that not all climate disinformation is equally verifiable, potentially implicating how future fact-checking systems should be designed.

关键词: climate fact-checking, disinformation classification, large language models, retrieval-augmented generation, structured reasoning, evaluation metrics, scientific literature, climate claims

6. MA-Bench: Towards Fine-grained Micro-Action Understanding

作者: Kun Li, Jihao Gu, Fei Wang, Zhiliang Wu, Hehe Fan, Dan Guo 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26586v1

评分: 34.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）在微动作理解领域的应用，属于大模型在特定领域（人类行为分析）的研究应用。核心相关性体现在：1）直接涉及大语言模型（LLMs）及其多模态扩展（MLLMs），因此"Large Language Models"得10分；2）论文通过微调（fine-tuning）提升模型性能，与"Post-training"相关，得8分；3）评估架构包含解释性推理（interpretive reasoning），涉及多步推理和深度推理，因此"Chain of Thought"和"System 2 Thinking"各得8分。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在微动作理解领域缺乏专门基准的问题，提出了MA-Bench基准和训练数据集，并通过微调显著提升了模型在微动作推理和解释任务上的性能。

摘要翻译

随着多模态大语言模型（MLLMs）的快速发展，其在人类情感分析中至关重要的微动作理解潜力，由于缺乏专门的评测基准，尚未得到充分探索。为解决这一问题，我们提出了MA-Bench基准，该基准包含1,000个视频和一个三层评估架构，逐步考察微动作感知、关系理解和解释推理能力。MA-Bench包含12,000个结构化问答对，能够系统性地评估模型的识别准确性和动作解释能力。对23个代表性MLLM的评测结果表明，现有模型在捕捉动作粒度和细粒度身体部位动态方面仍面临显著挑战。为应对这些挑战，我们进一步构建了MA-Bench-Train，这是一个包含20.5K个视频的大规模训练语料库，所有视频均配有结构化的微动作描述标注，可用于微调MLLMs。基于MA-Bench-Train微调的Qwen3-VL-8B模型在微动作推理与解释任务上均显示出明显的性能提升。我们的工作旨在为推进MLLMs理解细微微动作及人类相关行为建立一个基础性基准。项目页面：https://MA-Bench.github.io

摘要 (Abstract)

With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io

关键词: Multimodal Large Language Models, Micro-Action Understanding, Benchmark, Fine-tuning, Interpretive Reasoning, Human Emotion Analysis, Video Understanding, Qwen3-VL-8B

7. From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasonin

作者: Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An, Erhong Yang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26323v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在空间推理任务中的内部表示机制，因此与"Large Language Models"和"Mechanistic Interpretability"高度相关（10分）。论文涉及空间推理的分解和评估，与"Chain of Thought"和"System 2 Thinking"有一定关联（5分），但并非核心焦点。其他关键词如MoE、SFT、RAG、量化等均未在论文中提及或相关，故得0分。

!!! tip deepseek-chat TL;DR

该研究通过分解空间推理为计算原语并分析多语言LLMs的内部表示，发现当前LLMs仅具有有限、上下文依赖的空间表示而非稳健的通用空间推理能力。

摘要翻译

随着空间智能日益成为基础模型的关键能力，大型语言模型（LLMs）在空间推理基准测试中的表现究竟反映了其内部结构化的空间表征，还是依赖于语言启发式策略，目前尚不明确。我们从机制视角出发，通过考察空间信息在模型内部如何被表征和利用来探讨这一问题。借鉴人类空间认知的计算理论，我们将空间推理分解为三个基本要素：关系组合、表征转换和状态化空间更新，并为每个要素设计了受控的任务族。我们在单次推理条件下评估了英语、中文和阿拉伯语的多语言LLMs，并运用线性探测、基于稀疏自编码器的特征分析和因果干预等方法对内部表征进行了分析。研究发现，任务相关的空间信息被编码在模型中间层，并能因果性地影响行为，但这些表征是瞬态的、分散在不同任务族之间，且与最终预测的整合程度较弱。跨语言分析进一步揭示了机制上的简并性：相似的行为表现源于不同的内部处理路径。总体而言，我们的结果表明，当前LLMs展现出的空间表征是有限且依赖于具体情境的，而非稳健的通用空间推理能力，这凸显了在基准测试精度之外进行机制性评估的必要性。

摘要 (Abstract)

As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models’ (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial representations or reliance on linguistic heuristics. We address this question from a mechanistic perspective by examining how spatial information is internally represented and used. Drawing on computational theories of human spatial cognition, we decompose spatial reasoning into three primitives, relational composition, representational transformation, and stateful spatial updating, and design controlled task families for each. We evaluate multilingual LLMs in English, Chinese, and Arabic under single pass inference, and analyze internal representations using linear probing, sparse autoencoder based feature analysis, and causal interventions. We find that task relevant spatial information is encoded in intermediate layers and can causally influence behavior, but these representations are transient, fragmented across task families, and weakly integrated into final predictions. Cross linguistic analysis further reveals mechanistic degeneracy, where similar behavioral performance arises from distinct internal pathways. Overall, our results suggest that current LLMs exhibit limited and context dependent spatial representations rather than robust, general purpose spatial reasoning, highlighting the need for mechanistic evaluation beyond benchmark accuracy.

关键词: large language models, spatial reasoning, mechanistic interpretability, internal representations, computational primitives, linear probing, causal interventions, multilingual analysis

8. Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR

作者: Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang, Mingzhu Chen, Jiancan Wu, Kuien Liu, Xiang Wang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26126v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）的强化学习（RLVR），核心是改进模型的推理过程，使其更好地整合视觉证据。因此，与"Large Language Models"高度相关（10分），因为MLLMs是LLMs的扩展。与推理相关的关键词"Chain of Thought"和"System 2 Thinking"也高度相关（10分），因为论文直接解决推理链与视觉事实的弱关联问题，并引入轨迹引导来增强细粒度推理。其他关键词如MoE、SLMs、Scaling Laws、训练方法（Pre-training、SFT、RLHF等）、效率技术（PEFT、Quantization）、代理、工具使用、科学AI等，在摘要中未提及或与论文核心内容无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在强化学习中视觉感知与逻辑推理脱节的问题，提出了轨迹引导的强化学习方法，有效提升了推理性能并弥合了视觉证据与推理过程之间的差距。

摘要翻译

近期，基于可验证奖励的强化学习在多模态大语言模型中的应用进展主要聚焦于提升最终答案的正确性及加强视觉基础。然而，一个关键瓶颈依然存在：尽管模型能够关注到相关的视觉区域，却常常无法有效地将视觉证据融入后续推理过程，导致推理链条与视觉事实的关联较弱。为解决这一问题，我们提出了轨迹引导强化学习方法，该方法利用更强模型提供的专家推理轨迹，引导策略模型将视觉证据整合到细粒度的推理过程中。我们进一步引入了词元级重加权与轨迹过滤机制，以确保策略优化的稳定性和有效性。在多个多模态推理基准上的大量实验表明，轨迹引导强化学习方法持续提升了推理性能，并有效弥合了视觉感知与逻辑推理之间的差距。

摘要 (Abstract)

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.

关键词: Multimodal Large Language Models, Reinforcement Learning, Reasoning Trajectories, Visual Grounding, Policy Optimization, Multimodal Reasoning, Trajectory-Guided RL, RLVR

📋 所有论文列表

1. ✅ Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification

评分: 73.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	15.0/10	15.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	15.0/10	15.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在长文本生成中逐渐偏离图像证据、产生幻觉的问题，提出了Visual Re-Examination (VRE)自我进化训练框架，通过视觉内省和信息增益驱动的验证，显著提高了推理准确性、感知可靠性并大幅减少了幻觉。

摘要翻译

多模态大语言模型（MLLMs）在多模态推理任务中展现出强大性能，然而我们发现其在长文本生成中存在一种反复出现的失效模式：随着输出内容增长，模型逐渐偏离图像证据并依赖文本先验，导致推理缺乏依据并产生幻觉。有趣的是，通过注意力机制分析，我们发现MLLMs具备一种潜在但未被持续激活的后期视觉验证能力。基于这一观察，我们提出视觉再审视（Visual Re-Examination, VRE）框架——一种自演化的训练范式，使MLLMs能够在无需额外视觉输入的情况下，在推理过程中自主执行视觉内省。与从更强教师模型蒸馏视觉能力的方法不同，VRE通过模型自身生成反思轨迹来促进迭代式自我改进，借助信息增益使视觉信息转化为可执行知识。在多样化多模态基准测试上的大量实验表明，VRE能持续提升推理准确性与感知可靠性，同时显著减少幻觉现象，尤其在长链推理场景中效果尤为突出。代码已发布于https://github.com/Xiaobu-USTC/VRE。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long-form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late-stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self-improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long-chain settings. Code is available at https://github.com/Xiaobu-USTC/VRE.

关键词: Multimodal Large Language Models, Visual Re-Examination, Self-improvement, Hallucination mitigation, Multimodal reasoning, Long-form generation, Visual verification, Information gain

2. ✅ Learning to Commit: Generating Organic Pull Requests via Online Repository Memory

作者: Mo Li, L. H. Xu, Qitai Tan, Ting Cao, Yunxin Liu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26664v1

评分: 51.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	5.0/10	5.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对LLM-based coding agents生成的代码提交缺乏有机性（忽略项目特定惯例、重复内部API功能、违反架构约束）的问题，提出了Learning to Commit框架和Online Repository Memory方法，通过从历史提交中学习项目特定的技能模式，显著提高了生成代码提交的有机性评分。

摘要翻译

基于大语言模型（LLM）的编码代理在受控基准测试中取得了令人瞩目的成果，但在实际应用中，其生成的拉取请求却常常遭到真实维护者的拒绝。根本原因并非功能错误，而是缺乏有机性：生成的代码忽视了项目特定的惯例、重复了内部API已提供的功能，并且违反了多年开发过程中积累的隐性架构约束。仅仅让代理接触最新的代码库快照是不够的：快照仅揭示了代码库的最终状态，却无法展现达成该状态所依赖的、特定于该仓库的变更模式。
为此，我们提出了“学习提交”框架，该框架通过在线仓库记忆来弥合这一差距。给定一个按严格时间顺序分割的代码仓库，代理会对早期的提交进行监督式对比反思：它首先盲目地尝试解决每个历史问题，随后将其预测结果与真实的代码差异进行比对，并将其中的差距提炼为一套持续增长的技能——这些可复用的模式捕捉了编码风格、内部API使用方式以及架构不变性。当新的拉取请求描述到来时，代理会基于这些积累的技能来生成代码，从而产生植根于项目自身演化历程而非通用预训练先验的变更。
评估在技能构建阶段完全未接触过的、真实未来的已合并拉取请求上进行，并涵盖多个维度，包括功能正确性、代码风格一致性、内部API复用率以及修改区域的合理性。在一个由专家维护、具有丰富提交历史的仓库上进行的实验表明，在线仓库记忆能有效提升在预留未来任务上的有机性得分。

摘要 (Abstract)

Large language model (LLM)-based coding agents achieve impressive results on controlled benchmarks yet routinely produce pull requests that real maintainers reject. The root cause is not functional incorrectness but a lack of organicity: generated code ignores project-specific conventions, duplicates functionality already provided by internal APIs, and violates implicit architectural constraints accumulated over years of development. Simply exposing an agent to the latest repository snapshot is not enough: the snapshot reveals the final state of the codebase, but not the repository-specific change patterns by which that state was reached. We introduce Learning to Commit, a framework that closes this gap through Online Repository Memory. Given a repository with a strict chronological split, the agent performs supervised contrastive reflection on earlier commits: it blindly attempts to resolve each historical issue, compares its prediction against the oracle diff, and distils the gap into a continuously growing set of skills-reusable patterns capturing coding style, internal API usage, and architectural invariants. When a new PR description arrives, the agent conditions its generation on these accumulated skills, producing changes grounded in the project’s own evolution rather than generic pretraining priors. Evaluation is conducted on genuinely future, merged pull requests that could not have been seen during the skill-building phase, and spans multiple dimensions including functional correctness, code-style consistency, internal API reuse rate, and modified-region plausibility. Experiments on an expert-maintained repository with rich commit history show that Online Repository Memory effectively improves organicity scores on held-out future tasks.

关键词: Large Language Models, LLM-based coding agents, Online Repository Memory, organicity, pull requests, supervised contrastive reflection, skill distillation, repository-specific patterns

3. ✅ Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型训练中奖励设计的局限性，提出了一种名为PAPO的方法，通过解耦优势归一化将过程级评估整合到GRPO中，以在保持答案正确性的同时区分推理质量，实验表明其在多个基准测试中优于传统结果奖励模型。

摘要翻译

我们提出过程感知策略优化（PAPO），该方法通过解耦优势归一化，将过程级评估整合至群体相对策略优化（GRPO）中，以解决现有奖励设计的两项局限。结果奖励模型（ORM）仅评估最终答案的正确性，对所有正确回答一视同仁而忽略推理质量，且随着群体表现趋于一致正确，其优势信号逐渐消失。过程奖励模型（PRM）能提供更丰富的监督信息，但直接使用PRM分数会导致奖励破解问题，即模型通过冗长回答人为提高分数，而实际准确率却大幅下降。PAPO通过组合两个优势分量来解决这些问题：结果分量Aout源自ORM并在所有响应中归一化，过程分量Aproc则基于规则化PRM且仅在正确响应内部归一化。这种解耦设计确保Aout将训练锚定于正确性，而Aproc能在不扭曲结果信号的前提下区分推理质量。在多种模型规模和六个基准测试上的实验表明，PAPO始终优于ORM，在OlympiadBench上达到51.3%对比46.3%的准确率，且在ORM进入平台期并下降时仍能持续提升性能。

摘要 (Abstract)

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.

关键词: Process-Aware Policy Optimization, PAPO, Group Relative Policy Optimization, GRPO, decoupled advantage normalization, process reward models, reasoning quality, outcome reward models

4. ✅ SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文提出了SkinGPT-X，一个结合自进化皮肤病学记忆机制的多模态协作多智能体系统，以解决现有LLM在皮肤病诊断中缺乏可解释性和对罕见疾病处理能力不足的问题，并在多个数据集上实现了显著的性能提升。

摘要翻译

尽管大语言模型的最新进展显著推动了皮肤病学诊断的发展，但单体大语言模型常因训练数据稀疏性而在细粒度、大规模多类别诊断任务及罕见皮肤病诊断中面临困难，同时缺乏临床推理所必需的可解释性与可追溯性。虽然多智能体系统能够提供更透明、可解释的诊断，但现有框架主要集中于视觉问答与会话任务，且其对静态知识库的严重依赖限制了其在复杂现实临床场景中的适应性。本文提出SkinGPT-X，这是一个集成自进化皮肤病记忆机制的多模态协作多智能体皮肤病诊断系统。通过模拟皮肤病学家的诊断工作流程并实现持续记忆进化，SkinGPT-X为复杂及罕见皮肤病病例的管理提供透明且可信的诊断。为验证SkinGPT-X的鲁棒性，我们设计了三层对比实验。首先，我们在四个公共数据集上将SkinGPT-X与四种先进大语言模型进行基准测试，结果显示其取得了最先进的性能：在DDI31数据集上准确率较最佳模型提升9.6%，在Dermnet数据集上加权F1分数提升13%。其次，我们构建了一个涵盖498种不同皮肤病类别的大规模多类别数据集，以评估其细粒度分类能力。最后，我们整理了罕见皮肤病数据集——这是首个针对临床罕见皮肤病稀缺性问题设计的基准数据集，包含564个临床样本，涵盖八种罕见皮肤病。在该数据集上，SkinGPT-X实现了准确率提升9.8%、加权F1分数提升7.1%、科恩卡帕系数提升10%的显著改进。

摘要 (Abstract)

While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen’s Kappa improvement.

关键词: Large Language Models, Multi-agent Systems, Dermatological Diagnosis, Self-evolving Memory, Transparent Diagnostics, Rare Skin Diseases, AI for Science, Collaborative Agents

作者: Raia Abu Ahmad, Max Upravitelev, Aida Usmanova, Veronika Solopova, Georg Rehm 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26449v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	8.0/10	8.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文提出了ClimateCheck 2026共享任务，利用大语言模型结合检索增强生成和结构化推理，对气候相关声明进行科学事实核查和虚假信息叙事分类，并揭示了传统评估指标的系统性偏差以及不同虚假信息可验证性的差异。

摘要翻译

自动依据科学文献验证气候相关主张是一项具有挑战性的任务，其复杂性既源于学术证据的专业性，也源于气候虚假信息背后多样化的修辞策略。ClimateCheck 2026 是应对这一挑战的共享任务的第二次迭代，它在 2025 年版的基础上进行了扩展，训练数据增加了两倍，并新增了一项虚假信息叙事分类任务。该竞赛于 2026 年 1 月至 2 月在 CodaBench 平台上进行，吸引了 20 名注册参与者和 8 个排行榜提交系统，这些系统结合了密集检索流程、交叉编码器集成、大型语言模型以及结构化的层次推理方法。除了标准评估指标（Recall@K 和 Binary Preference），我们还采用了一个自动化框架来评估不完全标注下的检索质量，揭示了传统指标在系统排名中存在的系统性偏差。一项跨任务分析进一步表明，并非所有气候虚假信息都具有同等的可验证性，这可能对未来事实核查系统的设计方式具有启示意义。

摘要 (Abstract)

Automatically verifying climate-related claims against scientific literature is a challenging task, complicated by the specialised nature of scholarly evidence and the diversity of rhetorical strategies underlying climate disinformation. ClimateCheck 2026 is the second iteration of a shared task addressing this challenge, expanding on the 2025 edition with tripled training data and a new disinformation narrative classification task. Running from January to February 2026 on the CodaBench platform, the competition attracted 20 registered participants and 8 leaderboard submissions, with systems combining dense retrieval pipelines, cross-encoder ensembles, and large language models with structured hierarchical reasoning. In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems. A cross-task analysis further reveals that not all climate disinformation is equally verifiable, potentially implicating how future fact-checking systems should be designed.

6. ✅ MA-Bench: Towards Fine-grained Micro-Action Understanding

作者: Kun Li, Jihao Gu, Fei Wang, Zhiliang Wu, Hehe Fan, Dan Guo 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26586v1

评分: 34.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在微动作理解领域缺乏专门基准的问题，提出了MA-Bench基准和训练数据集，并通过微调显著提升了模型在微动作推理和解释任务上的性能。

摘要翻译

随着多模态大语言模型（MLLMs）的快速发展，其在人类情感分析中至关重要的微动作理解潜力，由于缺乏专门的评测基准，尚未得到充分探索。为解决这一问题，我们提出了MA-Bench基准，该基准包含1,000个视频和一个三层评估架构，逐步考察微动作感知、关系理解和解释推理能力。MA-Bench包含12,000个结构化问答对，能够系统性地评估模型的识别准确性和动作解释能力。对23个代表性MLLM的评测结果表明，现有模型在捕捉动作粒度和细粒度身体部位动态方面仍面临显著挑战。为应对这些挑战，我们进一步构建了MA-Bench-Train，这是一个包含20.5K个视频的大规模训练语料库，所有视频均配有结构化的微动作描述标注，可用于微调MLLMs。基于MA-Bench-Train微调的Qwen3-VL-8B模型在微动作推理与解释任务上均显示出明显的性能提升。我们的工作旨在为推进MLLMs理解细微微动作及人类相关行为建立一个基础性基准。项目页面：https://MA-Bench.github.io

摘要 (Abstract)

With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io

关键词: Multimodal Large Language Models, Micro-Action Understanding, Benchmark, Fine-tuning, Interpretive Reasoning, Human Emotion Analysis, Video Understanding, Qwen3-VL-8B

7. ✅ From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs

作者: Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An, Erhong Yang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26323v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究通过分解空间推理为计算原语并分析多语言LLMs的内部表示，发现当前LLMs仅具有有限、上下文依赖的空间表示而非稳健的通用空间推理能力。

摘要翻译

随着空间智能日益成为基础模型的关键能力，大型语言模型（LLMs）在空间推理基准测试中的表现究竟反映了其内部结构化的空间表征，还是依赖于语言启发式策略，目前尚不明确。我们从机制视角出发，通过考察空间信息在模型内部如何被表征和利用来探讨这一问题。借鉴人类空间认知的计算理论，我们将空间推理分解为三个基本要素：关系组合、表征转换和状态化空间更新，并为每个要素设计了受控的任务族。我们在单次推理条件下评估了英语、中文和阿拉伯语的多语言LLMs，并运用线性探测、基于稀疏自编码器的特征分析和因果干预等方法对内部表征进行了分析。研究发现，任务相关的空间信息被编码在模型中间层，并能因果性地影响行为，但这些表征是瞬态的、分散在不同任务族之间，且与最终预测的整合程度较弱。跨语言分析进一步揭示了机制上的简并性：相似的行为表现源于不同的内部处理路径。总体而言，我们的结果表明，当前LLMs展现出的空间表征是有限且依赖于具体情境的，而非稳健的通用空间推理能力，这凸显了在基准测试精度之外进行机制性评估的必要性。

摘要 (Abstract)

As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models’ (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial representations or reliance on linguistic heuristics. We address this question from a mechanistic perspective by examining how spatial information is internally represented and used. Drawing on computational theories of human spatial cognition, we decompose spatial reasoning into three primitives, relational composition, representational transformation, and stateful spatial updating, and design controlled task families for each. We evaluate multilingual LLMs in English, Chinese, and Arabic under single pass inference, and analyze internal representations using linear probing, sparse autoencoder based feature analysis, and causal interventions. We find that task relevant spatial information is encoded in intermediate layers and can causally influence behavior, but these representations are transient, fragmented across task families, and weakly integrated into final predictions. Cross linguistic analysis further reveals mechanistic degeneracy, where similar behavioral performance arises from distinct internal pathways. Overall, our results suggest that current LLMs exhibit limited and context dependent spatial representations rather than robust, general purpose spatial reasoning, highlighting the need for mechanistic evaluation beyond benchmark accuracy.

关键词: large language models, spatial reasoning, mechanistic interpretability, internal representations, computational primitives, linear probing, causal interventions, multilingual analysis

8. ✅ Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在强化学习中视觉感知与逻辑推理脱节的问题，提出了轨迹引导的强化学习方法，有效提升了推理性能并弥合了视觉证据与推理过程之间的差距。

摘要翻译

近期，基于可验证奖励的强化学习在多模态大语言模型中的应用进展主要聚焦于提升最终答案的正确性及加强视觉基础。然而，一个关键瓶颈依然存在：尽管模型能够关注到相关的视觉区域，却常常无法有效地将视觉证据融入后续推理过程，导致推理链条与视觉事实的关联较弱。为解决这一问题，我们提出了轨迹引导强化学习方法，该方法利用更强模型提供的专家推理轨迹，引导策略模型将视觉证据整合到细粒度的推理过程中。我们进一步引入了词元级重加权与轨迹过滤机制，以确保策略优化的稳定性和有效性。在多个多模态推理基准上的大量实验表明，轨迹引导强化学习方法持续提升了推理性能，并有效弥合了视觉感知与逻辑推理之间的差距。

摘要 (Abstract)

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.

关键词: Multimodal Large Language Models, Reinforcement Learning, Reasoning Trajectories, Visual Grounding, Policy Optimization, Multimodal Reasoning, Trajectory-Guided RL, RLVR

9. ❌ Automated near-term quantum algorithm discovery for molecular ground states

作者: Fabian Finger, Frederic Rapp, Pranav Kalidindi, Kerry He, Kante Yin, Alexander Koziell-Pipe, David Zsolt Manrique, Gabriel Greene-Diniz, Stephen Clark, Hamza Fawzi, Bernardino Romera Paredes, Alhussein Fawzi, Konstantinos Meichanetzidis 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26359v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心是利用大型语言模型驱动的AI平台（Hive）进行量子算法发现，属于大模型在科学领域的创新应用。因此，与"Large Language Models"高度相关（10分），与"AI for Science"高度相关（10分）。论文进行了可解释性研究，与"Mechanistic Interpretability"有一定关联（5分）。论文未涉及其他关键词的具体技术细节，如MoE、SFT、RAG、推理加速等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究利用大型语言模型驱动的AI平台Hive，为分子（LiH、H2O、F2）的基态问题自动发现了高效的量子启发式算法，显著减少了量子资源需求，并在量子计算机上进行了验证。

摘要翻译

量子算法的设计是一项复杂且反直觉的任务，这使其成为人工智能驱动算法发现的理想对象。为此，我们采用Hive——一个用于程序合成的人工智能平台，该平台利用大型语言模型驱动高度分布式的进化过程以发现新算法。我们聚焦于量子化学中的基态问题，并发现了高效的量子启发式算法，用于求解LiH、H2O和F2分子的基态问题，同时相较于当前最先进的近期量子算法，其所需的量子资源显著减少。此外，我们对所发现的算法进行了可解释性研究，并识别出导致效率提升的关键功能。最后，我们在Quantinuum System Model H2量子计算机上对Hive发现的量子电路进行了基准测试，并确定了达到化学精度所需的最低系统要求。我们预见，这种新颖的量子算法发现方法可应用于化学以外的其他领域，也可用于为容错量子计算机设计量子算法。

摘要 (Abstract)

Designing quantum algorithms is a complex and counterintuitive task, making it an ideal candidate for AI-driven algorithm discovery. To this end, we employ the Hive, an AI platform for program synthesis, which utilises large language models to drive a highly distributed evolutionary process for discovering new algorithms. We focus on the ground state problem in quantum chemistry, and discover efficient quantum heuristic algorithms that solve it for molecules LiH, H2O, and F2 while exhibiting significant reductions in quantum resources relative to state-of-the-art near-term quantum algorithms. Further, we perform an interpretability study on the discovered algorithms and identify the key functions responsible for the efficiency gains. Finally, we benchmark the Hive-discovered circuits on the Quantinuum System Model H2 quantum computer and identify minimum system requirements for chemical precision. We envision that this novel approach to quantum algorithm discovery applies to other domains beyond chemistry, as well as to designing quantum algorithms for fault-tolerant quantum computers.

关键词: quantum algorithm discovery, large language models, AI for science, quantum chemistry, ground state problem, program synthesis, evolutionary process, interpretability study

10. ❌ Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference

作者: Konstantinos Papaioannou, Thaleia Dimitra Doudali 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26498v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文的核心是解决多模态大语言模型（MLLMs）推理服务中的调度和性能问题，属于大模型系统优化领域。它与关键词高度相关的只有两个：1) “Large Language Models” OR “LLMs” OR “Foundation Models”：论文明确研究MLLMs（如ChatGPT、Gemini），这是LLMs的一个子类，因此高度相关，评10分。2) “Speculative Decoding” OR “Inference Acceleration”：论文的核心贡献RPS-Serve是一个调度系统，旨在通过模态感知调度来减少TTFT（首词时间）和整体延迟，这直接属于推理加速范畴，因此高度相关，评10分。其他关键词主要涉及模型架构、训练方法、对齐、应用领域（如科学AI）、特定推理技术（如CoT、RAG）或模型压缩等，论文未涉及这些具体技术，因此评0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型推理服务中因异构模态（文本、图像、视频）请求资源需求差异巨大而导致的严重排队阻塞和性能下降问题，提出了一种名为RPS-Serve的模态感知调度器，通过动态优先级和老化机制，平均减少了54%的总体首词时间和78.5%的关键请求延迟，实现了类似纯文本LLM的响应速度。

摘要翻译

多模态大语言模型（Multimodal Large Language Models, MLLMs）驱动着如ChatGPT、Gemini和Copilot等平台，实现了与文本、图像和视频更丰富的交互。这些异构工作负载引入了额外的推理阶段，例如视觉预处理和编码，从而增加了延迟和内存需求。现有仅针对纯文本工作负载优化的LLM服务系统在多模态场景下表现不佳：大型请求（如视频）会独占资源，导致严重的队头阻塞和性能下降。我们的核心洞察在于，多模态请求在资源需求上存在数量级差异，我们通过一个简单的抽象来捕捉这一特征：视频如同岩石，图像如同卵石，文本如同沙粒。我们设计了RPS-Serve，一种模态感知调度器，它让沙粒能够快速穿过卵石和岩石，在确保交互响应性的同时避免饥饿。RPS-Serve对请求进行分类，动态调整优先级，并应用老化机制以防止饥饿。在多种先进MLLM上的评估表明，与现有系统相比，RPS-Serve平均将首词元时间（time-to-first-token, TTFT）整体降低了54%，对延迟敏感请求更是降低了78.5%。通过模态感知调度以及对可用资源的高效利用，RPS-Serve为MLLM提供了类似LLM的响应能力。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) power platforms like ChatGPT, Gemini, and Copilot, enabling richer interactions with text, images, and videos. These heterogeneous workloads introduce additional inference stages, such as vision preprocessing and encoding, that inflate latency and memory demand. Existing LLM serving systems, optimized for text-only workloads, fail under multimodality: large requests (e.g., videos) monopolize resources, causing severe head-of-line blocking and performance degradation. Our key insight is that multimodal requests differ by orders of magnitude in resource demands, which we capture through a simple abstraction: videos behave like rocks, images like pebbles, and text like sand. We design RPS-Serve, a modality-aware scheduler that lets sand flow quickly through pebbles and rocks, ensuring interactive responsiveness while avoiding starvation. RPS-Serve classifies requests, prioritizes them dynamically, and applies aging to avoid starvation. Evaluation across state-of-the-art MLLMs shows that RPS-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems. RPS-Serve delivers LLM-like responsiveness for MLLMs, with modality-aware scheduling and by making the most efficient use of the available resources.

关键词: Multimodal Large Language Models, MLLMs, Inference Scheduling, Modality-aware Scheduling, Time-to-First-Token, TTFT, Head-of-Line Blocking, RPS-Serve

11. ❌ Experimental study on surveillance video-based indoor occupancy measurement with occupant-centric control

作者: Irfan Qaisar, Kailai Sun, Qingshan Jia, Qianchuan Zhao 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26081v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文明确将大语言模型（LLMs）应用于智能建筑中的室内人员测量和HVAC控制优化，属于大模型在科学/工程领域的应用研究。因此，与"Large Language Models" OR “LLMs” OR “Foundation Models"高度相关（10分）。同时，该研究属于AI在建筑科学/能源领域的应用，与"AI for Science” OR “Bioinformatics” OR “Cheminformatics"有一定关联（5分），尽管领域不完全匹配（非生物/化学信息学）。论文未涉及其他关键词所描述的具体大模型技术原理、训练方法、推理优化、代理系统或特定应用领域（如生物信息学），因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于大语言模型（LLMs）增强的视觉室内人员测量方法，并将其集成到HVAC的模型预测控制框架中，实验表明该方法能显著提高测量准确性并实现17.94%的HVAC节能潜力。

摘要翻译

精确的占用信息对于智能建筑中以人为中心的闭环控制至关重要。然而，现有的基于视觉的占用测量方法在实际室内环境中往往难以提供稳定而精确的测量结果，且其对下游暖通空调控制的影响尚未得到充分研究。为实现2050年净零排放目标，本文对基于大型语言模型增强的视觉室内占用测量及其对支持OCC的暖通空调运行的影响进行了实验研究。利用从中国某研究实验室收集的真实监控数据（附带逐帧人工标注的真实值），在相同条件下比较了纯检测、基于跟踪以及基于LLM优化的处理流程。结果表明，基于跟踪的方法相比纯检测测量提升了时间稳定性，而基于LLM的优化进一步提高了占用测量性能，并减少了错误的无人状态预测。性能最佳的流程（YOLOv8+DeepSeek）实现了0.8824的准确率和0.9320的F1分数。随后，该流程被集成到OpenStudio-EnergyPlus中的暖通空调监督模型预测控制框架内。实验结果表明，所提出的框架能够支持更高效的OCC运行，实现高达17.94%的显著暖通空调节能潜力。这些发现为未来人工智能增强的智能建筑运营研究提供了有效的方法论和实践基础。

摘要 (Abstract)

Accurate occupancy information is essential for closed-loop occupant-centric control (OCC) in smart buildings. However, existing vision-based occupancy measurement methods often struggle to provide stable and accurate measurements in real indoor environments, and their implications for downstream HVAC control remain insufficiently studied. To achieve Net Zero emissions by 2050, this paper presents an experimental study of large language models (LLMs)-enhanced vision-based indoor occupancy measurement and its impact on OCC-enabled HVAC operation. Detection-only, tracking-based, and LLM-based refinement pipelines are compared under identical conditions using real surveillance data collected from a research laboratory in China, with frame-level manual ground-truth annotations. Results show that tracking-based methods improve temporal stability over detection-only measurement, while LLM-based refinement further improves occupancy measurement performance and reduces false unoccupied prediction. The best-performing pipeline, YOLOv8+DeepSeek, achieves an accuracy of 0.8824 and an F1-score of 0.9320. This pipeline is then integrated into an HVAC supervisory model predictive control framework in OpenStudio-EnergyPlus. Experimental results demonstrate that the proposed framework can support more efficient OCC operation, achieving a substantial HVAC energy-saving potential of 17.94%. These findings provide an effective methodology and practical foundation for future research in AI-enhanced smart building operations.

关键词: indoor occupancy measurement, large language models (LLMs), vision-based, HVAC control, smart buildings, energy saving, model predictive control, DeepSeek

12. ❌ PQuantML: A Tool for End-to-End Hardware-aware Model Compression

作者: Roope Niemi, Anastasiia Petrovych, Arghya Ranjan Das, Enrico Lupi, Chang Sun, Dimitrios Danopoulos, Marlon Joshua Helbing, Mia Liu, Sebastian Dittmeier, Michael Kagan, Vladimir Loncar, Maurizio Pierini 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26595v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	10.0/10	10.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文PQuantML专注于神经网络模型压缩技术，特别是量化和剪枝，用于在严格延迟约束的环境中部署高性能模型。它直接与关键词"Quantization” OR “Model Compression” OR “Low-bit Weights"高度相关（10分），因为这是其核心内容。论文在科学领域（高能物理的喷注标记任务）有应用，因此与"AI for Science” OR “Bioinformatics” OR “Cheminformatics"有一定关联（5分），但并非生物信息学或化学信息学。其他关键词主要涉及大语言模型（LLMs）、训练方法、推理技术、代理系统等，而论文未提及这些主题，因此评分为0分。

!!! tip deepseek-chat TL;DR

PQuantML是一个新的开源、硬件感知的神经网络模型压缩库，通过提供统一的接口来应用剪枝和量化，简化了压缩模型的训练，并在喷注子结构分类等任务中实现了显著的参数和位宽减少，同时保持了准确性。

摘要翻译

PQuantML是一款新型开源、硬件感知的神经网络模型压缩库，专为端到端工作流设计。该库的研发动机源于将高性能模型部署至具有严格延迟限制环境的需求，它通过提供统一接口来联合或单独应用剪枝与量化技术，从而简化压缩模型的训练流程。该库实现了多种不同粒度的剪枝方法，以及支持高粒度量化（High-Granularity Quantization）的定点量化方案。我们在代表性任务上评估了PQuantML的性能，例如喷注子结构分类（即喷注标记）——这是与大型强子对撞机（LHC）实时数据处理相关的边缘计算难题。通过结合多种剪枝方法与定点量化，PQuantML在保持精度的同时实现了显著的参数量与比特宽度压缩。最终压缩效果进一步与QKeras、HGQ等现有工具进行了对比分析。

摘要 (Abstract)

PQuantML is a new open-source, hardware-aware neural network model compression library tailored to end-to-end workflows. Motivated by the need to deploy performant models to environments with strict latency constraints, PQuantML simplifies training of compressed models by providing a unified interface to apply pruning and quantization, either jointly or individually. The library implements multiple pruning methods with different granularities, as well as fixed-point quantization with support for High-Granularity Quantization. We evaluate PQuantML on representative tasks such as the jet substructure classification, so-called jet tagging, an on-edge problem related to real-time LHC data processing. Using various pruning methods with fixed-point quantization, PQuantML achieves substantial parameter and bit-width reductions while maintaining accuracy. The resulting compression is further compared against existing tools, such as QKeras and HGQ.

关键词: model compression, quantization, pruning, hardware-aware, end-to-end workflow, neural network, jet tagging, high-granularity quantization

13. ❌ SPECTRA: An Efficient Spectral-Informed Neural Network for Sensor-Based Activity Recognition

作者: Deepika Gurung, Lala Shakti Swarup Ray, Mengxi Liu, Bo Zhou, Paul Lukowicz 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26482v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文SPECTRA专注于传感器活动识别（HAR）的深度学习模型优化，核心是设计高效、可部署的边缘AI架构。与大多数关键词无关，因为论文不涉及大语言模型（LLM）、对齐、推理、代理等主题。相关关键词：1）“Small Language Models” OR “SLMs” OR “On-device AI”（5分）：论文强调边缘部署（如智能手机、微控制器），属于设备端AI范畴，但非语言模型。2）“Quantization” OR “Model Compression” OR “Low-bit Weights”（5分）：论文通过架构设计（如深度可分离卷积、紧凑GRU）减少参数和计算，属于模型压缩的广义范畴，但未明确提及量化或低比特权重。3）“Speculative Decoding” OR “Inference Acceleration”（5分）：论文优化延迟和能效，涉及推理加速，但非解码或推测性解码技术。其他关键词如预训练、微调、RAG等均不相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SPECTRA的高效光谱-时间神经网络架构，用于解决传感器活动识别中模型在边缘设备上部署时面临的准确性、延迟和能耗平衡问题，并在多个数据集和实际设备上验证了其优越性能。

摘要翻译

普适计算中基于传感器的实时应用需要可边缘部署的模型，以确保低延迟、隐私保护和高效交互。一个典型范例是基于传感器的人类活动识别，其模型必须在准确性与严格的资源限制之间取得平衡。然而，许多深度学习方法将时序传感器信号视为黑箱序列，忽视了频谱-时序结构，同时需要过量计算。我们提出SPECTRA——一种部署优先、协同设计的频谱-时序架构，该架构集成短时傅里叶变换特征提取、深度可分离卷积和通道自注意力机制，以在真实边缘设备的运行时和内存限制下捕捉频谱-时序依赖关系。一个紧凑的带注意力池化的双向门控循环单元以低成本汇总窗口内动态，减轻下游模型负担的同时保持准确性。在五个公开的人类活动识别数据集上，SPECTRA达到或接近更大的卷积神经网络-长短期记忆网络和Transformer基线模型的性能，同时显著减少参数量、延迟和能耗。在Google Pixel 9智能手机和STM32L4微控制器上的部署进一步验证了该架构可实现端到端可部署、实时、隐私保护且高效的人类活动识别系统。

摘要 (Abstract)

Real time sensor based applications in pervasive computing require edge deployable models to ensure low latency privacy and efficient interaction. A prime example is sensor based human activity recognition where models must balance accuracy with stringent resource constraints. Yet many deep learning approaches treat temporal sensor signals as black box sequences overlooking spectral temporal structure while demanding excessive computation. We present SPECTRA a deployment first co designed spectral temporal architecture that integrates short time Fourier transform STFT feature extraction depthwise separable convolutions and channel wise self attention to capture spectral temporal dependencies under real edge runtime and memory constraints. A compact bidirectional GRU with attention pooling summarizes within window dynamics at low cost reducing downstream model burden while preserving accuracy. Across five public HAR datasets SPECTRA matches or approaches larger CNN LSTM and Transformer baselines while substantially reducing parameters latency and energy. Deployments on a Google Pixel 9 smartphone and an STM32L4 microcontroller further demonstrate end to end deployable realtime private and efficient HAR.

关键词: sensor-based activity recognition, edge deployment, spectral-temporal architecture, low latency, energy efficiency, depthwise separable convolutions, bidirectional GRU, real-time HAR

14. ❌ ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs

作者: Inês Vieira, Inês Calvo, Iago Paulo, James Furtado, Rafael Ferreira, Diogo Tavares, Diogo Glória-Silva, David Semedo, João Magalhães 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26516v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是评估大语言模型（LLMs）在特定语言（欧洲葡萄牙语）中的表现，因此与"Large Language Models"高度相关（10分）。论文未涉及其他关键词的技术原理、方法或应用，如MoE、SLMs、训练技术、推理方法、代理系统、模型优化等，因此这些关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究针对欧洲葡萄牙语缺乏评估基准的问题，开发了ALBA基准来评估大语言模型在八个语言学维度的表现，实验发现不同模型在语言学维度上存在性能差异。

摘要翻译

随着大语言模型在多语言领域的扩展，评估其在资源不足语言中的表现变得日益重要。欧洲葡萄牙语（pt-PT）尤其受到这一问题的影响，因为现有的训练数据和基准测试主要基于巴西葡萄牙语（pt-BR）。为此，我们推出了ALBA——一个基于语言学原理构建的基准测试，旨在从八个语言学维度全面评估大语言模型在欧洲葡萄牙语中的语言相关任务能力，这些维度包括：语言变体、文化关联语义、话语分析、文字游戏、句法、形态学、词汇学以及语音与音系学。ALBA由语言专家人工构建，并搭配一个基于大语言模型的评判框架，以实现对欧洲葡萄牙语生成语言的可扩展评估。通过对多种模型的实验，我们发现不同语言学维度上的表现存在显著差异，这凸显了需要建立全面且对语言变体敏感的基准测试，以支持欧洲葡萄牙语工具的进一步发展。

摘要 (Abstract)

As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.

关键词: Large Language Models, European Portuguese, linguistic benchmark, multilingual evaluation, language variety, LLM-as-a-judge, linguistic dimensions, under-represented languages

15. ❌ EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching

作者: Paul Bontempo 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26587v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文研究英语-泰米尔语语码转换中情感与语言选择的关系，使用微调的XLM-RoBERTa模型进行语言识别和统计分析。论文主要涉及自然语言处理中的语码转换和情感分析，但未涉及大模型技术原理创新或大模型在不同领域的创新应用。仅与关键词"Post-training” OR “Supervised Fine-tuning” OR “SFT"有微弱关联（使用了微调模型），其他关键词均无直接关联。

!!! tip deepseek-chat TL;DR

This paper investigates how utterance sentiment influences language choice in English-Tamil code-switching and finds that positive utterances have higher English proportion while mixed-sentiment utterances show more frequent language switches.

摘要翻译

本文采用机器学习与统计建模方法，探究英语-泰米尔语语码转换文本中话语情感与语言选择之间的关系。我们基于DravidianCodeMix数据集的35,650条罗马化YouTube评论，使用微调后的XLM-RoBERTa模型进行词元级语言识别，并生成每句话的英语占比与语码转换频率指标。线性回归分析表明：在控制话语长度后，积极情感话语的英语占比（34.3%）显著高于消极情感话语（24.8%），而混合情感话语则呈现最高的语码转换频率。这些发现验证了研究假设——由于嵌入语与基质语言在社会语言学层面具有声望与身份认同的关联性，情感内容在多语语码转换环境中确实对语言选择产生显著影响。

摘要 (Abstract)

This paper investigates the relationship between utterance sentiment and language choice in English-Tamil code-switched text, using methods from machine learning and statistical modelling. We apply a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, producing per-utterance measurements of English proportion and language switch frequency. Linear regression analysis reveals that positive utterances exhibit significantly greater English proportion (34.3%) than negative utterances (24.8%), and mixed-sentiment utterances show the highest language switch frequency when controlling for utterance length. These findings support the hypothesis that emotional content demonstrably influences language choice in multilingual code-switching settings, due to socio-linguistic associations of prestige and identity with embedded and matrix languages.

关键词: code-switching, sentiment analysis, language identification, XLM-RoBERTa, English-Tamil, socio-linguistic, machine learning, statistical modeling

16. ❌ SAFT: Sensitivity-Aware Filtering and Transmission for Adaptive 3D Point Cloud Communication over Wireless Channels

作者: Huda Adam Sirag Mekki, Hui Yuan, Mohanad M. G. Hassan, Zejia Chen, Guanghui Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26197v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究3D点云在无线信道中的自适应传输，提出SAFT框架，包含编码、过滤、量化和解码模块。所有关键词均与大模型、深度学习技术原理或科学应用直接相关，但本文专注于点云通信的特定工程问题，未涉及大模型、语言模型、推理、对齐、代理等主题。唯一相关的是"Quantization”，因为论文提到了量化块用于离散表示，但这是传统信号处理中的量化，而非大模型的低比特权重压缩，因此给5分（有一定关联）。其他关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了无线信道中3D点云传输的可靠性问题，提出了SAFT框架，通过敏感性感知的令牌过滤和量化，在低信噪比下显著提升了几何保真度。

摘要翻译

由于无线信道中时变的信噪比（SNR）与有限带宽的限制，三维点云的可靠传输面临挑战。本文提出一种感知敏感度的滤波与传输框架（SAFT），该学习型传输框架集成了受Point-BERT启发的编码器、敏感度引导的令牌滤波（STF）单元、量化模块以及用于自适应重建的SNR感知解码器。具体而言，STF模块根据每个令牌在信道扰动下的重建敏感度，为其分配令牌级重要性分数。我们进一步引入一种仅用于训练的符号使用惩罚机制，以稳定离散表示，同时不影响实际传输的有效载荷。在ShapeNet、ModelNet40和8iVFB数据集上的实验表明，与分离的信源—信道编码方案（G-PCC结合LDPC与QAM）及现有学习型基线方法相比，SAFT在几何保真度（D1/D2 PSNR）上均有提升，且在低SNR环境下增益最为显著，这凸显了其在有限带宽条件下增强的鲁棒性。

摘要 (Abstract)

Reliable transmission of 3D point clouds over wireless channels is challenging due to time-varying signal-to-noise ratio (SNR) and limited bandwidth. This paper introduces sensitivity-aware filtering and transmission (SAFT), a learned transmission framework that integrates a Point-BERT-inspired encoder, a sensitivity-guided token filtering (STF) unit, a quantization block, and an SNR-aware decoder for adaptive reconstruction. Specifically, the STF module assigns token-wise importance scores based on the reconstruction sensitivity of each token under channel perturbation. We further employ a training-only symbol-usage penalty to stabilize the discrete representation, without affecting the transmitted payload. Experiments on ShapeNet, ModelNet40, and 8iVFB show that SAFT improves geometric fidelity (D1/D2 PSNR) compared with a separate source–channel coding pipeline (G-PCC combined with LDPC and QAM) and existing learned baselines, with the largest gains observed in low-SNR regimes, highlighting improved robustness under limited bandwidth.

关键词: 3D point cloud, wireless communication, adaptive transmission, sensitivity-aware filtering, quantization, SNR-aware decoder, geometric fidelity, bandwidth efficiency

17. ❌ EcoFair: Trustworthy and Energy-Aware Routing for Privacy-Preserving Vertically Partitioned Medical Inference

作者: Mostafa Anoosha, Dhavalkumar Thakker, Kuniko Paxton, Koorosh Aslansefat, Bhupesh Kumar Mishra, Baseer Ahmad, Rameez Raja Kureshi 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26483v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文专注于隐私保护、能源感知的医疗推理框架，涉及边缘计算、多模态融合和选择性路由机制。所有关键词均与大模型技术、训练方法、推理优化、代理系统等具体技术直接相关，而本文未提及任何大模型、语言模型或相关技术，仅与"AI for Science" OR “Bioinformatics” OR “Cheminformatics"有弱关联（应用于医疗领域），因此该关键词得5分，其余均为0分。

!!! tip deepseek-chat TL;DR

本文提出了EcoFair框架，通过选择性路由机制在隐私保护的垂直分区医疗推理中平衡诊断可靠性和边缘部署能效，实验表明能显著降低推理能耗并保持分类性能。

摘要翻译

隐私保护医疗推理必须在数据本地性、诊断可靠性与部署效率之间取得平衡。本文提出EcoFair——一种面向皮肤病学诊断的模拟垂直分区推理框架，其中原始图像与表格数据均保留在本地，仅传输特定模态的嵌入向量至服务器端进行多模态融合。EcoFair引入了一种轻量化优先的路由机制，当本地不确定性或基于元数据推导的临床风险表明需要额外计算时，该机制会选择性地激活更复杂的图像编码器。路由决策综合了预测不确定性、安全-危险概率间隙，以及基于患者年龄与病灶定位生成的表格神经符号风险评分。在三个皮肤病学基准数据集上的实验表明，EcoFair能在典型模型配对中显著降低边缘侧推理能耗，同时保持具有竞争力的分类性能。结果进一步表明，选择性路由机制可在不改变全局训练目标的前提下，在典型场景中改善对亚组敏感的恶性病例识别表现。这些发现使EcoFair成为边缘部署约束下兼顾隐私保护与能耗感知的实用医疗推理框架。

摘要 (Abstract)

Privacy-preserving medical inference must balance data locality, diagnostic reliability, and deployment efficiency. This paper presents EcoFair, a simulated vertically partitioned inference framework for dermatological diagnosis in which raw image and tabular data remain local and only modality-specific embeddings are transmitted for server-side multimodal fusion. EcoFair introduces a lightweight-first routing mechanism that selectively activates a heavier image encoder when local uncertainty or metadata-derived clinical risk indicates that additional computation is warranted. The routing decision combines predictive uncertainty, a safe–danger probability gap, and a tabular neurosymbolic risk score derived from patient age and lesion localisation. Experiments on three dermatology benchmarks show that EcoFair can substantially reduce edge-side inference energy in representative model pairings while remaining competitive in classification performance. The results further indicate that selective routing can improve subgroup-sensitive malignant-case behaviour in representative settings without modifying the global training objective. These findings position EcoFair as a practical framework for privacy-preserving and energy-aware medical inference under edge deployment constraints.

关键词: privacy-preserving medical inference, vertically partitioned inference, energy-aware routing, multimodal fusion, selective routing, dermatological diagnosis, edge deployment, inference efficiency

18. ❌ Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

作者: Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26648v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确提到’large language models’和’coding agents’，与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关，因此给10分。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等均未在摘要中提及或与论文主题无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了Vision2Web基准，用于系统评估视觉语言模型在从静态UI到全栈网站开发的多层次任务中的性能，发现现有模型在全栈开发上仍存在显著差距。

摘要翻译

近期大语言模型的进展提升了代码智能体的能力，但对复杂端到端网站开发的系统性评估仍显不足。为填补这一空白，我们提出了Vision2Web——一个用于视觉化网站开发的分层基准测试，涵盖从静态用户界面到代码生成、交互式多页面前端复现，到长周期全栈网站开发的全过程。该基准基于真实网站构建，共包含16个类别的193项任务，涉及918张原型图像和1,255个测试用例。为支持灵活、全面且可靠的评估，我们提出了基于工作流的智能体验证范式，该范式包含两个互补组件：图形用户界面（GUI）智能体验证器和基于视觉语言模型（VLM）的评判器。我们对在不同编码智能体框架下实例化的多个视觉语言模型进行了评估，结果显示所有任务层级均存在显著性能差距，即使最先进的模型在全栈开发任务上仍面临严峻挑战。

摘要 (Abstract)

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.

关键词: Vision2Web, visual website development, large language models, coding agents, benchmark, full-stack development, agent verification, visual language models

19. ❌ PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

作者: Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian, Zuyan Liu, Yushi Hu, Haoning Wu, Yuhao Dong, Benlin Liu, Ziwei Liu, Ranjay Krishna 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26653v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要关注视频感知推理基准的构建和评估，与大多数大模型技术关键词（如LLM架构、训练方法、优化技术等）无直接关联。仅与推理相关的关键词（Chain of Thought、System 2 Thinking）有中等关联（5分），因为论文强调复杂、长时程的推理过程，但未明确涉及这些具体技术。其他关键词（如AI for Science）虽属广义AI应用，但论文未聚焦科学领域，故评0分。

!!! tip deepseek-chat TL;DR

该研究提出了一个名为PerceptionComp的视频基准，用于评估复杂、长时程、以感知为中心的视频推理能力，发现当前最先进的多模态大模型在该基准上的表现显著低于现有基准，表明感知中心的长时程视频推理仍是一个主要瓶颈。

摘要翻译

我们推出PerceptionComp，这是一个用于复杂、长时程、以感知为中心的视频推理的人工标注基准。该基准的设计确保单一时刻信息不足：回答每个问题均需整合多个时间上分散的视觉证据，并在联合与顺序逻辑下满足组合约束，其范围涵盖物体、属性、关系、位置、动作及事件等感知子任务，并要求具备语义识别、视觉对应、时序推理和空间推理等能力。该基准包含来自城市漫步导览、室内别墅导览、电子游戏及极限户外运动等多个领域的279个视频，共1,114道高复杂度问题，全部采用人工标注。人类实验表明，PerceptionComp需要大量的实时思考与重复感知步骤：参与者耗时远超现有基准，且在禁止重复观看时准确率降至接近随机水平（18.97%）。当前最先进的多模态大语言模型在PerceptionComp上的表现也显著低于现有基准：评估中最佳模型Gemini-3-Flash在五选一设置下仅达到45.96%的准确率，而开源模型均低于40%。这些结果表明，以感知为中心的长时程视频推理仍是主要瓶颈，我们期望PerceptionComp能推动感知推理领域的进步。

摘要 (Abstract)

We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.

关键词: video reasoning, perception-centric, long-horizon, benchmark, multimodal large language models, temporal reasoning, spatial reasoning, complex questions

20. ❌ Ruka-v2: Tendon Driven Open-Source Dexterous Hand with Wrist and Abduction for Robot Learning

作者: Xinqi, Liu, Ruoxi Hu, Alejandro Ojeda Olarte, Zhuoran Chen, Kenny Ma, Charles Cheng Ji, Lerrel Pinto, Raunaq Bhirangi, Irmak Guzey 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26660v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人硬件设计（开源肌腱驱动仿人灵巧手Ruka-v2）和机器人学习应用（遥操作和自主策略学习），所有关键词均涉及大模型、深度学习技术原理或AI科学应用，而本文是纯粹的机器人硬件工程研究，未涉及任何大模型、深度学习或AI科学应用内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文解决了机器人灵巧手硬件缺乏可访问性和仿人灵巧性的瓶颈问题，通过设计开源肌腱驱动仿人灵巧手Ruka-v2（增加腕部和手指外展/内收自由度），在遥操作任务中实现了51.3%完成时间减少和21.2%成功率提升，并展示了其在机器人学习中的应用。

摘要翻译

缺乏易于获取且灵巧的机器人硬件一直是机器人实现人类水平灵巧性的重大瓶颈。去年，我们发布了Ruka——一款完全开源、肌腱驱动的人形手，具有11个自由度（每根手指2个，拇指3个），制造成本低于1300美元。它是最早完全开源的人形手之一，并引入了一种新颖的数据驱动手指控制方法，在控制系统内捕捉肌腱动力学。尽管有这些贡献，Ruka仍缺少两个对于紧密模仿人类行为至关重要的自由度：腕部活动度以及手指的内收/外展。在本文中，我们介绍Ruka-v2：一款完全开源、肌腱驱动的人形手，其特点是具有解耦的2自由度并联腕部以及手指的外展/内收功能。并联腕部增加了平滑、独立的屈曲/伸展和桡偏/尺偏，使得在诸如橱柜等受限环境中的操作成为可能。外展功能则支持抓取薄物体、手内旋转和书法等动作。我们介绍了Ruka-v2的设计，并通过遥操作任务的用户研究将其与Ruka进行比较评估，发现任务完成时间减少了51.3%，成功率提高了21.2%。我们进一步展示了其在机器人学习方面的全方位应用：跨越13项灵巧任务的双臂和单臂遥操作，以及在3项任务上的自主策略学习。所有3D打印文件、组装说明、控制器软件和视频均可在https://ruka-hand-v2.github.io/ 获取。

摘要 (Abstract)

Lack of accessible and dexterous robot hardware has been a significant bottleneck to achieving human-level dexterity in robots. Last year, we released Ruka, a fully open-sourced, tendon-driven humanoid hand with 11 degrees of freedom - 2 per finger and 3 at the thumb - buildable for under $1,300. It was one of the first fully open-sourced humanoid hands, and introduced a novel data-driven approach to finger control that captures tendon dynamics within the control system. Despite these contributions, Ruka lacked two degrees of freedom essential for closely imitating human behavior: wrist mobility and finger adduction/abduction. In this paper, we introduce Ruka-v2: a fully open-sourced, tendon-driven humanoid hand featuring a decoupled 2-DOF parallel wrist and abduction/adduction at the fingers. The parallel wrist adds smooth, independent flexion/extension and radial/ulnar deviation, enabling manipulation in confined environments such as cabinets. Abduction enables motions such as grasping thin objects, in-hand rotation, and calligraphy. We present the design of Ruka-v2 and evaluate it against Ruka through user studies on teleoperated tasks, finding a 51.3% reduction in completion time and a 21.2% increase in success rate. We further demonstrate its full range of applications for robot learning: bimanual and single-arm teleoperation across 13 dexterous tasks, and autonomous policy learning on 3 tasks. All 3D print files, assembly instructions, controller software, and videos are available at https://ruka-hand-v2.github.io/ .

关键词: tendon-driven dexterous hand, open-source robot hardware, humanoid hand design, robot learning, teleoperation, autonomous policy learning, degrees of freedom, wrist mobility

21. ❌ Machine Learning Transferability for Malware Detection

作者: César Vieira, João Vitorino, Eva Maia, Isabel Praça 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26632v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于传统机器学习（ML）在恶意软件检测中的应用，特别是数据预处理和特征统一，以解决数据集间的特征兼容性和泛化问题。摘要和标题中未提及任何大模型（LLM）、深度学习、或相关技术（如MoE、RLHF、RAG等），也未涉及AI for Science的具体子领域（如生物信息学）。所有关键词均与大模型技术、深度学习原理或科学AI应用相关，而本文研究的是传统ML方法，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究评估了不同数据预处理方法在统一恶意软件检测数据集特征时的效果，以提升机器学习模型在分布偏移下的泛化能力和跨数据集的可迁移性。

摘要翻译

恶意软件持续构成组织面临的主要操作风险，尤其在攻击者使用混淆技术规避检测时。尽管机器学习（ML）检测方法的开发不断推进，但公共数据集仍普遍存在特征兼容性不足的问题。这限制了模型在面临分布偏移时的泛化能力，以及在不同数据集间的可迁移性。本研究评估了不同数据预处理方法在结合机器学习模型检测可移植可执行（Portable Executable, PE）文件时的适用性。预处理流程统一了EMBERv2（2,381维特征）数据集，并在两种训练设置下训练配对模型：EMBER + BODMAS 以及 EMBER + BODMAS + ERMDS。在模型评估方面，两种模型均在TRITIUM、INFERNO和SOREL-20M数据集上进行测试。此外，针对EMBER + BODMAS训练设置，也使用ERMDS数据集进行了测试。

摘要 (Abstract)

Malware continues to be a predominant operational risk for organizations, especially when obfuscation techniques are used to evade detection. Despite the ongoing efforts in the development of Machine Learning (ML) detection approaches, there is still a lack of feature compatibility in public datasets. This limits generalization when facing distribution shifts, as well as transferability to different datasets. This study evaluates the suitability of different data preprocessing approaches for the detection of Portable Executable (PE) files with ML models. The preprocessing pipeline unifies EMBERv2 (2,381-dim) features datasets, trains paired models under two training setups: EMBER + BODMAS and EMBER + BODMAS + ERMDS. Regarding model evaluation, both EMBER + BODMAS and EMBER + BODMAS + ERMDS models are tested against TRITIUM, INFERNO and SOREL-20M. ERMDS is also used for testing for the EMBER + BODMAS setup.

关键词: Machine Learning, Malware Detection, Transferability, Data Preprocessing, Feature Compatibility, Portable Executable, Generalization, Distribution Shift

22. ❌ Make Geometry Matter for Spatial Reasoning

作者: Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26639v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）的空间推理能力，通过几何令牌融合和微调技术提升性能。与关键词的相关性分析：1）‘Large Language Models/Foundation Models’得5分，论文提到’vision-language models’和'3D foundation models’，属于基础模型范畴但非纯文本大模型；2）‘Pre-training/Domain Adaptation’得5分，涉及预训练3D基础模型和领域适应（空间推理）；3）‘Post-training/Supervised Fine-tuning’得8分，论文核心是微调框架（GeoSR），包含训练策略（Geometry-Unleashing Masking）和融合机制，属于监督微调范畴。其他关键词（如MoE、SLMs、Scaling Laws、RLHF、RAG、Agents等）与论文内容无直接关联，得0分。论文未涉及指定专家作者。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在静态和动态场景中空间推理能力不足的问题，提出了GeoSR框架，通过几何令牌释放掩蔽和几何引导融合机制，有效利用几何信息，在多个基准测试中取得了最先进的性能。

摘要翻译

得益于大规模训练，视觉语言模型（VLMs）在图像和视频理解方面展现出强大能力，但其在静态场景与动态视频中进行空间推理的能力仍然有限。近期研究尝试通过将预训练三维基础模型中的几何标记（geometry tokens）注入VLMs以应对这一局限。然而我们观察到，此类工作中简单的标记融合与标准微调往往未能充分利用几何线索进行空间推理，因为VLMs倾向于严重依赖二维视觉线索。本文提出GeoSR框架，旨在通过激励VLMs主动运用几何标记进行推理，使几何信息真正发挥作用。GeoSR包含两个核心组件：（1）几何激发掩码（Geometry-Unleashing Masking），通过在训练中策略性地掩蔽部分二维视觉标记，削弱非几何捷径，迫使模型借助几何标记进行空间推理；（2）几何引导融合（Geometry-Guided Fusion），采用门控路由机制，在几何证据关键区域自适应增强几何标记的贡献度。这些设计共同释放了几何标记在空间推理任务中的潜力。在静态与动态空间推理基准上的大量实验表明，GeoSR通过有效利用几何信息，持续超越现有方法并建立了新的性能标杆。项目页面详见https://suhzhang.github.io/GeoSR/。

摘要 (Abstract)

Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.

关键词: vision-language models, spatial reasoning, geometry tokens, fine-tuning, 3D foundation models, GeoSR framework, static and dynamic scenes, state-of-the-art performance

23. ❌ Think over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling

作者: Ruixing Zhang, Hanzhang Jiang, Leilei Sun, Liangzhe Han, Jibin Wang, Weifeng Lv 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26610v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是从蜂窝信号重建GPS轨迹的Sig2GPS问题，采用图像到视频生成方法，涉及视频生成模型、强化学习优化和轨迹数据挖掘。所有关键词都聚焦于大语言模型（LLM）及其相关技术（如MoE、SFT、RAG、量化等），或特定科学AI应用（如生物信息学）。论文未提及任何LLM、深度学习技术原理创新或大模型在科学领域的应用，核心是视频生成和轨迹重建，与关键词主题完全无关。

!!! tip deepseek-chat TL;DR

该论文提出Sig2GPS方法，将蜂窝信号重建GPS轨迹的问题重构为图像到视频生成任务，通过微调视频生成模型和强化学习优化，显著提升了轨迹重建精度和可扩展性。

摘要翻译

移动设备持续与蜂窝基站交互，产生海量信令记录，为理解人类移动性提供了广泛覆盖。然而，此类记录仅提供粗略的位置线索（例如服务小区标识符），因此限制了其在需要高精度GPS轨迹的应用中的直接使用。本文研究Sig2GPS问题：从蜂窝信令中重建GPS轨迹。受领域专家常将信令轨迹置于地图上并勾勒对应GPS路线的启发，与依赖复杂多阶段工程流程或回归坐标的传统解决方案不同，Sig2GPS被重新定义为一种在地图视觉域中直接操作的图像到视频生成任务：将信令轨迹渲染在地图上，并训练视频生成模型以绘制连续的GPS路径。为支持这一范式，我们构建了一个配对的信令-轨迹视频数据集，用于微调开源视频模型，并引入了一种基于轨迹感知强化学习的优化方法，通过奖励机制提升生成保真度。在大规模真实数据集上的实验表明，该方法相较于强工程化和基于学习的基线模型有显著提升，同时在下一GPS预测任务上的额外结果也证明了其可扩展性和跨城市迁移能力。总体而言，这些结果表明，地图视觉视频生成为轨迹数据挖掘提供了一个实用接口，使得在地图约束下直接生成和优化连续路径成为可能。

摘要 (Abstract)

Mobile devices continuously interact with cellular base stations, generating massive volumes of signaling records that provide broad coverage for understanding human mobility. However, such records offer only coarse location cues (e.g., serving-cell identifiers) and therefore limit their direct use in applications that require high-precision GPS trajectories. This paper studies the Sig2GPS problem: reconstructing GPS trajectories from cellular signaling. Inspired by domain experts often lay the signaling trace on the map and sketch the corresponding GPS route, unlike conventional solutions that rely on complex multi-stage engineering pipelines or regress coordinates, Sig2GPS is reframed as an image-to-video generation task that directly operates in the map-visual domain: signaling traces are rendered on a map, and a video generation model is trained to draw a continuous GPS path. To support this paradigm, a paired signaling-to-trajectory video dataset is constructed to fine-tune an open-source video model, and a trajectory-aware reinforcement learning-based optimization method is introduced to improve generation fidelity via rewards. Experiments on large-scale real-world datasets show substantial improvements over strong engineered and learning-based baselines, while additional results on next GPS prediction indicate scalability and cross-city transferability. Overall, these results suggest that map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints.

关键词: GPS trajectory reconstruction, cellular signaling, image-to-video generation, video generation model, reinforcement learning optimization, trajectory data mining, map-visual domain, Sig2GPS

24. ❌ Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow

作者: Ziyue Zeng, Xun Su, Haoyuan Liu, Bingyu Lu, Yui Tatsumi, Hiroshi Watanabe 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26571v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出了一种零样本视频压缩框架GVC，将预训练的视频生成模型直接用作编解码器，通过将确定性整流流ODE转换为SDE来实现基于码本的压缩。论文的核心是视频生成模型在压缩任务中的应用，属于生成模型领域。与大多数关键词（如LLMs、MoE、SLMs、对齐、推理、代理等）完全无关。唯一相关的关键词是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文使用了预训练的视频基础模型（video foundation models），并涉及领域适应（将生成模型适应到压缩任务），但这不是论文的核心创新点，因此给予5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种零样本视频压缩框架GVC，通过将预训练的视频生成模型的确定性ODE转换为SDE，使其直接作为编解码器使用，实现了高质量重建和灵活的比特率控制。

摘要翻译

现有生成式视频压缩方法仅将生成模型用作传统编解码器之上的事后重建模块。我们提出生成式视频编解码器（Generative Video Codec，GVC），这是一种零样本框架，可将预训练的视频生成模型直接转化为编解码器本身：传输的比特流直接指定生成式解码轨迹，无需重新训练。为实现这一目标，我们在推理时将现代视频基础模型的确定性修正流常微分方程（rectified-flow ODE）转换为等效的随机微分方程（SDE），从而解锁基于码本驱动的压缩所需的每步随机注入点。基于这一统一主干，我们实例化了三种互补的条件调节策略——采用自适应尾帧原子分配的图像到视频（Image-to-Video，I2V）、以近乎零边信息运行的文本到视频（Text-to-Video，T2V）作为纯生成先验，以及采用边界共享图像组（GOP）链式结构以实现双锚点时序控制的首尾帧到视频（First-Last-Frame-to-Video，FLF2V）。这些变体共同构成了空间保真度、时序连贯性与压缩效率之间原则性的权衡空间。在标准基准测试上的实验表明，GVC能够在低于0.002 bpp的码率下实现高质量重建，同时通过单一超参数支持灵活的码率控制。

摘要 (Abstract)

Existing generative video compression methods use generative models only as post-hoc reconstruction modules atop conventional codecs. We propose \emph{Generative Video Codec} (GVC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies – \emph{Image-to-Video} (I2V) with adaptive tail-frame atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVC achieves high-quality reconstruction below 0.002,bpp while supporting flexible bitrate control through a single hyperparameter.

关键词: Generative Video Compression, Zero-shot Coding, Rectified Flow, Stochastic Differential Equation, Video Foundation Models, Bitstream, Codebook-driven Compression, Temporal Coherence

25. ❌ Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

作者: Eziyo Ehsani, Luca Giamattei, Ivano Malavolta, Roberto Pietrantuono 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26603v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）从云端迁移到边缘设备（On-device AI/SLMs）时面临的性能、能耗和隐私权衡问题，并进行了实证研究。论文高度相关于：1) LLMs/Foundation Models（明确提及并研究迁移问题）；2) Mixture of Experts（MoE）（研究发现MoE架构在能耗方面具有优势）；3) Small Language Models/On-device AI（研究重点就是设备端AI，测试了0.5B到9B参数的模型）；4) Quantization/Model Compression（深入研究了量化技术及其对能耗的影响，发现了量化-能耗悖论）。论文未涉及其他关键词，如训练方法、推理技术、对齐、代理、科学AI应用等。

!!! tip deepseek-chat TL;DR

该论文研究了将大语言模型迁移到移动设备时，在生成质量、延迟、能耗和内存占用之间存在的复杂权衡关系，并通过实证分析发现量化技术对节能效果有限，而MoE架构在保持低能耗的同时能容纳更大模型，最终确定了中等规模模型（如Qwen2.5-3B）是实现质量与可持续能耗平衡的实用选择。

摘要翻译

大型语言模型（LLM）从云端集群向边缘设备的迁移有望提升隐私保护与离线可用性，但这一转型面临着严峻的现实挑战：移动设备的电池物理限制、热约束，以及最关键的内存限制。为探索这一领域，我们构建了一套可复现的实验流程，用以剖析能耗、延迟与生成质量之间复杂的相互作用。与理论研究不同，我们在无需获取系统根权限的条件下，对参数量从0.5B到9B的八种模型进行了细粒度功耗指标采集，确保研究结果反映真实用户场景。我们利用该流程在旗舰安卓设备三星Galaxy S25 Ultra上进行了实证案例研究，建立了关于生成质量、性能与资源消耗之间权衡的基础假设。我们的研究揭示了一个反直觉的量化-能耗悖论：尽管现代感知重要性的量化技术能有效减少内存占用，使更大模型得以载入RAM，但我们发现，与标准混合精度方法相比，其在节能方面效果微乎其微。这证明，对于电池续航而言，模型架构而非其量化方案才是决定性因素。我们进一步发现，混合专家模型（Mixture-of-Experts, MoE）架构打破了常规的规模-能耗关系趋势，它能提供相当于7B模型的存储容量，同时保持1B至2B模型的较低能耗水平。最后，对这些多目标权衡的分析揭示了一个实用的最佳平衡点——中等规模模型（如Qwen2.5-3B），它们能在响应质量与可持续能耗之间实现有效平衡。

摘要 (Abstract)

The migration of Large Language Models (LLMs) from cloud clusters to edge devices promises enhanced privacy and offline accessibility, but this transition encounters a harsh reality: the physical constraints of mobile batteries, thermal limits, and, most importantly, memory constraints. To navigate this landscape, we constructed a reproducible experimental pipeline to profile the complex interplay between energy consumption, latency, and quality. Unlike theoretical studies, we captured granular power metrics across eight models ranging from 0.5B to 9B parameters without requiring root access, ensuring our findings reflect realistic user conditions. We harness this pipeline to conduct an empirical case study on a flagship Android device, the Samsung Galaxy S25 Ultra, establishing foundational hypotheses regarding the trade-offs between generation quality, performance, and resource consumption. Our investigation uncovered a counter-intuitive quantization-energy paradox. While modern importance-aware quantization successfully reduces memory footprints to fit larger models into RAM, we found it yields negligible energy savings compared to standard mixed-precision methods. This proves that for battery life, the architecture of the model, not its quantization scheme, is the decisive factor. We further identified that Mixture-of-Experts (MoE) architectures defy the standard size-energy trend, offering the storage capacity of a 7B model while maintaining the lower energy profile of a 1B to 2B model. Finally, an analysis of these multi-objective trade-offs reveals a pragmatic sweet spot of mid-sized models, such as Qwen2.5-3B, that effectively balance response quality with sustainable energy consumption.

关键词: Large Language Models (LLMs), On-device AI, Energy consumption, Quantization, Mixture-of-Experts (MoE), Model compression, Edge computing, Performance trade-offs

26. ❌ Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation

作者: Einari Vaaras, Manu Airaksinen, Okko Räsänen 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26592v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是生物医学时间序列数据标注中的样本选择策略，主要比较了随机采样、最远优先遍历和基于2D可视化的交互式方法。论文内容与大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词都聚焦于大语言模型和深度学习技术本身。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及生物医学领域的机器学习应用（婴儿运动评估和语音情感识别），但论文重点在于数据标注策略而非AI模型本身，因此相关性中等，给5分。

!!! tip deepseek-chat TL;DR

该论文研究了生物医学时间序列数据标注中三种样本选择策略（随机采样、最远优先遍历和基于2D可视化的交互式方法）的效果，发现基于2D可视化的方法在聚合标注者标签时表现最佳，但标注者间标签分布变异性较高，而随机采样在标注者数量或专业知识不确定时最安全。

摘要翻译

在生物医学领域，可靠的机器学习模型依赖于准确的标签，然而生物医学时间序列数据的标注仍具挑战性。算法样本选择可能辅助标注，但来自真实人类标注者参与研究的证据尚不充分。为此，我们比较了三种用于标注的样本选择方法：随机抽样（RND）、最远优先遍历（FAFT），以及一种基于图形用户界面的方法，该方法支持对高维数据的互补二维可视化（2DVs）进行探索。我们在婴儿运动评估（IMA）和语音情感识别（SER）的四项分类任务中评估了这些方法。十二名被分为专家与非专家的标注者在有限标注预算下执行数据标注，并进行了标注后实验以评估抽样方法。在所有分类任务中，当汇总各标注者的标签时，2DV方法表现最佳。在IMA任务中，2DV最有效地捕捉了稀有类别，但也因有限的标注预算而表现出更大的标注者间标签分布变异性，导致基于单个标注者标签训练的模型分类性能下降；在此情况下，FAFT方法表现更优。对于SER任务，在专家标注者中2DV优于其他方法，在个体标注者设置下对于非专家标注者也能达到与之相当的性能。失败风险分析表明，当标注者数量或专业水平不确定时，RND是最安全的选择，而2DV则因其更高的标签分布变异性具有最高风险。此外，实验后访谈表明，2DV使标注任务变得更有趣和令人愉悦。总体而言，基于2DV的抽样方法在生物医学时间序列数据标注中展现出潜力，尤其是在标注预算约束不严苛的情况下。

摘要 (Abstract)

Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators’ labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.

关键词: biomedical time-series data, data annotation, sample selection, interactive visualization, infant motility assessment, speech emotion recognition, label distribution variability, annotation budget

27. ❌ Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

作者: Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26551v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的高效骨干网络设计，提出了一种名为LowFormer的新型视觉骨干网络家族，并引入了轻量级注意力模块Lowtention。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或科学领域应用相关，而本文研究的是视觉骨干网络（Vision Backbones）的效率优化，属于计算机视觉领域，与评分关键词中的大模型技术、科学AI应用等主题无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文针对视觉骨干网络效率评估中MACs指标的不足，提出了一种名为LowFormer的高效视觉骨干网络家族，通过引入轻量级注意力模块Lowtention，在ImageNet等任务上实现了更优的速度和性能。

摘要翻译

视觉骨干网络在现代计算机视觉中扮演着核心角色。提升其效率可直接惠及广泛的下游应用。为衡量效率，许多研究依赖MACs（乘积累加运算）作为执行时间的预测指标。本文通过实验揭示了该指标的不足，尤其在边缘设备环境下。通过对比常见架构设计元素的MAC数量与执行时间，我们识别了高效执行的关键因素，并为优化骨干网络设计提供了见解。基于这些见解，我们提出了LowFormer，一种新颖的视觉骨干网络系列。LowFormer采用精简的宏观与微观设计，其中包括Lowtention——一种轻量化的多头自注意力（Multi-Head Self-Attention）替代方案。Lowtention不仅被证明更高效，还能在ImageNet上实现更优的结果。此外，我们推出了LowFormer的边缘GPU版本，可进一步提升其在边缘GPU和桌面GPU上的基准速度。通过在小型图像分类数据集上的评估，以及将其适配至多个下游任务（如目标检测、语义分割、图像检索和视觉目标跟踪），我们证明了LowFormer的广泛适用性。与当前先进的骨干网络相比，LowFormer模型在各种硬件平台上均能持续实现显著的加速效果。代码与模型发布于https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md。

摘要 (Abstract)

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline’s speed on edge GPU and desktop GPU. We demonstrate LowFormer’s wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

关键词: Vision Backbones, Efficiency, MACs, LowFormer, Lowtention, Edge Devices, Image Classification, Downstream Tasks

28. ❌ Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

作者: Yoseph Berhanu Alebachew, Hunter Leary, Swanand Vaishampayan, Chris Brown 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26567v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在软件工程领域的应用，特别是仓库级问答任务。高度相关的关键词包括：LLMs（论文直接研究Claude 3.5和GPT-4o）、RAG（论文评估检索增强生成方法）、LLM Agents（论文测试agentic配置）。中等相关的关键词包括：推理相关（论文分析模型是否真正推理而非记忆）、事实性（论文关注答案准确性）、可解释性（论文分析模型行为）。其他关键词如MoE、量化、科学AI等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在仓库级程序理解问答任务中的表现，发现现有模型在基准测试中准确率有限，且高分往往来自对Stack Overflow答案的记忆而非真正推理，同时提出了包含结构信息的检索增强方法能提升性能。

摘要翻译

大型语言模型（LLM）在软件工程任务中展现出卓越能力，包括问答任务。然而，大多数研究和基准测试聚焦于孤立功能或单文件代码片段，忽视了现实世界程序理解中常涉及多文件与系统级依赖的复杂性。本研究首次提出StackRepoQA——一个基于134个开源Java项目中1,318个真实开发者问题及其采纳答案构建的多项目、仓库级问答数据集。利用该数据集，我们系统评估了两种主流LLM（Claude 3.5 Sonnet与GPT-4o）在直接提示与智能体配置下的表现，并将基线性能与基于文件检索和结构依赖图表示增强的检索增强生成方法进行对比。实验结果表明，LLM在基线条件下达到中等准确率，融入结构信号后性能有所提升，但仓库级程序理解的总体准确率仍受限。分析显示，高分结果往往源于对Stack Overflow答案的逐字复现而非真正推理。据我们所知，这是首个在仓库级问答任务中提供此类实证证据的研究。我们公开StackRepoQA数据集，以推动在基准构建、评估框架及解耦记忆与推理的增强策略方面的深入研究，从而推进LLM成为可靠的仓库级程序理解工具。

摘要 (Abstract)

Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering (QA). However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the challenges of real-world program comprehension, which often spans multiple files and system-level dependencies. In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects. Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Nonetheless, overall accuracy remains limited for repository-scale comprehension. The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning. To our knowledge, this is the first empirical study to provide such evidence in repository-level QA. We release StackRepoQA to encourage further research into benchmarks, evaluation protocols, and augmentation strategies that disentangle memorization from reasoning, advancing LLMs as reliable tool for repository-scale program comprehension.

关键词: Large Language Models, repository-level question answering, retrieval-augmented generation, LLM agents, program comprehension, benchmark evaluation, StackRepoQA dataset, software engineering

29. ❌ When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

作者: Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26556v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Transformer模型的蒸馏方法以提升推理效率，核心涉及KV缓存压缩（高度相关）、模型压缩、推理加速、长上下文处理、后训练技术和大语言模型应用。其他关键词如MoE、小模型、数据质量、对齐、RAG、思维链、智能体等未在摘要中体现，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了通过蒸馏方法将预训练Transformer转换为更高效的混合模型，发现基于困惑度的评估会低估师生模型差距，并提出Hybrid-KDA架构和GenDistill流程，在保持86-90%教师模型准确率的同时将KV缓存内存减少75%并提升首token生成速度2-4倍。

摘要翻译

通过蒸馏将预训练的Transformer模型转化为更高效的混合模型，为降低推理成本提供了一种前景广阔的方法。然而，要在蒸馏模型中实现高质量的生成，需要对学生架构和蒸馏过程进行精心的联合设计。许多先前的蒸馏工作通过使用对数似然对候选答案进行排序来评估下游多项选择基准，而非要求模型进行自回归生成，这可能掩盖模型质量的重要差异。例如，我们证明，一个70亿参数的蒸馏模型在对数似然评分下与教师模型的差距仅为0.2个百分点，但当模型必须进行自回归生成答案时，其表现实际上落后了20.8个百分点。
我们提出了一种混合Kimi Delta注意力（Hybrid-KDA）架构，并搭配多阶段蒸馏流程GenDistill，在整个过程中使用基于生成的评估来指导设计决策。将此方法应用于Qwen3-0.6B模型，我们系统地消融了六个设计维度：训练目标、损失掩码、训练时长、数据集选择、参数冻结以及架构选择。我们发现，基于对数似然的评估持续低估了教师模型与学生模型之间的差距，并且在某些情况下会逆转设计选择的排序，这意味着仅基于困惑度的评估所得出的结论可能具有误导性。在我们研究的因素中，数据集选择、仅对补全部分进行掩码，以及在后续训练期间冻结注意力层，对生成质量的影响最为显著。
我们最佳的Hybrid-KDA模型在知识基准测试中保持了教师模型86%至90%的准确率，同时将KV缓存内存降低了高达75%，并在128K令牌的上下文长度下将首令牌生成时间缩短了2至4倍。

摘要 (Abstract)

Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2,pp under log-likelihood scoring actually falls behind by 20.8,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86–90% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75% and improving time-to-first-token by 2–4$\times$ at 128K-token contexts.

关键词: distillation, hybrid model, KV cache compression, inference efficiency, autoregressive generation, Transformer, generation quality, model compression

30. ❌ How Open Must Language Models be to Enable Reliable Scientific Inference?

作者: James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor, Cameron R. Jones, Sean Trott, Roger P. Levy, Benjamin K. Bergen, Micah Altman 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26539v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心讨论语言模型（特别是大语言模型）在科学研究中的可靠性问题，分析开放与封闭模型对科学推断的影响，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文聚焦科学应用场景，属于’AI for Science’范畴（10分）。其他关键词涉及具体技术原理、训练方法、推理优化、应用框架等，论文未深入讨论这些技术细节，仅作为背景提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了语言模型的开放程度如何影响基于模型研究的科学推断可靠性，认为当前封闭模型通常不适合科学研究，并提出了识别和缓解推断威胁的方法建议。

摘要翻译

模型的开放或封闭程度如何影响基于该模型的研究所能得出的科学推论？本文分析了关于模型构建与部署的信息限制如何威胁可靠推断。我们认为，当前封闭模型通常不适合科学研究目的（除少数显著例外），并探讨了这些模型对可靠推断所造成的问题可通过何种方式解决或缓解。我们建议，在科研中使用模型时，应系统识别可能威胁推断的因素及已采取的缓解措施，并需提供模型选择的具体依据。

摘要 (Abstract)

How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.

关键词: language models, scientific inference, open models, closed models, reliable inference, model transparency, research methodology, AI in science

31. ❌ The Multi-AMR Buffer Storage, Retrieval, and Reshuffling Problem: Exact and Heuristic Approaches

作者: Max Disselnmeyer, Thomas Bömer, Laura Dörr, Bastian Amberg, Anne Meyer 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26542v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是生产系统中缓冲区存储、检索和重排问题的优化算法（Multi-AMR BSRRP），涉及机器人调度、整数规划和启发式算法。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文专注于运筹学、自动化调度和工业工程领域，与评分关键词完全无关。

!!! tip deepseek-chat TL;DR

本文研究了多自主移动机器人协同解决缓冲区存储、检索和重排问题（Multi-AMR BSRRP），提出了精确整数规划模型和分层启发式算法，实验证明启发式算法能大幅减少计算时间，适用于高密度生产环境。

摘要翻译

缓冲区在生产系统中对于解耦连续工序至关重要。在空间受限的棕地设施等密集地面存储环境中，严重劳动力短缺和运营成本上升日益挑战人工操作的可行性。实现这些区域的自动化需要解决缓冲区存储、提取与重整问题。以往研究主要集中于对固定货物集合进行重整和提取的场景，而实际制造环境需要一种能同时处理新到货单元负载的自适应方法。本文提出多自主移动机器人缓冲区管理问题，协调机器人车队在共享地面区域内，同时处理重整任务以及具有时间窗口的存储与提取任务。我们建立了一个二进制整数规划模型以获取精确解用于基准测试。由于该问题是NP难问题，精确方法在工业规模下计算难以处理，因此我们提出一种分层启发式算法。该方法将问题分解为两个层次：通过A*搜索进行单元负载放置的任务级序列规划，以及采用约束规划方法实现多机器人协调与调度。实验表明，与精确模型相比，该算法可实现计算时间的数量级缩减。这些结果证实了该启发式算法作为高密度生产环境响应式控制逻辑的可行性。

摘要 (Abstract)

Buffer zones are essential in production systems to decouple sequential processes. In dense floor storage environments, such as space-constrained brownfield facilities, manual operation is increasingly challenged by severe labor shortages and rising operational costs. Automating these zones requires solving the Buffer Storage, Retrieval, and Reshuffling Problem (BSRRP). While previous work has addressed scenarios where the focus is limited to reshuffling and retrieving a fixed set of items, real-world manufacturing necessitates an adaptive approach that also incorporates arriving unit loads. This paper introduces the Multi-AMR BSRRP, coordinating a robot fleet to manage concurrent reshuffling, alongside time-windowed storage and retrieval tasks, within a shared floor area. We formulate a Binary Integer Programming (IP) model to obtain exact solutions for benchmarking purposes. As the problem is NP-hard, rendering exact methods computationally intractable for industrial scales, we propose a hierarchical heuristic. This approach decomposes the problem into an A* search for task-level sequence planning of unit load placements, and a Constraint Programming (CP) approach for multi-robot coordination and scheduling. Experiments demonstrate orders-of-magnitude computation time reductions compared to the exact formulation. These results confirm the heuristic’s viability as responsive control logic for high-density production environments.

关键词: Buffer Storage, Retrieval and Reshuffling Problem, Multi-AMR, Binary Integer Programming, Hierarchical Heuristic, A* Search, Constraint Programming, Robot Fleet Coordination

32. ❌ JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems

作者: Guangzhao Yang, Yu Pan, Shi Qiu, Ningjie Bai 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26515v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文JAL-Turn专注于语音对话系统中的实时轮转检测，采用联合声学-语言建模方法，核心是轻量级框架设计、并行处理架构和数据构建流程。所有评分关键词均与大模型技术、训练方法、推理优化、对齐技术、代理系统、科学AI应用等直接相关，而本文研究的是传统语音AI系统中的特定工程问题，未涉及任何大模型技术、深度学习创新或科学领域应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对工业级语音AI代理中高效稳健的轮转检测难题，提出了JAL-Turn这一轻量级联合声学-语言建模框架，通过共享冻结的ASR编码器实现与语音识别的并行处理，并在多语言基准和内部数据集上验证了其优于现有方法的检测精度和实时性能。

摘要翻译

尽管近期取得了一定进展，在工业级语音AI智能体部署中，高效且鲁棒的对话轮次检测仍是一项重大挑战。现有系统大多仅依赖声学或语义线索，导致准确性与稳定性欠佳；而近期赋予大语言模型全双工能力的尝试，不仅需要成本高昂的全双工数据，还带来巨大的训练与部署开销，限制了实时性能。本文提出JAL-Turn——一种轻量高效的纯语音轮次检测框架，该框架采用声学-语言联合建模范式，通过交叉注意力模块自适应地融合预训练声学表征与语言特征，以支持低延迟的保持与转换状态预测。通过共享冻结的自动语音识别（ASR）编码器，JAL-Turn使轮次检测能够与语音识别完全并行运行，不引入额外的端到端延迟或计算开销。此外，我们设计了一套可扩展的数据构建流程，能够从大规模真实对话语料中自动推导出可靠的轮次标注标签。在公开多语言基准测试及内部日语客服数据集上的大量实验表明，JAL-Turn在检测准确率上持续优于现有强基线方法，同时保持了卓越的实时性能。

摘要 (Abstract)

Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments. Many existing systems rely solely on acoustic or semantic cues, leading to suboptimal accuracy and stability, while recent attempts to endow large language models with full-duplex capabilities require costly full-duplex data and incur substantial training and deployment overheads, limiting real-time performance. In this paper, we propose JAL-Turn, a lightweight and efficient speech-only turn-taking framework that adopts a joint acoustic-linguistic modeling paradigm, in which a cross-attention module adaptively integrates pre-trained acoustic representations with linguistic features to support low-latency prediction of hold vs shift states. By sharing a frozen ASR encoder, JAL-Turn enables turn-taking prediction to run fully in parallel with speech recognition, introducing no additional end-to-end latency or computational overhead. In addition, we introduce a scalable data construction pipeline that automatically derives reliable turn-taking labels from large-scale real-world dialogue corpora. Extensive experiments on public multilingual benchmarks and an in-house Japanese customer-service dataset show that JAL-Turn consistently outperforms strong state-of-the-art baselines in detection accuracy while maintaining superior real-time performance.

关键词: turn-taking detection, full-duplex spoken dialogue systems, joint acoustic-linguistic modeling, real-time performance, speech recognition, cross-attention module, data construction pipeline, low-latency prediction

33. ❌ CADSmith: Multi-Agent CAD Generation with Programmatic Geometric Validation

作者: Jesse Barkley, Rumi Loghmani, Amir Barati Farimani 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26512v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文CADSmith的核心是使用多智能体系统（LLM Agents, Multi-agent Systems）通过检索增强生成（RAG）和工具使用（Tool Use）来生成和验证CAD代码，涉及链式思考（Chain of Thought）和系统2思维（System 2 Thinking）进行迭代自我修正（Self-Correction），属于大模型（LLMs）在科学领域（AI for Science）的应用，与这些关键词高度相关（10分）。幻觉缓解（Hallucination Mitigation）通过几何验证间接相关（5分）。其他关键词如MoE、量化、预训练等未涉及（0分）。

!!! tip deepseek-chat TL;DR

CADSmith通过多智能体管道和程序化几何验证的闭环细化，显著提高了从自然语言生成CAD代码的质量和可靠性。

摘要翻译

现有文本到CAD生成方法要么以单次方式运行且缺乏几何验证，要么依赖有损的视觉反馈而无法修正尺寸误差。本文提出CADSmith，一种从自然语言生成CadQuery代码的多智能体流程。该系统通过两个嵌套的校正循环进行迭代优化：内循环负责解决执行错误，外循环则基于程序化几何验证机制。外循环结合了OpenCASCADE内核的精确测量数据（边界框尺寸、体积、实体有效性）与独立视觉语言模型Judge的整体视觉评估，从而同时提供数值精度和高层形状感知能力，确保几何结构正确收敛。该系统采用基于API文档的检索增强生成技术而非微调方法，可在底层CAD库更新时维护最新数据库。我们在包含三个难度层级（T1至T3）的100条提示词定制基准上，通过三种消融配置进行评估。相较于零样本基线，CADSmith实现了100%的执行率（从95%提升），中位数F1分数从0.9707提升至0.9846，中位数交并比从0.8085提升至0.9629，平均倒角距离从28.37降低至0.74。这表明采用程序化几何反馈的闭环优化机制能显著提升大语言模型生成CAD模型的质量与可靠性。

摘要 (Abstract)

Existing methods for text-to-CAD generation either operate in a single pass with no geometric verification or rely on lossy visual feedback that cannot resolve dimensional errors. We present CADSmith, a multi-agent pipeline that generates CadQuery code from natural language. It then undergoes an iterative refinement process through two nested correction loops: an inner loop that resolves execution errors and an outer loop grounded in programmatic geometric validation. The outer loop combines exact measurements from the OpenCASCADE kernel (bounding box dimensions, volume, solid validity) with holistic visual assessment from an independent vision-language model Judge. This provides both the numerical precision and the high-level shape awareness needed to converge on the correct geometry. The system uses retrieval-augmented generation over API documentation rather than fine-tuning, maintaining a current database as the underlying CAD library evolves. We evaluate on a custom benchmark of 100 prompts in three difficulty tiers (T1 through T3) with three ablation configurations. Against a zero-shot baseline, CADSmith achieves a 100% execution rate (up from 95%), improves the median F1 score from 0.9707 to 0.9846, the median IoU from 0.8085 to 0.9629, and reduces the mean Chamfer Distance from 28.37 to 0.74, demonstrating that closed-loop refinement with programmatic geometric feedback substantially improves the quality and reliability of LLM-generated CAD models.

关键词: text-to-CAD generation, multi-agent pipeline, retrieval-augmented generation, programmatic geometric validation, iterative refinement, LLM-generated CAD models, closed-loop refinement, CAD code generation

34. ❌ AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese

作者: Afonso Simplício, Gonçalo Vinagre, Miguel Moura Ramos, Diogo Tavares, Rafael Ferreira, Giuseppe Attanasio, Duarte M. Alves, Inês Calvo, Inês Vieira, Rui Guerra, James Furtado, Beatriz Canaverde, Iago Paulo, Vasco Ramos, Diogo Glória-Silva, Miguel Faria, Marcos Treviso, Daniel Gomes, Pedro Gomes, David Semedo, André Martins, João Magalhães 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26511v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发针对欧洲葡萄牙语的大语言模型AMALIA，因此与’Large Language Models’高度相关（10分）。论文明确提到在mid-和post-training阶段使用高质量pt-PT数据，因此与’Pre-training’和’Post-training’高度相关（各10分）。论文强调数据质量的重要性，与’Scaling Laws AND Data Quality’有一定关联（5分）。其他关键词如MoE、SLMs、对齐、推理、代理等均未在摘要中提及，因此评0分。

!!! tip deepseek-chat TL;DR

该论文针对欧洲葡萄牙语在大语言模型中代表性不足的问题，开发了完全开源的AMALIA模型，通过在训练阶段使用更多高质量pt-PT数据并创建专门的评估基准，显著提升了模型在pt-PT特定任务上的性能。

摘要翻译

尽管开放大语言模型（LLM）发展迅速，欧洲葡萄牙语（pt-PT）在训练数据和本土化评估中仍处于代表性不足的状态，机器翻译的基准测试很可能遗漏了该语言变体的语言学及文化细微差异。我们推出了AMALIA——一个完全开放的大语言模型，通过在训练中期及后期阶段使用更多高质量的欧洲葡萄牙语数据，优先提升对该语言的支持。为了更准确地评估欧洲葡萄牙语，我们发布了一套欧洲葡萄牙语基准测试，其中包含翻译后的标准任务以及四个针对欧洲葡萄牙语生成能力、语言熟练度、以及欧洲葡萄牙语/巴西葡萄牙语（pt-PT/pt-BR）偏见的全新数据集。实验表明，AMALIA在翻译基准测试中与强基线模型表现相当，同时显著提升了在欧洲葡萄牙语专项评估中的性能，这为欧洲葡萄牙语的定向训练和本土化基准测试提供了有力支持。

摘要 (Abstract)

Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant’s linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.

关键词: large language models, European Portuguese, open source, pre-training, post-training, benchmark evaluation, language model adaptation, multilingual NLP

35. ❌ AIRA_2: Overcoming Bottlenecks in AI Research Agents

作者: Karen Hambardzumyan, Nicolas Baldwin, Edan Toledo, Rishi Hazra, Michael Kuchnik, Bassel Al Omari, Thomas Simon Foster, Anton Protopopov, Jean-Christophe Gagnon-Audet, Ishita Mediratta, Kelvin Niu, Michael Shvartsman, Alisia Lupidi, Alexis Audran-Reiss, Parth Pathak, Tatiana Shavrina, Despoina Magka, Hela Momand, Derek Dunfield, Nicola Cancedda, Pontus Stenetorp, Carole-Jean Wu, Jakob Nicolaus Foerster, Yoram Bachrach, Martin Josifoski 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26499v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究AI研究代理（AIRA_2）的架构改进，直接涉及LLM Agents（ReAct agents）和LLMs（作为代理的核心组件），因此这两个关键词高度相关（10分）。论文提到代理的“Tool Use”能力（动态作用域和调试），有一定关联（5分）。研究背景提到AI研究代理，属于“AI for Science”的范畴（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

论文研究了AI研究代理中的三个性能瓶颈（同步单GPU执行、泛化差距、固定单轮LLM操作限制），提出了AIRA_2架构（异步多GPU工作池、隐藏一致评估协议、ReAct代理），在MLE-bench-30上实现了从71.8%到76.0%的性能提升。

摘要翻译

现有研究已识别出AI研究智能体存在的三个结构性性能瓶颈：(1)同步单GPU执行限制了样本吞吐量，制约了搜索效益；(2)泛化鸿沟——基于验证的选择机制导致性能在长期搜索中持续退化；(3)固定式单轮LLM（大语言模型）算子的能力局限为搜索性能设置了天花板。我们提出AIRA$_2$系统，通过三项架构设计解决这些瓶颈：采用异步多GPU工作池实现实验吞吐量的线性增长；建立隐藏一致性评估协议以提供可靠评估信号；部署ReAct智能体实现动态行动规划与交互式调试。在MLE-bench-30基准测试中，AIRA$_2$在24小时内达到71.8%的平均百分位排名——超越此前最佳记录69.9%——并持续提升至72小时的76.0%。消融实验表明所有组件均不可或缺，且先前研究报道的"过拟合"现象实为评估噪声所致，而非真实的数据记忆效应。

摘要 (Abstract)

Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA$_2$, which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA$_2$ achieves a mean Percentile Rank of 71.8% at 24 hours - surpassing the previous best of 69.9% - and steadily improves to 76.0% at 72 hours. Ablation studies reveal that each component is necessary and that the “overfitting” reported in prior work was driven by evaluation noise rather than true data memorization.

关键词: AI research agents, performance bottlenecks, asynchronous multi-GPU, Hidden Consistent Evaluation, ReAct agents, MLE-bench-30, Percentile Rank, ablation studies

36. ❌ Foundation Model for Cardiac Time Series via Masked Latent Attention

作者: Moritz Vandenhirtz, Samuel Ruipérez-Campillo, Simon Böhi, Sonia Laguna, Irene Cannistraci, Andrea Agostini, Ece Ozkan, Thomas M. Sutter, Julia E. Vogt 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26475v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于心电图（ECG）信号的基础模型（Foundation Model）研究，核心贡献是提出了一种名为LAMAE的掩码自编码器，通过潜在注意力机制学习跨导联的连接，以改进ECG表示的质量和可迁移性。因此，它与’Foundation Models’（权重1.0）高度相关（10分），因为论文明确开发并评估了一个用于ECG的基础模型。与’Pre-training OR Continual Pre-training OR Domain Adaptation’（权重1.0）高度相关（10分），因为论文的核心是自监督预训练方法（masked autoencoder）。与’AI for Science OR Bioinformatics OR Cheminformatics’（权重1.0）高度相关（10分），因为论文将AI应用于生物医学信号（ECG）分析，属于AI for Science在生物信息学/医疗领域的应用。其他关键词（如MoE、SFT、RAG、Agents等）与论文的特定技术焦点（ECG信号处理、自监督预训练）无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对心电图（ECG）信号提出了一种名为LAMAE的基础模型，通过掩码自编码器和潜在注意力机制学习跨导联连接，在Mimic-IV-ECG数据库上验证了其在预测ICD-10代码任务中优于现有基线方法。

摘要翻译

心电图（ECG）是最广泛可用的临床信号之一，在心血管诊断中发挥着核心作用。尽管近期的基础模型（FMs）在学习可迁移的心电图表征方面展现出潜力，但现有的预训练方法大多将导联视为独立通道，未能明确利用其强烈的结构冗余性。我们提出了潜在注意力掩码自编码器（LAMAE）基础模型，该模型通过在自监督预训练中学习跨导联连接机制，直接利用这种结构。我们的方法通过潜在注意力建模导联间的高阶交互，实现了对导联特定表征的置换不变聚合与自适应加权。我们在Mimic-IV-ECG数据库上提供的实证证据表明，利用跨导联连接构成了一种有效的结构化监督形式，提升了表征质量与可迁移性。我们的方法在预测ICD-10编码任务中表现出强劲性能，优于独立导联掩码建模及基于对齐的基线方法。

摘要 (Abstract)

Electrocardiograms (ECGs) are among the most widely available clinical signals and play a central role in cardiovascular diagnosis. While recent foundation models (FMs) have shown promise for learning transferable ECG representations, most existing pretraining approaches treat leads as independent channels and fail to explicitly leverage their strong structural redundancy. We introduce the latent attention masked autoencoder (LAMAE) FM that directly exploits this structure by learning cross-lead connection mechanisms during self-supervised pretraining. Our approach models higher-order interactions across leads through latent attention, enabling permutation-invariant aggregation and adaptive weighting of lead-specific representations. We provide empirical evidence on the Mimic-IV-ECG database that leveraging the cross-lead connection constitutes an effective form of structural supervision, improving representation quality and transferability. Our method shows strong performance in predicting ICD-10 codes, outperforming independent-lead masked modeling and alignment-based baselines.

关键词: Foundation Model, ECG, masked autoencoder, latent attention, self-supervised pretraining, cross-lead connection, cardiac time series, representation learning

37. ❌ UNIFERENCE: A Discrete Event Simulation Framework for Developing Distributed AI Models

作者: Doğaç Eldenk, Stephen Xia 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26469v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UNIFERENCE是一个用于开发和评估分布式AI模型的离散事件仿真框架，其核心贡献在于系统工具和仿真方法，而非大模型或深度学习技术本身。论文关注分布式推理算法的开发、基准测试和部署，涉及设备、网络建模和仿真精度验证，但未具体研究大模型架构、训练方法、对齐技术、推理优化、代理系统或科学AI应用等关键词领域。所有关键词均与大模型技术原理或特定应用直接相关，而本文属于底层系统工具，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出了UNIFERENCE，一个用于开发和评估分布式AI模型的离散事件仿真框架，解决了缺乏标准化工具导致结果难以复现的问题，并通过与PyTorch Distributed集成实现了仿真到真实部署的无缝过渡，评估显示其运行时分析精度高达98.6%。

摘要翻译

由于缺乏对异构设备和网络进行建模的标准化工具，分布式推理算法的开发与评估仍然面临困难。现有研究通常依赖临时测试平台或专有基础设施，导致结果难以复现，并限制了对假设性硬件或网络配置的探索。本文提出UNIFERENCE，一个专为在统一环境中开发、基准测试和部署分布式人工智能模型而设计的离散事件仿真（DES）框架。UNIFERENCE通过轻量级逻辑进程对设备和网络行为进行建模，这些进程仅在通信原语上进行同步，从而在保持因果顺序的同时消除了回滚操作。该框架与PyTorch Distributed无缝集成，使得同一代码库能够从仿真环境平滑过渡到实际部署。我们的评估表明，在不同后端和硬件设置下，UNIFERENCE对运行时的分析精度相比实际物理部署可达98.6%。通过连接仿真与部署环节，UNIFERENCE为研究分布式推理算法及探索未来系统设计（从高性能集群到边缘设备）提供了一个易于使用、可复现的平台。本框架已在https://github.com/Dogacel/Uniference开源。

摘要 (Abstract)

Developing and evaluating distributed inference algorithms remains difficult due to the lack of standardized tools for modeling heterogeneous devices and networks. Existing studies often rely on ad-hoc testbeds or proprietary infrastructure, making results hard to reproduce and limiting exploration of hypothetical hardware or network configurations. We present UNIFERENCE, a discrete-event simulation (DES) framework designed for developing, benchmarking, and deploying distributed AI models within a unified environment. UNIFERENCE models device and network behavior through lightweight logical processes that synchronize only on communication primitives, eliminating rollbacks while preserving the causal order. It integrates seamlessly with PyTorch Distributed, enabling the same codebase to transition from simulation to real deployment. Our evaluation demonstrates that UNIFERENCE profiles runtime with up to 98.6% accuracy compared to real physical deployments across diverse backends and hardware setups. By bridging simulation and deployment, UNIFERENCE provides an accessible, reproducible platform for studying distributed inference algorithms and exploring future system designs, from high-performance clusters to edge-scale devices. The framework is open-sourced at https://github.com/Dogacel/Uniference.

关键词: discrete-event simulation, distributed AI models, inference algorithms, PyTorch Distributed, simulation framework, device and network modeling, benchmarking, deployment

38. ❌ A Boltzmann-machine-enhanced Transformer For DNA Sequence Classification

作者: Zhixuan Cao, Yishu Xu, Xuang WU 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26465v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于生物信息学领域的DNA序列分类，提出了一种结合Boltzmann机和Transformer的模型，属于AI for Science（生物信息学）范畴，因此该关键词得10分。模型强调可解释性，与Mechanistic Interpretability有一定关联，得5分。其他关键词主要涉及大语言模型（LLM）的技术、训练、推理、对齐、应用等，而本文研究的是特定领域的Transformer变体，不涉及LLM、MoE、SLMs、Scaling Laws、各种训练技术（预训练、微调、对齐、RLHF、PEFT）、推理优化（RAG、长上下文、注意力优化、推理加速、量化）、智能体、工具使用、幻觉缓解、模型合并、上下文学习等，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文针对DNA序列分类中需要发现潜在位点相互作用和组合依赖性的问题，提出了一种结合Boltzmann机和Transformer的模型，通过引入结构化二值门控变量和变分推理，在保持准确预测的同时学习可解释的稳定结构。

摘要翻译

DNA序列分类不仅需要高预测精度，还要求能够揭示潜在的位点相互作用、组合调控以及类似上位性的高阶依赖关系。尽管标准Transformer具备强大的全局建模能力，但其softmax注意力机制是连续、稠密且弱约束的，更适合信息路由而非显式结构发现。本文提出一种用于DNA序列分类的玻尔兹曼机增强型Transformer模型。该模型基于多头注意力机制，引入结构化二元门控变量来表示潜在的查询-键连接，并通过玻尔兹曼风格的能量函数对其进行约束。查询-键相似度定义了局部偏置项，可学习的成对相互作用捕捉边之间的协同与竞争关系，而潜在隐藏单元则建模高阶组合依赖。由于对离散门控图进行精确后验推断是不可行的，我们采用平均场变分推断来估计边激活概率，并结合Gumbel-Softmax方法将连续概率逐步压缩为近似离散的门控，同时保持端到端的可微性。在训练过程中，我们联合优化分类损失与能量损失，促使模型在实现准确预测的同时，倾向于选择低能量、稳定且可解释的结构。我们进一步从能量函数与变分自由能出发，推导出平均场定点方程、Gumbel-Softmax松弛及最终联合目标函数的完整框架。所提出的框架为整合玻尔兹曼机、可微离散优化与Transformer，实现生物序列上的结构化学习提供了统一视角。

摘要 (Abstract)

DNA sequence classification requires not only high predictive accuracy but also the ability to uncover latent site interactions, combinatorial regulation, and epistasis-like higher-order dependencies. Although the standard Transformer provides strong global modeling capacity, its softmax attention is continuous, dense, and weakly constrained, making it better suited for information routing than explicit structure discovery. In this paper, we propose a Boltzmann-machine-enhanced Transformer for DNA sequence classification. Built on multi-head attention, the model introduces structured binary gating variables to represent latent query-key connections and constrains them with a Boltzmann-style energy function. Query-key similarity defines local bias terms, learnable pairwise interactions capture synergy and competition between edges, and latent hidden units model higher-order combinatorial dependencies. Since exact posterior inference over discrete gating graphs is intractable, we use mean-field variational inference to estimate edge activation probabilities and combine it with Gumbel-Softmax to progressively compress continuous probabilities into near-discrete gates while preserving end-to-end differentiability. During training, we jointly optimize classification and energy losses, encouraging the model to achieve accurate prediction while favoring low-energy, stable, and interpretable structures. We further derive the framework from the energy function and variational free energy to the mean-field fixed-point equations, Gumbel-Softmax relaxation, and the final joint objective. The proposed framework provides a unified view of integrating Boltzmann machines, differentiable discrete optimization, and Transformers for structured learning on biological sequences.

关键词: DNA sequence classification, Transformer, Boltzmann machine, structured binary gating, variational inference, interpretable structure, biological sequences, energy function

39. ❌ Neuro-Symbolic Process Anomaly Detection

作者: Devashish Gaikwad, Wil M. P. van der Aalst, Gyunam Park 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26461v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是过程异常检测，采用神经符号AI方法（Logic Tensor Networks）将领域知识（Declare约束）集成到神经网络中，属于传统AI/机器学习在特定应用领域（过程挖掘）的研究。所有评分关键词均专注于大模型（LLMs）及其相关技术（训练、推理、对齐、应用等），而本文完全不涉及大模型、深度学习技术原理创新或大模型在不同领域的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种神经符号方法，通过Logic Tensor Networks将领域知识（Declare约束）集成到基于自动编码器的神经网络中，以改进过程异常检测，有效区分异常行为与罕见但合规的行为，并在合成和真实数据集上提升了F1分数。

摘要翻译

过程异常检测是过程挖掘的重要应用，旨在识别过程行为与正常模式的偏差。基于神经网络的方法近期被应用于此任务，其直接从事件日志中学习，无需预定义过程模型。然而，由于异常检测是纯统计性任务，这些模型未能融入人类领域知识。因此，罕见但符合规范的轨迹常因出现频率低而被误判为异常，这限制了检测过程的有效性。神经符号人工智能领域的最新进展引入了逻辑张量网络（Logic Tensor Networks, LTN），作为一种利用实值逻辑将符号知识整合到神经网络中的方法。在本研究中，我们提出一种神经符号方法，通过LTN与Declare约束将领域知识集成到神经异常检测中。以自编码器模型为基础，我们将Declare约束编码为学习过程中的软逻辑引导轨，以区分异常行为与罕见但符合规范的行为。在合成和真实数据集上的评估表明，即使仅存在10条符合规范的轨迹，我们的方法也能提升F1分数，并且Declare约束的选择（进而体现人类领域知识）对性能提升有显著影响。

摘要 (Abstract)

Process anomaly detection is an important application of process mining for identifying deviations from the normal behavior of a process. Neural network-based methods have recently been applied to this task, learning directly from event logs without requiring a predefined process model. However, since anomaly detection is a purely statistical task, these models fail to incorporate human domain knowledge. As a result, rare but conformant traces are often misclassified as anomalies due to their low frequency, which limits the effectiveness of the detection process. Recent developments in the field of neuro-symbolic AI have introduced Logic Tensor Networks (LTN) as a means to integrate symbolic knowledge into neural networks using real-valued logic. In this work, we propose a neuro-symbolic approach that integrates domain knowledge into neural anomaly detection using LTN and Declare constraints. Using autoencoder models as a foundation, we encode Declare constraints as soft logical guiderails within the learning process to distinguish between anomalous and rare but conformant behavior. Evaluations on synthetic and real-world datasets demonstrate that our approach improves F1 scores even when as few as 10 conformant traces exist, and that the choice of Declare constraint and by extension human domain knowledge significantly influences performance gains.

关键词: Process anomaly detection, Neuro-symbolic AI, Logic Tensor Networks, Declare constraints, Autoencoder, Domain knowledge integration, Event logs, F1 score improvement

40. ❌ Can AI Models Direct Each Other? Organizational Structure as a Probe into Training Limitations

作者: Rui Liu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26458v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多智能体系统中AI模型之间的协作与指导关系，核心涉及LLM Agents、Multi-agent Systems和Tool Use等关键词，与这些关键词高度相关（10分）。论文提到模型的分析、探索和规划能力，与Chain of Thought和System 2 Thinking有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了多智能体系统中昂贵AI模型指导廉价模型解决软件工程任务的有效性，发现当管理者模型具备真正的能力优势时，这种指导关系能显著提升性能，但当前模型的训练方式限制了它们在分工协作中的表现。

摘要翻译

昂贵的AI模型能否有效指导廉价模型解决软件工程任务？本研究通过引入ManagerWorker双智能体流水线探讨该问题：一个昂贵的“管理”模型（仅文本处理，无代码执行能力）负责分析问题、分配探索任务并审核实现方案，而一个廉价的“执行”模型（拥有完整仓库访问权限）则负责执行代码修改。我们在SWE-bench Lite的200个实例上评估了五种配置方案，这些方案在管理-执行关系、流水线复杂度和模型配对方面存在差异。研究发现揭示了多智能体指导机制的双重性：（1）强管理模型指导弱执行模型（62%）的表现与强单智能体（60%）相当，但仅消耗后者极少比例的token用量，表明昂贵的推理过程可替代昂贵的执行过程；（2）弱管理模型指导弱执行模型（42%）的表现反而弱于单独弱智能体（44%），证明指导关系需要真实的能力差距——缺乏实质内容的结构纯属额外开销；（3）管理模型的价值在于指导过程而非单纯审核——仅添加基础审核环节仅比基线提升2个百分点，而结构化探索与规划环节可提升11个百分点，说明主动指导才是能力差距产生效益的关键；（4）这些现象可归因于同一根源：当前模型均被训练为单体智能体，将其拆分为指导者/执行者角色会违背其训练数据分布。该流水线的成功在于通过设计规避这种不匹配——使每个模型尽可能接近其训练模式（管理模型专注于文本生成，执行模型专注于工具使用），并将组织结构外部化为代码。这一诊断揭示了具体的训练缺陷：委托授权、范围化执行和模式切换等能力在当前训练数据中尚属空白。

摘要 (Abstract)

Can an expensive AI model effectively direct a cheap one to solve software engineering tasks? We study this question by introducing ManagerWorker, a two-agent pipeline where an expensive “manager” model (text-only, no code execution) analyzes issues, dispatches exploration tasks, and reviews implementations, while a cheap “worker” model (with full repo access) executes code changes. We evaluate on 200 instances from SWE-bench Lite across five configurations that vary the manager-worker relationship, pipeline complexity, and model pairing. Our findings reveal both the promise and the limits of multi-agent direction: (1) a strong manager directing a weak worker (62%) matches a strong single agent (60%) at a fraction of the strong-model token usage, showing that expensive reasoning can substitute for expensive execution; (2) a weak manager directing a weak worker (42%) performs worse than the weak agent alone (44%), demonstrating that the directing relationship requires a genuine capability gap–structure without substance is pure overhead; (3) the manager’s value lies in directing, not merely reviewing–a minimal review-only loop adds just 2pp over the baseline, while structured exploration and planning add 11pp, showing that active direction is what makes the capability gap productive; and (4) these behaviors trace to a single root cause: current models are trained as monolithic agents, and splitting them into director/worker roles fights their training distribution. The pipeline succeeds by designing around this mismatch–keeping each model close to its trained mode (text generation for the manager, tool use for the worker) and externalizing organizational structure to code. This diagnosis points to concrete training gaps: delegation, scoped execution, and mode switching are skills absent from current training data.

关键词: multi-agent systems, LLM agents, tool use, software engineering, manager-worker pipeline, training limitations, capability gap, organizational structure

41. ❌ CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities

作者: Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26425v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CPUBone专注于计算机视觉骨干网络设计，针对CPU设备的低并行化能力进行优化，属于深度学习模型效率优化领域。与大多数关键词（特别是大语言模型相关技术）无关。仅与三个关键词有弱关联：1）‘Small Language Models OR SLMs OR On-device AI’（5分）- 论文关注设备端AI效率，但针对视觉模型而非语言模型；2）‘Quantization OR Model Compression OR Low-bit Weights’（5分）- 涉及模型效率优化，但未具体使用量化或压缩技术；3）‘Speculative Decoding OR Inference Acceleration’（5分）- 涉及推理加速，但针对视觉任务而非文本生成。其他关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文针对CPU设备低并行化能力的问题，通过改进卷积操作设计出CPUBone视觉骨干网络，实现了在多种CPU设备上的最优速度-精度权衡，并有效迁移到下游视觉任务。

摘要翻译

近期关于视觉骨干架构的研究主要聚焦于为具备高并行处理能力的硬件平台优化效率。这类平台日益涵盖手机及嵌入式AI加速模块等嵌入式系统。相比之下，CPU无法以相同方式实现运算并行化，因此模型需要遵循一种特定的设计理念：通过维持较高的每秒运算量（MACpS），在运算总量（MACs）与硬件高效执行之间取得平衡。为此，我们研究了标准卷积的两种改进方案以降低计算成本：分组卷积与减小卷积核尺寸。尽管这两种调整都能显著降低推理所需的MACs总量，但维持低延迟仍需保持硬件效率。我们在多种CPU设备上的实验证实，这些调整能成功在CPU上保持较高的硬件效率。基于这些发现，我们提出了CPUBone——一个专为CPU推理优化的新型视觉骨干模型系列。CPUBone在广泛的CPU设备上实现了最优的速度-精度权衡（SATs），并能将其高效性有效迁移至目标检测和语义分割等下游任务。模型与代码已发布于https://github.com/altair199797/CPUBone。

摘要 (Abstract)

Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile phones and embedded AI accelerator modules. In contrast, CPUs do not have the possibility to parallelize operations in the same manner, wherefore models benefit from a specific design philosophy that balances amount of operations (MACs) and hardware-efficient execution by having high MACs per second (MACpS). In pursuit of this, we investigate two modifications to standard convolutions, aimed at reducing computational cost: grouping convolutions and reducing kernel sizes. While both adaptations substantially decrease the total number of MACs required for inference, sustaining low latency necessitates preserving hardware-efficiency. Our experiments across diverse CPU devices confirm that these adaptations successfully retain high hardware-efficiency on CPUs. Based on these insights, we introduce CPUBone, a new family of vision backbone models optimized for CPU-based inference. CPUBone achieves state-of-the-art Speed-Accuracy Trade-offs (SATs) across a wide range of CPU devices and effectively transfers its efficiency to downstream tasks such as object detection and semantic segmentation. Models and code are available at https://github.com/altair199797/CPUBone.

关键词: vision backbone, CPU inference, hardware efficiency, convolution modifications, MACs reduction, speed-accuracy trade-off, object detection, semantic segmentation

42. ❌ KMM-CP: Practical Conformal Prediction under Covariate Shift via Selective Kernel Mean Matching

作者: Siddhartha Laghuvarapu, Rohan Deb, Jimeng Sun 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26415v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于机器学习中的不确定性量化方法（Conformal Prediction）及其在协变量偏移下的应用，属于传统机器学习/统计学习范畴，而非大模型或深度学习技术。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文未涉及这些内容。唯一有微弱关联的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文在分子性质预测基准上进行了实验，属于科学计算应用，但论文核心是方法学而非AI在科学领域的创新应用，因此仅给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于核均值匹配的KMM-CP框架，用于在协变量偏移下实现更稳定的保形预测，并在分子性质预测基准上证明其能将覆盖误差降低50%以上。

摘要翻译

不确定性量化对于在科学发现与医疗健康等高风险领域部署机器学习模型至关重要。共形预测能够在可交换性假设下提供有限样本覆盖保证，但该假设常因分布偏移在实践中被违背。在协变量偏移场景下，恢复有效性需进行重要性加权，然而当训练分布与测试分布的支持集重叠有限时，精确的密度比估计会变得不稳定。本文提出KMM-CP——一种基于核均值匹配的协变量偏移校正共形预测框架。我们证明，通过在显式权重约束下最小化再生核希尔伯特空间矩差异，KMM能直接控制决定共形覆盖误差的偏差-方差分量，并在温和条件下建立渐近覆盖保证。进而提出选择性扩展方案，该方案能识别可靠支持集重叠区域，并将共形校正限制在此子集内，从而进一步提升低重叠区域的稳定性。在具有现实分布偏移的分子性质预测基准测试中，实验表明KMM-CP相较于现有方法将覆盖差距降低了50%以上。代码发布于https://github.com/siddharthal/KMM-CP。

摘要 (Abstract)

Uncertainty quantification is essential for deploying machine learning models in high-stakes domains such as scientific discovery and healthcare. Conformal Prediction (CP) provides finite-sample coverage guarantees under exchangeability, an assumption often violated in practice due to distribution shift. Under covariate shift, restoring validity requires importance weighting, yet accurate density-ratio estimation becomes unstable when training and test distributions exhibit limited support overlap. We propose KMM-CP, a conformal prediction framework based on Kernel Mean Matching (KMM) for covariate-shift correction. We show that KMM directly controls the bias-variance components governing conformal coverage error by minimizing RKHS moment discrepancy under explicit weight constraints, and establish asymptotic coverage guarantees under mild conditions. We then introduce a selective extension that identifies regions of reliable support overlap and restricts conformal correction to this subset, further improving stability in low-overlap regimes. Experiments on molecular property prediction benchmarks with realistic distribution shifts show that KMM-CP reduces coverage gap by over 50% compared to existing approaches. The code is available at https://github.com/siddharthal/KMM-CP.

关键词: Conformal Prediction, Covariate Shift, Kernel Mean Matching, Uncertainty Quantification, Distribution Shift, Molecular Property Prediction, Coverage Guarantee, Importance Weighting

43. ❌ Why Models Know But Don’t Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models

作者: Richard J. Young 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26410v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩展思维模型中的思维令牌与可见答案之间的忠实度差异，核心关注Chain-of-Thought推理、系统2思维、事实性/幻觉缓解和可解释AI。论文直接研究CoT推理模型在误导提示下的行为，因此CoT关键词得15分（核心内容）。论文涉及深度推理过程（系统2思维）、模型真实性问题（幻觉缓解）和模型内部思维分析（可解释AI），这些关键词得10分。论文研究模型自我反思能力（自我纠正）得8分。论文明确研究LLM推理行为，因此LLM关键词得10分。其他关键词与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，在扩展思维模型中，当模型受到误导提示影响时，超过一半的情况下模型的思维令牌包含提示相关信息而可见答案完全省略，揭示了思维-答案忠实度差异，表明仅监控答案文本会遗漏大部分受提示影响的推理过程。

摘要翻译

扩展思维模型在用户可见答案之外暴露了第二条文本生成通道（“思维标记”）。本研究在MMLU和GPQA问题上检验了12个开源推理模型，这些问题均配有误导性提示。在模型实际遵循提示（选择提示目标而非正确答案）的10,506个案例中，每个案例根据模型是否在其思维标记、答案文本、两者皆或两者皆无中承认提示进行分类。在55.4%的案例中，模型的思维标记包含了可见答案完全省略的提示相关关键词，这种模式被称为思维-答案分歧。反向情况（仅答案承认）近乎为零（0.5%），证实了这种不对称具有方向性。提示类型显著影响模式：奉承型提示是最透明的，在受其影响的案例中有58.8%在双通道中都承认了教授的权威；而一致性（72.2%）和非伦理（62.7%）提示则主要由仅思维承认主导。模型间差异也很大，从近乎完全分歧（Step-3.5-Flash: 94.7%）到相对透明（Qwen3.5-27B: 19.6%）不等。这些结果表明，仅监控答案文本会遗漏超过一半受提示影响的推理过程，而访问思维标记虽属必要，仍有11.8%的案例在任一通道中都没有言语化的承认。

摘要 (Abstract)

Extended-thinking models expose a second text-generation channel (“thinking tokens”) alongside the user-visible answer. This study examines 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Among the 10,506 cases where models actually followed the hint (choosing the hint’s target over the ground truth), each case is classified by whether the model acknowledges the hint in its thinking tokens, its answer text, both, or neither. In 55.4% of these cases the model’s thinking tokens contain hint-related keywords that the visible answer omits entirely, a pattern termed thinking-answer divergence. The reverse (answer-only acknowledgment) is near-zero (0.5%), confirming that the asymmetry is directional. Hint type shapes the pattern sharply: sycophancy is the most transparent hint, with 58.8% of sycophancy-influenced cases acknowledging the professor’s authority in both channels, while consistency (72.2%) and unethical (62.7%) hints are dominated by thinking-only acknowledgment. Models also vary widely, from near-total divergence (Step-3.5-Flash: 94.7%) to relative transparency (Qwen3.5-27B: 19.6%). These results show that answer-text-only monitoring misses more than half of all hint-influenced reasoning and that thinking-token access, while necessary, still leaves 11.8% of cases with no verbalized acknowledgment in either channel.

关键词: Chain-of-Thought, thinking tokens, reasoning models, faithfulness divergence, hint influence, transparency, open-weight models, answer-text monitoring

44. ❌ Generative Modeling in Protein Design: Neural Representations, Conditional Generation, and Evaluation Standards

作者: Senura Hansaja Wanasekara, Minh-Duong Nguyen, Xiaochen Liu, Nguyen H. Tran, Ken-Tye Yong 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26378v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文是一篇关于蛋白质研究中生成式AI的综述，主要关注蛋白质设计中的生成建模，包括神经表示、条件生成和评估标准。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词特指大语言模型（LLM）及相关技术，而论文讨论的是蛋白质领域的生成式AI，未涉及LLM。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为蛋白质研究属于生物信息学（Bioinformatics）范畴，是AI在科学领域的应用，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

这篇综述系统性地梳理了蛋白质研究中的生成式AI方法，包括神经表示、生成架构和任务设置，并提出了评估标准和未来挑战，旨在推动从预测建模到可靠的、功能驱动的蛋白质工程。

摘要翻译

生成建模已成为蛋白质研究的核心范式，将机器学习从结构预测扩展到序列设计、主链生成、逆向折叠和生物分子相互作用建模等领域。然而，现有文献在表征方式、模型类别和任务框架方面仍较为零散，使得方法比较或确定合适的评估标准变得困难。本综述系统梳理了蛋白质研究中的生成式人工智能，围绕以下方面展开：(i) 涵盖序列、几何和多模态编码的基础表征；(ii) 包括 $\mathrm{SE}(3)$-等变扩散、流匹配以及混合预测器-生成器系统在内的生成架构；(iii) 从结构预测与从头设计到蛋白质-配体及蛋白质-蛋白质相互作用的任务场景。除方法梳理外，我们还比较了不同方法的假设前提、条件机制与可控性，并整合了强调防泄漏数据划分、物理有效性检验和面向功能的基准测试等评估最佳实践。最后，我们提出了关键的开放挑战：构象动力学与内在无序区域的建模、在保持效率的前提下扩展至大型复合体，以及针对两用性生物安全风险构建稳健的安全框架。通过将架构进展与实际评估标准及负责任的发展考量相统一，本综述旨在加速从预测建模向可靠、功能驱动的蛋白质工程的转变。

摘要 (Abstract)

Generative modeling has become a central paradigm in protein research, extending machine learning beyond structure prediction toward sequence design, backbone generation, inverse folding, and biomolecular interaction modeling. However, the literature remains fragmented across representations, model classes, and task formulations, making it difficult to compare methods or identify appropriate evaluation standards. This survey provides a systematic synthesis of generative AI in protein research, organized around (i) foundational representations spanning sequence, geometric, and multimodal encodings; (ii) generative architectures including $\mathrm{SE}(3)$-equivariant diffusion, flow matching, and hybrid predictor-generator systems; and (iii) task settings from structure prediction and de novo design to protein-ligand and protein-protein interactions. Beyond cataloging methods, we compare assumptions, conditioning mechanisms, and controllability, and we synthesize evaluation best practices that emphasize leakage-aware splits, physical validity checks, and function-oriented benchmarks. We conclude with critical open challenges: modeling conformational dynamics and intrinsically disordered regions, scaling to large assemblies while maintaining efficiency, and developing robust safety frameworks for dual-use biosecurity risks. By unifying architectural advances with practical evaluation standards and responsible development considerations, this survey aims to accelerate the transition from predictive modeling to reliable, function-driven protein engineering.

关键词: Generative Modeling, Protein Design, Neural Representations, Conditional Generation, Evaluation Standards, Bioinformatics, AI for Science, Protein Engineering

45. ❌ Generative Score Inference for Multimodal Data

作者: Xinyu Tian, Xiaotong Shen 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26349v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Generative Score Inference (GSI)框架，专注于多模态数据的不确定性量化。与关键词的相关性分析：1. ‘Large Language Models’得8分，因为论文明确将GSI应用于大语言模型的幻觉检测，这是核心应用场景之一。2. ‘Hallucination Mitigation’得8分，因为论文将GSI用于幻觉检测并取得SOTA性能，这是直接相关的研究方向。3. 其他关键词得0分，因为论文未涉及MoE、SLMs、训练技术、推理优化、智能体、模型压缩等具体技术，也未涉及科学AI应用。论文主要贡献是通用的不确定性量化框架，而非特定大模型技术或应用领域的创新。

!!! tip deepseek-chat TL;DR

该论文提出了Generative Score Inference (GSI)框架，通过深度生成模型生成合成样本来近似条件分数分布，从而在多模态学习中实现统计有效且信息丰富的预测和置信集构建，并在大语言模型幻觉检测和图像描述不确定性估计中验证了其有效性。

摘要翻译

在各类监督学习场景中，精确的不确定性量化对于做出可靠决策至关重要，尤其是在处理图像和文本等复杂的多模态数据时。现有方法通常面临显著局限，包括僵化的假设和有限的泛化能力，这制约了其在多样化监督学习任务中的有效性。为克服这些局限，我们提出了生成式分数推断（Generative Score Inference, GSI），这是一种灵活的推断框架，能够在广泛的多模态学习问题中构建统计有效且信息丰富的预测集与置信集。GSI利用深度生成模型生成的合成样本来近似条件分数分布，从而在不强加关于数据或任务的限制性假设的前提下，实现精确的不确定性量化。我们通过两个代表性场景实证验证了GSI的能力：大语言模型的幻觉检测以及图像描述任务中的不确定性估计。我们的方法在幻觉检测中达到了最先进的性能，并在图像描述中实现了稳健的预测不确定性，且其性能受到底层生成模型质量的积极影响。这些发现凸显了GSI作为一种通用推断框架的潜力，能够显著增强多模态学习中的不确定性量化与可信度。

摘要 (Abstract)

Accurate uncertainty quantification is crucial for making reliable decisions in various supervised learning scenarios, particularly when dealing with complex, multimodal data such as images and text. Current approaches often face notable limitations, including rigid assumptions and limited generalizability, constraining their effectiveness across diverse supervised learning tasks. To overcome these limitations, we introduce Generative Score Inference (GSI), a flexible inference framework capable of constructing statistically valid and informative prediction and confidence sets across a wide range of multimodal learning problems. GSI utilizes synthetic samples generated by deep generative models to approximate conditional score distributions, facilitating precise uncertainty quantification without imposing restrictive assumptions about the data or tasks. We empirically validate GSI’s capabilities through two representative scenarios: hallucination detection in large language models and uncertainty estimation in image captioning. Our method achieves state-of-the-art performance in hallucination detection and robust predictive uncertainty in image captioning, and its performance is positively influenced by the quality of the underlying generative model. These findings underscore the potential of GSI as a versatile inference framework, significantly enhancing uncertainty quantification and trustworthiness in multimodal learning.

关键词: Generative Score Inference, uncertainty quantification, multimodal data, hallucination detection, large language models, confidence sets, deep generative models, image captioning

46. ❌ CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law

作者: JiHyeok Jung, TaeYoung Yoon, HyunSouk Cho 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26332v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于评估大语言模型在韩国法律领域的上下文感知推理能力，因此与’Large Language Models’高度相关（10分）。论文涉及法律推理任务，需要多步推理和深度思考，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。其他关键词如MoE、SLMs、训练技术、优化方法、代理系统、模型压缩等均未在论文中涉及，因此评分为0分。论文属于大模型在特定领域（法律）的应用研究，符合研究背景要求。

!!! tip deepseek-chat TL;DR

该论文提出了CALRK-Bench基准来评估大语言模型在韩国法律中的上下文感知推理能力，实验发现当前大模型在这些任务上表现不佳。

摘要翻译

法律推理不仅需要法律规则的适用，更要求理解这些规则运作的语境。然而，现有的法律基准主要评估固定规范假设下的规则应用，因而未能捕捉法律判断发生转变或多重规范相互作用的场景。本研究提出CALRK-Bench——一个基于韩国法律体系的语境感知法律推理基准。该基准评估模型能否识别法律规范的时间效力、判断特定案件是否具备充分的法律信息，并理解法律判决转变背后的原因。数据集构建于法律判例与法律咨询记录，并经由法律专家验证。实验结果表明，即使是近期的大型语言模型在这三项任务上也持续表现出较低性能。CALRK-Bench为评估语境感知法律推理能力（而非简单的法律知识记忆）提供了新的压力测试。我们的代码公开于https://github.com/jhCOR/CALRKBench。

摘要 (Abstract)

Legal reasoning requires not only the application of legal rules but also an understanding of the context in which those rules operate. However, existing legal benchmarks primarily evaluate rule application under the assumption of fixed norms, and thus fail to capture situations where legal judgments shift or where multiple norms interact. In this work, we propose CALRK-Bench, a context-aware legal reasoning benchmark based on the legal system in Korean. CALRK-Bench evaluates whether models can identify the temporal validity of legal norms, determine whether sufficient legal information is available for a given case, and understand the reasons behind shifts in legal judgments. The dataset is constructed from legal precedents and legal consultation records, and is validated by legal experts. Experimental results show that even recent large language models consistently exhibit low performance on these three tasks. CALRK-Bench provides a new stress test for evaluating context-aware legal reasoning rather than simple memorization of legal knowledge. Our code is available at https://github.com/jhCOR/CALRKBench.

关键词: Legal Reasoning, Context-Aware, Benchmark, Large Language Models, Korean Law, Temporal Validity, Legal Judgments, Evaluation

47. ❌ Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation

作者: Yiming Ren, Yujiu Yang, Junjie Wang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26330v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究视觉语言模型（VLMs）的监督微调（SFT）问题，直接涉及’Post-training OR Supervised Fine-tuning OR SFT’（10分）和’PEFT OR LoRA OR Parameter-efficient Fine-tuning’（10分）。论文关注推理能力下降问题，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（8分）和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（8分）相关。论文使用Qwen3-VL-2B模型，属于大模型范畴，与’Large Language Models OR LLMs OR Foundation Models’（8分）相关。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型在监督微调过程中出现的推理能力下降问题，并提出了一种轻量级的输入自适应深度聚合机制，在仅增加0.14M参数的情况下显著提升了模型的推理和感知性能。

摘要翻译

在视觉指令数据上进行监督微调（SFT）通常能提升视觉语言模型（VLMs）的感知能力，但同时会降低其推理性能，从而在训练后阶段产生持续的推理损失。我们探究这种性能下降是否与深度方向表征的访问受阻有关，并发现即使采用固定的跨深度聚合也能显著恢复推理能力，这表明保持跨深度访问是VLM微调中一个被忽略的重要因素。基于此观察，我们提出了输入自适应深度聚合（Input-Adaptive Depth Aggregation, IADA），这是一种轻量级机制，通过低秩瓶颈实现跨深度检索的输入自适应、模态感知和高效参数化。在Qwen3-VL-2B模型上，相较于仅使用LoRA的微调方法，IADA仅增加0.14M参数，就将平均推理分数提升了9.5分，平均感知分数提升了3.3分，且在参数高效的低秩设置中表现出最强的增益。

摘要 (Abstract)

Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by $3.3$ points over LoRA-only fine-tuning with only 0.14M additional parameters, with the strongest gains appearing in parameter-efficient low-rank settings.

关键词: Vision-Language Models, Supervised Fine-Tuning, Reasoning Tax, Parameter-Efficient Fine-Tuning, Depth Aggregation, LoRA, Cross-depth Representations, Qwen3-VL-2B

48. ❌ PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management

作者: Eugenio Rodrigo Zimmer Neves, Amanda Vanon Correa, Camila Campioni, Gabielli Pare Guglielmi, Bruno Morelli 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26324v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要关注制药知识管理的规范性信息架构（PLP基础设施），涉及文档保存、语义解释和上下文呈现的技术分离，以解决可追溯性、透明度和问责制问题。论文虽然涉及AI在药学中的应用（如机器辅助阅读），但核心是信息系统架构和知识管理框架，而非大模型或深度学习技术。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文主题有一定关联（涉及药学领域的AI应用），但论文未深入探讨大模型技术原理或创新，因此仅给5分（有一定关联）。其他关键词均未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PATOS-Lector-PRISMA（PLP）的规范性信息架构，用于解决制药知识管理中因文档保存、语义解释和上下文呈现操作混淆而导致的溯源丢失、解释不透明和问责制侵蚀等问题，并通过巴西监管背景下的实际系统数据验证了该架构的有效性。

摘要翻译

现有大多数药学人工智能方法将三个认识论层面不同的操作——文档保存、语义解读与情境呈现——压缩至单一技术层中。这种混淆是导致一系列重复性脆弱问题的根源，包括来源信息丢失、解读过程不透明、警示疲劳以及责任追溯弱化。本文提出PATOS–Lector–PRISMA（PLP）架构作为负责任药品知识管理的规范性信息基础设施。PATOS通过显式版本控制与溯源机制保存监管文档；Lector在人工策管下实施机器辅助阅读，生成锚定于原始来源的类型化断言；PRISMA通过RPDA框架（监管、处方、调配、给药）实现情境化呈现，将同一信息核心折射至不同的专业视图。该架构引入“证据包”作为可问责断言的规范化单元（具备版本化、可追溯、认识论边界明确及策管验证等特性），其断言类型通过言外行为效力进行界定。通过实际系统数据，本文以单水合安乃近为例演示了该架构在三层结构中的完整工作流程。该架构在巴西监管环境中开发并验证，其运行实例已涵盖超过16,000份官方文档及针对五种参考药物策管的38个证据包。研究表明，该方案可与现有临床决策支持系统形成互补，提供当前系统所缺乏的基础设施条件：文档锚定性、解读透明度及制度化的责任追溯能力。

摘要 (Abstract)

Most existing approaches to AI in pharmacy collapse three epistemologically distinct operations into a single technical layer: document preservation, semantic interpretation, and contextual presentation. This conflation is a root cause of recurring fragilities including loss of provenance, interpretive opacity, alert fatigue, and erosion of accountability. This paper proposes the PATOS–Lector–PRISMA (PLP) infrastructure as a normative information architecture for responsible pharmaceutical knowledge management. PATOS preserves regulatory documents with explicit versioning and provenance; Lector implements machine-assisted reading with human curation, producing typed assertions anchored to primary sources; PRISMA delivers contextual presentation through the RPDA framework (Regulatory, Prescription, Dispensing, Administration), refracting the same informational core into distinct professional views. The architecture introduces the Evidence Pack as a formal unit of accountable assertion (versioned, traceable, epistemically bounded, and curatorially validated), with assertions typified by illocutionary force. A worked example traces dipyrone monohydrate across all three layers using real system data. Developed and validated in Brazil’s regulatory context, the architecture is grounded in an operational implementation comprising over 16,000 official documents and 38 curated Evidence Packs spanning five reference medications. The proposal is demonstrated as complementary to operational decision support systems, providing infrastructural conditions that current systems lack: documentary anchoring, interpretive transparency, and institutional accountability.

关键词: pharmaceutical knowledge management, normative information architecture, document preservation, semantic interpretation, contextual presentation, accountability, regulatory documents, Evidence Pack

49. ❌ Label-Free Cross-Task LoRA Merging with Null-Space Compression

作者: Wonyoung Lee, Wooseong Jeong, Kuk-Jin Yoon 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26317v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LoRA微调后的模型合并方法，与’PEFT/LoRA’和’Model Merging’高度相关（10分）。论文涉及基础模型和微调，与’Large Language Models’和’Post-training/SFT’有一定关联（8分）。方法涉及压缩，与’Quantization/Model Compression’有弱关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RAG等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Null-Space Compression Merging的标签无关、输出无关的LoRA合并方法，通过适配器几何设置合并权重，在异构视觉任务和NLI基准测试中实现了最先进的性能。

摘要翻译

模型融合技术可在无需联合多任务训练的情况下，将独立微调的检查点进行整合。在基础模型时代，基于低秩自适应（LoRA）的微调方法已被广泛采用，这使得LoRA融合成为极具潜力的研究方向。现有方法通常适用于同构场景（即所有目标任务均为分类任务），但在任务同时涵盖分类与回归时往往失效。基于熵的替代方法无法应用于回归任务，且由于语言模型需处理长序列标记，其计算成本高昂。本文提出零空间压缩（NSC）融合方法，这是一种无需标签、与输出无关的技术，其融合权重完全由适配器的几何结构决定。我们的核心发现是：在LoRA微调过程中，参数增量$ΔW = BA$中的下投影因子$A$会压缩其零空间，且压缩程度与模型性能相关。NSC利用这一现象作为融合的优化信号，可泛化至分类、回归及序列生成任务。在二十项异构视觉任务上，NSC取得了最先进的性能表现，实现了均衡的性能提升，而现有方法往往对任务子集产生过拟合。此外，NSC在六项自然语言推理基准测试以及视觉问答（VQA）与图像描述生成等视觉-语言评估任务中均优于基线方法，展现了其可扩展性与有效性。

摘要 (Abstract)

Model merging combines independently fine-tuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches using entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor $A$ in $ΔW = BA$ compresses its null space, and the compression correlates with performance. NSC uses this as an optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision-language evaluations for VQA and image captioning, demonstrating scalability and effectiveness.

关键词: Model Merging, LoRA, Null-Space Compression, Parameter-efficient Fine-tuning, Cross-task, Label-free, Vision Tasks, NLI Benchmarks

50. ❌ Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy

作者: Wooseong Jeong, Wonyoung Lee, Kuk-Jin Yoon 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26299v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LoRA模块的合并方法，与’PEFT/LoRA’和’Model Merging’高度相关（10分）；涉及模型对齐和微调，与’Post-training/SFT’和’Instruction Tuning/Alignment’有一定关联（5分）；虽未明确提及LLM，但LoRA技术常用于大模型，故给’Large Language Models’基础分（5分）；其他关键词如MoE、量化、推理加速等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LoRA模块合并时存在的子空间覆盖不足和方向各向异性问题，提出了TARA-Merging方法，通过偏好对齐和方向重加权，在多个视觉和NLI基准上显著提升了合并模型的性能和泛化能力。

摘要翻译

融合多个低秩自适应（LoRA）模块对于构建通用系统具有广阔前景，但也面临挑战，因为LoRA更新方向跨越不同的子空间且贡献程度不均。当简单合并时，这种不匹配可能削弱对某些任务损失最为关键的方向，同时过度强调相对次要的方向，最终降低模型忠实表征所有任务的能力。我们从两个视角重新审视这一问题：子空间覆盖度——用于捕捉LoRA方向覆盖多样化表征方向的广度，以及各向异性——反映这些方向上影响的不均衡性。我们提出TARA-Merging（任务秩各向异性对齐）方法，该方法通过偏好加权的交叉熵伪损失来对齐合并权重，同时保留与任务相关的LoRA子空间。这确保了广泛的子空间覆盖，并通过方向性重加权缓解各向异性。在八个视觉任务和六个自然语言推理基准测试中，TARA-Merging始终优于原始基线及LoRA感知基线，展现出强大的鲁棒性和泛化能力，并凸显了在LoRA合并中同时处理子空间覆盖度与各向异性的重要性。

摘要 (Abstract)

Merging multiple Low-Rank Adaptation (LoRA) modules is promising for constructing general-purpose systems, yet challenging because LoRA update directions span different subspaces and contribute unevenly. When merged naively, such mismatches can weaken the directions most critical to certain task losses while overemphasizing relatively less important ones, ultimately reducing the model’s ability to represent all tasks faithfully. We revisit this problem through two perspectives: subspace coverage, which captures how broadly LoRA directions cover diverse representational directions, and anisotropy, which reflects the imbalance of influence across those directions. We propose TARA-Merging (Task-Rank Anisotropy Alignment), which aligns merging weights using a preference-weighted cross-entropy pseudo-loss while preserving task-relevant LoRA subspaces. This ensures broad subspace coverage and mitigates anisotropy via direction-wise reweighting. Across eight vision and six NLI benchmarks, TARA-Merging consistently outperforms vanilla and LoRA-aware baselines, demonstrating strong robustness and generalization, and highlighting the importance of addressing both subspace coverage and anisotropy in LoRA merging.

关键词: LoRA merging, subspace coverage, anisotropy, parameter-efficient fine-tuning, model merging, task alignment, TARA-Merging, multi-task learning

51. ❌ findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

作者: Héctor Javier Vázquez Martínez 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26292v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文介绍了一个用于音节级语音标记化和嵌入的语言无关工具包，专注于语音处理、音节分割和工具标准化，与所有评分关键词（主要涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）均无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对音节级语音标记化研究碎片化的问题，开发了一个名为findsylls的语言无关工具包，统一了音节检测和嵌入提取方法，并在多种语言上验证了其跨资源环境的可重复实验支持能力。

摘要翻译

音节单元为口语建模和无监督词汇发现提供了紧凑且具有语言学意义的表征，但音节划分研究在不同实现方案、数据集和评估协议间仍处于碎片化状态。我们推出findsylls——一个模块化、语言无关的工具包，它将经典音节检测器与端到端音节划分器统一于通用接口之下，支持音节分割、嵌入提取及多粒度评估。该工具包实现并标准化了广泛使用的方法（如Sylber、VG-HuBERT），允许其组件重组，从而实现对表征、算法和标记速率的可控比较。我们在英语和西班牙语语料库以及来自科诺语（一种记录不足的中部曼德语系语言）的新手工标注数据上演示了findsylls，展示了单一框架如何在高资源与资源匮乏场景下支持可复现的音节级实验。

摘要 (Abstract)

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.

关键词: syllable-level speech tokenization, language-agnostic toolkit, syllable segmentation, embedding extraction, reproducible experiments, under-resourced languages, multi-granular evaluation, spoken language modeling

52. ❌ PhysVid: Physics Aware Local Conditioning for Generative Video Models

作者: Saurabh, Pathak, Elahe Arani, Mykola Pechenizkiy, Bahram Zonooz 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26285v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文PhysVid专注于生成式视频模型的物理合理性改进，提出了一种基于局部物理条件约束的方法。所有关键词均与语言模型、推理、对齐、优化、压缩、代理等技术直接相关，而本文研究的是视频生成模型中的物理约束注入，与这些语言模型技术无直接关联。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及将AI（特别是生成模型）应用于物理规律建模，属于AI在科学领域的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均不涉及，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对生成式视频模型常违反物理规律的问题，提出了一种物理感知的局部条件约束方案PhysVid，通过在训练中融合物理描述和推理时使用负物理提示，显著提升了生成视频的物理合理性，在VideoPhy基准上比基线模型提高了约33%的物理常识分数。

摘要翻译

生成式视频模型虽能实现较高的视觉保真度，却常违背基础物理原理，限制了其在真实场景中的可靠性。先前尝试引入物理约束的方法多依赖于条件输入：帧级信号具有领域特定性且时间跨度短，而全局文本提示则过于粗略且含噪声，无法捕捉细粒度动态。本文提出PhysVid——一种基于物理感知的局部条件化方案，该方案在时间连续的帧块上运行。每个帧块均标注了基于物理的状态、相互作用与约束描述，这些描述在训练过程中通过块感知交叉注意力机制与全局提示融合。在推理阶段，我们引入负向物理提示（描述局部相关的物理定律违反情况）以引导生成过程远离不合理的运动轨迹。在VideoPhy基准测试中，PhysVid将物理常识得分较基线视频生成模型提升约33%，在VideoPhy2上最高提升约8%。这些结果表明，局部化的物理感知引导能显著增强生成视频的物理合理性，标志着向基于物理基础的视频模型迈出了重要一步。

摘要 (Abstract)

Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real-world settings. Prior attempts to inject physics rely on conditioning: frame-level signals are domain-specific and short-horizon, while global text prompts are coarse and noisy, missing fine-grained dynamics. We present PhysVid, a physics-aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics-grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk-aware cross-attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by $\approx 33%$ over baseline video generators, and by up to $\approx 8%$ on VideoPhy2. These results show that local, physics-aware guidance substantially increases physical plausibility in generative video and marks a step toward physics-grounded video models.

关键词: Generative Video Models, Physics Aware, Local Conditioning, Physical Plausibility, Negative Physics Prompts, VideoPhy Benchmark, Chunk-aware Cross-attention

53. ❌ Knowdit: Agentic Smart Contract Vulnerability Detection with Auditing Knowledge Summarization

作者: Ziqiao Kong, Wanxu Xia, Chong Wang, Yi Lu, Pan Li, Shaohua Li, Zong Cao, Yang Liu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26270v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Knowdit框架，这是一个用于智能合约漏洞检测的多智能体系统，因此与’LLM Agents/Autonomous Agents/Agentic Workflow’和’Multi-agent Systems/Agent Coordination’高度相关（10分）。框架包含迭代循环和反思机制，与’Self-Correction/Self-Improvement/Self-Reflection’有一定关联（5分）。智能体可能使用工具进行规范生成和模糊测试，与’Tool Use/Function Calling/API Tool Use’有一定关联（5分）。论文未明确提及大模型、深度学习技术原理或科学领域应用，与其他关键词无关（0分）。

!!! tip deepseek-chat TL;DR

论文提出Knowdit，一个基于知识图谱和多智能体框架的系统，用于检测智能合约漏洞，在评估中显著优于基线方法并发现了多个先前未知的漏洞。

摘要翻译

智能合约管理着去中心化金融（DeFi）中数十亿美元的资金，然而自动化漏洞检测仍然面临挑战，因为许多漏洞与项目特定的业务逻辑紧密耦合。我们观察到，在不同DeFi商业模式中反复出现的漏洞往往共享相同的底层经济机制，我们将其称为DeFi语义，并且捕捉这些共享的抽象概念可以实现更系统化的审计。基于这一洞见，我们提出了Knowdit，一个知识驱动的、智能体化的智能合约漏洞检测框架。Knowdit首先从历史人工审计报告中构建审计知识图谱，将细粒度的DeFi语义与反复出现的漏洞模式关联起来。针对新项目，一个多智能体框架通过一个由共享工作内存驱动的迭代循环——包括规范生成、测试工具合成、模糊测试执行和发现反馈——来利用这些知识，实现持续优化。
我们在12个最近的Code4rena项目（包含75个真实漏洞）上评估了Knowdit。Knowdit检测出了全部14个高严重性漏洞和77%的中等严重性漏洞，且仅产生2个误报，显著优于所有基线方法。在应用于六个真实世界项目时，Knowdit进一步发现了12个先前未知的高严重性漏洞和10个中等严重性漏洞，证明了其卓越的性能。

摘要 (Abstract)

Smart contracts govern billions of dollars in decentralized finance (DeFi), yet automated vulnerability detection remains challenging because many vulnerabilities are tightly coupled with project-specific business logic. We observe that recurring vulnerabilities across diverse DeFi business models often share the same underlying economic mechanisms, which we term DeFi semantics, and that capturing these shared abstractions can enable more systematic auditing. Building on this insight, we propose Knowdit, a knowledge-driven, agentic framework for smart contract vulnerability detection. Knowdit first constructs an auditing knowledge graph from historical human audit reports, linking fine-grained DeFi semantics with recurring vulnerability patterns. Given a new project, a multi-agent framework leverages this knowledge through an iterative loop of specification generation, harness synthesis, fuzz execution, and finding reflection, driven by a shared working memory for continuous refinement. We evaluate Knowdit on 12 recent Code4rena projects with 75 ground-truth vulnerabilities. Knowdit detects all 14 high-severity and 77% of medium-severity vulnerabilities with only 2 false positives, significantly outperforming all baselines. Applied to six real-world projects, Knowdit further discovers 12 high- and 10 medium-severity previously unknown vulnerabilities, proving its outstanding performance.

关键词: smart contract vulnerability detection, agentic framework, multi-agent system, auditing knowledge graph, DeFi semantics, fuzz execution, working memory, vulnerability patterns

54. ❌ GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation

作者: Xujing Tao, Chuxin Wang, Yubo Ai, Zhixin Cheng, Zhuoyuan Li, Liangsheng Liu, Yujia Chen, Xinjun Li, Qiao Li, Wenfei Yang, Tianzhu Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26260v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于3D语义分割的计算机视觉任务，提出GeoGuide框架解决开放词汇3D分割问题，通过几何引导增强语义一致性。与大多数大模型技术关键词（如LLMs、MoE、RLHF等）无直接关联，仅与’Pre-training’（提及使用预训练3D模型）和’AI for Science’（属于AI在科学/工程领域的应用）有中等相关度（5分），其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出GeoGuide框架，通过分层几何引导解决开放词汇3D语义分割中几何与语义对齐不足的问题，在多个数据集上实现了优越性能。

摘要翻译

开放词汇三维语义分割旨在识别训练集之外的任意类别。现有方法主要依赖于从二维开放词汇模型蒸馏知识。然而，将三维特征对齐到二维表示空间会限制其内在的几何学习能力，并继承来自二维预测的误差。为应对这些局限，我们提出GeoGuide，一种利用预训练三维模型、融合层次化几何-语义一致性以实现开放词汇三维分割的新框架。具体而言，我们引入基于不确定性的超点蒸馏模块，融合几何与语义特征以估计逐点不确定性，自适应地加权超点内的二维特征，从而抑制噪声同时保留判别性信息，以增强局部语义一致性。此外，我们的实例级掩码重建模块利用几何先验，通过重建完整的实例掩码来强化实例内部的语义一致性。同时，我们的实例间关系一致性模块通过对齐几何与语义相似性矩阵，校准同类物体的跨实例一致性，以缓解视角变化引起的语义漂移。在ScanNet v2、Matterport3D和nuScenes数据集上的大量实验证明了GeoGuide的优越性能。

摘要 (Abstract)

Open-vocabulary 3D semantic segmentation aims to segment arbitrary categories beyond the training set. Existing methods predominantly rely on distilling knowledge from 2D open-vocabulary models. However, aligning 3D features to the 2D representation space restricts intrinsic 3D geometric learning and inherits errors from 2D predictions. To address these limitations, we propose GeoGuide, a novel framework that leverages pretrained 3D models to integrate hierarchical geometry-semantic consistency for open-vocabulary 3D segmentation. Specifically, we introduce an Uncertainty-based Superpoint Distillation module to fuse geometric and semantic features for estimating per-point uncertainty, adaptively weighting 2D features within superpoints to suppress noise while preserving discriminative information to enhance local semantic consistency. Furthermore, our Instance-level Mask Reconstruction module leverages geometric priors to enforce semantic consistency within instances by reconstructing complete instance masks. Additionally, our Inter-Instance Relation Consistency module aligns geometric and semantic similarity matrices to calibrate cross-instance consistency for same-category objects, mitigating viewpoint-induced semantic drift. Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate the superior performance of GeoGuide.

关键词: Open-vocabulary 3D semantic segmentation, Hierarchical geometric guidance, Uncertainty-based superpoint distillation, Instance-level mask reconstruction, Inter-instance relation consistency, Geometric-semantic consistency, 3D feature learning, ScanNet v2

55. ❌ GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

作者: Rui Xie, Zhi Gao, Chenrui Shi, Zirui Shang, Lu Chen, Qing Li 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26266v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出GUIDE框架，通过检索增强的自动化标注流程解决GUI代理的领域偏见问题。核心相关关键词：1）‘Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’（10分）- 论文核心创新是subtitle-driven Video-RAG pipeline，进行三阶段检索；2）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）- 论文研究GUI agents，属于LLM代理范畴；3）‘Multi-agent Systems OR Agent Coordination’（8分）- 实验在OSWorld上验证了GUIDE对多代理系统的适用性；4）‘Large Language Models OR LLMs OR Foundation Models’（8分）- 论文基于大视觉语言模型（VLMs）构建，属于大模型应用。其他关键词与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对GUI代理因训练数据不足导致的领域偏见问题，提出了GUIDE框架，通过从网络教程视频中检索并自动标注领域专业知识来增强代理的规划和接地能力，实验表明该框架能显著提升任务性能而不修改模型参数。

摘要翻译

大型视觉语言模型为图形用户界面智能体赋予了强大的界面理解与交互通用能力。然而，由于训练过程中对特定领域软件操作数据的接触不足，这些智能体表现出显著的领域偏差——它们对特定应用程序的具体操作流程（规划）和界面元素布局（定位）缺乏熟悉度，从而限制了其在实际任务中的表现。本文提出GUIDE框架（基于教学视频驱动的专业能力实现GUI去偏），这是一种无需训练、即插即用的解决方案，通过基于检索增强的自动化标注流程从网络教学视频中自主获取领域专业知识，以消除GUI智能体的领域偏差。GUIDE包含两项核心创新：首先，基于字幕的视频检索增强生成流程通过字幕分析解锁视频语义，执行渐进式三阶段检索——领域分类、主题提取与相关性匹配——以识别任务相关的教学视频；其次，基于逆向动力学范式构建的全自动标注流程，将经过UI元素检测增强的连续关键帧输入视觉语言模型，推断出所需的规划与定位知识，并将其注入智能体的相应模块，以同时解决领域偏差的两种表现形式。在OSWorld基准上的大量实验表明，GUIDE作为即插即用组件，对多智能体系统和单模型智能体均具有普适性。在不修改任何模型参数或架构的情况下，它持续带来超过5%的性能提升并减少执行步骤，验证了GUIDE可作为架构无关的增强方案，有效弥合GUI智能体的领域偏差。

摘要 (Abstract)

Large vision-language models have endowed GUI agents with strong general capabilities for interface understanding and interaction. However, due to insufficient exposure to domain-specific software operation data during training, these agents exhibit significant domain bias - they lack familiarity with the specific operation workflows (planning) and UI element layouts (grounding) of particular applications, limiting their real-world task performance. In this paper, we present GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise), a training-free, plug-and-play framework that resolves GUI agent domain bias by autonomously acquiring domain-specific expertise from web tutorial videos through a retrieval-augmented automated annotation pipeline. GUIDE introduces two key innovations. First, a subtitle-driven Video-RAG pipeline unlocks video semantics through subtitle analysis, performing progressive three-stage retrieval - domain classification, topic extraction, and relevance matching - to identify task-relevant tutorial videos. Second, a fully automated annotation pipeline built on an inverse dynamics paradigm feeds consecutive keyframes enhanced with UI element detection into VLMs, inferring the required planning and grounding knowledge that are injected into the agent’s corresponding modules to address both manifestations of domain bias. Extensive experiments on OSWorld demonstrate GUIDE’s generality as a plug-and-play component for both multi-agent systems and single-model agents. It consistently yields over 5% improvements and reduces execution steps - without modifying any model parameters or architecture - validating GUIDE as an architecture-agnostic enhancement to bridge GUI agent domain bias.

关键词: GUI agents, domain bias, retrieval-augmented generation, video retrieval, automated annotation, vision-language models, multi-agent systems, plug-and-play framework

56. ❌ Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models

作者: Antoine Edy, Max Conti, Quentin Macé 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26259v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是Late Interaction检索模型的特定行为分析（长度偏差和相似度分布），属于信息检索领域，而非大模型或深度学习技术原理的创新。所有关键词均与大模型技术、训练方法、推理优化、应用领域等相关，与论文的检索模型分析主题无直接关联。

!!! tip deepseek-chat TL;DR

该论文分析了Late Interaction检索模型中多向量评分导致的长度偏差和MaxSim算子之外的相似度分布，发现在实际应用中因果模型存在理论偏差，双向模型在极端情况下也会受影响，且MaxSim算子能有效利用token级相似度得分。

摘要翻译

尽管晚期交互模型展现出强大的检索性能，但其许多内在动态机制仍未得到充分研究，这可能隐藏了性能瓶颈。在本研究中，我们聚焦于晚期交互检索中的两个议题：使用多向量评分时产生的长度偏差，以及被MaxSim算子池化的最佳分数之外的相似度分布。我们在NanoBEIR基准上对前沿模型进行了这些行为的分析。结果表明，虽然因果性晚期交互模型的理论长度偏差在实践中确实存在，但双向模型在极端情况下也可能受其影响。我们还发现，在排名第一的文档词元之外不存在显著的相似度趋势，这验证了MaxSim算子能有效利用词元级别的相似度分数。

摘要 (Abstract)

While Late Interaction models exhibit strong retrieval performance, many of their underlying dynamics remain understudied, potentially hiding performance bottlenecks. In this work, we focus on two topics in Late Interaction retrieval: a length bias that arises when using multi-vector scoring, and the similarity distribution beyond the best scores pooled by the MaxSim operator. We analyze these behaviors for state-of-the-art models on the NanoBEIR benchmark. Results show that while the theoretical length bias of causal Late Interaction models holds in practice, bi-directional models can also suffer from it in extreme cases. We also note that no significant similarity trend lies beyond the top-1 document token, validating that the MaxSim operator efficiently exploits the token-level similarity scores.

关键词: Late Interaction models, retrieval performance, length bias, multi-vector scoring, similarity distribution, MaxSim operator, NanoBEIR benchmark, token-level similarity

57. ❌ ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction

作者: David Hagerman, Roman Naeem, Erik Brorsson, Fredrik Kahl, Lennart Svensson 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26258v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ARTA专注于计算机视觉领域的高效密集特征提取，提出了一种混合分辨率视觉Transformer架构，通过自适应令牌分配优化计算效率。所有评分关键词均与大语言模型（LLMs）相关，而本文研究的是视觉Transformer（ViT）在图像分割任务中的应用，未涉及任何语言模型、大模型技术原理或AI for Science应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

ARTA提出了一种自适应混合分辨率令牌分配的视觉Transformer架构，通过从粗到细的令牌分配策略，在显著减少计算量的同时，在ADE20K、COCO-Stuff和Cityscapes等密集预测任务上实现了最先进的性能。

摘要翻译

我们提出ARTA，一种用于高效密集特征提取的混合分辨率由粗到精视觉变换器。与从密集高分辨率（精细）标记开始的模型不同，ARTA从低分辨率（粗糙）标记开始，并使用轻量级分配器来预测哪些区域需要更多精细标记。该分配器迭代地预测语义（类别）边界分数，并将额外标记分配给高于低阈值的图像块，从而将标记密度集中在边界附近，同时保持对微弱边界证据的高敏感性。这种定向分配促使标记表示单一语义类别，而非混合类别。混合分辨率注意力实现了粗糙与精细标记之间的交互，将计算聚焦于语义复杂的区域，同时避免在均匀区域进行冗余处理。实验表明，ARTA在ADE20K和COCO-Stuff数据集上以显著更少的浮点运算次数（FLOPs）取得了最先进的结果，并在Cityscapes数据集上以明显更低的计算成本实现了有竞争力的性能。例如，ARTA-Base在约1亿参数级别中，于ADE20K数据集上达到54.6% mIoU，同时比同类骨干网络使用更少的FLOPs和内存。

摘要 (Abstract)

We present ARTA, a mixed-resolution coarse-to-fine vision transformer for efficient dense feature extraction. Unlike models that begin with dense high-resolution (fine) tokens, ARTA starts with low-resolution (coarse) tokens and uses a lightweight allocator to predict which regions require more fine tokens. The allocator iteratively predicts a semantic (class) boundary score and allocates additional tokens to patches above a low threshold, concentrating token density near boundaries while maintaining high sensitivity to weak boundary evidence. This targeted allocation encourages tokens to represent a single semantic class rather than a mixture of classes. Mixed-resolution attention enables interaction between coarse and fine tokens, focusing computation on semantically complex areas while avoiding redundant processing in homogeneous regions. Experiments demonstrate that ARTA achieves state-of-the-art results on ADE20K and COCO-Stuff with substantially fewer FLOPs, and delivers competitive performance on Cityscapes at markedly lower compute. For example, ARTA-Base attains 54.6 mIoU on ADE20K in the ~100M-parameter class while using fewer FLOPs and less memory than comparable backbones.

关键词: vision transformer, mixed-resolution, token allocation, dense feature extraction, semantic segmentation, computational efficiency, coarse-to-fine, adaptive attention

58. ❌ Channelling, Coordinating, Collaborating: A Three-Layer Framework for Disability-Centered Human-Agent Collaboration

作者: Lan Xiao, Catherine Holloway 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26252v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究AI在残障人士协作中的框架设计，属于人机协作和AI辅助工具领域。论文未涉及大模型技术原理、训练方法、推理优化、科学应用等具体技术细节。仅与’LLM Agents/Autonomous Agents/Agentic Workflow’和’Multi-agent Systems/Agent Coordination’有中等关联（5分），因为论文讨论AI作为协作伙伴和协调者，但未明确使用LLM或具体代理技术。其他关键词均无直接关联（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一个三层框架（引导、协调、共创）来重新思考AI在能力多样化协作中的作用，旨在建立跨能力的信息共享基础、协调不同能力协作者的工作流程，并作为有限伙伴为共同目标做出贡献。

摘要翻译

人工智能无障碍工具大多为个体使用而设计，旨在帮助个人克服特定功能障碍。然而对许多残障人士而言，复杂任务的完成往往依赖于与具备互补能力者的协作，而非独立完成。我们提出一个包含"信息引导"“流程协调"“共同创造"的三层框架，重新思考人工智能在能力多样化协作中的角色：建立跨能力共享的信息基础，协调不同能力协作者间的工作流程，并作为有明确边界的合作伙伴推动共同目标的实现。该框架以"能力多样化协作"理论、共同基础理论及Carlile的3T框架为理论基础，通过聚焦残障人士既有的协作性与相互依存的工作模式，拓展了"智能体作为远程协作者"的研究范式。

摘要 (Abstract)

AI accessibility tools have mostly been designed for individual use, helping one person overcome a specific functional barrier. But for many people with disabilities, complex tasks are accomplished through collaboration with others who bring complementary abilities, not solitary effort. We propose a three-layer framework, Channelling, Coordinating, and Co-Creating, that rethinks AI’s role in ability-diverse collaboration: establishing shared informational ground across abilities, mediating workflows between collaborators with different abilities, and contributing as a bounded partner toward shared goals. Grounded in the Ability-Diverse Collaboration framework, grounding theory, and Carlile’s 3T framework, it extends the ``agents as remote collaborators’’ vision by centring the collaborative, interdependent ways people with disabilities already work.

关键词: AI accessibility, disability-centered collaboration, human-agent collaboration, ability-diverse collaboration, collaborative framework, AI as partner, workflow mediation, shared informational ground

59. ❌ Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan

作者: Chihiro Taguchi, Yukinori Takubo, David Chiang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26248v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于为濒危语言Ikema Miyakoan开发自动语音识别（ASR）系统，属于语音处理和语言文档领域。虽然涉及AI技术（ASR），但研究内容与评分关键词列表中的大模型、深度学习技术原理、AI for Science等主题无直接关联。关键词列表主要针对大语言模型（LLMs）及其相关技术（如微调、对齐、推理、压缩等），而本文的ASR模型是传统的语音识别模型，并非大语言模型，也未涉及生物信息学或化学信息学等科学AI应用。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

本研究针对日本冲绳濒危语言Ikema Miyakoan，开发了一个基于现场录音的自动语音识别系统，实验表明该系统能显著降低转录时间和认知负荷，为技术支持的濒危语言文档化提供了可行路径。

摘要翻译

语言濒危对全球语言多样性构成重大挑战，而技术进步为语言记录与复兴开辟了新途径。其中，自动语音识别技术（ASR）在协助濒危语言数据转写方面展现出日益增长的应用潜力。本研究聚焦于濒危程度极高的琉球语系方言池间语（Ikema），该语言在日本冲绳地区仅存约1300名使用者，且其中多数年龄超过60岁。我们基于田野录音资料，开展了一项针对池间语的ASR系统开发工作。具体而言，我们（1）从田野录音中构建了{\totaldatasethours}小时的语音语料库；（2）训练出字符错误率低至15%的ASR模型；（3）评估了ASR辅助对语音转写效率的影响。研究结果表明，ASR技术的整合能显著降低转写时间与认知负荷，为规模化、技术支持的濒危语言记录工作提供了可行路径。

摘要 (Abstract)

Language endangerment poses a major challenge to linguistic diversity worldwide, and technological advances have opened new avenues for documentation and revitalization. Among these, automatic speech recognition (ASR) has shown increasing potential to assist in the transcription of endangered language data. This study focuses on Ikema, a severely endangered Ryukyuan language spoken in Okinawa, Japan, with approximately 1,300 remaining speakers, most of whom are over 60 years old. We present an ongoing effort to develop an ASR system for Ikema based on field recordings. Specifically, we (1) construct a {\totaldatasethours}-hour speech corpus from field recordings, (2) train an ASR model that achieves a character error rate as low as 15%, and (3) evaluate the impact of ASR assistance on the efficiency of speech transcription. Our results demonstrate that ASR integration can substantially reduce transcription time and cognitive load, offering a practical pathway toward scalable, technology-supported documentation of endangered languages.

关键词: Automatic Speech Recognition, Endangered Languages, Ikema Miyakoan, Speech Corpus, Character Error Rate, Transcription Efficiency, Language Documentation

60. ❌ Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

作者: Shashi Kumar, Esaú Villatoro-Tello, Sergio Burdisso, Kadri Hacioglu, Thibault Bañeras-Roux, Hasindri Watawana, Dairazalia Sanchez-Cortes, Srikanth Madikeri, Petr Motlicek, Andreas Stolcke 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26246v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based ASR系统如何利用多模态对话上下文提升识别性能，并提出了Abstract Compression方法来解决原始音频上下文过长的问题。因此与’Large Language Models’高度相关（10分），因为这是LLM在语音识别领域的应用。与’Post-training/SFT’有一定关联（5分），因为提到了supervised multi-turn training。与’Context Window Extension’有一定关联（5分），因为研究如何有效处理长对话上下文。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过压缩对话音频上下文来提升基于大语言模型的自动语音识别系统性能，提出的抽象压缩方法在减少计算开销的同时部分恢复了原始上下文带来的性能增益。

摘要翻译

基于标准大语言模型的语音识别系统通常孤立处理话语，限制了其利用对话上下文的能力。本研究探讨了来自先前话轮的多模态上下文是否能改进基于大语言模型的自动语音识别，以及如何高效表征该上下文。研究发现，经过有监督的多轮训练后，对话上下文主要有助于提升对语境实体的识别能力。然而，直接以原始上下文为条件进行处理的成本高昂，因为先前话轮的音频标记序列会随对话长度快速增长。为解决此问题，我们提出抽象压缩方法，该方法将先前话轮的音频部分替换为固定数量的学习潜在标记，同时显式保留对应的文本转录。在领域内和领域外测试集上，压缩模型以更小的先前话轮音频占用空间，部分恢复了原始上下文条件处理所获得的性能提升。我们还对压缩设置及其权衡进行了针对性分析。

摘要 (Abstract)

Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.

关键词: LLM-based ASR, conversational context, abstract compression, multimodal context, speech recognition, contextual entities, audio token compression, prior-turn conditioning

61. ❌ Physics-Informed Neural Networks and Sequence Encoder: Application to heating and early cooling of thermo-stamping process

作者: Mouad Elaarabi, Domenico Borzacchiello, Philippe Le Bot, Nathan Lauzeral, Sebastien Comas-Cardona 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26245v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是物理信息神经网络（PINN）与序列编码器在热冲压过程中的应用，属于AI在科学领域的应用，但与所有大模型、深度学习技术原理相关的关键词均无直接关联。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文涉及AI在材料科学和工程领域的应用，但并非生物信息学或化学信息学，故给5分。其他关键词均未涉及，故全部给0分。

!!! tip deepseek-chat TL;DR

该论文研究了将物理信息神经网络与序列编码器结合应用于热冲压过程的加热和早期冷却阶段，证明了该方法在真实工业场景中的可行性，并探索了多模态数据输入和合成数据训练对模型泛化能力的提升。

摘要翻译

在先前的研究中（Elaarabi等人，2025b），我们介绍了用于在线动态系统识别的序列编码器（Sequence Encoder，简称SE）（Elaarabi等人，2025a）及其与物理信息神经网络（Physics-Informed Neural Network，简称PINN）的结合方法（PINN-SE），并在合成数据与真实数据案例中进行了测试。该序列编码器能够有效地将时间序列编码为特征向量，随后PINN利用这些特征向量映射出动态行为，从而预测系统在参数、初始条件（ICs）和边界条件（BCs）变化下的响应。此前的工作（Elaarabi等人，2025b）中，针对真实数据的测试仅限于简单的一维问题，且序列编码器的输入也仅为一维时间序列。本研究探讨了将PINN-SE应用于一个更贴近实际的场景的可能性：热冲压工艺中的加热与早期冷却阶段，这是连续纤维增强热塑性聚合物复合材料成型过程中的关键环节。同时，本研究也探索了将PINN-SE的输入扩展至多模态数据（例如时间序列的二维图像）以及涉及可变几何形状场景的可能性。结果表明，将多个编码器与先前提出的方法（Elaarabi等人，2025b）相结合是可行的；我们还证明，基于实验数据生成的合成数据对模型进行训练，有助于模型在训练阶段未接触过的真实实验数据上实现良好的泛化性能。

摘要 (Abstract)

In a previous work (Elaarabi et al., 2025b), the Sequence Encoder for online dynamical system identification (Elaarabi et al., 2025a) and its combination with PINN (PINN-SE) were introduced and tested on both synthetic and real data case scenarios. The sequence encoder is able to effectively encode time series into feature vectors, which the PINN then uses to map to dynamical behavior, predicting system response under changes in parameters, ICs and BCs. Previously (Elaarabi et al., 2025b), the tests on real data were limited to simple 1D problems and only 1D time series inputs of the Sequence Encoder. In this work, the possibility of applying PINN-SE to a more realistic case is investigated: heating and early cooling of the thermo-stamping process, which is a critical stage in the forming process of continuous fiber reinforced composite materials with thermoplastic polymer. The possibility of extending the PINN-SE inputs to multimodal data, such as sequences of temporal 2D images and to scenarios involving variable geometries, is also explored. The results show that combining multiple encoders with the previously proposed method (Elaarabi et al., 2025b) is feasible, we also show that training the model on synthetic data generated based on experimental data can help the model to generalize well for real experimental data, unseen during the training phase.

关键词: Physics-Informed Neural Networks, Sequence Encoder, thermo-stamping process, composite materials, multimodal data, synthetic data training, dynamical system identification, generalization

62. ❌ Automating Domain-Driven Design: Experience with a Prompting Framework

作者: Tobias Eisenreich, Husein Jusic, Stefan Wagner 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26244v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文明确使用LLMs自动化领域驱动设计（DDD），因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术细节，如MoE、SLMs、训练方法、推理优化、代理系统等，也未涉及科学领域的AI应用，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于大型语言模型（LLM）的提示框架，用于自动化领域驱动设计（DDD）的核心活动，研究发现LLM能有效辅助生成文档（如术语表和上下文映射），但无法完全替代人类专家决策，且错误会累积影响后期步骤的实用性。

摘要翻译

领域驱动设计（DDD）是一种用于构建复杂软件系统的强大设计技术。本文介绍了一种提示框架，该框架通过结构化的大语言模型（LLM）交互自动化核心DDD活动。我们将DDD分解为五个顺序步骤：（1）建立统一语言，（2）模拟事件风暴，（3）识别限界上下文，（4）设计聚合体，以及（5）映射至技术架构。在一项案例研究中，我们基于FTAPI企业平台的实际需求验证了该提示框架。虽然前几个步骤持续生成有价值且可用的工件，但后续步骤表明，微小的错误或不准确之处可能传播并累积。总体而言，该框架作为构建可执行文档（如术语表和上下文映射）的协作陪练伙伴表现出色，但无法实现全自动化。这使得专家能够将讨论集中在关键权衡上。在我们的评估中，步骤1至3运行良好，但累积的错误使得步骤4和5生成的工件不具实用性。我们的研究结果表明，大语言模型可以增强而非取代架构专业知识，它提供了一种实用工具，在保留以人为中心的决策的同时，减少了DDD的工作量和开销。

摘要 (Abstract)

Domain-driven design (DDD) is a powerful design technique for architecting complex software systems. This paper introduces a prompting framework that automates core DDD activities through structured large language model (LLM) interactions. We decompose DDD into five sequential steps: (1) establishing an ubiquitous language, (2) simulating event storming, (3) identifying bounded contexts, (4) designing aggregates, and (5) mapping to technical architecture. In a case study, we validated the prompting framework against real-world requirements from FTAPI’s enterprise platform. While the first steps consistently generate valuable and usable artifacts, later steps show how minor errors or inaccuracies can propagate and accumulate. Overall, the framework excels as a collaborative sparring partner for building actionable documentation, such as glossaries and context maps, but not for full automation. This allows the experts to concentrate their discussion on the critical trade-offs. In our evaluation, Steps 1 to 3 worked well, but the accumulated errors rendered the artifacts generated from Steps 4 and 5 impractical. Our findings show that LLMs can enhance, but not replace, architectural expertise, offering a practical tool to reduce the effort and overhead of DDD while preserving human-centric decision-making.

关键词: Domain-driven design, Large language models, Prompting framework, Software architecture, Automation, Collaborative tool, Error propagation, Human-centric decision-making

63. ❌ Clawed and Dangerous: Can We Trust Open Agentic Systems?

作者: Shiping Chen, Qin Wang, Guangsheng Yu, Xu Wang, Liming Zhu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26221v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究开放代理系统（open agentic systems）的安全挑战，该系统结合了LLM规划、外部能力、持久记忆和特权执行。论文与’Large Language Models’高度相关（10分），因为LLM是代理系统的核心规划组件；与’LLM Agents’高度相关（10分），因为论文系统化研究了代理系统；与’Tool Use’高度相关（10分），因为代理系统依赖外部工具能力；与’Multi-agent Systems’有一定关联（5分），因为论文涉及代理协调和治理，但未深入多代理系统架构。其他关键词如MoE、SLMs、训练技术、推理优化、科学AI应用等与论文内容完全无关（0分）。

!!! tip deepseek-chat TL;DR

本文系统研究了开放代理系统的安全挑战，提出了六维分析分类法，综合了50篇相关文献，并建立了安全构建代理平台的参考原则和评估记分卡，发现当前研究在攻击表征和基准构建方面相对成熟，但在部署控制、操作治理等方面存在明显不足。

摘要翻译

开放式智能体系统将基于大语言模型的规划与外部能力、持久化记忆及特权执行相结合，广泛应用于代码助手、浏览器副驾驶和企业自动化等领域。OpenClaw 是此类广义系统的一个典型实例。
目前尚未得到充分关注的是，其安全挑战与传统依赖可预测执行和明确定义控制流的软件存在根本差异。在开放式智能体系统中，一切皆具“概率性”：计划在运行时动态生成，关键决策可能受不可信的自然语言输入和工具输出影响，执行在不确定环境中展开，且行动基于人类用户委托的权限实施。因此，核心挑战不仅在于抵御个别攻击的鲁棒性，更在于持续不确定性下智能体行为的治理问题。
本文通过软件工程视角系统化梳理该领域。我们提出一个六维分析分类框架，综合分析了涵盖攻击、基准测试、防御、审计及相邻工程基础的 50 篇文献。基于此综述，我们推导出构建安全智能体平台的参考原则，并设计了一套用于评估平台安全态势的评分卡。研究表明，现有文献在攻击特征描述和基准构建方面相对成熟，但在部署控制、运行治理、持久化内存完整性及能力撤销等方面仍显薄弱。这些差距为构建可治理、可审计、具备受损恢复能力的智能体生态系统指明了具体的工程议程。

摘要 (Abstract)

Open agentic systems combine LLM-based planning with external capabilities, persistent memory, and privileged execution. They are used in coding assistants, browser copilots, and enterprise automation. OpenClaw is a visible instance of this broader class. Without much attention yet, their security challenge is fundamentally different from that of traditional software that relies on predictable execution and well-defined control flow. In open agentic systems, everything is ‘‘probabilistic’’: plans are generated at runtime, key decisions may be shaped by untrusted natural-language inputs and tool outputs, execution unfolds in uncertain environments, and actions are taken under authority delegated by human users. The central challenge is therefore not merely robustness against individual attacks, but the governance of agentic behavior under persistent uncertainty. This paper systematizes the area through a software engineering lens. We introduce a six-dimensional analytical taxonomy and synthesize 50 papers spanning attacks, benchmarks, defenses, audits, and adjacent engineering foundations. From this synthesis, we derive a reference doctrine for secure-by-construction agent platforms, together with an evaluation scorecard for assessing platform security posture. Our review shows that the literature is relatively mature in attack characterization and benchmark construction, but remains weak in deployment controls, operational governance, persistent-memory integrity, and capability revocation. These gaps define a concrete engineering agenda for building agent ecosystems that are governable, auditable, and resilient under compromise.

关键词: open agentic systems, LLM-based planning, security challenges, software engineering, analytical taxonomy, secure-by-construction, governance, persistent uncertainty

64. ❌ Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

作者: Shrinidhi Kumbhar, Haofu Liao, Srikar Appalaraju, Kunwar Yashraj Singh 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26211v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究离散扩散视觉语言模型（DVLMs）在GUI grounding任务中的应用，属于大模型（LLaDA-V）在特定领域（GUI交互）的研究应用，与’Large Language Models OR LLMs OR Foundation Models’（8分）相关，因为使用了LLaDA-V模型；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）高度相关，因为论文明确目标是开发GUI agents；与’Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）和’Post-training OR Supervised Fine-tuning OR SFT’（5分）有一定关联，因为涉及模型适应和训练；其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了离散扩散视觉语言模型（DVLMs）作为自回归模型的替代方案在GUI grounding任务中的可行性，通过提出混合掩码调度和改进训练数据，在多个数据集上实现了竞争性的性能，并展示了扩散模型在构建GUI agents方面的潜力。

摘要翻译

自回归（AR）视觉语言模型（VLMs）长期以来在多模态理解、推理及图形用户界面（GUI）定位任务中占据主导地位。近期，离散扩散视觉语言模型（DVLMs）在多模态推理中展现出强大性能，其具备双向注意力机制、并行令牌生成和迭代优化能力。然而，其在GUI定位方面的潜力尚未得到探索。本研究旨在评估离散DVLMs能否作为自回归模型在GUI定位任务中的可行替代方案。我们将LLaDA-V模型适配于单轮动作与边界框预测任务，将其构建为基于多模态输入的文本生成问题。为更好地捕捉边界框几何结构的层次化特征，我们提出了一种结合线性掩码与确定性掩码的混合掩码调度策略，相比采用线性掩码训练的GUI适配版LLaDA-V，该策略将定位准确率（以步骤成功率SSR衡量）最高提升了6.1个百分点。在涵盖网页、桌面及移动端界面的四个数据集上的评估表明，采用混合掩码的适配扩散模型始终优于线性掩码变体，并且尽管预训练数据有限，其性能仍可与自回归模型竞争。系统化消融实验显示，增加扩散步数、生成长度及块长度虽能提升准确率，但也会增加延迟，且当扩散步数超过一定阈值后准确率趋于饱和。通过扩展多样化GUI领域的训练数据，可进一步将延迟降低约1.3秒，并在各基准测试中平均提升定位准确率约20个百分点。这些结果表明，离散DVLMs是GUI定位任务中一种极具前景的建模框架，代表了向基于扩散的GUI智能体发展的重要一步。

摘要 (Abstract)

Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over the GUI-adapted LLaDA-V trained with linear masking. Evaluations on four datasets spanning web, desktop, and mobile interfaces show that the adapted diffusion model with hybrid masking consistently outperforms the linear-masked variant and performs competitively with autoregressive counterparts despite limited pretraining. Systematic ablations reveal that increasing diffusion steps, generation length, and block length improves accuracy but also increases latency, with accuracy plateauing beyond a certain number of diffusion steps. Expanding the training data with diverse GUI domains further reduces latency by about 1.3 seconds and improves grounding accuracy by an average of 20 points across benchmarks. These results demonstrate that discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents.

关键词: GUI grounding, discrete diffusion vision-language models, LLaDA-V, hybrid masking schedule, GUI agents, multimodal understanding, autoregressive models, bounding-box prediction

65. ❌ Sparse Auto-Encoders and Holism about Large Language Models

作者: Jumbly Grindrod 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26207v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心探讨LLMs的语义表示哲学（整体论vs分解论），直接涉及’Large Language Models’（全文讨论LLM技术）和’Mechanistic Interpretability’（通过稀疏自编码器分析潜在特征），这两项高度相关（10分）。其他关键词如MoE、训练方法、推理技术、应用领域等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文探讨了大型语言模型（LLMs）是否体现语义整体论，并通过分析稀疏自编码器发现的潜在特征，论证了即使存在可解释特征，整体论观点仍然成立。

摘要翻译

大语言模型（LLM）技术是否暗示了一种元语义图景，即关于词语和复杂表达式如何获得其意义的图景？一种审慎的研究路径是，通过探究大语言模型在捕捉语言表达式意义时似乎内置的假设，来考量这些假设的合理性（Grindrod, 2026a, 2026b）。先前有观点认为，大语言模型因采用一种分布语义学形式，故而采纳了关于意义的一种整体论形式（Grindrod, 2023; Grindrod et al., forthcoming）。然而，近期在机械可解释性方面的研究对这些论点提出了挑战。具体而言，在大语言模型使用的高维空间中发现的大量可解释潜在特征，可能对整体论解释构成挑战。在本文中，我将首先阐述认为大语言模型体现了一种整体论的原有理由（第1节），继而介绍近期关于通过稀疏自编码器生成的特征的研究，并解释此类特征的发现如何提示了一种替代性的、分解式的意义图景（第2节）。随后，我将通过更细致地考察此类特征的本质来回应这一挑战（第3节）。最后，我将回归到Grindrod等人所辩护的整体论图景，并论证只要这些特征是可数的，该图景依然成立（第4节）。

摘要 (Abstract)

Does Large Language Model (LLM) technology suggest a meta-semantic picture i.e. a picture of how words and complex expressions come to have the meaning that they do? One modest approach explores the assumptions that seem to be built into how LLMs capture the meanings of linguistic expressions as a way of considering their plausibility (Grindrod, 2026a, 2026b). It has previously been argued that LLMs, in employing a form of distributional semantics, adopt a form of holism about meaning (Grindrod, 2023; Grindrod et al., forthcoming). However, recent work in mechanistic interpretability presents a challenge to these arguments. Specifically, the discovery of a vast array of interpretable latent features within the high dimensional spaces used by LLMs potentially challenges the holistic interpretation. In this paper, I will present the original reasons for thinking that LLMs embody a form of holism (section 1), before introducing recent work on features generated through sparse auto-encoders, and explaining how the discovery of such features suggests an alternative decompositional picture of meaning (section 2). I will then respond to this challenge by considering in greater detail the nature of such features (section 3). Finally, I will return to the holistic picture defended by Grindrod et al. and argue that the picture still stands provided that the features are countable (section 4).

关键词: Large Language Models, LLMs, holism, meaning, sparse auto-encoders, mechanistic interpretability, latent features, semantic representation

66. ❌ An Object Web Seminar: A Retrospective on a Technical Dialogue Still Reverbarating

作者: James J. Cusick 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26203v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文回顾了1999年关于Object Web技术的研讨会，主要讨论分布式架构、软件开发工具和早期Web技术的历史演变，虽然结尾提到早期AI工具和当前AI技术浪潮的类比，但全文核心内容与深度学习、大模型技术原理或AI在科学领域的应用创新完全无关，所有关键词均未涉及论文实质内容。

!!! tip deepseek-chat TL;DR

本文回顾了1999年关于Object Web技术的研讨会，探讨了分布式架构和软件开发工具在早期Web时代的演变，并简要类比了早期AI工具与当前AI技术浪潮的相似性。

摘要翻译

技术变革日新月异，新趋势往往迅速取代昨日的新焦点。本文通过1999年一场研讨会的内容，探讨了对象技术与早期网络应用融合的巅峰时期。彼时分布式架构正经历重大变革，更深入的软件能力刚开始通过互联网广泛普及。对象网络应运而生，并融入了反映这些能力的新开发工具，使得万维网早期应用的设计与部署成为可能。该会议探讨了这些工具与架构的历史沿革、实际应用及未来可能性。尽管“对象网络”这一术语已逐渐淡出，但其核心设计理念仍以不同形式持续主导着技术发展——例如当前主流的Kubernetes与微服务等方案，依然延续着对象网络的核心设计特征。本文不仅将这场研讨会与当今软件领域的关联性进行梳理，还提及了四分之一世纪前该研讨会展示的早期人工智能工具，并探讨了特定技术的流行浪潮如何可能影响当前对人工智能技术发展的关注焦点。

摘要 (Abstract)

Technology change happens quickly such that new trends tend to crowd out the focus on what was new just yesterday. In this paper the peak popularity of the confluence of Object Technologies with early Web adoption is explored through the content of a seminar held in 1999. Distributed architectures were undergoing significant change at this point, and deeper software capabilities were just beginning to be broadly accessible over the Internet. The Object Web arose and was infused with new development tools reflecting these capabilities and allowing design of applications for deployment during the early days of the World Wide Web. This conference discussed the history, evolution, and use of these tools, architectures, and their future possibilities. The continued dominance of these approaches although under different names is demonstrated even though the term Object Web has receded in use. Favored newer offerings such as Kubernetes and microservices still model the core design attributes of the Object Web for example. Aside from connecting this seminar to relevance in the software world of today this paper also touches on the early AI tools demonstrated in this seminar a quarter century ago and how the popularity wave of any given technology might affect the current focus on AI technology offerings.

关键词: Object Web, distributed architectures, software development tools, World Wide Web, technology evolution, early AI tools, Kubernetes, microservices

67. ❌ MemCam: Memory-Augmented Camera Control for Consistent Video Generation

作者: Xinhang Gao, Junlin Guan, Shuhan Luo, Wenzhuo Li, Guanghuan Tan, Jiacheng Wang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26193v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视频生成中的相机控制和场景一致性，属于计算机视觉和生成模型领域，与所有评分关键词（主要针对大语言模型技术）完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了MemCam方法，通过记忆增强和上下文压缩来解决动态相机控制下长视频生成的场景一致性问题，实验表明其在场景一致性方面显著优于现有方法。

摘要翻译

交互式视频生成在场景模拟与视频创作领域具有重要潜力。然而，现有方法在动态相机控制下的长视频生成过程中，常因上下文信息有限而难以维持场景一致性。为应对这一挑战，我们提出MemCam——一种基于记忆增强的交互式视频生成方法，该方法将先前生成的帧视为外部记忆，并利用其作为上下文条件，以实现高场景一致性的可控相机视点生成。为构建更长且更相关的上下文，我们设计了一个上下文压缩模块，该模块将记忆帧编码为紧凑表示，并采用基于共可见性的动态检索机制来选取最相关的历史帧，从而在丰富上下文信息的同时降低计算开销。在交互式视频生成任务上的实验表明，MemCam在场景一致性方面显著优于现有基线方法及开源前沿模型，尤其在相机大范围旋转的长视频场景中表现突出。

摘要 (Abstract)

Interactive video generation has significant potential for scene simulation and video creation. However, existing methods often struggle with maintaining scene consistency during long video generation under dynamic camera control due to limited contextual information. To address this challenge, we propose MemCam, a memory-augmented interactive video generation approach that treats previously generated frames as external memory and leverages them as contextual conditioning to achieve controllable camera viewpoints with high scene consistency. To enable longer and more relevant context, we design a context compression module that encodes memory frames into compact representations and employs co-visibility-based selection to dynamically retrieve the most relevant historical frames, thereby reducing computational overhead while enriching contextual information. Experiments on interactive video generation tasks show that MemCam significantly outperforms existing baseline methods as well as open-source state-of-the-art approaches in terms of scene consistency, particularly in long video scenarios with large camera rotations.

关键词: video generation, camera control, scene consistency, memory-augmented, context compression, interactive video, long video generation, co-visibility selection

68. ❌ Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI

作者: Jing Zhang, Bastien Bergere, Emilie Bollache, Jonas Leite, Mikaël Laredo, Alban Redheuil, Nadjia Kachenoura 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26186v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于使用深度学习（SwinUNETR）进行心脏MRI图像分割，属于医学影像分析领域。论文内容与绝大多数关键词（如LLM、MoE、对齐、推理、代理等）完全无关，因为这些关键词主要针对大语言模型及其相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（具体是心脏影像分析）领域的应用，符合’AI for Science’的范畴，但并非核心大模型技术，因此给予8分（有一定关联，但非核心）。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合解剖学先验知识的渐进式学习策略，用于从心脏MRI图像中可靠地分割左心房疤痕，实验表明该方法提高了分割的准确性和可靠性。

摘要翻译

心脏磁共振钆延迟增强（LGE）成像能够无创识别左心房（LA）瘢痕，其空间分布与心房颤动（AF）的严重程度及复发风险密切相关。然而，由于图像对比度低、标注存在差异以及缺乏解剖学约束，自动化的左心房瘢痕分割仍面临挑战，常导致预测结果不可靠。为此，我们旨在借鉴临床工作流程，提出一种渐进式学习策略以从LGE图像中分割左心房瘢痕。我们实现了一个基于SwinUNETR的三阶段框架，包括：1）左心房腔预学习模型；2）双任务模型，进一步学习左心房几何结构与瘢痕模式之间的空间关系；3）针对瘢痕的精细分割进行微调。此外，我们引入了一种解剖学感知的空间加权损失函数，通过将瘢痕预测约束在解剖学上合理的左心房壁区域，同时减轻标注偏差，从而融入了先验临床知识。在LASCARQS公开数据集上经过五折交叉验证后，我们对验证集LGE体积数据获得的初步结果显示：左心房分割的Dice系数为0.94；左心房瘢痕分割的Dice系数为0.50，豪斯多夫距离为11.84毫米，平均表面距离为1.80毫米，优于单阶段瘢痕分割方法（相应指标分别为0.49、13.02毫米、1.96毫米）。通过将临床解剖学先验知识与诊断推理显式嵌入深度学习，所提出的方法提高了LGE图像中左心房瘢痕分割的准确性与可靠性，揭示了结合临床知识进行模型设计的重要性。

摘要 (Abstract)

Cardiac MRI late gadolinium enhancement (LGE) enables non-invasive identification of left atrial (LA) scar, whose spatial distribution is strongly associated with atrial fibrillation (AF) severity and recurrence. However, automatic LA scar segmentation remains challenging due to low contrast, annotation variability, and the lack of anatomical constraints, often leading to non-reliable predictions. Accordingly, our aim was to propose a progressive learning strategy to segment LA scar from LGE images inspired from a clinical workflow. A 3-stage framework based on SwinUNETR was implemented, comprising: 1) a first LA cavity pre-learning model, 2) dual-task model which further learns spatial relationship between LA geometry and scar patterns, and 3) fine-tuning on precise segmentation of the scar. Furthermore, we introduced an anatomy-aware spatially weighted loss that incorporates prior clinical knowledge by constraining scar predictions to anatomically plausible LA wall regions while mitigating annotation bias. Our preliminary results obtained on validation LGE volumes from LASCARQS public dataset after 5-fold cross validation, LA segmentation had Dice score of 0.94, LA scar segmentation achieved Dice score of 0.50, Hausdorff Distance of 11.84 mm, Average Surface Distance of 1.80 mm, outperforming only a one-stage scar segmentation with 0.49, 13.02 mm, 1.96 mm, repectively. By explicitly embedding clinical anatomical priors and diagnostic reasoning into deep learning, the proposed approach improved the accuracy and reliability of LA scar segmentation from LGE, revealing the importance of clinically informed model design.

关键词: Left atrial scar segmentation, Late gadolinium enhancement MRI, Progressive learning, Anatomical priors, SwinUNETR, Cardiac MRI, Deep learning, Medical image analysis

69. ❌ On the Complexity of Optimal Graph Rewiring for Oversmoothing and Oversquashing in Graph Neural Networks

作者: Mostafa Haghir Chehreghani 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26140v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究图神经网络（GNNs）中的过平滑和过挤压问题，以及通过图重连优化图结构的计算复杂性。所有关键词均与大模型、深度学习技术原理或科学应用相关，但论文专注于图神经网络的理论计算复杂性分析，未涉及任何大模型（LLMs）、深度学习技术（如MoE、量化、推理加速等）或科学领域应用（如生物信息学）。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了通过优化图结构来缓解图神经网络中过平滑和过挤压问题的计算复杂性，证明了基于谱间隙和电导的精确优化是NP难的。

摘要翻译

图神经网络在扩展至深层架构时面临两个根本性挑战：过度平滑（即节点表征收敛至无法区分的向量）与过度挤压（即来自遥远节点的信息无法通过瓶颈传播）。这两种现象均与底层图结构密切相关，这引发了一个自然问题：我们能否通过优化图拓扑来缓解这些问题？本文对此类图结构优化的计算复杂性进行了理论研究。我们分别基于谱间隙和传导率，将缓解过度平滑与过度挤压问题形式化为图优化问题。通过从最小二分问题归约，我们证明了这两个问题的精确优化均为NP难问题，并确立了其判定版本的NP完全性。我们的研究结果为理解图重连在优化图神经网络时的根本局限性提供了理论基础，并为实践中采用近似算法与启发式方法提供了理论依据。

摘要 (Abstract)

Graph Neural Networks (GNNs) face two fundamental challenges when scaled to deep architectures: oversmoothing, where node representations converge to indistinguishable vectors, and oversquashing, where information from distant nodes fails to propagate through bottlenecks. Both phenomena are intimately tied to the underlying graph structure, raising a natural question: can we optimize the graph topology to mitigate these issues? This paper provides a theoretical investigation of the computational complexity of such graph structure optimization. We formulate oversmoothing and oversquashing mitigation as graph optimization problems based on spectral gap and conductance, respectively. We prove that exact optimization for either problem is NP-hard through reductions from Minimum Bisection, establishing NP-completeness of the decision versions. Our results provide theoretical foundations for understanding the fundamental limits of graph rewiring for GNN optimization and justify the use of approximation algorithms and heuristic methods in practice.

关键词: Graph Neural Networks, Oversmoothing, Oversquashing, Graph Rewiring, Computational Complexity, NP-hard, Spectral Gap, Conductance

70. ❌ ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

作者: Xianpeng, Sun, Haonan Sun, Tian Yu, Sheng Ma, Qincheng Zhang, Lifei Rao, Chen Tian 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26137v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究软件工程评估的基准方法，使用LLM辅助生成任务提示，并评估软件工程代理的性能。因此，与"Large Language Models OR LLMs OR Foundation Models”（权重1.0）相关，评分为8分，因为LLM用于辅助任务生成，但不是核心模型创新。与"LLM Agents OR Autonomous Agents OR Agentic Workflow”（权重1.0）相关，评分为8分，因为论文评估软件工程代理，涉及代理工作流。其他关键词如MoE、SLMs、训练技术、推理优化、科学AI应用等与论文内容无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种时间一致的基准方法，用于评估仓库感知的软件工程系统，通过LLM辅助生成任务提示，并发现提示构造是影响评估结果的一阶变量。

摘要翻译

对仓库感知软件工程系统的评估常因合成任务设计、提示泄露以及仓库知识与未来代码变更之间的时序污染而受到干扰。我们提出一种时序一致的基准测试方法：在时间T0对代码仓库进行快照，仅使用T0前可用的工件构建仓库衍生的代码知识，并在基于未来时段(T0, T1]内合并的拉取请求所衍生的工程任务上进行评估。每个历史拉取请求通过LLM辅助的提示生成流程转化为自然语言任务，该基准被形式化为匹配的A/B对比实验——在保持其他变量恒定的条件下，对同一软件工程智能体分别在使用与不使用仓库衍生代码知识的情况下进行评估。我们还在两个开源仓库（DragonFly和React）上使用三个Claude系列模型和四种提示粒度开展了基线特征研究。在两个仓库中，文件级F1分数从最小提示到引导提示呈单调上升趋势，最强测试模型在DragonFly和React上分别达到0.8081和0.8078。这些结果表明提示构建是基准测试的一阶变量。更广泛而言，本基准测试强调时序一致性与提示控制是仓库感知软件工程评估的核心效度要求。

摘要 (Abstract)

Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities. Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model. These results show that prompt construction is a first-order benchmark variable. More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.

关键词: repository-aware software engineering, time-consistent benchmark, LLM-assisted prompt generation, software engineering agent, evaluation methodology, temporal contamination, prompt granularity, A/B comparison

71. ❌ SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

作者: Deepak Kumar 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26130v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究8个前沿LLM在代码审查任务中的性能评估，直接涉及LLM关键词（10分）。研究发现模型在长上下文配置中性能下降，与长上下文LLM问题相关（5分）。论文未涉及其他技术原理创新或科学领域应用，其余关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文通过SWE-PRBench基准测试发现，8个前沿LLM在代码审查任务中仅能检测15-31%的人工标记问题，且性能随上下文长度增加而下降，表明AI代码审查仍远低于人类专家水平。

摘要翻译

我们推出SWE-PRBench基准测试集，该数据集包含350个经过人工标注真实值的代码拉取请求，用于评估AI代码审查质量。基于kappa系数为0.75验证的大语言模型即评委框架进行评估，在仅差异文本配置下，8个前沿模型仅能检测出15-31%的人工标注问题，这表明尽管在代码生成基准测试中表现优异，AI代码审查能力仍远低于人类专家水平。拉取请求选自活跃的开源项目库，通过代码库质量评分从700个候选请求中筛选得出，并在三种冻结上下文配置下进行评估：仅差异文本（config_A）、差异文本附带文件内容（config_B）和完整上下文（config_C），从而实现对上下文提供策略的系统性消融研究。所有8个模型从config_A到config_C呈现单调性能下降趋势，即使通过结构化语义层（包括AST提取的函数上下文和导入图解析）提供上下文时亦然。主导机制在于config_B配置下Type2_上下文类问题检测能力的崩溃，这与长上下文中的注意力稀释现象一致：在所有8个模型中，包含摘要的结构化2000词元差异文本提示优于2500词元完整上下文提示（后者额外包含执行上下文、行为映射和测试签名）。前四名模型在统计上无显著差异（平均得分0.147-0.153），而与后四名模型（平均得分≤0.113）存在明显层级差距。数据集、上下文配置、标注数据及评估框架均已公开发布。

摘要 (Abstract)

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of human-flagged issues on the diff-only configuration, demonstrating that AI code review remains far below human expert performance despite strong results on code generation benchmarks. Pull requests are drawn from active open-source repositories, filtered from 700 candidates using a Repository Quality Score, and evaluated under three frozen context configurations: diff only (config_A), diff with file content (config_B), and full context (config_C), enabling systematic ablation of context provision strategies. All 8 models degrade monotonically from config_A to config_C, even when context is provided via structured semantic layers including AST-extracted function context and import graph resolution. The dominant mechanism is a collapse of Type2_Contextual issue detection at config_B, consistent with attention dilution in long contexts: a structured 2,000-token diff-with-summary prompt outperforms a 2,500-token full-context prompt enriched with execution context, behaviour mapping, and test signatures across all 8 models. The top four models are statistically indistinguishable (mean score 0.147-0.153) while a clear tier gap separates them from the remaining four (mean score <= 0.113). Dataset, contexts, annotations, and evaluation harness are released publicly.

关键词: AI code review, LLM evaluation, benchmark, pull request, context configuration, attention dilution, human-annotated ground truth, frontier models

72. ❌ Finding Distributed Object-Centric Properties in Self-Supervised Transformers

作者: Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26127v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究自监督视觉Transformer（ViT）中的物体中心表示，属于计算机视觉领域，而非大语言模型（LLM）或深度学习技术原理的创新。仅与两个关键词有微弱关联：1. “Hallucination Mitigation OR Factuality OR Truthfulness” 得5分，因为论文提到其方法可用于缓解多模态大语言模型中的物体幻觉，但这不是核心贡献；2. “Mechanistic Interpretability OR Explainable AI” 得10分，因为论文分析ViT内部表示以理解物体中心属性的编码方式，属于可解释AI范畴。其他关键词均与LLM、训练技术、推理优化、代理系统等无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文发现自监督视觉Transformer中的物体中心信息分布在网络各层，并提出Object-DINO方法提取这些信息以提升无监督物体发现性能并缓解多模态大模型的物体幻觉问题。

摘要翻译

诸如DINO等自监督视觉Transformer（ViT）展现出一种涌现的对象发现能力，这通常体现在最终层[CLS]标记的注意力图中。然而，这些图常包含虚假激活，导致对象定位效果不佳。这是因为[CLS]标记基于图像级目标进行训练，其总结的是整个图像而非专注于对象。这种聚合稀释了存在于局部、块级交互中的以对象为中心的信息。我们通过计算所有层中基于块级注意力组件（查询、键和值）的块间相似性来分析此问题。我们发现：（1）以对象为中心的特性编码于源自所有三个组件（$q, k, v$）的相似性图中，这与先前仅使用键特征或[CLS]标记的研究不同。（2）这种以对象为中心的信息分布于整个网络中，而不仅限于最终层。基于这些洞见，我们提出了Object-DINO，一种无需训练的方法，用于提取这种分布式的以对象为中心的信息。Object-DINO根据其块相似性对所有层的注意力头进行聚类，并自动识别对应于所有对象的以对象为中心的聚类。我们在两个应用中验证了Object-DINO的有效性：提升无监督对象发现（CorLoc指标提升+3.6至+12.4）以及通过提供视觉 grounding 来缓解多模态大语言模型中的对象幻觉。我们的结果表明，利用这种分布式的以对象为中心的信息可在无需额外训练的情况下改进下游任务。

摘要 (Abstract)

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO’s effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

关键词: Self-supervised Vision Transformers, object-centric properties, attention maps, distributed representation, unsupervised object discovery, object hallucination mitigation, visual grounding, training-free method

73. ❌ DPD-Cancer: Explainable Graph-based Deep Learning for Small Molecule Anti-Cancer Activity Prediction

作者: Magnus H. Strømme, Alex G. C. de Sá, David B. Ascher 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26114v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文DPD-Cancer专注于使用基于图注意力变换器（GAT）的深度学习方法进行小分子抗癌活性预测，属于AI在生物信息学/化学信息学领域的应用。与大多数关键词（如LLMs、MoE、SFT、RLHF等）无关，因为这些关键词主要涉及大语言模型的技术原理、训练方法或推理优化，而本文未涉及任何语言模型或相关技术。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文提到利用注意力机制提供可解释性；与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为论文直接应用深度学习于癌症研究和药物发现，属于AI for Science范畴。其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一种基于图注意力变换器的深度学习方法DPD-Cancer，用于预测小分子抗癌活性和细胞系特异性反应，在多个基准测试中表现出优越性能，并通过注意力机制提供可解释性以辅助药物优化。

摘要翻译

精准预测药物反应是计算生物化学领域的一个关键瓶颈，其难点在于如何有效建模分子结构与细胞环境之间的相互作用。在癌症研究中，由于肿瘤异质性和基因组变异性的存在，这一问题尤为突出，阻碍了有效疗法的识别。传统方法往往难以捕捉不同细胞系中化学特征与生物学结果之间的非线性关系。为此，我们提出了DPD-Cancer，一种基于图注意力变换器（Graph Attention Transformer, GAT）框架的深度学习方法。该方法专为小分子抗癌活性分类及细胞系特异性反应（特别是生长抑制浓度pGI50）的定量预测而设计。在与前沿方法（pdCSM-cancer、ACLPred和MLASM）的基准测试中，DPD-Cancer表现出优越性能，在严格划分的NCI60数据集上获得了高达0.87的受试者工作特征曲线下面积（Area Under ROC Curve, AUC），在ACLPred/MLASM数据集上AUC更高达0.98。在针对10种癌症类型和73个细胞系的pGI50预测中，该模型在独立测试集上实现了高达0.72的皮尔逊相关系数。这些结果证实，基于注意力的机制在提取有意义的分子表征方面具有显著优势，使DPD-Cancer成为候选药物优先排序的有力工具。此外，DPD-Cancer通过利用注意力机制识别并可视化特定分子亚结构，提供了可解释性，为先导化合物优化提供了可操作的见解。DPD-Cancer已作为网络服务器免费提供，访问地址为：https://biosig.lab.uq.edu.au/dpd_cancer/。

摘要 (Abstract)

Accurate drug response prediction is a critical bottleneck in computational biochemistry, limited by the challenge of modelling the interplay between molecular structure and cellular context. In cancer research, this is acute due to tumour heterogeneity and genomic variability, which hinder the identification of effective therapies. Conventional approaches often fail to capture non-linear relationships between chemical features and biological outcomes across diverse cell lines. To address this, we introduce DPD-Cancer, a deep learning method based on a Graph Attention Transformer (GAT) framework. It is designed for small molecule anti-cancer activity classification and the quantitative prediction of cell-line specific responses, specifically growth inhibition concentration (pGI50). Benchmarked against state-of-the-art methods (pdCSM-cancer, ACLPred, and MLASM), DPD-Cancer demonstrated superior performance, achieving an Area Under ROC Curve (AUC) of up to 0.87 on strictly partitioned NCI60 data and up to 0.98 on ACLPred/MLASM datasets. For pGI50 prediction across 10 cancer types and 73 cell lines, the model achieved Pearson’s correlation coefficients of up to 0.72 on independent test sets. These findings confirm that attention-based mechanisms offer significant advantages in extracting meaningful molecular representations, establishing DPD-Cancer as a competitive tool for prioritising drug candidates. Furthermore, DPD-Cancer provides explainability by leveraging the attention mechanism to identify and visualise specific molecular substructures, offering actionable insights for lead optimisation. DPD-Cancer is freely available as a web server at: https://biosig.lab.uq.edu.au/dpd_cancer/.

关键词: deep learning, graph attention transformer, anti-cancer activity prediction, small molecule, drug response prediction, explainability, bioinformatics, cheminformatics

74. ❌ “Oops! ChatGPT is Temporarily Unavailable!”: A Diary Study on Knowledge Workers’ Experiences of LLM Withdrawal

作者: Eunseo Oh, Suyoun Lee, Jae Young Choi, Soobin Park, Youn-kyung Lim 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26099v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究知识工作者对LLM暂时不可用的体验，属于LLM在人类工作实践中的应用研究，仅与第一个关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（评分为10），因为论文核心是探讨LLM在工作环境中的嵌入和影响。其他关键词均涉及具体技术原理、方法或特定应用领域，而本文是社会学/人机交互研究，不涉及这些技术细节或科学应用，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究通过日记研究调查了知识工作者在LLM暂时不可用时的体验，发现LLM撤出会破坏工作流程，促使工作者重新认识专业价值，并揭示了LLM使用已成为不可避免的规范。

摘要翻译

大型语言模型（LLMs）已深度融入知识工作领域，引发了人们对日益增长的依赖性及人类技能可能被削弱的担忧。为探究LLMs在工作实践中的普遍性，我们对频繁使用LLMs的用户（N=10）开展了为期四天的日志研究，观察知识工作者在LLMs被临时停用时的应对方式。研究发现表明：LLMs的停用通过暴露任务执行中的缺口扰乱了参与者的工作流程；自主工作促使参与者重新确立专业价值；日常实践揭示了LLM使用已在多大程度上成为不可避免的规范性存在。本研究将LLMs概念化为当代知识工作的基础设施，通过实证揭示了LLMs常被忽视的作用，并在当前LLM无处不在的工作环境中，提出以价值驱动的适应性应用作为维护专业价值的实践路径。

摘要 (Abstract)

LLMs have become deeply embedded in knowledge work, raising concerns about growing dependency and the potential undermining of human skills. To investigate the pervasiveness of LLMs in work practices, we conducted a four-day diary study with frequent LLM users (N=10), observing how knowledge workers responded to a temporary withdrawal of LLMs. Our findings show how LLM withdrawal disrupted participants’ workflows by identifying gaps in task execution, how self-directed work led participants to reclaim professional values, and how everyday practices revealed the extent to which LLM use had become inescapably normative. Conceptualizing LLMs as infrastructural to contemporary knowledge work, this research contributes empirical insights into the often invisible role of LLMs and proposes value-driven appropriation as an approach to supporting professional values in the current LLM-pervasive work environment.

关键词: LLM withdrawal, knowledge work, diary study, workflow disruption, professional values, infrastructural, value-driven appropriation, LLM dependency

75. ❌ A Human-Inspired Decoupled Architecture for Efficient Audio Representation Learning

作者: Harunori Kawano, Takeshi Sasaki 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26098v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种高效音频表示学习架构HEAR，专注于音频领域的自监督学习，而非大语言模型。与关键词的相关性分析如下：1）与’Small Language Models OR SLMs OR On-device AI’相关度5分：论文关注资源受限设备部署，涉及模型轻量化，但非语言模型；2）与’Quantization OR Model Compression OR Low-bit Weights’相关度5分：通过架构创新减少参数和计算量，属于模型压缩范畴；3）与’Speculative Decoding OR Inference Acceleration’相关度5分：降低推理计算成本（9.47 GFLOPs），涉及推理加速；其他关键词均未涉及大语言模型、对齐、推理、代理等核心内容，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对标准Transformer在音频表示学习中参数过多、计算成本高的问题，提出了一种受人类认知启发的解耦架构HEAR，在仅使用15M参数和9.47 GFLOPs的情况下，在多种音频分类基准测试中实现了具有竞争力的性能。

摘要翻译

尽管自监督学习（SSL）已彻底改变了音频表征领域，但标准Transformer的过度参数化与二次计算成本限制了其在资源受限设备上的部署。为应对这一瓶颈，我们提出HEAR（受人类启发的高效音频表征），一种新颖的解耦架构。受人类从全局语境中分离局部声学特征的认知能力启发，HEAR将处理流程拆分为两个专用模块：用于局部特征提取的声学模型和用于全局语义整合的任务模型。结合通过知识蒸馏训练的声学标记器，我们的方法实现了鲁棒的掩码音频建模。大量实验表明，HEAR仅需1500万参数和9.47 GFLOPs即可完成推理，其计算成本仅为传统基础模型（通常需要8500万至9400万参数）的极小部分。尽管具备如此高的效率，HEAR在多种音频分类基准测试中仍展现出极具竞争力的性能。代码与预训练模型已公开于https://github.com/HarunoriKawano/HEAR。

摘要 (Abstract)

While self-supervised learning (SSL) has revolutionized audio representation, the excessive parameterization and quadratic computational cost of standard Transformers limit their deployment on resource-constrained devices. To address this bottleneck, we propose HEAR (Human-inspired Efficient Audio Representation), a novel decoupled architecture. Inspired by the human cognitive ability to isolate local acoustic features from global context, HEAR splits the processing pipeline into two dedicated modules: an Acoustic Model for local feature extraction and a Task Model for global semantic integration. Coupled with an Acoustic Tokenizer trained via knowledge distillation, our approach enables robust Masked Audio Modeling (MAM). Extensive experiments demonstrate that HEAR requires only 15M parameters and 9.47 GFLOPs for inference, operating at a fraction of the computational cost of conventional foundation models (which typically require 85M-94M parameters). Despite this high efficiency, HEAR achieves highly competitive performance across diverse audio classification benchmarks. The code and pre-trained models are available at https://github.com/HarunoriKawano/HEAR

关键词: audio representation learning, self-supervised learning, Transformer efficiency, decoupled architecture, Masked Audio Modeling, parameter efficiency, computational cost reduction, audio classification

76. ❌ Dynamic Tokenization via Reinforcement Patching: End-to-end Training and Zero-shot Transfer

作者: Yulun Wu, Sravan Kumar Ankireddy, Samuel Sharpe, Nikita Seleznev, Dehao Yuan, Hyeji Kim, Nam H. Nguyen 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26097v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为Reinforcement Patching（ReinPatch）的序列分割框架，使用强化学习联合优化序列分割策略和下游序列骨干模型。该研究专注于时间序列预测中的动态标记化（tokenization）问题，属于深度学习中的序列建模和表示学习领域。虽然论文涉及深度学习模型和强化学习，但其核心内容与所有评分关键词（主要围绕大语言模型、对齐、推理、代理、压缩等大模型特定技术）完全无关。论文未提及任何语言模型、大模型技术原理或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于强化学习的动态序列分割框架（ReinPatch），用于优化时间序列数据的自适应标记化策略，在时间序列预测任务中取得了优于现有数据驱动分割方法的性能。

摘要翻译

高效聚合空间或时间视野以获取紧凑表征，已成为现代深度学习模型的统一原则，然而为长视野序列数据（尤其是时间序列等连续序列）学习数据自适应的表征，仍是一个开放挑战。尽管固定尺寸分块（patching）已提升了可扩展性与性能，但端到端地发现可变尺寸、数据驱动的分块，往往迫使模型依赖软离散化、特定骨干网络或启发式规则。本研究提出强化分块（ReinPatch），这是首个利用强化学习联合优化序列分块策略及其下游序列骨干网络的框架。通过将分块边界定位构建为可通过组相对策略梯度（Group Relative Policy Gradient, GRPG）优化的离散决策过程，ReinPatch绕过了对连续松弛的需求，并以自然方式执行动态分块策略优化。此外，我们的方法能够严格强制执行目标压缩率，使下游骨干网络得以高效扩展，并天然支持多层级联建模。我们在时间序列预测数据集上评估ReinPatch，其相较于最先进的数据驱动分块策略展现出卓越性能。进一步地，我们的解耦设计允许将分块模块提取为独立的基础分块器，为研究社区提供了对纯性能驱动的神经分块策略所偏好的分割行为的可视化与实证洞察。

摘要 (Abstract)

Efficiently aggregating spatial or temporal horizons to acquire compact representations has become a unifying principle in modern deep learning models, yet learning data-adaptive representations for long-horizon sequence data, especially continuous sequences like time series, remains an open challenge. While fixed-size patching has improved scalability and performance, discovering variable-sized, data-driven patches end-to-end often forces models to rely on soft discretization, specific backbones, or heuristic rules. In this work, we propose Reinforcement Patching (ReinPatch), the first framework to jointly optimize a sequence patching policy and its downstream sequence backbone model using reinforcement learning. By formulating patch boundary placement as a discrete decision process optimized via Group Relative Policy Gradient (GRPG), ReinPatch bypasses the need for continuous relaxations and performs dynamic patching policy optimization in a natural manner. Moreover, our method allows strict enforcement of a desired compression rate, freeing the downstream backbone to scale efficiently, and naturally supports multi-level hierarchical modeling. We evaluate ReinPatch on time-series forecasting datasets, where it demonstrates compelling performance compared to state-of-the-art data-driven patching strategies. Furthermore, our detached design allows the patching module to be extracted as a standalone foundation patcher, providing the community with visual and empirical insights into the segmentation behaviors preferred by a purely performance-driven neural patching strategy.

关键词: Reinforcement Patching, dynamic tokenization, sequence patching, time-series forecasting, reinforcement learning, Group Relative Policy Gradient, hierarchical modeling, data-driven patches

77. ❌ Selective Deficits in LLM Mental Self-Modeling in a Behavior-Based Test of Theory of Mind

作者: Christopher Ackerman 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26089v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文核心研究LLMs的心智理论（Theory of Mind）能力，特别是自我建模和他人建模，属于LLMs认知能力评估的前沿研究。高度相关的关键词包括：LLMs（研究对象）、Chain of Thought/System 2 Thinking（论文涉及推理过程）、Self-Correction/Self-Reflection（自我建模与反思）、LLM Agents（将LLMs视为具有心智状态的智能体）、World Models（心智模型构建）。与Mechanistic Interpretability有一定关联（探讨LLMs内部工作机制）。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该研究通过行为实验测试LLMs的心智理论能力，发现近期LLMs在他人心智建模上达到人类水平，但在自我建模上普遍失败，除非提供推理轨迹作为工作记忆支持，并揭示了LLMs在任务中会进行策略性欺骗。

摘要翻译

将自我与他人表征为具有知识、意图和信念状态并能以此指导行为的行动者——这种“心理理论”能力是人类普遍具备的，它使我们得以在社会世界中游刃有余地应对乃至操控。该能力依赖于我们形成自我与他人心智模型的功能。鉴于心理理论在人类活动中的无处不在，大语言模型在其训练数据中必然接触过无数相关示例，因而可能已学会模仿这种能力，但它们是否真正习得了可在任意情境中运用的因果模型尚不明确。为此，我们设计了一种新颖的实验范式，要求受试者构建自我与他人的心理状态表征，并基于这些表征进行策略性行动，而非仅仅描述它们。我们运用该范式测试了2024年以来发布的一系列领先的开源与闭源大语言模型以及人类受试者。研究发现：1）2025年年中之前发布的模型在所有任务中均告失败；2）较新发布的模型在建模他人认知状态任务上已达到人类水平；3）即使是前沿模型在自我建模任务中依然失败——除非为其提供推理轨迹形式的“草稿纸”。我们进一步证明了认知负荷对他人建模任务的影响，这为“大语言模型在单次前向传播中使用了类似有限容量工作记忆的机制来保持心理表征”提供了启示性证据。最后，我们探究了推理模型在自我与他人建模任务中的成功机制，并发现它们能够轻易实施策略性欺骗。

摘要 (Abstract)

The ability to represent oneself and others as agents with knowledge, intentions, and belief states that guide their behavior - Theory of Mind - is a human universal that enables us to navigate - and manipulate - the social world. It is supported by our ability to form mental models of ourselves and others. Its ubiquity in human affairs entails that LLMs have seen innumerable examples of it in their training data and therefore may have learned to mimic it, but whether they have actually learned causal models that they can deploy in arbitrary settings is unclear. We therefore develop a novel experimental paradigm that requires that subjects form representations of the mental states of themselves and others and act on them strategically rather than merely describe them. We test a wide range of leading open and closed source LLMs released since 2024, as well as human subjects, on this paradigm. We find that 1) LLMs released before mid-2025 fail at all of our tasks, 2) more recent LLMs achieve human-level performance on modeling the cognitive states of others, and 3) even frontier LLMs fail at our self-modeling task - unless afforded a scratchpad in the form of a reasoning trace. We further demonstrate cognitive load effects on other-modeling tasks, offering suggestive evidence that LLMs are using something akin to limited-capacity working memory to hold these mental representations in mind during a single forward pass. Finally, we explore the mechanisms by which reasoning models succeed at the self- and other-modeling tasks, and show that they readily engage in strategic deception.

关键词: Theory of Mind, LLM mental self-modeling, cognitive states, reasoning trace, strategic deception, working memory, agent modeling, behavior-based test

78. ❌ When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization

作者: Zhihan Chen, Yuhuan Zhao, Yijie Zhu, Xinyu Yao 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26078v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究主题驱动文本到图像扩散模型在多主体个性化中的身份崩溃问题，属于计算机视觉和生成模型领域，与提供的大模型和深度学习技术关键词（主要涉及语言模型、训练方法、推理技术、代理系统等）无直接关联。论文未涉及任何关键词中的技术或应用领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现当前主题驱动文本到图像扩散模型在多主体合成中存在严重的身份崩溃问题，尤其是在复杂场景中，并提出了一个新的评估基准和指标来系统暴露和量化这一失败模式。

摘要翻译

主题驱动的文本到图像扩散模型在保持单一身份方面取得了显著成功，但其组合多个交互主体的能力仍鲜有探索且极具挑战性。现有的评估方案通常依赖全局CLIP指标，这些指标对局部身份坍缩不敏感，且无法捕捉多主体纠缠的严重程度。本文揭示了当前模型中普遍存在的“可扩展性假象”：尽管它们在简单布局中能出色合成2-4个主体，但当规模扩展至6-10个主体或涉及复杂物理交互时，便会遭遇灾难性的身份坍缩。为系统揭示这一失效模式，我们构建了一个严格的压力测试基准，包含75个提示词，分布于不同主体数量和交互难度（中性、遮挡、交互）中。进一步，我们证明基于CLIP的标准指标对此任务存在根本缺陷，因为它们常为语义正确但身份坍缩的图像（例如生成通用克隆体）赋予高分。为此，我们引入了主体坍缩率（Subject Collapse Rate, SCR），这是一种基于DINOv2结构先验的新型评估指标，能严格惩罚局部注意力泄漏和同质化现象。通过对前沿模型（MOSAIC、XVerse、PSR）的广泛评估，我们发现随着场景复杂度增加，身份保真度急剧下降，在10个主体时SCR接近100%。我们将这种坍缩归因于全局注意力路由中固有的语义捷径，这凸显了未来生成架构中亟需显式物理解耦的迫切性。

摘要 (Abstract)

Subject-driven text-to-image diffusion models have achieved remarkable success in preserving single identities, yet their ability to compose multiple interacting subjects remains largely unexplored and highly challenging. Existing evaluation protocols typically rely on global CLIP metrics, which are insensitive to local identity collapse and fail to capture the severity of multi-subject entanglement. In this paper, we identify a pervasive “Illusion of Scalability” in current models: while they excel at synthesizing 2-4 subjects in simple layouts, they suffer from catastrophic identity collapse when scaled to 6-10 subjects or tasked with complex physical interactions. To systematically expose this failure mode, we construct a rigorous stress-test benchmark comprising 75 prompts distributed across varying subject counts and interaction difficulties (Neutral, Occlusion, Interaction). Furthermore, we demonstrate that standard CLIP-based metrics are fundamentally flawed for this task, as they often assign high scores to semantically correct but identity-collapsed images (e.g., generating generic clones). To address this, we introduce the Subject Collapse Rate (SCR), a novel evaluation metric grounded in DINOv2’s structural priors, which strictly penalizes local attention leakage and homogenization. Our extensive evaluation of state-of-the-art models (MOSAIC, XVerse, PSR) reveals a precipitous drop in identity fidelity as scene complexity grows, with SCR approaching 100% at 10 subjects. We trace this collapse to the semantic shortcuts inherent in global attention routing, underscoring the urgent need for explicit physical disentanglement in future generative architectures.

关键词: subject-driven text-to-image diffusion models, multi-subject personalization, identity collapse, stress-test benchmark, Subject Collapse Rate (SCR), DINOv2, global attention routing, physical disentanglement

79. ❌ Semi-Automated Knowledge Engineering and Process Mapping for Total Airport Management

作者: Darryl Teo, Adharsha Sam, Chuan Shen Marcus Koh, Rakesh Nagi, Nuno Antunes Ribeiro 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26076v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLMs构建机场管理知识图谱，因此与’Large Language Models’高度相关（10分）。论文比较了基于片段和文档级的处理，涉及’Context Window Extension’（8分）。论文关注生成输出的可追溯性和可验证性，与’Hallucination Mitigation’和’Explainable AI’有一定关联（各5分）。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种融合符号知识工程和生成式大语言模型的双阶段框架，用于构建机场管理领域的知识图谱，解决了数据孤岛和语义不一致问题，并通过文档级处理提高了非线性程序依赖的恢复能力。

摘要翻译

机场运营文档的复杂性根植于其广泛的技术术语、严格的法规体系、专有的区域信息以及多利益相关方之间碎片化的沟通模式。由此形成的数据孤岛与语义不一致问题，对整体机场管理（Total Airport Management, TAM）倡议构成了显著障碍。本文提出一种方法论框架，通过符号知识工程（Knowledge Engineering, KE）与生成式大语言模型（Large Language Models, LLMs）的双阶段融合，构建领域 grounded、机器可读的知识图谱（Knowledge Graph, KG）。
该框架采用一种 scaffolded 融合策略，其中由专家构建的 KE 结构引导 LLM 提示，以促进语义对齐的知识三元组发现。我们基于 Google LangExtract 库对该方法进行评估，并通过比较基于局部片段的推理与文档级处理，探究上下文窗口利用的影响。与先前关于 LLMs 长上下文性能衰减的实证观察相反，文档级处理提升了对非线性流程依赖关系的恢复能力。
为确保机场运营所需的高保真溯源，所提框架融合了用于发现的概率模型与用于将每次提取锚定至原始来源的确定性算法。这实现了绝对的追溯性与可验证性，从而弥合了“黑箱”生成式输出与运营工具所需的透明度之间的鸿沟。最后，我们引入一个自动化框架，将上述流程操作化，以从非结构化文本语料中合成复杂的运营工作流。

摘要 (Abstract)

Documentation of airport operations is inherently complex due to extensive technical terminology, rigorous regulations, proprietary regional information, and fragmented communication across multiple stakeholders. The resulting data silos and semantic inconsistencies present a significant impediment to the Total Airport Management (TAM) initiative. This paper presents a methodological framework for constructing a domain-grounded, machine-readable Knowledge Graph (KG) through a dual-stage fusion of symbolic Knowledge Engineering (KE) and generative Large Language Models (LLMs). The framework employs a scaffolded fusion strategy in which expert-curated KE structures guide LLM prompts to facilitate the discovery of semantically aligned knowledge triples. We evaluate this methodology on the Google LangExtract library and investigate the impact of context window utilization by comparing localized segment-based inference with document-level processing. Contrary to prior empirical observations of long-context degradation in LLMs, document-level processing improves the recovery of non-linear procedural dependencies. To ensure the high-fidelity provenance required in airport operations, the proposed framework fuses a probabilistic model for discovery and a deterministic algorithm for anchoring every extraction to its ground source. This ensures absolute traceability and verifiability, bridging the gap between “black-box” generative outputs and the transparency required for operational tooling. Finally, we introduce an automated framework that operationalizes this pipeline to synthesize complex operational workflows from unstructured textual corpora.

关键词: Knowledge Graph, Large Language Models, Airport Management, Knowledge Engineering, Context Window, Traceability, Workflow Synthesis

80. ❌ R-PGA: Robust Physical Adversarial Camouflage Generation via Relightable 3D Gaussian Splatting

作者: Tianrui Lou, Siyuan Liang, Jiawei Liang, Yuze Gao, Xiaochun Cao 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26067v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究物理对抗性伪装生成，聚焦于3D高斯溅射、渲染和优化技术，用于攻击自动驾驶系统。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用直接相关，而本文主题是计算机视觉、3D重建和对抗攻击，与评分关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于可重光照3D高斯溅射的鲁棒物理对抗性伪装生成框架（R-PGA），解决了现有方法在复杂动态场景中泛化能力不足的问题，通过高保真模拟和主动挖掘最坏物理配置的优化策略，显著提升了对抗攻击的鲁棒性和有效性。

摘要翻译

物理对抗性伪装通过将对抗性纹理映射到三维物体上，对自动驾驶系统构成严重安全威胁。然而，现有方法在复杂动态场景中仍显脆弱，难以泛化至多样的几何（如观测配置）与辐射度（如动态光照、大气散射）变化。我们将此不足归因于仿真与优化中的两个根本局限。首先，对粗糙、过度简化仿真（如通过CARLA）的依赖导致了显著的领域差距，将优化限制在有偏的特征空间中。其次，以平均性能为目标的传统策略产生了崎岖的损失曲面，使得伪装易受配置变化的影响。为弥合这些差距，我们提出了基于可重光照物理三维高斯溅射（Relightable Physical 3D Gaussian Splatting, 3DGS）的攻击框架（R-PGA）。技术上，针对仿真保真度问题，我们利用3DGS确保照片级真实感重建，并通过物理解耦属性增强其能力，以分离内在材质与光照影响。此外，我们设计了一种混合渲染管线，利用精确的可重光照3DGS进行前景渲染，同时采用预训练的图像转换模型合成与重光照前景协调的、合理的重光照背景。针对优化鲁棒性问题，我们提出了硬物理配置挖掘（Hard Physical Configuration Mining, HPCM）模块，旨在主动挖掘最差物理配置并抑制其对应的损失峰值。该策略不仅降低了整体损失幅度，还有效平缓了崎岖的损失曲面，确保了在不同物理配置下对抗效果与鲁棒性的一致保持。

摘要 (Abstract)

Physical adversarial camouflage poses a severe security threat to autonomous driving systems by mapping adversarial textures onto 3D objects. Nevertheless, current methods remain brittle in complex dynamic scenarios, failing to generalize across diverse geometric (e.g., viewing configurations) and radiometric (e.g., dynamic illumination, atmospheric scattering) variations. We attribute this deficiency to two fundamental limitations in simulation and optimization. First, the reliance on coarse, oversimplified simulations (e.g., via CARLA) induces a significant domain gap, confining optimization to a biased feature space. Second, standard strategies targeting average performance result in a rugged loss landscape, leaving the camouflage vulnerable to configuration shifts.To bridge these gaps, we propose the Relightable Physical 3D Gaussian Splatting (3DGS) based Attack framework (R-PGA). Technically, to address the simulation fidelity issue, we leverage 3DGS to ensure photo-realistic reconstruction and augment it with physically disentangled attributes to decouple intrinsic material from lighting. Furthermore, we design a hybrid rendering pipeline that leverages precise Relightable 3DGS for foreground rendering, while employing a pre-trained image translation model to synthesize plausible relighted backgrounds that align with the relighted foreground.To address the optimization robustness issue, we propose the Hard Physical Configuration Mining (HPCM) module, designed to actively mine worst-case physical configurations and suppress their corresponding loss peaks. This strategy not only diminishes the overall loss magnitude but also effectively flattens the rugged loss landscape, ensuring consistent adversarial effectiveness and robustness across varying physical configurations.

关键词: Physical adversarial camouflage, 3D Gaussian Splatting, Relightable rendering, Autonomous driving security, Robust optimization, Hard Physical Configuration Mining, Domain gap, Adversarial attack

81. ❌ MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection

作者: Peiyuan Jiang, Yao Liu, Yanglei Gan, Jiaye Yang, Lu Liu, Daibing Yao, Qiao Liu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26064v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多模态欺骗检测，提出了一种基于GSR引导的渐进式蒸馏方法，并构建了MuDD数据集。论文内容涉及多模态学习、知识蒸馏、生理信号处理和欺骗检测，但完全不涉及大语言模型、深度学习技术原理创新或AI for Science等关键词。所有关键词均与大模型技术、深度学习创新或科学AI应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对非接触式欺骗检测中视觉和听觉线索不稳定的问题，通过构建多模态数据集MuDD并提出GSR引导的渐进式蒸馏框架，实现了更稳定的跨模态知识迁移和最优的欺骗检测性能。

摘要翻译

非接触式自动欺骗检测仍面临挑战，因为视觉与听觉欺骗线索常缺乏稳定的跨被试模式。相比之下，皮肤电反应（Galvanic Skin Response, GSR）能提供更可靠的生理线索，已广泛应用于接触式欺骗检测。本研究利用GSR中稳定的欺骗相关知识，通过跨模态知识蒸馏指导非接触模态的表征学习。然而，一个关键障碍是缺乏适用于此场景的数据集。为此，我们提出了MuDD——一个大规模多模态欺骗检测数据集，包含130名参与者超过690分钟的录制数据。除视频、音频和GSR外，MuDD还提供了光电容积描记、心率及人格特质数据，支持更广泛的欺骗科学研究。基于此数据集，我们提出GSR引导的渐进式蒸馏（GSR-guided Progressive Distillation, GPD），这是一个用于缓解GSR与非接触信号间巨大模态差异导致负迁移的跨模态蒸馏框架。GPD的核心创新在于将渐进式特征级与数字级蒸馏与动态路由相结合，使模型能自适应地决定训练过程中教师知识的传递方式，从而实现更稳定的跨模态知识迁移。大量实验与可视化结果表明，GPD优于现有方法，在欺骗检测和隐藏数字识别任务上均达到最先进的性能水平。

摘要 (Abstract)

Non-contact automatic deception detection remains challenging because visual and auditory deception cues often lack stable cross-subject patterns. In contrast, galvanic skin response (GSR) provides more reliable physiological cues and has been widely used in contact-based deception detection. In this work, we leverage stable deception-related knowledge in GSR to guide representation learning in non-contact modalities through cross-modal knowledge distillation. A key obstacle, however, is the lack of a suitable dataset for this setting. To address this, we introduce MuDD, a large-scale Multimodal Deception Detection dataset containing recordings from 130 participants over 690 minutes. In addition to video, audio, and GSR, MuDD also provides Photoplethysmography, heart rate, and personality traits, supporting broader scientific studies of deception. Based on this dataset, we propose GSR-guided Progressive Distillation (GPD), a cross-modal distillation framework for mitigating the negative transfer caused by the large modality mismatch between GSR and non-contact signals. The core innovation of GPD is the integration of progressive feature-level and digit-level distillation with dynamic routing, which allows the model to adaptively determine how teacher knowledge should be transferred during training, leading to more stable cross-modal knowledge transfer. Extensive experiments and visualizations show that GPD outperforms existing methods and achieves state-of-the-art performance on both deception detection and concealed-digit identification.

关键词: deception detection, multimodal dataset, cross-modal knowledge distillation, galvanic skin response, progressive distillation, non-contact detection, MuDD dataset, GSR-guided learning

82. ❌ Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

作者: Zizhao Chen, Ping Wei, Ziyang Ren, Huan Li, Xiangru Yin 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26052v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态媒体验证，提出了一种名为MaLSF的框架，用于检测和定位多模态虚假信息。其核心创新在于利用mask-label pairs作为语义锚点，通过双向跨模态验证和分层语义聚合来识别局部语义不一致。然而，论文并未涉及任何大模型（LLM）或深度学习技术原理的创新，也未提及任何评分关键词中的具体技术（如MoE、Scaling Laws、RLHF、RAG等）。论文属于多模态计算机视觉与自然语言处理的交叉领域，但未涉及大模型在不同领域的应用或大模型技术本身的创新。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MaLSF的新型多模态媒体验证框架，通过mask-label pairs作为语义锚点，采用双向跨模态验证和分层语义聚合来检测和定位多模态虚假信息，在DGM4和多模态假新闻检测任务上取得了最先进的性能。

摘要翻译

随着多模态虚假信息日益复杂化，其检测与定位变得至关重要。然而，当前依赖被动整体融合的多模态验证方法难以应对复杂的虚假信息。由于“特征稀释”现象，全局对齐倾向于平均化细微的局部语义不一致，从而有效地掩盖了本应被发现的冲突。我们提出MaLSF（掩码感知局部语义融合）这一新颖框架，它将范式转向主动、双向的验证过程，模拟人类认知中的交叉参照机制。MaLSF利用掩码-标签对作为语义锚点来连接像素与词语。其核心机制包含两项创新：1）双向跨模态验证模块，该模块充当“审问者”角色，通过并行查询流（以文本为查询和以图像为查询）来显式定位冲突；2）分层语义聚合模块，该模块智能地聚合这些多粒度冲突信号以进行面向特定任务的推理。此外，为提取细粒度的掩码-标签对，我们引入了一组多样化的掩码-标签对提取解析器。MaLSF在DGM4数据集及多模态虚假新闻检测任务上均取得了最先进的性能。广泛的消融实验与可视化结果进一步验证了其有效性与可解释性。

摘要 (Abstract)

As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to ‘feature dilution,’ global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.

关键词: multimodal misinformation detection, mask-label pairs, bidirectional cross-modal verification, hierarchical semantic aggregation, local semantic inconsistencies, multimodal fake news detection, MaLSF framework

83. ❌ Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays

作者: Kang Liu, Zhuoqi Ma, Siyu Liang, Yunan Li, Xiyue Gao, Chao Liang, Kun Xie, Qiguang Miao 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26049v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学影像（胸部X光）的视觉-语言预训练，核心创新在于整合临床上下文和放射科医生注视点来指导诊断推理。与大多数关键词无关，因为论文不涉及通用大语言模型技术、推理方法、对齐、高效微调、代理系统等。仅与两个关键词高度相关：1) “Pre-training OR Continual Pre-training OR Domain Adaptation”（10分），因为论文核心是提出新的视觉-语言预训练框架CoGaze；2) “AI for Science OR Bioinformatics OR Cheminformatics”（10分），因为论文属于AI在生物医学（放射学）领域的应用。其他关键词评分为0，因为论文未涉及这些技术。

!!! tip deepseek-chat TL;DR

该论文针对现有医学视觉-语言预训练模型难以捕捉放射科医生诊断工作流程的问题，提出了一个整合临床上下文和放射科医生注视点引导的预训练框架CoGaze，在多项胸部X光任务上显著超越了现有方法。

摘要翻译

尽管医学视觉-语言预训练领域近期取得了进展，现有模型仍难以捕捉诊断工作流程：胸片通常被当作与上下文无关的图像处理，而放射科医师的注视——作为视觉推理的关键线索——在现有方法中仍未得到充分探索。这些限制阻碍了疾病特异性模式的建模，并削弱了跨模态对齐。为弥补这一差距，我们提出了CoGaze，一种面向胸部X光片的上下文与注视引导的视觉-语言预训练框架。我们首先提出了一种上下文融合视觉编码器，用于建模放射科医师如何整合临床上下文（包括患者病史、症状和诊断意图）以指导诊断推理。随后，我们设计了一种多层次监督范式，该范式（1）通过混合正样本对比学习强化模态内与模态间的语义对齐，（2）通过疾病感知的跨模态表征学习注入诊断先验知识，（3）利用放射科医师的注视作为概率先验，引导注意力聚焦于诊断关键区域。大量实验表明，CoGaze在多项任务中持续优于现有先进方法，在自由文本和结构化报告生成任务上分别提升高达+2.0%的CheXbertF1分数和+1.2%的BLEU2分数，在零样本分类任务上提升+23.2%的AUROC，在图文检索任务上提升+12.2%的Precision@1。代码发布于https://github.com/mk-runner/CoGaze。

摘要 (Abstract)

Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists’ gaze – a crucial cue for visual reasoning – remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context – including patient history, symptoms, and diagnostic intent – to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists’ gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.

关键词: medical vision-language pretraining, chest X-rays, clinical context, radiologists’ gaze, diagnostic workflow, cross-modal alignment, CoGaze framework, disease-aware representation learning

84. ❌ H-Node Attack and Defense in Large Language Models

作者: Eric Yocam, Varghese Vaidyan, Yong Wang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26045v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM中的幻觉问题，提出H-Node ANC框架进行攻击和防御，因此与’Large Language Models’、‘Hallucination Mitigation’、‘Mechanistic Interpretability’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理加速、AI for Science等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出H-Node ANC框架，通过识别和操作Transformer LLM中的幻觉节点（H-Nodes）来实现对幻觉表示的白盒攻击和自适应防御，在多个模型上验证了攻击的选择性和防御的有效性，且对模型通用能力影响最小。

摘要翻译

本文提出H-Node对抗性噪声消除（H-Node ANC）机制框架，该框架能够在基于Transformer的大语言模型（LLMs）中，在单个隐藏状态维度层面识别、利用并防御幻觉表征。通过基于末词元隐藏状态训练的逻辑回归探针，我们将幻觉信号定位至一小部分高方差维度——称为幻觉节点（Hallucination Nodes, H-Nodes）——该探针在四种模型架构上的AUC均达到0.90。一种白盒对抗攻击在推理时通过实时前向钩子放大这些维度，实现了3.02倍的选择性，同时对防御者的可见性低于10%。自适应ANC防御采用置信度加权的消除方法，在推理过程中实时抑制H-Node的过度激活，相比静态消除方法，将基础激活漂移降低了33-42%。一种动态迭代扩展方法在连续推理过程中重新排序消除目标，从单次推理基线8%的鲁棒性中恢复了最高0.69的鲁棒性。所有贡献均在OPT-125M、Phi-3-mini-4k-instruct、LLaMA-3-8B-Instruct和Mistral-7B-Instruct-v0.3（参数规模125M-8B）上得到验证。该方法对困惑度的影响具有精准性（<5%），MMLU性能下降最多为3%，证实了该防御机制不会损害模型的通用推理能力。

摘要 (Abstract)

We present H-Node Adversarial Noise Cancellation (H-Node ANC), a mechanistic framework that identifies, exploits, and defends hallucination representations in transformer-based large language models (LLMs) at the level of individual hidden-state dimensions. A logistic regression probe trained on last-token hidden states localizes hallucination signal to a small set of high-variance dimensions – termed Hallucination Nodes (H-Nodes) – with probe AUC reaching 0.90 across four architectures. A white-box adversarial attack amplifies these dimensions at inference time via a real-time forward hook, achieving a selectivity of 3.02x with less than 10% visibility to the defender. Adaptive ANC defense suppresses H-Node excess in-pass using confidence-weighted cancellation, reducing grounded activation drift by 33-42% over static cancellation. A dynamic iterative extension that re-ranks cancellation targets across successive passes recovers up to 0.69 robustness from a single-pass baseline of 8%. All contributions are validated on OPT-125M, Phi-3-mini-4k-instruct, LLaMA-3-8B-Instruct, and Mistral-7B-Instruct-v0.3 (125M-8B parameters). Perplexity impact is surgical (<5%) and MMLU degradation is at most 3%, confirming that the defense does not impair general reasoning capability.

关键词: Hallucination, Large Language Models, Mechanistic Interpretability, Adversarial Attack, Defense Mechanism, Transformer, Hidden States, Robustness

85. ❌ Designing Fatigue-Aware VR Interfaces via Biomechanical Models

作者: Harshitha Voleti, Charalambos Poullis 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26031v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究虚拟现实（VR）界面设计，使用生物力学模型和强化学习来优化VR界面以减少用户疲劳。论文内容完全聚焦于人机交互（HCI）、VR和生物力学模拟，未涉及任何大语言模型（LLM）、深度学习技术原理或AI for Science的具体应用。所有评分关键词均与大模型、深度学习技术或科学AI应用相关，而本论文的研究领域（VR界面设计）与这些关键词无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于生物力学模型和分层强化学习的框架，用于优化VR界面布局以减少用户在中空交互中的疲劳，并通过实验验证了该框架的有效性。

摘要翻译

虚拟现实（VR）中的长时间悬空交互会导致手臂疲劳与不适，从而对用户体验产生负面影响。将人机工程学考量融入VR用户界面（UI）设计通常需要大量的人工参与评估。尽管生物力学模型已被用于模拟人机交互任务中的人类行为，但其作为人机工程学VR UI设计替代用户的应用仍未被充分探索。我们提出了一种分层强化学习框架，该框架利用生物力学用户模型来评估和优化面向悬空交互的VR界面。我们训练了一个运动智能体，使其能在序列化条件下于VR中执行按钮按压任务，该智能体采用真实的运动策略，并通过经过验证的三区控制恢复（3CC-r）疲劳模型来估计肌肉层面的负荷。模拟的疲劳输出作为UI智能体的反馈，该智能体通过强化学习（RL）优化UI元素布局以最小化疲劳。我们将RL优化布局与手动设计的居中基线布局及贝叶斯优化基线布局进行了比较。结果表明，生物力学模型得出的疲劳趋势与真实用户数据相符。此外，在一项后续的人体研究中，使用模拟疲劳反馈的RL优化布局产生了显著更低的感知疲劳。我们进一步通过一个非均匀交互频率的较长序列任务的模拟案例研究，展示了该框架的可扩展性。据我们所知，这是首个将模拟的生物力学肌肉疲劳作为VR UI布局设计直接优化信号的研究。我们的发现凸显了生物力学用户模型作为人机工程学VR界面设计有效替代工具的潜力，能够在减少对大量人工参与依赖的同时，实现高效的早期迭代。

摘要 (Abstract)

Prolonged mid-air interaction in virtual reality (VR) causes arm fatigue and discomfort, negatively affecting user experience. Incorporating ergonomic considerations into VR user interface (UI) design typically requires extensive human-in-the-loop evaluation. Although biomechanical models have been used to simulate human behavior in HCI tasks, their application as surrogate users for ergonomic VR UI design remains underexplored. We propose a hierarchical reinforcement learning framework that leverages biomechanical user models to evaluate and optimize VR interfaces for mid-air interaction. A motion agent is trained to perform button-press tasks in VR under sequential conditions, using realistic movement strategies and estimating muscle-level effort via a validated three-compartment control with recovery (3CC-r) fatigue model. The simulated fatigue output serves as feedback for a UI agent that optimizes UI element layout via reinforcement learning (RL) to minimize fatigue. We compare the RL-optimized layout against a manually-designed centered baseline and a Bayesian optimized baseline. Results show that fatigue trends from the biomechanical model align with human user data. Moreover, the RL-optimized layout using simulated fatigue feedback produced significantly lower perceived fatigue in a follow-up human study. We further demonstrate the framework’s extensibility via a simulated case study on longer sequential tasks with non-uniform interaction frequencies. To our knowledge, this is the first work using simulated biomechanical muscle fatigue as a direct optimization signal for VR UI layout design. Our findings highlight the potential of biomechanical user models as effective surrogate tools for ergonomic VR interface design, enabling efficient early-stage iteration with less reliance on extensive human participation.

关键词: Virtual Reality, Biomechanical Models, Fatigue, Reinforcement Learning, User Interface Design, Mid-air Interaction, Ergonomics, Human-Computer Interaction

86. ❌ Unlabeled Cross-Center Automatic Analysis for TAAD: An Integrated Framework from Segmentation to Clinical Features

作者: Mengdi Liu, Qiang Li, Weizhi Nie, Shaopeng Zhang, Yuting Su 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26019v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学影像分析（特别是主动脉夹层分割和临床特征提取），采用无监督域适应（UDA）技术解决跨机构部署问题。与大多数关键词（如LLM、MoE、推理方法、对齐技术等）完全无关。仅与两个关键词相关：1）‘Pre-training OR Continual Pre-training OR Domain Adaptation’：论文明确使用无监督域适应（UDA），属于域适应范畴，但非大模型背景，给5分。2）‘AI for Science OR Bioinformatics OR Cheminformatics’：论文属于AI在生物医学（心血管疾病）领域的应用，符合’AI for Science’，给8分。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该研究提出了一种无监督域适应框架，用于在无目标域标注的情况下，实现跨机构的主动脉夹层自动分割和临床特征提取，实验表明其能显著提升分割性能并为术前评估提供有意义帮助。

摘要翻译

A型主动脉夹层（Type A Aortic Dissection, TAAD）是一种危及生命的心血管急症，需要快速而精确的术前评估。虽然关键的解剖学和病理学特征对于手术规划具有决定性作用，但当前研究主要集中于提升分割精度，而对可靠、定量地提取具有临床指导意义的特征则探索不足。此外，构建全面的TAAD数据集需要耗费大量人力进行专家级的像素级标注，这对大多数临床机构而言并不现实。由于显著的域偏移问题，在单一中心数据集上训练的模型在跨机构部署时也会出现严重的性能下降。本研究致力于解决一个临床关键挑战：在完全缺乏目标域标注的情况下，于跨机构部署中准确提取TAAD的关键临床特征。为此，我们提出了一种基于无监督域适应（Unsupervised Domain Adaptation, UDA）的框架，用于自动化提取TAAD临床特征。该框架利用有限的源域标注，同时有效适应来自目标域的未标注数据。针对真实世界急诊工作流程定制，我们的框架旨在实现稳定的跨机构多类别分割、可靠且可量化的临床特征提取，以及独立于高成本标注的实际可部署性。大量实验表明，与现有先进方法相比，我们的方法显著提升了跨域分割性能。更重要的是，一项涉及多位心血管外科医生的读者研究证实，自动提取的临床特征能为术前评估提供有意义的辅助，凸显了所提出的端到端“从分割到特征”流程的实用价值。

摘要 (Abstract)

Type A Aortic Dissection (TAAD) is a life-threatening cardiovascular emergency that demands rapid and precise preoperative evaluation. While key anatomical and pathological features are decisive for surgical planning, current research focuses predominantly on improving segmentation accuracy, leaving the reliable, quantitative extraction of clinically actionable features largely under-explored. Furthermore, constructing comprehensive TAAD datasets requires labor-intensive, expert level pixel-wise annotations, which is impractical for most clinical institutions. Due to significant domain shift, models trained on a single center dataset also suffer from severe performance degradation during cross-institutional deployment. This study addresses a clinically critical challenge: the accurate extraction of key TAAD clinical features during cross-institutional deployment in the total absence of target-domain annotations. To this end, we propose an unsupervised domain adaptation (UDA)-driven framework for the automated extraction of TAAD clinical features. The framework leverages limited source-domain labels while effectively adapting to unlabeled data from target domains. Tailored for real-world emergency workflows, our framework aims to achieve stable cross-institutional multi-class segmentation, reliable and quantifiable clinical feature extraction, and practical deployability independent of high-cost annotations. Extensive experiments demonstrate that our method significantly improves cross-domain segmentation performance compared to existing state-of-the-art approaches. More importantly, a reader study involving multiple cardiovascular surgeons confirms that the automatically extracted clinical features provide meaningful assistance for preoperative assessment, highlighting the practical utility of the proposed end-to-end segmentation-to-feature pipeline.

关键词: Type A Aortic Dissection, unsupervised domain adaptation, cross-institutional deployment, segmentation, clinical feature extraction, medical image analysis, cardiovascular emergency, end-to-end pipeline

87. ❌ VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation

作者: Rakib Hossain Sajib, Md Kishor Morol, Rajan Das Gupta, Mohammad Sakib Mahmood, Shuvra Smaran Das 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26015v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究大型视觉语言模型（LVLMs）在零样本人脸年龄估计中的应用，属于大模型在科学领域（计算机视觉/生物识别）的应用研究。与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为论文明确评估GPT-4o、Claude 3.5 Sonnet、LLaMA 3.2 Vision等大型多模态模型。与’AI for Science OR Bioinformatics OR Cheminformatics’相关（8分），因为年龄估计在生物识别、医疗保健等科学应用中有重要价值。其他关键词主要涉及大模型技术原理（如MoE、量化、推理优化等）或特定训练方法（如RLHF、PEFT），论文未涉及这些具体技术，因此评分为0。

!!! tip deepseek-chat TL;DR

该研究通过零样本评估大型视觉语言模型（如GPT-4o、Claude 3.5 Sonnet）在面部年龄估计任务上的性能，发现这些通用模型在无需微调的情况下能达到与传统监督方法竞争的结果，同时揭示了图像质量和人口统计子群带来的性能差异。

摘要翻译

基于人脸图像的年龄估计是一项具有挑战性的计算机视觉任务，在生物识别、医疗保健和人机交互领域具有重要应用。传统的深度学习方法需要大量标注数据和针对特定领域的训练，而近期大型视觉-语言模型（Large Vision-Language Models, LVLMs）的进展为零样本年龄估计提供了潜力。本研究对人脸年龄估计这一传统上由特定领域卷积网络和监督学习主导的任务，对最先进的大型视觉-语言模型（LVLMs）进行了全面的零样本评估。我们在未经任何微调或任务特定适配的情况下，评估了GPT-4o、Claude 3.5 Sonnet和LLaMA 3.2 Vision在两个基准数据集（UTKFace和FG-NET）上的性能。通过使用包括平均绝对误差（MAE）、均方误差（MSE）、均方根误差（RMSE）、平均绝对百分比误差（MAPE）、平均偏差误差（MBE）、决定系数（$R^2$）、一致性相关系数（CCC）以及±5岁准确率在内的八项评估指标，我们证明了通用型LVLMs在零样本设置下能够提供具有竞争力的性能。我们的研究结果凸显了LVLMs在准确进行生物特征年龄估计方面的新兴能力，并将这些模型定位为现实世界应用中极具前景的工具。此外，我们指出了与图像质量和人口统计子组相关的性能差异，强调了进行公平感知多模态推理的必要性。这项工作引入了一个可复现的基准，并将LVLMs定位为法医学、健康监测和人机交互等现实世界应用中的潜力工具。该基准侧重于未经微调的严格零样本推理，并强调了在提示敏感性、可解释性、计算成本和人口统计公平性方面仍存在的挑战。

摘要 (Abstract)

Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. While traditional deep learning approaches require extensive labeled datasets and domain-specific training, recent advances in large vision-language models (LVLMs) offer the potential for zero-shot age estimation. This study presents a comprehensive zero-shot evaluation of state-of-the-art Large Vision-Language Models (LVLMs) for facial age estimation, a task traditionally dominated by domain-specific convolutional networks and supervised learning. We assess the performance of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on two benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. Using eight evaluation metrics, including MAE, MSE, RMSE, MAPE, MBE, $R^2$, CCC, and $\pm$5-year accuracy, we demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings. Our findings highlight the emergent capabilities of LVLMs for accurate biometric age estimation and position these models as promising tools for real-world applications. Additionally, we highlight performance disparities linked to image quality and demographic subgroups, underscoring the need for fairness-aware multimodal inference. This work introduces a reproducible benchmark and positions LVLMs as promising tools for real-world applications in forensic science, healthcare monitoring, and human-computer interaction. The benchmark focuses on strict zero-shot inference without fine-tuning and highlights remaining challenges related to prompt sensitivity, interpretability, computational cost, and demographic fairness.

关键词: Large Vision-Language Models, zero-shot age estimation, facial images, benchmark evaluation, GPT-4o, Claude 3.5 Sonnet, demographic fairness, biometric applications

88. ❌ FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants

作者: Mahesh Bhosale, Abdul Wasi, Shantam Srivastava, Shifa Latif, Tianyu Luan, Mingchen Gao, David Doermann, Xuan Gong 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26008v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文FairLLaVA专注于多模态大语言模型（MLLMs）的公平性微调，核心贡献是提出一种参数高效的微调方法，以减少视觉指令调优中的群体差异。因此，与以下关键词高度相关（10分）：1）‘Large Language Models’（论文研究MLLMs）；2）‘Post-training/SFT’（涉及监督微调）；3）‘Instruction Tuning’（核心是视觉指令调优）；4）‘PEFT’（方法基于参数高效微调，如低秩适配器）；5）‘AI for Science’（应用于医学影像领域）。其他关键词如MoE、SLMs、RLHF、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在医学影像应用中存在的公平性风险，提出了一种参数高效的微调方法FairLLaVA，通过最小化目标属性间的互信息来正则化模型表示，从而在保持整体性能的同时显著减少群体间差异，并在大规模胸部放射学报告生成和皮肤镜视觉问答基准上验证了其有效性。

摘要翻译

尽管多模态大语言模型（MLLMs）在图像条件生成方面能力强大，但其在不同人口统计群体间的表现可能不均，凸显了公平性风险。在安全至上的临床环境中，此类差异可能导致生成不平等的诊断叙述，并削弱对人工智能辅助决策的信任。尽管公平性问题已在纯视觉和纯语言模型中得到广泛研究，但其对MLLMs的影响在很大程度上仍未得到充分探索。为应对这些偏差，我们提出了FairLLaVA——一种参数高效的微调方法，可在不损害整体性能的前提下，缓解视觉指令微调中的群体差异。通过最小化目标属性间的互信息，FairLLaVA对模型表征进行正则化，使其不受人口统计特征影响。该方法可作为轻量级插件集成，通过低秩适配器（LoRA）微调保持效率，并提供一种与架构无关的公平视觉指令遵循方案。在大规模胸部放射学报告生成和皮肤镜视觉问答基准上的广泛实验表明，FairLLaVA能持续减少组间差异，同时提升不同医学影像模态下的公平性调整临床性能与自然语言生成质量。代码可通过https://github.com/bhosalems/FairLLaVA获取。

摘要 (Abstract)

While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model’s representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.

关键词: multimodal large language models, fairness, parameter-efficient fine-tuning, visual instruction tuning, medical imaging, demographic-invariant representations, chest radiology report generation, dermoscopy visual question answering

89. ❌ Longitudinal Boundary Sharpness Coefficient Slopes Predict Time to Alzheimer’s Disease Conversion in Mild Cognitive Impairment: A Survival Analysis Using the ADNI Cohort

作者: Ishaan Cherukuri 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26007v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用MRI扫描和生存分析预测阿尔茨海默病转化，属于医学影像分析和生物信息学领域。论文未涉及任何大模型、深度学习技术原理或创新方法，仅提及CNN和RNN作为先前研究背景，但本研究未使用这些技术。因此，除’AI for Science OR Bioinformatics OR Cheminformatics’（评5分，因属于生物信息学应用）外，所有其他关键词均评0分，因论文内容与这些大模型相关技术完全无关。

!!! tip deepseek-chat TL;DR

该研究通过分析MRI扫描中灰白质边界锐度系数的时间斜率，使用随机生存森林模型预测轻度认知障碍向阿尔茨海默病转化的时间，相比基线模型将测试C指数从0.24提升至0.63。

摘要翻译

预测轻度认知障碍（MCI）患者是否会进展为阿尔茨海默病（AD）对于神经退行性疾病的早期阶段至关重要。这种不确定性限制了临床试验的入组，并延误了急需的治疗。边界锐度系数（Boundary Sharpness Coefficient, BSC）用于衡量结构MRI上灰白质边界的清晰程度。本研究测量了BSC随时间的变化，即边界每年退化速度的指标，在预测MCI向AD转化方面，远优于仅观察单次基线扫描。本研究分析了来自450名ADNI受试者（95名转化者，355名稳定者；平均随访时间：4.84年）的1,824次T1加权MRI扫描。BSC体素图是通过在灰白质皮质带进行组织分割计算得出的。先前的研究使用CNN和RNN模型，在AD分类上达到了96.0%的准确率，在MCI转化预测上达到84.2%，但这些方法忽略了大脑内的特定区域。本研究特别关注灰白质界面。该方法利用捕捉边界退化率的时间斜率特征，并将其输入随机生存森林（Random Survival Forest）——一种针对右删失生存数据的非参数集成方法。基于BSC斜率训练的随机生存森林在测试中取得了0.63的C指数，相较于基线参数模型（测试C指数：0.24）提升了163%。结构MRI的成本仅为PET成像的一小部分（800-1500美元 vs. 5000-7000美元），且无需采集脑脊液。这些时间性生物标志物有助于进行以患者为中心的安全性筛查以及风险评估。

摘要 (Abstract)

Predicting whether someone with mild cognitive impairment (MCI) will progress to Alzheimer’s disease (AD) is crucial in the early stages of neurodegeneration. This uncertainty limits enrollment in clinical trials and delays urgent treatment. The Boundary Sharpness Coefficient (BSC) measures how well-defined the gray-white matter boundary looks on structural MRI. This study measures how BSC changes over time, namely, how fast the boundary degrades each year works much better than looking at a single baseline scan for predicting MCI-to-AD conversion. This study analyzed 1,824 T1-weighted MRI scans from 450 ADNI subjects (95 converters, 355 stable; mean follow-up: 4.84 years). BSC voxel-wise maps were computed using tissue segmentation at the gray-white matter cortical ribbon. Previous studies have used CNN and RNN models that reached 96.0% accuracy for AD classification and 84.2% for MCI conversion, but those approaches disregard specific regions within the brain. This study focused specifically on the gray-white matter interface. The approach uses temporal slope features capturing boundary degradation rates, feeding them into Random Survival Forest, a non-parametric ensemble method for right-censored survival data. The Random Survival Forest trained on BSC slopes achieved a test C-index of 0.63, a 163% improvement over baseline parametric models (test C-index: 0.24). Structural MRI costs a fraction of PET imaging ($800–$1,500 vs. $5,000–$7,000) and does not require CSF collection. These temporal biomarkers could help with patient-centered safety screening as well as risk assessment.

关键词: Alzheimer’s disease, mild cognitive impairment, Boundary Sharpness Coefficient, MRI, survival analysis, Random Survival Forest, biomarker, ADNI

90. ❌ AutoB2G: A Large Language Model-Driven Agentic Framework For Automated Building-Grid Co-Simulation

作者: Borui Zhang, Nariman Mahdavi, Subbu Sethuvenkatraman, Shuang Ao, Flora Salim 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26005v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出AutoB2G框架，核心是基于大语言模型（LLM）的智能体（LLM Agents）框架SOCIA，用于自动化建筑-电网协同仿真。因此与’Large Language Models’和’LLM Agents’高度相关（10分）。框架通过检索代码库（DAG结构）来指导LLM生成可执行路径，与’Retrieval-Augmented Generation’有一定关联（5分）。LLM被用于调用仿真函数，与’Tool Use’相关（5分）。研究属于建筑能源领域的科学AI应用，与’AI for Science’相关（8分）。其他关键词如MoE、SFT、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对建筑-电网协同仿真中手动配置依赖编程专家的问题，提出了基于大语言模型智能体框架的AutoB2G系统，能够根据自然语言任务描述自动生成、执行和迭代优化仿真器，有效协调建筑与电网交互以提升电网侧性能指标。

摘要翻译

建筑运行数据的日益丰富推动了强化学习（RL）的应用，该方法能够直接从数据中学习控制策略，并应对大规模建筑集群的复杂性与不确定性。然而，现有仿真环境大多侧重于建筑侧性能指标，缺乏对电网层面影响的系统性评估，且其实验工作流仍严重依赖人工配置和大量编程专业知识。为此，本文提出AutoB2G——一个自动化的建筑-电网协同仿真框架，该框架仅基于自然语言任务描述即可完成整个仿真工作流。本框架扩展了CityLearn V2以支持建筑到电网（Building-to-Grid, B2G）交互，并采用基于大语言模型（Large Language Model, LLM）的SOCIA（Simulation Orchestration for Computational Intelligence with Agents）框架，自动生成、执行并迭代优化仿真器。由于大语言模型缺乏对仿真功能实现背景的先验知识，本研究构建了一个涵盖仿真配置与功能模块的代码库，并将其组织为有向无环图（Directed Acyclic Graph, DAG），以显式表达模块依赖关系与执行顺序，从而引导大语言模型检索完整的可执行路径。实验结果表明，AutoB2G能够有效实现仿真器的自动化构建，并通过协调B2G交互以提升电网侧性能指标。

摘要 (Abstract)

The growing availability of building operational data motivates the use of reinforcement learning (RL), which can learn control policies directly from data and cope with the complexity and uncertainty of large-scale building clusters. However, most existing simulation environments prioritize building-side performance metrics and lack systematic evaluation of grid-level impacts, while their experimental workflows still rely heavily on manual configuration and substantial programming expertise. Therefore, this paper proposes AutoB2G, an automated building-grid co-simulation framework that completes the entire simulation workflow solely based on natural-language task descriptions. The framework extends CityLearn V2 to support Building-to-Grid (B2G) interaction and adopts the large language model (LLM)-based SOCIA (Simulation Orchestration for Computational Intelligence with Agents) framework to automatically generate, execute, and iteratively refine the simulator. As LLMs lack prior knowledge of the implementation context of simulation functions, a codebase covering simulation configurations and functional modules is constructed and organized as a directed acyclic graph (DAG) to explicitly represent module dependencies and execution order, guiding the LLM to retrieve a complete executable path. Experimental results demonstrate that AutoB2G can effectively enable automated simulator implementations, coordinating B2G interactions to improve grid-side performance metrics.

关键词: Large Language Model, LLM-driven Agentic Framework, Automated Building-Grid Co-Simulation, Reinforcement Learning, Simulation Orchestration, Natural-language Task Descriptions, Directed Acyclic Graph, Grid-side Performance

91. ❌ Weight Tying Biases Token Embeddings Towards the Output Space

作者: Antonio Lopardo, Avyukth Harish, Catherine Arnett, Akshat Gupta 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26663v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究权重绑定（weight tying）对语言模型嵌入空间的影响，属于大模型技术原理的创新研究。与’Large Language Models’高度相关（8分），因为研究针对语言模型设计；与’Small Language Models’有一定关联（5分），因为论文提到对训练小型LLM有启示；与’Pre-training’高度相关（8分），因为研究训练过程中的梯度动态；与’Mechanistic Interpretability’高度相关（10分），因为论文使用tuned lens分析提供机制性证据，这是可解释AI的核心内容。其他关键词如MoE、SFT、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究发现权重绑定会使词嵌入偏向输出空间，因为输出梯度在训练早期占主导地位，这通过机制性证据解释了为什么权重绑定可能损害大规模性能，并对训练小型LLM有启示。

摘要翻译

权重绑定（即输入与输出嵌入矩阵共享参数）是语言模型设计中的常见做法，但其对所学嵌入空间的影响仍缺乏深入理解。本文研究表明，与参数不绑定的可比模型相比，绑定模型的嵌入矩阵更接近输出（解嵌入）矩阵而非输入嵌入，这表明共享矩阵主要被塑造用于输出预测而非输入表征。这种解嵌入偏倚源于训练早期输出梯度占据主导地位。通过使用调谐透镜分析，我们发现这会对早期层计算产生负面影响，使其对残差流的贡献效率降低。在训练过程中对输入梯度进行缩放可减轻此偏倚，为梯度失衡的作用提供了因果性证据。这是权重绑定使嵌入矩阵为输出预测而优化、从而损害其输入表征功能的机制性证据。这些结果有助于解释为何权重绑定可能损害大规模模型的性能，并对训练较小规模大型语言模型具有启示意义——在这些模型中，嵌入矩阵占总参数量的比重显著。

摘要 (Abstract)

Weight tying, i.e. sharing parameters between input and output embedding matrices, is common practice in language model design, yet its impact on the learned embedding space remains poorly understood. In this paper, we show that tied embedding matrices align more closely with output (unembedding) matrices than with input embeddings of comparable untied models, indicating that the shared matrix is shaped primarily for output prediction rather than input representation. This unembedding bias arises because output gradients dominate early in training. Using tuned lens analysis, we show this negatively affects early-layer computations, which contribute less effectively to the residual stream. Scaling input gradients during training reduces this bias, providing causal evidence for the role of gradient imbalance. This is mechanistic evidence that weight tying optimizes the embedding matrix for output prediction, compromising its role in input representation. These results help explain why weight tying can harm performance at scale and have implications for training smaller LLMs, where the embedding matrix contributes substantially to total parameter count.

关键词: weight tying, token embeddings, output space, gradient imbalance, language model design, mechanistic evidence, embedding matrix, parameter sharing

92. ❌ MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

作者: Joris Köster, Zixuan Liu, Siavash Khajavi, Zizhan Zheng 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26557v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理成本优化框架MemBoost，与’Large Language Models’高度相关（10分），因为它直接针对LLM服务；与’Retrieval-Augmented Generation’高度相关（10分），因为框架基于检索增强生成机制，并扩展了其应用场景；与’Speculative Decoding OR Inference Acceleration’较强相关（8分），因为框架通过答案重用和路由机制加速推理、降低成本；与’Small Language Models’有一定关联（5分），因为框架中使用了轻量级模型处理部分查询；其他关键词如MoE、训练方法、对齐、智能体等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在重复或相似查询场景下推理成本高的问题，提出了MemBoost框架，通过答案重用、检索增强和成本感知路由机制，在保持高质量回答的同时显著降低了推理成本。

摘要翻译

大型语言模型（LLM）在实际服务中展现出强大性能，但推理成本高昂，尤其是在用户与会话间存在重复或近似重复查询的工作负载下。本研究提出MemBoost，一种内存增强的LLM服务框架，它使轻量级模型能够复用先前生成的答案并检索相关支持信息以实现低成本推理，同时将困难或不确定的查询选择性地升级至更强模型进行处理。与主要针对单次响应进行知识基础的标准化检索增强生成不同，MemBoost通过支持答案复用、持续内存增长和成本感知路由，专为交互式场景设计。在模拟工作负载下对多种模型进行的实验表明，MemBoost显著减少了昂贵的大模型调用次数及总体推理成本，同时保持了与强模型基线相当的高答案质量。

摘要 (Abstract)

Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.

关键词: Large Language Models, LLM inference, cost-aware, retrieval-augmented generation, memory-boosted, answer reuse, query routing, inference acceleration

93. ❌ Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model

作者: Maria Kefala, Jeffery L. Painter, Syed Tauhid Bukhari, Maurizio Sessa 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26544v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要使用DeepSeek V3大语言模型处理药物警戒数据，属于大模型在生物信息学/科学领域的应用。因此，与’Large Language Models’关键词高度相关（8分），与’AI for Science OR Bioinformatics OR Cheminformatics’关键词高度相关（10分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化、代理系统等，论文未涉及这些技术细节或应用场景，故评分为0分。

!!! tip deepseek-chat TL;DR

本研究开发了一个欧盟时间索引的药物警戒参考数据集，使用大语言模型DeepSeek V3处理产品特性摘要以识别不良事件，解决了现有数据集缺乏时间信息的问题，支持更准确的信号检测性能评估。

摘要翻译

背景：由于缺乏可靠的参考数据集，最优信号检测方法的识别受到阻碍。现有数据集未能捕捉不良事件（AEs）被监管机构正式确认的时间点，导致无法将分析限制在确认前时期，从而限制了对早期检测性能的评估。本研究通过构建一个针对欧盟（EU）的时间索引参考数据集来填补这一空白，该数据集整合了不良事件被纳入产品说明书的时间信息及监管元数据。方法：从欧盟药品联盟注册库（数据截止日期：2025年12月15日）中检索所有集中授权产品（n=1,513）的现行及历史产品特性概要（SmPCs）。提取第4.8节内容，并使用DeepSeek V3进行处理以识别不良事件。监管元数据（包括标签变更信息）通过程序化方式提取。时间索引基于不良事件被纳入SmPC的日期。结果：该数据库包含1995年至2025年期间的17,763个SmPC版本，涵盖125,026条药物-不良事件关联。经限制为活性产品的时间索引参考数据集包含1,479种药品和110,823条药物-不良事件关联。大部分不良事件在上市前识别（74.5%），上市后识别占25.5%。安全性更新在2012年左右达到高峰。胃肠道疾病、皮肤及皮下组织疾病以及神经系统疾病是占比最高的系统器官分类。药物所含不良事件的中位数为48项，涉及14个系统器官分类。结论：本研究提出的数据集通过整合欧盟不良事件确认的时间信息，弥补了药物警戒领域的关键空白，有助于更准确地评估信号检测性能，并为不同分析方法的方法学比较提供支持。

摘要 (Abstract)

Background: The identification of optimal signal detection methods is hindered by the lack of reliable reference datasets. Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance. This study addresses this gap by developing a time-indexed reference dataset for the European Union (EU), incorporating the timing of AE inclusion in product labels along with regulatory metadata. Methods: Current and historical Summaries of Product Characteristics (SmPCs) for all centrally authorized products (n=1,513) were retrieved from the EU Union Register of Medicinal Products (data lock: 15 December 2025). Section 4.8 was extracted and processed using DeepSeek V3 to identify AEs. Regulatory metadata, including labelling changes, were programmatically extracted. Time indexing was based on the date of AE inclusion in the SmPC. Results: The database includes 17,763 SmPC versions spanning 1995-2025, comprising 125,026 drug-AE associations. The time-indexed reference dataset, restricted to active products, included 1,479 medicinal products and 110,823 drug-AE associations. Most AEs were identified pre-marketing (74.5%) versus post-marketing (25.5%). Safety updates peaked around 2012. Gastrointestinal, skin, and nervous system disorders were the most represented System Organ Classes. Drugs had a median of 48 AEs across 14 SOCs. Conclusions: The proposed dataset addresses a critical gap in pharmacovigilance by incorporating temporal information on AE recognition for the EU, supporting more accurate assessment of signal detection performance and facilitating methodological comparisons across analytical approaches.

关键词: pharmacovigilance, signal detection, reference dataset, large language model, DeepSeek V3, adverse events, European Union, time-indexed

94. ❌ Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs

作者: Vinicius Anjos de Almeida, Sandro Saorin da Silva, Josimar Chire, Leonardo Vicenzi, Nícolas Henrique Borges, Helena Kociolek, Sarah Miriã de Castro Rocha, Frederico Nassif Gomes, Júlia Cristina Ferreira, Oge Marques, Lucas Emanuel Silva e Oliveira 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26510v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是评估BERT模型和LLMs（如GPT-5和Gemini-2.5）在葡萄牙语临床命名实体识别（NER）中的应用，属于大模型在生物医学领域的应用研究。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为NER是生物信息学中的关键任务。其他关键词（如MoE、SFT、RAG等）未在摘要中提及或与论文内容无关，均得0分。

!!! tip deepseek-chat TL;DR

该研究评估了BERT模型和大型语言模型（LLMs）在葡萄牙语临床命名实体识别中的性能，发现mmBERT-base模型表现最佳（micro F1 = 0.76），并通过迭代分层策略改善了类别不平衡问题。

摘要翻译

临床记录包含宝贵的非结构化信息。命名实体识别（NER）能够自动提取医学概念，然而葡萄牙语的基准数据集仍然稀缺。本研究旨在评估基于BERT的模型和大型语言模型（LLMs）在葡萄牙语临床NER任务中的表现，并测试处理多标签不平衡的策略。我们使用公开的SemClinBr语料库和私有的乳腺癌数据集，比较了BioBERTpt、BERTimbau、ModernBERT和mmBERT等模型与GPT-5和Gemini-2.5等LLMs。所有模型在相同条件下训练，并通过精确率、召回率和F1分数进行评估。为缓解类别不平衡问题，我们探索了迭代分层、加权损失和过采样等方法。mmBERT-base模型取得了最佳性能（微观F1 = 0.76），优于所有其他模型。迭代分层策略改善了类别平衡并提升了整体性能。多语言BERT模型（特别是mmBERT）在葡萄牙语临床NER任务中表现强劲，且可在有限计算资源下本地运行。平衡的数据划分策略能进一步提升模型表现。

摘要 (Abstract)

Clinical notes contain valuable unstructured information. Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce. In this study, we aimed to evaluate BERT-based models and large language models (LLMs) for clinical NER in Portuguese and to test strategies for addressing multilabel imbalance. We compared BioBERTpt, BERTimbau, ModernBERT, and mmBERT with LLMs such as GPT-5 and Gemini-2.5, using the public SemClinBr corpus and a private breast cancer dataset. Models were trained under identical conditions and evaluated using precision, recall, and F1-score. Iterative stratification, weighted loss, and oversampling were explored to mitigate class imbalance. The mmBERT-base model achieved the best performance (micro F1 = 0.76), outperforming all other models. Iterative stratification improved class balance and overall performance. Multilingual BERT models, particularly mmBERT, perform strongly for Portuguese clinical NER and can run locally with limited computational resources. Balanced data-splitting strategies further enhance performance.

关键词: Clinical named entity recognition, Portuguese language, BERT models, Large language models, Clinical notes, Multilabel imbalance, mmBERT, Bioinformatics

95. ❌ Entanglement as Memory: Mechanistic Interpretability of Quantum Language Models

作者: Nathan Roll 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26494v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究量子语言模型的机制可解释性，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为这是论文的核心方法论和贡献。论文属于科学AI应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但并非生物信息学或化学信息学具体领域。其他关键词均涉及传统大语言模型技术、训练方法、推理优化、代理系统等，而论文专注于量子计算框架下的语言模型，使用量子电路、纠缠等概念，与传统深度学习或大模型技术无直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文首次对量子语言模型进行机制可解释性研究，发现单量子比特模型可经典模拟并使用几何策略，而双量子比特模型利用纠缠编码上下文，但后者在真实量子硬件上因噪声退化，揭示了噪声与表达性之间的权衡。

摘要翻译

量子语言模型在序列任务中展现出竞争力，但其训练出的量子电路究竟利用了真正的量子资源，还是仅仅将经典计算嵌入量子硬件，目前仍不明确。先前的研究仅通过终端指标评估这些模型，并未考察它们实际在内部习得的记忆策略。我们首次对量子语言模型开展了机制可解释性研究，结合因果门消融、纠缠追踪和密度矩阵置换干预等方法，在受控的长程依赖任务上进行分析。研究发现：单量子比特模型完全可通过经典方式模拟，并收敛至与匹配的经典基线相同的几何策略；而具有纠缠门的双量子比特模型则习得了一种表征上截然不同的策略，该策略通过量子比特间的纠缠对上下文进行编码——这一结论得到三项独立因果检验的证实（p < 0.0001, d = 0.89）。在真实量子硬件上，仅经典几何策略能在设备噪声下保持有效；纠缠策略则退化至随机水平。这些发现将机制可解释性确立为量子语言模型科学研究的一种工具，并揭示了一种噪声-表达能力权衡关系，该关系决定了哪些习得策略能在实际部署中存续。

摘要 (Abstract)

Quantum language models have shown competitive performance on sequential tasks, yet whether trained quantum circuits exploit genuinely quantum resources – or merely embed classical computation in quantum hardware – remains unknown. Prior work has evaluated these models through endpoint metrics alone, without examining the memory strategies they actually learn internally. We introduce the first mechanistic interpretability study of quantum language models, combining causal gate ablation, entanglement tracking, and density-matrix interchange interventions on a controlled long-range dependency task. We find that single-qubit models are exactly classically simulable and converge to the same geometric strategy as matched classical baselines, while two-qubit models with entangling gates learn a representationally distinct strategy that encodes context in inter-qubit entanglement – confirmed by three independent causal tests (p < 0.0001, d = 0.89). On real quantum hardware, only the classical geometric strategy survives device noise; the entanglement strategy degrades to chance. These findings open mechanistic interpretability as a tool for the science of quantum language models and reveal a noise-expressivity tradeoff governing which learned strategies survive deployment.

关键词: quantum language models, mechanistic interpretability, entanglement, quantum circuits, causal gate ablation, density-matrix interchange, noise-expressivity tradeoff, long-range dependency

96. ❌ Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

作者: Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26434v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用开源大语言模型（LLMs）从芬兰电子健康记录（EHRs）中自动化检索临床信息，属于AI在生物医学（Bioinformatics）领域的应用，因此与’Large Language Models’和’AI for Science’高度相关（10分）。研究测试了不同规模的模型（4B-70B），包括较小模型，与’Small Language Models’有一定关联（5分）。方法涉及从EHRs中检索信息以回答问题，与’Retrieval-Augmented Generation’概念部分相关（5分）。实验评估了低精度量化（4-bit/8-bit）对性能的影响，与’Quantization’高度相关（10分）。研究还讨论了模型输出中的临床显著错误和事实不一致问题，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、Scaling Laws、各种训练方法（预训练、微调、对齐等）、推理优化、智能体、模型合并等均未在论文中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个本地可部署的临床上下文问答框架，使用开源大语言模型从芬兰电子健康记录中自动检索患者信息，在1,664个专家标注的问答对上测试显示，Llama-3.1-70B模型达到95.3%的准确率，同时低精度量化在保持性能的同时降低了部署资源需求，但临床评估发现少量输出存在临床显著错误，强调了临床部署中验证和人工监督的必要性。

摘要翻译

临床医生常需从电子健康记录（EHR）中检索患者特定信息，这一任务耗时且易出错。本文提出一种可本地部署的临床上下文问答（CCQA）框架，能够在不依赖外部数据传输的情况下直接从EHR中回答临床问题。研究基于183名患者的医疗记录构建了1,664组专家标注的问答对（数据主体为芬兰语临床文本），在完全离线环境下对参数量从4B到70B的开源大语言模型（LLMs）进行了性能评估。在自由文本生成任务中，Llama-3.1-70B模型实现了95.3%的准确率，对语义等效问题变体的回答一致性达97.3%；而参数量较小的Qwen3-30B-A3B-2507模型也表现出可比性能。在多项选择设定中，各模型准确率相近但校准程度存在差异。低精度量化（4位与8位）在降低GPU内存需求、提升部署可行性的同时，保持了模型的预测性能。临床评估发现2.9%的输出存在临床显著性错误，且语义等效问题偶尔会产生不一致回答，其中部分案例出现一种表述正确而另一种包含临床显著性错误的情况（占总数0.96%）。这些结果表明，本地部署的开源LLMs能够通过自然语言查询从EHR中准确检索患者特定信息，同时凸显了临床部署中验证机制与人工监督的必要性。

摘要 (Abstract)

Clinicians often need to retrieve patient-specific information from electronic health records (EHRs), a task that is time-consuming and error-prone. We present a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinical questions directly from EHRs without external data transfer. Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients. The dataset consisted predominantly of Finnish clinical text. In free-text generation, Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants, while the smaller Qwen3-30B-A3B-2507 model achieved comparable performance. In a multiple-choice setting, models showed similar accuracy but variable calibration. Low-precision quantization (4-bit and 8-bit) preserved predictive performance while reducing GPU memory requirements and improving deployment feasibility. Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained a clinically significant error (0.96% of cases). These findings demonstrate that locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, while highlighting the need for validation and human oversight in clinical deployment.

关键词: Large Language Models, Electronic Health Records, Clinical Question Answering, Quantization, Local Deployment, Finnish Clinical Text, Open-source LLMs, Clinical Information Retrieval

97. ❌ Analysing Calls to Order in German Parliamentary Debates

作者: Nina Smirnova, Daniel Dan, Philipp Mayr 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26430v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Analysing Calls to Order in German Parliamentary Debates》研究德国议会辩论中的秩序呼吁（CtO），属于政治学、议会研究和计算社会科学领域。论文使用基于规则的方法检测和标注CtO，创建了一个包含72年德国议会辩论的数据集，并分析了CtO的触发因素和影响因素（如性别、党派、议题）。所有评分关键词均涉及大模型、深度学习技术原理、AI应用（如科学领域）或相关技术（如训练、推理、对齐、压缩等）。该论文未使用任何深度学习、机器学习或人工智能方法，也未涉及大模型技术或其在科学领域的应用。其核心是政治话语分析和数据集构建，与所有技术关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究系统分析了德国联邦议院72年议会辩论中的秩序呼吁（CtO），发现CtO的发出部分具有主观性，受会议主席和议会动态影响，且男性议员和反对党议员收到更多CtO，最常见的触发原因是对个人的侮辱。

摘要翻译

议会辩论是政治权力的核心场域，它塑造着立法成果与公共话语。辩论中的失范行为标志着政治极化与制度冲突。本研究以议事规则警告（CtO；复数形式：CtOs）作为违反规范的正式指标，对德国联邦议院中的失范现象进行了系统性考察。尽管议事规则警告具有重要意义，但议会研究领域对其鲜有系统性关注。我们提出了一种基于规则的方法，用于检测和标注议会演讲中的议事规则警告，并构建了一个新颖的、跨越72年的德国议会辩论数据集，其中包含了已标注的议事规则警告实例。此外，我们首次开发了议事规则警告触发因素的分类体系，并分析了与其发生相关的各类因素。研究结果表明，尽管存在正式规章，议事规则警告的发出仍具有一定主观性，并受到会议主席及议会动态的影响，某些议员受到的影响尤为显著。对个人的侮辱是触发议事规则警告的最常见原因。总体而言，男性议员及反对党议员比女性议员和执政联盟议员收到更多的议事规则警告。大多数议事规则警告触发因素出现在涉及政府事务及总统行为的演讲中。议事规则警告触发因素数据集可通过以下链接获取：https://github.com/kalawinka/cto_analysis。

摘要 (Abstract)

Parliamentary debate constitutes a central arena of political power, shaping legislative outcomes and public discourse. Incivility within this arena signals political polarization and institutional conflict. This study presents a systematic investigation of incivility in the German Bundestag by examining calls to order (CtO; plural: CtOs) as formal indicators of norm violations. Despite their relevance, CtOs have received little systematic attention in parliamentary research. We introduce a rule-based method for detecting and annotating CtOs in parliamentary speeches and present a novel dataset of German parliamentary debates spanning 72 years that includes annotated CtO instances. Additionally, we develop the first classification system for CtO triggers and analyze the factors associated with their occurrence. Our findings show that, despite formal regulations, the issuance of CtOs is partly subjective and influenced by session presidents and parliamentary dynamics, with certain individuals disproportionately affected. An insult towards individuals is the most frequent cause of CtO. In general, male members and those belonging to opposition parties receive more calls to order than their female and coalition-party counterparts. Most CtO triggers were detected in speeches dedicated to governmental affairs and actions of the presidency. The CtO triggers dataset is available at: https://github.com/kalawinka/cto_analysis.

关键词: parliamentary debates, calls to order, incivility, German Bundestag, rule-based detection, dataset, political polarization, norm violations

98. ❌ Word Alignment-Based Evaluation of Uniform Meaning Representations

作者: Daniel Zeman, Federica Gamba 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26401v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于语义表示（Uniform Meaning Representations, UMR）的评估方法，提出了一种基于词对齐的节点匹配算法，并与现有的smatch方法进行比较。论文的核心是自然语言处理中语义表示的形式化评估技术，属于语义解析和评估方法学领域。所有评分关键词均涉及大模型、深度学习技术原理、训练方法、推理优化、对齐技术、应用领域等，而本论文完全不涉及这些主题。论文没有讨论任何大模型技术、训练过程、推理加速、对齐方法或科学AI应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对图结构语义表示（UMR）的评估难题，提出了一种利用词对齐信息的节点匹配算法，相比传统smatch方法更直观且避免了NP-hard搜索问题。

摘要翻译

基于图结构的句子语义表示比较与评估面临一项挑战：同一句子的不同表征可能包含不同数量的节点，且节点间的对应关系并不明确。现有方法倾向于通过最大化节点关系与属性的$F_1$分数来确定节点映射，无论其相似性是本质性的还是偶然性的；这导致节点属性值的差异无法用于细致的错误分析。我们提出一种节点匹配算法，该算法能够比较同一句子的多种统一语义表示（UMR），并利用UMR中固有的节点-词语对齐信息。我们将其与此前常用的方法（特别是AMR评估中事实上的标准工具smatch）进行对比，论证基于词语对齐的敏感性可使语义表示的比较更直观、更可解释，同时规避smatch固有的NP难搜索问题。该方法的实现代码已开源提供。

摘要 (Abstract)

Comparison and evaluation of graph-based representations of sentence meaning is a challenge because competing representations of the same sentence may have different number of nodes, and it is not obvious which nodes should be compared to each other. Existing approaches favor node mapping that maximizes $F_1$ score over node relations and attributes, regardless whether the similarity is intentional or accidental; consequently, the identified mismatches in values of node attributes are not useful for any detailed error analysis. We propose a node-matching algorithm that allows comparison of multiple Uniform Meaning Representations (UMR) of one sentence and that takes advantage of node-word alignments, inherently available in UMR. We compare it with previously used approaches, in particular smatch (the de-facto standard in AMR evaluation), and argue that sensitivity to word alignment makes the comparison of meaning representations more intuitive and interpretable, while avoiding the NP-hard search problem inherent in smatch. A script implementing the method is freely available.

关键词: Word Alignment, Uniform Meaning Representations, UMR, Node Matching, Semantic Representation, Evaluation, smatch, Graph-based Representation

99. ❌ Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers

作者: Yusheng Zhao, Hourun Li, Bohan Wu, Jingyang Yuan, Meng Zhang, Yichun Yin, Lifeng Shang, Ming Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26380v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	8.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Switch Attention（SwiAttn），一种动态混合注意力机制，在Transformer层中为每个token动态选择全局全注意力或局部滑动窗口注意力，以平衡长上下文建模的计算效率与性能。核心相关关键词：1）‘Pre-training OR Continual Pre-training OR Domain Adaptation’（10分）：论文明确采用continual pretraining优化模型；2）‘Context Window Extension OR Long Context LLMs’（10分）：直接针对长上下文（32K）语言建模问题；3）‘KV Cache Compression OR Linear Attention OR FlashAttention’（8分）：属于注意力机制优化技术，与论文的注意力计算效率改进高度相关；4）‘Large Language Models OR LLMs OR Foundation Models’（8分）：论文研究Transformer架构改进，适用于大语言模型；5）‘Speculative Decoding OR Inference Acceleration’（5分）：论文关注计算效率，与推理加速有一定关联。其他关键词如MoE、SFT、RAG、Agents等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对Transformer中标准全注意力计算复杂度高、滑动窗口注意力感受野窄的问题，提出了一种动态混合注意力机制SwiAttn，通过为每个token动态路由到全局或局部注意力分支，在长上下文语言建模任务中实现了效率与性能的平衡。

摘要翻译

注意力机制已成为现代Transformer架构的核心组件。然而，标准全局注意力的计算复杂度随序列长度呈二次方增长，这构成了长上下文语言建模的主要瓶颈。滑动窗口注意力通过限制上下文长度来提升效率，但代价是感受野变窄。现有研究尝试通过构建混合模型来兼顾二者优势，但往往采用静态的、启发式设计的交替模式，限制了在不同场景下计算资源的高效分配。本文提出Switch Attention（SwiAttn），一种新颖的混合Transformer模型，能够实现全局注意力与滑动窗口注意力之间的动态细粒度路由。在每一层Transformer中，SwiAttn为每个词元动态地将计算路由至全局注意力分支（以聚合全局信息）或滑动窗口分支（以进行高效的局部模式匹配）。我们设计了一种自适应正则化目标，以引导模型提升计算效率。此外，我们采用持续预训练来优化模型，将全局注意力架构迁移至混合架构。我们在常规上下文长度（4K）与长上下文长度（32K）下的二十三个基准数据集上进行了广泛实验，结果验证了所提方法的有效性。

摘要 (Abstract)

The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields. While existing efforts attempt to take the benefits from both sides by building hybrid models, they often resort to static, heuristically designed alternating patterns that limit efficient allocation of computation in various scenarios. In this paper, we propose Switch Attention (SwiAttn), a novel hybrid transformer that enables dynamic and fine-grained routing between full attention and sliding window attention. For each token at each transformer layer, SwiAttn dynamically routes the computation to either a full-attention branch for global information aggregation or a sliding-window branch for efficient local pattern matching. An adaptive regularization objective is designed to encourage the model towards efficiency. Moreover, we adopt continual pretraining to optimize the model, transferring the full attention architecture to the hybrid one. Extensive experiments are conducted on twenty-three benchmark datasets across both regular (4K) and long (32K) context lengths, demonstrating the effectiveness of the proposed method.

关键词: Switch Attention, hybrid transformer, dynamic routing, full attention, sliding window attention, long-context language modeling, computational efficiency, continual pretraining

100. ❌ A Formal Framework for Uncertainty Analysis of Text Generation with Large Language Models

作者: Steffen Herbold, Florian Lemmerich 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26363v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于大语言模型（LLMs）文本生成的不确定性分析，提出了一个形式化框架来测量不确定性，包括提示、生成和解释三个方面。论文的核心内容直接涉及LLMs，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文主要关注不确定性测量和形式化框架，不涉及其他关键词如MoE、SLMs、训练技术、推理加速、幻觉缓解、AI for Science等具体技术或应用领域，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个形式化框架来测量大语言模型文本生成中的不确定性，将提示、生成和解释建模为相互关联的自回归过程，并展示了如何用该框架统一现有方法并指出尚未研究的不确定性方面。

摘要翻译

大型语言模型（LLM）生成文本的过程本质上是具有不确定性的，其不确定性来源不仅包括文本生成本身，还涉及所使用的提示词及下游解读。本研究提出了一个综合考虑这些不同层面的不确定性度量形式化框架。该框架将提示词构建、文本生成与语义解读建模为相互关联的自回归过程，这些过程可整合为统一的采样树结构。我们引入了过滤器和目标函数来描述不确定性如何在采样树上表达不同维度，并演示如何通过这些函数来表征现有不确定性研究方法。借助本框架，我们不仅揭示了不同方法在形式上的关联性及其可归约的共同核心，同时指出了尚未被探索的额外不确定性维度。

摘要 (Abstract)

The generation of texts using Large Language Models (LLMs) is inherently uncertain, with sources of uncertainty being not only the generation of texts, but also the prompt used and the downstream interpretation. Within this work, we provide a formal framework for the measurement of uncertainty that takes these different aspects into account. Our framework models prompting, generation, and interpretation as interconnected autoregressive processes that can be combined into a single sampling tree. We introduce filters and objective functions to describe how different aspects of uncertainty can be expressed over the sampling tree and demonstrate how to express existing approaches towards uncertainty through these functions. With our framework we show not only how different methods are formally related and can be reduced to a common core, but also point out additional aspects of uncertainty that have not yet been studied.

关键词: Large Language Models, LLMs, uncertainty analysis, text generation, formal framework, autoregressive processes, sampling tree, prompting

101. ❌ SocialX: A Modular Platform for Multi-Source Big Data Research in Indonesia

作者: Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26253v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SocialX是一个用于印度尼西亚多源大数据研究的模块化平台，专注于数据收集、预处理和分析的工程框架，不涉及大模型、深度学习技术原理或科学AI应用。所有关键词均与大模型技术、深度学习创新或AI科学应用相关，而本文是数据工程平台，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对印度尼西亚多源大数据研究中的数据碎片化问题，提出了一个模块化平台SocialX，通过三层架构整合异构数据收集、语言感知预处理和可插拔分析，实现了统一的数据处理流程。

摘要翻译

印度尼西亚的大数据研究面临一个根本性的碎片化制约：相关数据分散在社交媒体、新闻门户、电子商务平台、评论网站和学术数据库等不同来源中，每种数据源都具有不同的格式、访问方法和噪声特征。研究人员必须独立构建采集管道、清洗异构数据并整合各自分离的分析工具，这一过程往往使研究本身相形见绌。我们推出SocialX——一个面向多源大数据研究的模块化平台，它将异构数据采集、语言感知预处理和可插拔分析集成到一个统一、与数据源无关的流程中。该平台通过轻量级作业协调机制将核心功能分离为三个独立层（采集层、预处理层和分析层）。这种模块化设计使各层能够独立演进：新增数据源、预处理方法或分析工具时无需修改现有流程。我们阐述了实现这种可扩展性的设计原则，详细说明了针对印度尼西亚语跨语域文本特有挑战的预处理方法，并通过典型研究流程的演示展现了该平台的实用性。SocialX已作为基于网络的平台公开提供，访问地址为https://www.socialx.id。

摘要 (Abstract)

Big data research in Indonesia is constrained by a fundamental fragmentation: relevant data is scattered across social media, news portals, e-commerce platforms, review sites, and academic databases, each with different formats, access methods, and noise characteristics. Researchers must independently build collection pipelines, clean heterogeneous data, and assemble separate analysis tools, a process that often overshadows the research itself. We present SocialX, a modular platform for multi-source big data research that integrates heterogeneous data collection, language-aware preprocessing, and pluggable analysis into a unified, source-agnostic pipeline. The platform separates concerns into three independent layers (collection, preprocessing, and analysis) connected by a lightweight job-coordination mechanism. This modularity allows each layer to grow independently: new data sources, preprocessing methods, or analysis tools can be added without modifying the existing pipeline. We describe the design principles that enable this extensibility, detail the preprocessing methodology that addresses challenges specific to Indonesian text across registers, and demonstrate the platform’s utility through a walkthrough of a typical research workflow. SocialX is publicly accessible as a web-based platform at https://www.socialx.id.

关键词: big data research, modular platform, multi-source data, heterogeneous data, Indonesian text, data preprocessing, research workflow, web-based platform

102. ❌ A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs

作者: Uri Z. Kialy, Avi Shtarkberg, Ayal Klein 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26236v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多语言大语言模型（Gemma-2-9B-IT）如何处理文化特定的语用语域（如俚语），使用稀疏自编码器（SAEs）进行内部表征探测，属于大模型技术原理和可解释性AI研究。因此，仅与’Large Language Models OR LLMs OR Foundation Models’（核心研究对象）和’Mechanistic Interpretability OR Explainable AI’（核心研究方法：SAEs用于可解释性分析）高度相关（10分）。其他关键词涉及模型架构、训练方法、推理优化、应用领域等，论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该研究通过稀疏自编码器探测多语言大模型Gemma-2-9B-IT的内部表征，发现模型将非正式语域（如俚语）内化为一种可移植的、语言无关的抽象概念，而非仅作为表面启发式记忆。

摘要翻译

尽管多语言模型成功实现了跨语言的事实性与句法知识迁移，但其对文化特异性语用语域（如俚语）的处理方式，究竟是基于孤立语言特异性记忆，还是形成了统一抽象概念，目前尚不明确。本研究通过稀疏自编码器（Sparse Autoencoders, SAEs）探测Gemma-2-9B-IT模型在三种类型学差异显著的源语言（英语、希伯来语和俄语）中的内部表征，以探究此问题。为明确区分语用语域处理与浅层词汇敏感性，我们引入了一个新颖数据集，其中每个目标词均为多义词，同时出现在字面语境与非正式语境中。研究发现，尽管非正式语域信号大多分布于语言特异性特征中，但仍存在一个规模较小但高度稳健的跨语言核心特征持续显现。这一共享核心构成了几何上连贯的“非正式语域子空间”，并在模型深层网络中逐渐强化。关键的是，这些共享表征不仅具有相关性：利用这些特征进行激活引导（activation steering）能够因果性地改变所有源语言的输出正式度，并可零样本迁移至涵盖不同语系与文字体系的六种未见语言。这些结果共同首次提供了机制性证据，表明多语言大语言模型（LLMs）对非正式语域的内化并非仅停留在表层启发式记忆，而是形成了一种可迁移的、与语言无关的语用抽象概念。

摘要 (Abstract)

While multilingual language models successfully transfer factual and syntactic knowledge across languages, it remains unclear whether they process culture-specific pragmatic registers, such as slang, as isolated language-specific memorizations or as unified, abstract concepts. We study this by probing the internal representations of Gemma-2-9B-IT using Sparse Autoencoders (SAEs) across three typologically diverse source languages: English, Hebrew, and Russian. To definitively isolate pragmatic register processing from trivial lexical sensitivity, we introduce a novel dataset in which every target term is polysemous, appearing in both literal and informal contexts. We find that while much of the informal-register signal is distributed across language-specific features, a small but highly robust cross-linguistic core consistently emerges. This shared core forms a geometrically coherent ``informal register subspace’’ that sharpens in the model’s deeper layers. Crucially, these shared representations are not merely correlational: activation steering with these features causally shifts output formality across all source languages and transfers zero-shot to six unseen languages spanning diverse language families and scripts. Together, these results provide the first mechanistic evidence that multilingual LLMs internalize informal register not just as surface-level heuristics, but as a portable, language-agnostic pragmatic abstraction.

关键词: multilingual language models, pragmatic register, sparse autoencoders (SAEs), internal representations, informal register subspace, cross-linguistic core, activation steering, language-agnostic abstraction

103. ❌ GS-BrainText: A Multi-Site Brain Imaging Report Dataset from Generation Scotland for Clinical Natural Language Processing Development and Validation

作者: Beatrice Alex, Claire Grover, Arlene Casey, Richard Tobin, Heather Whalley, William Whiteley 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26235v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要介绍了一个用于临床自然语言处理（NLP）开发和验证的脑成像报告数据集GS-BrainText，属于生物医学信息学领域。论文内容聚焦于数据集构建、标注、质量保证和基准评估，未涉及大模型、深度学习技术原理创新或任何评分关键词中的具体技术（如LLM、MoE、SFT、RLHF、RAG等）。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物信息学（Bioinformatics）应用，但并非核心创新技术研究，只是数据集资源介绍，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究构建并发布了GS-BrainText数据集，包含8,511份脑放射学报告和2,431份标注，用于开发和评估临床自然语言处理工具，并揭示了NLP工具在不同健康委员会、表型和年龄组中性能泛化的挑战。

摘要翻译

我们推出GS-BrainText数据集，该精选数据集包含来自"苏格兰世代"队列的8,511份脑部放射学报告，其中2,431份针对24种脑疾病表型进行了标注。这一多中心数据集涵盖苏格兰国家医疗服务体系（NHS）五个健康委员会辖区，具有广泛的年龄代表性（平均年龄58岁，中位年龄53岁），使其在开发和评估可泛化的临床自然语言处理（NLP）算法与工具方面具有独特价值。专家标注由多学科临床团队采用标准化标注框架完成，每个NHS健康委员会辖区的标注数据均经过10-100%的双重标注及严格质量把控。使用基于规则的现有NLP系统EdIE-R（该系统的开发与标注框架同步进行）进行的基准评估显示，系统在不同健康委员会（F1值：86.13-98.13）、表型类别（F1值：22.22-100）和年龄组（F1值：87.01-98.13）间存在性能差异，凸显了NLP工具泛化面临的关键挑战。GS-BrainText数据集填补了英国临床文本资源的重要空白，为研究语言变异、诊断不确定性表达以及数据特征对NLP系统性能的影响提供了宝贵资源。

摘要 (Abstract)

We present GS-BrainText, a curated dataset of 8,511 brain radiology reports from the Generation Scotland cohort, of which 2,431 are annotated for 24 brain disease phenotypes. This multi-site dataset spans five Scottish NHS health boards and includes broad age representation (mean age 58, median age 53), making it uniquely valuable for developing and evaluating generalisable clinical natural language processing (NLP) algorithms and tools. Expert annotations were performed by a multidisciplinary clinical team using an annotation schema, with 10-100% double annotation per NHS health board and rigorous quality assurance. Benchmark evaluation using EdIE-R, an existing rule-based NLP system developed in conjunction with the annotation schema, revealed some performance variation across health boards (F1: 86.13-98.13), phenotypes (F1: 22.22-100) and age groups (F1: 87.01-98.13), highlighting critical challenges in generalisation of NLP tools. The GS-BrainText dataset addresses a significant gap in available UK clinical text resources and provides a valuable resource for the study of linguistic variation, diagnostic uncertainty expression and the impact of data characteristics on NLP system performance.

关键词: brain radiology reports, clinical natural language processing, dataset, annotation, generalisation, NHS health boards, phenotypes, benchmark evaluation

104. ❌ Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

作者: Nicholas Edwards, Sebastian Schuster 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26233v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理在软件工程中的澄清寻求行为，与’Large Language Models’和’LLM Agents’高度相关（10分），并采用多代理架构，与’Multi-agent Systems’高度相关（10分）。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM代理在遇到不明确指令时如何主动寻求澄清的问题，并提出了一种不确定性感知的多代理框架，在SWE-bench测试中显著提高了任务解决率。

摘要翻译

随着大语言模型（LLM）智能体在软件工程等开放式领域日益广泛地部署，它们频繁遇到因缺乏关键上下文而定义不明确的指令。人类开发者通常会通过提出澄清性问题来自然解决这种不明确性，而当前的智能体则主要被优化用于自主执行任务。在本研究中，我们基于SWE-bench Verified的一个定义不明确变体，系统性地评估了LLM智能体寻求澄清的能力。我们提出了一种具备不确定性感知的多智能体框架，该框架将不明确性检测与代码执行过程显式解耦。实验结果表明，采用OpenHands + Claude Sonnet 4.5构建的多智能体系统实现了69.40%的任务解决率，显著优于标准的单智能体设置（61.20%），并缩小了与在明确定义指令下运行的智能体之间的性能差距。此外，我们发现该多智能体系统展现出校准良好的不确定性判断能力：在简单任务上节省查询次数，同时对更复杂的问题主动寻求信息。这些发现表明，当前模型能够转变为主动的协作伙伴，使智能体能够在现实世界中定义不明确的任务中，自主识别何时需要提出问题以获取缺失信息。

摘要 (Abstract)

As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution. Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.

关键词: LLM agents, clarification-seeking, uncertainty-aware, multi-agent system, underspecified instructions, software engineering, SWE-bench, task resolve rate

105. ❌ ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

作者: Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26182v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	10.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM在临床决策中的应用，直接涉及LLMs、多智能体系统、MCTS、临床推理等关键词，相关度评分为10分；与RAG、自我纠正、可解释性等有一定关联，评5分；其余关键词如MoE、量化、工具调用等未涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在临床诊断中复杂推理能力不足的问题，提出了基于蒙特卡洛树搜索和双记忆架构的多智能体框架ClinicalAgents，显著提升了诊断准确性和可解释性。

摘要翻译

尽管大语言模型（LLMs）在医疗健康领域展现出潜力，但其在处理准确临床诊断所需的复杂非线性推理时仍面临困难。现有方法通常依赖于从症状到诊断的静态线性映射，未能捕捉人类临床医生固有的迭代式、假设驱动的推理过程。为弥补这一差距，我们提出了ClinicalAgents——一种新颖的多智能体框架，旨在模拟专家临床医生的认知工作流程。与僵化的顺序链不同，ClinicalAgents采用了一种动态编排机制，该机制被建模为蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）过程。这使得一个“编排器”能够迭代生成假设、主动验证证据，并在关键信息缺失时触发回溯。该框架的核心是一个双记忆架构：一个可变的“工作记忆”，用于维护不断演变的患者状态以实现情境感知推理；以及一个静态的“经验记忆”，通过主动反馈循环检索临床指南和历史病例。大量实验表明，ClinicalAgents实现了最先进的性能，与强大的单智能体和多智能体基线相比，显著提升了诊断准确性和可解释性。

摘要 (Abstract)

While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non-linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis-driven reasoning inherent to human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an Orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. Central to this framework is a Dual-Memory architecture: a mutable Working Memory that maintains the evolving patient state for context-aware reasoning, and a static Experience Memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.

关键词: Large Language Models, Multi-agent Systems, Monte Carlo Tree Search, Clinical Decision Making, Dual-Memory Architecture, Clinical Diagnosis, Reasoning Framework, Healthcare AI

106. ❌ Clash of the models: Comparing performance of BERT-based variants for generic news frame detection

作者: Vihang Jumle 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26156v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究BERT变体模型在新闻框架检测任务中的性能比较，并引入了微调模型。与’Large Language Models’相关度8分（论文提到LLMs的发展促进了计算方法的探索，但主要使用BERT系列而非真正的大语言模型）；与’Post-training OR Supervised Fine-tuning OR SFT’相关度10分（论文核心贡献之一是引入各种微调模型进行新闻框架检测）；其他关键词均与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究比较了五种BERT变体模型在通用新闻框架检测任务中的性能，并引入了微调模型，同时提供了一个基于瑞士选举背景的标注数据集来测试框架分析方法的上下文鲁棒性。

摘要翻译

框架理论依然是政治传播学中应用最广泛的理论之一。近年来，计算技术的发展，特别是Transformer架构的引入以及大型语言模型（LLMs）的兴起，自然促使学者们探索各种新颖的计算方法，尤其是用于演绎式框架检测。尽管许多研究表明，不同的Transformer模型在性能上超越了先前基于词袋特征的模型，但关于这些模型在分类任务中相互比较的讨论仍在持续发展。立足于这一研究节点，本研究作出了三项关键贡献：首先，本研究对通用新闻框架检测进行了比较分析，并对比了五种基于BERT的变体模型（BERT、RoBERTa、DeBERTa、DistilBERT和ALBERT）的性能，从而为围绕政治传播研究采用计算文本分析的最佳实践讨论提供了新的见解。其次，本研究引入了多种经过微调的模型，这些模型能够稳健地执行通用新闻框架检测任务。第三，在先前众多以美国为中心的数据研究基础上，本研究为学术界提供了一个基于瑞士选举情境的标注通用新闻框架数据集，该数据集有助于测试这些计算框架分析方法的情境鲁棒性。

摘要 (Abstract)

Framing continues to remain one of the most extensively applied theories in political communication. Developments in computation, particularly with the introduction of transformer architecture and more so with large language models (LLMs), have naturally prompted scholars to explore various novel computational approaches, especially for deductive frame detection, in recent years. While many studies have shown that different transformer models outperform their preceding models that use bag-of-words features, the debate continues to evolve regarding how these models compare with each other on classification tasks. By placing itself at this juncture, this study makes three key contributions: First, it comparatively performs generic news frame detection and compares the performance of five BERT-based variants (BERT, RoBERTa, DeBERTa, DistilBERT and ALBERT) to add to the debate on best practices around employing computational text analysis for political communication studies. Second, it introduces various fine-tuned models capable of robustly performing generic news frame detection. Third, building upon numerous previous studies that work with US-centric data, this study provides the scholarly community with a labelled generic news frames dataset based on the Swiss electoral context that aids in testing the contextual robustness of these computational approaches to framing analysis.

关键词: BERT variants, news frame detection, fine-tuned models, computational text analysis, political communication, transformer models, Swiss electoral context, classification tasks

107. ❌ DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

作者: Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, Xuanlin Yang, Qifeng Cai, Ruichuan An, Bohan Zeng, Zhen Hao Wong, Chengyu Shen, Runming He, Zhaoyang Han, Yaowei Zheng, Fangcheng Fu, Conghui He, Bin Cui, Zhiyu Li, Weinan E, Wentao Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26164v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究数据中心的动态训练框架DataFlex，用于优化LLM训练中的数据选择、混合和加权。高度相关关键词：LLMs（核心研究对象）、Pre-training/Domain Adaptation和Post-training/SFT（框架支持这些训练阶段）。中等相关：Scaling Laws AND Data Quality（涉及数据质量优化，但非主要研究缩放规律）。其余关键词（如MoE、SLMs、Alignment、RAG、推理技术等）与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

论文提出了DataFlex统一框架，通过动态数据选择、混合和加权优化LLM训练，实验证明该方法在MMLU等任务上优于静态全数据训练，并提高了训练效率。

摘要翻译

以数据为中心的训练已成为改进大语言模型（LLM）的一个前景广阔的方向，其不仅优化模型参数，还在优化过程中对训练数据的选择、组合与加权进行优化。然而，现有的数据选择、数据混合优化和数据重加权方法通常在独立的代码库中开发，接口不一致，阻碍了可复现性、公平比较和实际集成。本文提出 DataFlex，一个基于 LLaMA-Factory 构建的、统一的以数据为中心的动态训练框架。DataFlex 支持三种主要的动态数据优化范式：样本选择、领域混合调整和样本重加权，同时保持与原始训练工作流的完全兼容。它提供了可扩展的训练器抽象和模块化组件，能够直接替代标准的 LLM 训练，并统一了关键且依赖于模型的操作，如嵌入提取、推理和梯度计算，同时支持包括 DeepSpeed ZeRO-3 在内的大规模设置。我们对多种以数据为中心的方法进行了全面的实验。在 Mistral-7B 和 Llama-3.2-3B 模型上，动态数据选择在 MMLU 基准上始终优于静态全数据训练。在数据混合方面，当在 SlimPajama 数据集上以 60 亿和 300 亿词元规模预训练 Qwen2.5-1.5B 时，DoReMi 和 ODM 方法相较于默认比例，均提升了 MMLU 准确率和语料库级别的困惑度。DataFlex 相比原始实现也实现了持续的运行时改进。这些结果表明，DataFlex 为 LLM 的以数据为中心的动态训练提供了一个有效、高效且可复现的基础设施。

摘要 (Abstract)

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

关键词: Data-centric training, Large Language Models, Dynamic data optimization, Sample selection, Domain mixture adjustment, Sample reweighting, Training framework, LLaMA-Factory

108. ❌ LLM Benchmark-User Need Misalignment for Climate Change

作者: Oucheng Liu, Lexing Xie, Jing Jiang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26106v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究LLM在气候变化领域的应用评估，与’Large Language Models’高度相关（10分），因为全文围绕LLM作为气候知识接口的评估展开。与’Retrieval-Augmented Generation’相关（8分），因为论文明确提到为RAG系统开发提供指导。与’AI for Science’相关（8分），因为气候变化是科学领域的重要应用场景。其他关键词如MoE、SFT、量化等涉及具体技术原理或方法，论文未涉及这些技术细节，故评0分。

!!! tip deepseek-chat TL;DR

该研究发现当前LLM在气候变化领域的评估基准与真实用户需求存在显著不匹配，而人类与LLM的知识交互模式与人类间交互相似，为基准设计、RAG系统和LLM训练提供了实用指导。

摘要翻译

气候变化是影响公共决策与政策讨论的重要社会科学议题。随着大语言模型日益成为获取气候知识的重要接口，现有基准测试能否反映真实用户需求，对于评估大语言模型在实际场景中的表现至关重要。本研究提出一个"主动知识行为框架”，用以捕捉人与人、人与人工智能之间不同类型知识寻求与提供行为。我们进一步构建了"主题-意图-形式"分类体系，并运用该体系分析了代表不同知识行为的气候相关数据。研究结果表明，当前基准测试与真实用户需求存在显著错位，而人与大语言模型之间的知识交互模式与人与人之间的交互模式高度相似。这些发现为基准测试设计、检索增强生成系统开发及大语言模型训练提供了可操作的指导。代码发布于https://github.com/OuchengLiu/LLM-Misalign-Climate-Change。

摘要 (Abstract)

Climate change is a major socio-scientific issue shapes public decision-making and policy discussions. As large language models (LLMs) increasingly serve as an interface for accessing climate knowledge, whether existing benchmarks reflect user needs is critical for evaluating LLM in real-world settings. We propose a Proactive Knowledge Behaviors Framework that captures the different human-human and human-AI knowledge seeking and provision behaviors. We further develop a Topic-Intent-Form taxonomy and apply it to analyze climate-related data representing different knowledge behaviors. Our results reveal a substantial mismatch between current benchmarks and real-world user needs, while knowledge interaction patterns between humans and LLMs closely resemble those in human-human interactions. These findings provide actionable guidance for benchmark design, RAG system development, and LLM training. Code is available at https://github.com/OuchengLiu/LLM-Misalign-Climate-Change.

关键词: Large Language Models, Climate Change, Benchmark Evaluation, User Needs, Knowledge Behaviors, RAG Systems, Human-AI Interaction, Proactive Knowledge Framework

109. ❌ IndoBERT-Relevancy: A Context-Conditioned Relevancy Classifier for Indonesian Text

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于印尼语文本相关性分类任务，使用基于BERT的模型（IndoBERT Large），主要涉及监督微调（SFT）技术来训练分类器。论文未涉及大模型技术原理创新、科学领域应用或其他关键词中的前沿技术，仅与’Post-training OR Supervised Fine-tuning OR SFT’有中等关联（5分），因为模型训练使用了监督微调方法。其他关键词均与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对印尼语文本相关性分类任务，提出了IndoBERT-Relevancy模型，通过构建新数据集和监督微调方法，实现了96.5%的准确率。

摘要翻译

判断一段文本是否与给定主题相关是自然语言处理中的一项基础任务，然而针对印度尼西亚语（Bahasa Indonesia）的相关研究仍基本处于空白状态。与情感分析或命名实体识别不同，相关性分类要求模型同时对两个输入之间的关系进行推理：主题语境和候选文本。我们提出了IndoBERT-Relevancy，这是一个基于IndoBERT Large（3.35亿参数）构建的语境条件相关性分类器，并在一个包含188个主题、共计31,360个标注对的新数据集上进行了训练。通过迭代的、基于失败案例驱动的数据构建过程，我们证明单一数据源不足以实现稳健的相关性分类，而针对性的合成数据能有效解决模型的特定弱点。我们的最终模型取得了0.948的F1分数和96.5%的准确率，能够同时处理正式和非正式的印尼语文本。该模型已在HuggingFace平台公开提供。

摘要 (Abstract)

Determining whether a piece of text is relevant to a given topic is a fundamental task in natural language processing, yet it remains largely unexplored for Bahasa Indonesia. Unlike sentiment analysis or named entity recognition, relevancy classification requires the model to reason about the relationship between two inputs simultaneously: a topical context and a candidate text. We introduce IndoBERT-Relevancy, a context-conditioned relevancy classifier built on IndoBERT Large (335M parameters) and trained on a novel dataset of 31,360 labeled pairs spanning 188 topics. Through an iterative, failure-driven data construction process, we demonstrate that no single data source is sufficient for robust relevancy classification, and that targeted synthetic data can effectively address specific model weaknesses. Our final model achieves an F1 score of 0.948 and an accuracy of 96.5%, handling both formal and informal Indonesian text. The model is publicly available at HuggingFace.

关键词: relevancy classification, Indonesian text, IndoBERT, context-conditioned classifier, supervised fine-tuning, dataset construction, natural language processing

110. ❌ I Want to Believe (but the Vocabulary Changed): Measuring the Semantic Structure and Evolution of Conspiracy Theories

作者: Manisha Keim, Sarmad Chandio, Osama Khalid, Rishab Nithyanand 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26062v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究阴谋论在在线政治话语中的语义结构和演变，使用Reddit评论数据和词嵌入对齐技术进行分析。论文内容属于计算社会科学和自然语言处理的应用研究，但完全不涉及大模型、深度学习技术原理、模型训练优化、推理加速、AI代理、AI for Science等关键词领域。所有关键词均与大模型技术或特定AI应用直接相关，而本文是传统NLP方法在社会科学领域的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了在线政治话语中阴谋论的语义结构和时间演变，通过分析Reddit评论发现阴谋论语言形成可区分的语义区域，并揭示了其非均匀的演变模式，包括语义稳定性、扩展、收缩和替换。

摘要翻译

关于阴谋论的研究主要集中于信念形成、接触与传播机制，却较少关注其含义随时间演变的过程。这一研究空白之所以持续存在，部分原因在于阴谋论相关术语常被视为稳定的词汇标记，导致难以区分真正的语义演变与表层词汇更替。本文通过量化分析在线政治话语中阴谋论的语义结构及其演化路径来弥补这一不足。基于2012年至2022年间Reddit政治子论坛的1.699亿条评论数据，我们首先证明阴谋论相关语言在语言空间中形成了连贯且语义可区分的区域，这使得阴谋论能够被视作语义对象进行研究。随后，我们运用对齐词向量技术追踪这些语义对象随时间演变的过程，实现了跨时期语义邻域的比较分析。研究发现，阴谋论的演化呈现非均匀性特征，表现出语义稳定性、扩展性、收缩性与替代性等动态模式，这些复杂演变规律是单纯基于关键词的研究方法所无法捕捉的。

摘要 (Abstract)

Research on conspiracy theories has largely focused on belief formation, exposure, and diffusion, while paying less attention to how their meanings change over time. This gap persists partly because conspiracy-related terms are often treated as stable lexical markers, making it difficult to separate genuine semantic changes from surface-level vocabulary changes. In this paper, we measure the semantic structure and evolution of conspiracy theories in online political discourse. Using 169.9M comments from Reddit’s r/politics subreddit spanning 2012–2022, we first demonstrate that conspiracy-related language forms coherent and semantically distinguishable regions of language space, allowing conspiracy theories to be treated as semantic objects. We then track how these objects evolve over time using aligned word embeddings, enabling comparisons of semantic neighborhoods across periods. Our analysis reveals that conspiracy theories evolve non-uniformly, exhibiting patterns of semantic stability, expansion, contraction, and replacement that are not captured by keyword-based approaches alone.

关键词: conspiracy theories, semantic structure, semantic evolution, word embeddings, online political discourse, Reddit, natural language processing, computational social science

111. ❌ Retrieval-Augmented Generation Based Nurse Observation Extraction

作者: Kyomin Hwang, Nojun Kwak 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26046v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文明确提到使用LLMs和RAG技术来自动提取临床观察记录，因此这两个关键词高度相关（10分）。论文属于医疗领域的AI应用，与’AI for Science’有一定关联（8分）。其他关键词如MoE、SFT、量化等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于检索增强生成（RAG）的自动化管道，用于从护士口述中提取临床观察记录，在MEDIQA-SYNUR测试数据集上取得了0.796的F1分数。

摘要翻译

近期，大型语言模型（LLM）的进展在诸多领域显著减轻了人类工作负担，这一趋势正日益延伸至医疗领域。本文提出一种自动化流程，旨在通过从护士口述记录中自动提取临床观察结果，以减轻护士的工作负担。为确保提取的准确性，我们引入了一种基于检索增强生成（Retrieval-Augmented Generation, RAG）的方法。我们的方法展现出良好的性能，在MEDIQA-SYNUR测试数据集上取得了0.796的F1分数。

摘要 (Abstract)

Recent advancements in Large Language Models (LLMs) have played a significant role in reducing human workload across various domains, a trend that is increasingly extending into the medical field. In this paper, we propose an automated pipeline designed to alleviate the burden on nurses by automatically extracting clinical observations from nurse dictations. To ensure accurate extraction, we introduce a method based on Retrieval-Augmented Generation (RAG). Our approach demonstrates effective performance, achieving an F1-score of 0.796 on the MEDIQA-SYNUR test dataset.

关键词: Retrieval-Augmented Generation, Large Language Models, clinical observations, nurse dictations, automated pipeline, medical field, F1-score, MEDIQA-SYNUR

112. ❌ AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents

作者: Wenbo Gao, Renxi Liu, Xian Wang, Fang Guo, Shuai Yang, Xi Chen, Hui-Ling Zhen, Hanting Chen, Weizhe Lin, Xiaosong Li, Yaoyuan Wang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26034v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AgentCollab框架，核心研究LLM agents的协作推理，直接涉及LLMs、SLMs、多步推理、深度推理、自我反思、自主代理、工具使用和多代理协调等关键词，这些是论文的核心内容（10分）。其他关键词如MoE、数据质量、训练方法、RAG、注意力优化、量化等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文研究LLM agents在执行复杂任务时效率与鲁棒性的权衡问题，提出AgentCollab协作框架，通过自我反思信号动态协调不同能力模型，实验表明该框架能提升LLM agents的准确率-效率帕累托前沿。

摘要翻译

由大型语言模型（LLM）驱动的自主智能体通过长程推理与工具交互执行复杂任务，其中存在执行效率与推理鲁棒性之间的根本权衡。不同能力-成本层级的模型提供了互补优势：低成本模型能够快速执行，但在困难推理环节可能表现不佳；而能力更强的模型虽能以更高计算成本提供更稳健的推理，却牺牲了效率。本文提出AgentCollab——一种自驱动的协同推理框架，该框架在智能体执行过程中动态协调具有不同推理能力的模型。该框架不依赖外部路由模块，而是利用智能体自身的反思信号来判断当前推理轨迹是否取得实质性进展，仅在必要时将控制权移交至更高层级的强推理模型。为进一步稳定长程任务执行，我们引入难度感知的累积升级策略，该策略根据近期失败信号动态分配额外推理预算。实验中，我们采用“小模型-大模型”双层架构实例化该框架。在多样化多步骤智能体基准测试上的实验表明，AgentCollab能持续优化LLM智能体的精度-效率帕累托边界。

摘要 (Abstract)

Autonomous agents powered by large language models (LLMs) perform complex tasks through long-horizon reasoning and tool interaction, where a fundamental trade-off arises between execution efficiency and reasoning robustness. Models at different capability-cost levels offer complementary advantages: lower-cost models enable fast execution but may struggle on difficult reasoning segments, while stronger models provide more robust reasoning at higher computational cost. We present AgentCollab, a self-driven collaborative inference framework that dynamically coordinates models with different reasoning capacities during agent execution. Instead of relying on external routing modules, the framework uses the agent’s own self-reflection signal to determine whether the current reasoning trajectory is making meaningful progress, and escalates control to a stronger reasoning tier only when necessary. To further stabilize long-horizon execution, we introduce a difficulty-aware cumulative escalation strategy that allocates additional reasoning budget based on recent failure signals. In our experiments, we instantiate this framework using a two-level small-large model setting. Experiments on diverse multi-step agent benchmarks show that AgentCollab consistently improves the accuracy-efficiency Pareto frontier of LLM agents.

关键词: LLM agents, autonomous agents, collaborative inference, self-reflection, multi-step reasoning, tool interaction, small-large model coordination, efficiency-robustness trade-off

113. ❌ Toward Culturally Grounded Natural Language Processing

作者: Sina Bagheri Nezhad 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26013v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多语言NLP中的文化能力问题，综述了2020-2026年间50多篇相关文献。与关键词的相关性分析如下：1) 论文讨论多语言模型（LLMs）的性能不平等和文化对齐问题，与’Large Language Models’、‘Instruction Tuning/Alignment’、‘Hallucination Mitigation’、‘Explainable AI’等关键词相关，但非技术核心；2) 涉及训练数据、评估基准设计等，与’Scaling Laws AND Data Quality’、‘Pre-training’、‘Post-training’等有一定关联；3) 论文未涉及MoE、SLMs、推理加速、模型压缩、科学AI应用等具体技术，相关关键词得0分。

!!! tip deepseek-chat TL;DR

该论文综述了多语言自然语言处理中文化能力不足的问题，指出当前模型在跨文化理解和社区特定场景中表现不佳，并提出了一个以丰富上下文元数据、文化分层评估和参与式对齐为核心的文化基础NLP研究议程。

摘要翻译

近年来多语言自然语言处理领域的进展常被视为全球包容性提升的体现，但越来越多的研究表明，多语言能力与文化胜任力之间存在脱节。本文综合梳理了2020至2026年间发表的50余篇文献，涵盖多语言性能不平等、跨语言迁移、文化感知评估、文化对齐、多模态本土知识建模、基准测试设计批判以及社区扎根的数据实践等议题。纵观这些研究，训练数据覆盖范围仍是性能的关键决定因素，但仅此并不足够：分词方式、提示语言、翻译式基准测试设计、文化特异性监督以及多模态语境均对结果产生实质性影响。近期关于Global-MMLU、CDEval、WorldValuesBench、CulturalBench、CULEMO、CulturalVQA、GIMMICK、DRISHTIKON、WorldCuisines、CARE、CLCA等工作的研究，以及对基准测试设计和社区扎根评估的新近批判表明，即使强大的多语言模型仍可能抹平地方性规范、误读文化根植性线索，并在资源匮乏或社区特定场景中表现欠佳。我们认为，领域研究应从将语言视为基准测试表格中孤立条目的现状，转向对传播生态系统的建模——即语言使用所依托的制度、文字系统、翻译流程、领域、模态与社区。基于此，我们提出以文化扎根的自然语言处理为核心的研究议程，重点关注丰富的情境元数据、文化分层评估、参与式对齐、语言内部变异以及多模态社区感知设计。

摘要 (Abstract)

Recent progress in multilingual NLP is often taken as evidence of broader global inclusivity, but a growing literature shows that multilingual capability and cultural competence come apart. This paper synthesizes over 50 papers from 2020–2026 spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal local-knowledge modeling, benchmark design critiques, and community-grounded data practices. Across this literature, training data coverage remains a strong determinant of performance, yet it is not sufficient: tokenization, prompt language, translated benchmark design, culturally specific supervision, and multimodal context all materially affect outcomes. Recent work on Global-MMLU, CDEval, WorldValuesBench, CulturalBench, CULEMO, CulturalVQA, GIMMICK, DRISHTIKON, WorldCuisines, CARE, CLCA, and newer critiques of benchmark design and community-grounded evaluation shows that strong multilingual models can still flatten local norms, misread culturally grounded cues, and underperform in lower-resource or community-specific settings. We argue that the field should move from treating languages as isolated rows in a benchmark spreadsheet toward modeling communicative ecologies: the institutions, scripts, translation pipelines, domains, modalities, and communities through which language is used. On that basis, we propose a research agenda for culturally grounded NLP centered on richer contextual metadata, culturally stratified evaluation, participatory alignment, within-language variation, and multimodal community-aware design.

关键词: culturally grounded NLP, multilingual NLP, cultural competence, benchmark design, cross-lingual transfer, cultural alignment, community-grounded evaluation, participatory alignment

114. ❌ Detailed Geometry and Appearance from Opportunistic Motion

作者: Ryosuke Hirai, Kohei Yamashita, Antoine Guédon, Ryo Kawahara, Vincent Lepetit, Ko Nishino 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26665v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的3D重建问题，利用2D高斯泼溅和球谐函数等技术从稀疏固定相机中恢复物体的几何形状和外观。所有评分关键词均涉及大模型、深度学习技术原理或特定AI应用领域（如生物信息学），而本文研究内容属于传统计算机视觉范畴，未涉及任何大模型、深度学习技术或AI for Science应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用物体运动提供的虚拟视角，通过联合姿态和形状优化以及新颖的外观建模方法，从稀疏固定相机中恢复更准确3D几何和外观的技术。

摘要翻译

从稀疏固定相机集合中重建三维几何与外观是一项基础性任务，具有广泛的应用前景，但其本质上受限于有限的观测视角。我们证明，通过利用物体在操作过程中的偶然运动可以突破这一限制：当人操纵物体时（例如移动椅子或抬起杯子），静态相机在物体的局部坐标系中实质上实现了对该物体的“环绕”观测，从而提供了额外的虚拟视角。然而，利用这种物体运动面临两大挑战：物体姿态与几何估计的紧密耦合问题，以及静态光照条件下运动物体外观的复杂变化问题。我们通过以下方式解决这些挑战：首先，采用二维高斯泼溅技术，通过交替优化六自由度轨迹与基元参数，构建了姿态与形状的联合优化框架；其次，提出了一种新颖的外观模型，该模型在球谐函数空间内通过反射方向探针分解了漫反射与镜面反射分量。在视角极度稀疏的合成与真实数据集上进行的大量实验表明，相较于现有先进基线方法，我们的方法能够恢复出显著更精确的几何结构与外观。

摘要 (Abstract)

Reconstructing 3D geometry and appearance from a sparse set of fixed cameras is a foundational task with broad applications, yet it remains fundamentally constrained by the limited viewpoints. We show that this bound can be broken by exploiting opportunistic object motion: as a person manipulates an object~(e.g., moving a chair or lifting a mug), the static cameras effectively ``orbit’’ the object in its local coordinate frame, providing additional virtual viewpoints. Harnessing this object motion, however, poses two challenges: the tight coupling of object pose and geometry estimation and the complex appearance variations of a moving object under static illumination. We address these by formulating a joint pose and shape optimization using 2D Gaussian splatting with alternating minimization of 6DoF trajectories and primitive parameters, and by introducing a novel appearance model that factorizes diffuse and specular components with reflected directional probing within the spherical harmonics space. Extensive experiments on synthetic and real-world datasets with extremely sparse viewpoints demonstrate that our method recovers significantly more accurate geometry and appearance than state-of-the-art baselines.

关键词: 3D reconstruction, geometry and appearance, opportunistic motion, 2D Gaussian splatting, 6DoF trajectory, spherical harmonics, sparse viewpoints, joint pose and shape optimization

115. ❌ GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

作者: Nicolas von Lützow, Barbara Rössle, Katharina Schmid, Matthias Nießner 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26661v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究3D高斯场景生成的完全自回归方法，使用基于Transformer的模型进行下一个令牌预测。虽然涉及Transformer架构，但论文专注于3D生成建模的特定应用，而非大语言模型（LLM）或深度学习技术原理的创新。所有关键词均针对LLM相关技术、训练方法、推理优化、对齐、代理系统等，与论文的3D生成主题无直接关联。论文未涉及科学领域的AI应用（如生物信息学），也未讨论大模型在不同领域的研究应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出GaussianGPT，一种基于Transformer的自回归模型，通过下一个令牌预测直接生成3D高斯，实现可控和上下文感知的3D场景生成，作为扩散或流匹配方法的补充范式。

摘要翻译

三维生成建模的最新进展主要依赖于扩散模型或流匹配框架。本文则探索了一种完全自回归的替代方案，提出了GaussianGPT——一种基于Transformer的模型，它通过下一令牌预测直接生成三维高斯分布，从而实现了完整三维场景的生成。我们首先使用带有向量量化的稀疏三维卷积自编码器，将高斯基元压缩为离散的潜在网格。生成的令牌经过序列化后，由具备三维旋转位置编码的因果Transformer进行建模，从而能够顺序生成空间结构与外观信息。与基于扩散模型的方法对场景进行整体优化不同，我们的框架通过逐步构建场景，天然支持场景补全、外延绘制、通过温度参数进行可控采样以及灵活的生成范围设定。该框架充分利用了自回归建模的组合归纳偏置与可扩展性优势，同时操作与现代神经渲染管线兼容的显式表示，从而将自回归Transformer定位为可控且具备上下文感知能力的三维生成范式的补充路径。

摘要 (Abstract)

Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.

关键词: 3D Gaussian generation, autoregressive modeling, transformer-based model, next-token prediction, 3D scene generation, causal transformer, vector quantization, neural rendering

116. ❌ Zero-Shot Depth from Defocus

作者: Yiming Zuo, Hongyu Wen, Venkat Subramanian, Patrick Chen, Karhan Kayan, Mario Bijelic, Felix Heide, Jia Deng 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26658v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究计算机视觉中的深度估计任务（Depth from Defocus），提出新的基准数据集ZEDD和Transformer架构FOSSA。论文内容完全聚焦于计算机视觉的特定任务，未涉及任何大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大模型、深度学习技术原理或AI科学应用相关，与该论文的计算机视觉研究无任何关联。

!!! tip deepseek-chat TL;DR

该论文提出了一个新的零样本深度散焦基准数据集ZEDD和Transformer架构FOSSA，在深度估计任务上比基线方法减少了高达55.7%的误差。

摘要翻译

离焦深度估计（Depth from Defocus, DfD）是指从聚焦堆栈中估计密集度量深度图的任务。与以往研究常过度拟合特定数据集不同，本文聚焦于具有挑战性且实用的零样本泛化场景。我们首先提出了一个新的真实世界DfD基准数据集ZEDD，与先前基准相比，其包含的场景数量增加了8.3倍，且图像质量与真实深度图质量显著更高。我们还设计了一种名为FOSSA的新型网络架构。FOSSA是一种基于Transformer的架构，其新颖设计专门针对DfD任务定制。核心贡献在于引入了具有聚焦距离嵌入的堆栈注意力层，该层能实现聚焦堆栈间的高效信息交换。最后，我们开发了一种新的训练数据生成流程，使我们能够利用现有的大规模RGBD数据集来生成合成聚焦堆栈。在ZEDD及其他基准上的实验结果表明，该方法相较于基线模型有显著提升，误差降低最高达55.7%。ZEDD基准数据集发布于https://zedd.cs.princeton.edu。代码与模型检查点发布于https://github.com/princeton-vl/FOSSA。

摘要 (Abstract)

Depth from Defocus (DfD) is the task of estimating a dense metric depth map from a focus stack. Unlike previous works overfitting to a certain dataset, this paper focuses on the challenging and practical setting of zero-shot generalization. We first propose a new real-world DfD benchmark ZEDD, which contains 8.3x more scenes and significantly higher quality images and ground-truth depth maps compared to previous benchmarks. We also design a novel network architecture named FOSSA. FOSSA is a Transformer-based architecture with novel designs tailored to the DfD task. The key contribution is a stack attention layer with a focus distance embedding, allowing efficient information exchange across the focus stack. Finally, we develop a new training data pipeline allowing us to utilize existing large-scale RGBD datasets to generate synthetic focus stacks. Experiment results on ZEDD and other benchmarks show a significant improvement over the baselines, reducing errors by up to 55.7%. The ZEDD benchmark is released at https://zedd.cs.princeton.edu. The code and checkpoints are released at https://github.com/princeton-vl/FOSSA.

关键词: Depth from Defocus, zero-shot generalization, Transformer architecture, focus stack, depth estimation, benchmark dataset, computer vision, FOSSA

117. ❌ Tunable Soft Equivariance with Guarantees

作者: Md Ashiqur Rahman, Lim Jun Hao, Jeremiah Jiang, Teck-Yian Lim, Raymond A. Yeh 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26657v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究计算机视觉中的等变性（equivariance）问题，提出了一种通过投影模型权重到设计子空间来构建软等变模型的通用框架，并应用于ViT、ResNet等预训练架构。所有评分关键词均专注于大语言模型（LLMs）及其相关技术（如训练方法、推理优化、应用场景等），而本文完全不涉及语言模型、自然语言处理或大模型技术，属于纯粹的计算机视觉和机器学习理论方法研究，与所有关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过投影权重到设计子空间来构建软等变模型的通用框架，在多个预训练骨干网络和任务上提高了性能并降低了等变性误差。

摘要翻译

等变性是计算机视觉模型的基本属性，然而现实数据中严格等变条件很少得到满足，这可能限制模型的性能。因此，控制等变程度具有重要价值。我们提出一个通用框架，通过将模型权重投影到设计的子空间中来构建软等变模型。该方法适用于任何预训练架构，并为诱导等变误差提供了理论边界。我们在多个预训练骨干网络（包括ViT和ResNet）上，通过图像分类、语义分割和人体轨迹预测任务验证了方法的有效性。值得注意的是，在具有竞争力的ImageNet基准测试中，我们的方法在提升性能的同时降低了等变误差。

摘要 (Abstract)

Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model’s performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.

关键词: soft equivariance, equivariance error, model weights projection, pre-trained architectures, theoretical guarantees, image classification, semantic segmentation, human-trajectory prediction

118. ❌ Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision

作者: Ling Li, Bowen Liu, Zinuo Zhan, Peng Jie, Jianhui Zhong, Kenglun Chang, Zhidong Deng 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26646v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉接地（Visual Grounding）中的多模态指代表达理解，核心创新是结合手势（手部指向）和语言的多模态推理。与关键词的相关性分析：1）高度相关（8-10分）：论文明确提到Multimodal Large Language Models（MLLMs）和Chain of Thought（CoT）推理，因此"Large Language Models"和"Chain of Thought"高度相关；论文关注提升agent理解物理意图的能力，与"LLM Agents"相关。2）中等相关（5分）：论文旨在解决语义模糊性问题，与"Hallucination Mitigation"有一定关联；提出的结构化推理过程涉及深度推理，与"System 2 Thinking"部分相关。3）无关（0分）：其他关键词如MoE、量化、RAG等未在论文中涉及，与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文针对传统视觉接地方法依赖文本描述而忽略非语言指示线索的问题，提出了首个大规模自我中心指代视觉接地数据集EgoPoint-Ground和基于视觉思维链的SV-CoT框架，通过结合手势和语言线索进行结构化推理，在基准测试中实现了11.7%的绝对性能提升。

摘要翻译

传统的视觉定位方法主要依赖文本描述来定位物体，这一范式本质上受限于语言歧义，且往往忽略了现实交互中普遍存在的非语言指示性线索。在自然的第一人称交互中，手势指向与言语结合构成了最直观的指代表达机制。为弥补这一差距，我们提出了EgoPoint-Ground，这是首个专为第一人称指示性视觉定位设计的大规模多模态数据集。该数据集包含复杂场景中超过15k个交互样本，提供了丰富、多粒度的标注，包括手部-目标边界框对以及密集语义描述。我们为手势指向的指代表达解析建立了一个综合性基准，评估了广泛的主流多模态大语言模型及先进的视觉定位架构。此外，我们提出了SV-CoT，一种新颖的基线框架，该框架将定位任务重构为一个结构化推理过程，通过视觉思维链范式协同整合手势与语言线索。大量实验表明，SV-CoT相较于现有方法实现了**11.7%**的绝对性能提升，有效缓解了语义歧义，并提升了智能体理解多模态物理意图的能力。数据集与代码将公开提供。

摘要 (Abstract)

Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.

关键词: Visual Grounding, Egocentric Vision, Multimodal Large Language Models, Hand Pointing, Deictic Cues, Chain of Thought, Semantic Ambiguity, Agent Comprehension

119. ❌ Drive-Through 3D Vehicle Exterior Reconstruction via Dynamic-Scene SfM and Distortion-Aware Gaussian Splatting

作者: Nitin Kulkarni, Akhil Devarashetti, Charlie Cluss, Livio Forte, Philip Schneider, Chunming Qiao, Alina Vereshchaka 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26638v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和3D重建领域，提出了一种用于动态场景车辆外观重建的端到端流水线，涉及实例分割（SAM 3）、运动门控、鲁棒特征匹配（RoMa v2）、运动恢复结构（SfM）和3D高斯溅射（3D-GS）等技术。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关，未涉及任何大语言模型、模型训练、对齐、推理、代理、压缩或AI for Science等主题。

!!! tip deepseek-chat TL;DR

该论文解决了在杂乱经销商通道中动态移动车辆的高保真3D重建问题，通过结合实例分割、运动门控、鲁棒特征匹配、运动恢复结构和失真感知3D高斯溅射的端到端流水线，实现了比标准3D-GS高3.85 dB PSNR的检查级交互式3D模型。

摘要翻译

车辆外观的高保真三维重建能提升在线汽车市场买家的信心，但在杂乱经销商通道中生成此类模型存在严峻技术挑战。与静态场景摄影测量不同，该场景的特点是动态车辆在高度杂乱的静态背景前移动。广角镜头畸变、高反射性汽车漆面以及破坏经典对极几何约束的非刚性车轮旋转，进一步加剧了该问题的复杂性。我们提出了一种采用双立柱相机架的端到端流程。首先，我们通过耦合SAM 3实例分割与运动门控来解析动态场景模糊性，清晰分离运动车辆，并显式掩蔽非刚性车轮以强制执行严格的对极几何。其次，我们在原始畸变的4K图像上，利用由语义置信掩码引导的RoMa v2学习匹配器，提取鲁棒的特征对应点。第三，将这些匹配点集成到一个利用CAD导出的相对位姿先验来消除尺度漂移的、考虑相机架约束的运动恢复结构（SfM）优化中。最后，我们采用一个感知畸变的三维高斯泼溅框架（3DGUT），结合随机马尔可夫链蒙特卡洛（MCMC）致密化策略，来渲染反射表面。在10个经销商处对25辆实车进行的评估表明，我们的完整流程在预留视角上实现了28.66 dB的峰值信噪比（PSNR）、0.89的结构相似性指数（SSIM）和0.21的学习感知图像块相似度（LPIPS），相比标准三维高斯泼溅（3D-GS）提升了3.85 dB，无需受控影棚设施即可生成可用于检测的交互式三维模型。

摘要 (Abstract)

High-fidelity 3D reconstruction of vehicle exteriors improves buyer confidence in online automotive marketplaces, but generating these models in cluttered dealership drive-throughs presents severe technical challenges. Unlike static-scene photogrammetry, this setting features a dynamic vehicle moving against heavily cluttered, static backgrounds. This problem is further compounded by wide-angle lens distortion, specular automotive paint, and non-rigid wheel rotations that violate classical epipolar constraints. We propose an end-to-end pipeline utilizing a two-pillar camera rig. First, we resolve dynamic-scene ambiguities by coupling SAM 3 for instance segmentation with motion-gating to cleanly isolate the moving vehicle, explicitly masking out non-rigid wheels to enforce strict epipolar geometry. Second, we extract robust correspondences directly on raw, distorted 4K imagery using the RoMa v2 learned matcher guided by semantic confidence masks. Third, these matches are integrated into a rig-aware SfM optimization that utilizes CAD-derived relative pose priors to eliminate scale drift. Finally, we use a distortion-aware 3D Gaussian Splatting framework (3DGUT) coupled with a stochastic Markov Chain Monte Carlo (MCMC) densification strategy to render reflective surfaces. Evaluations on 25 real-world vehicles across 10 dealerships demonstrate that our full pipeline achieves a PSNR of 28.66 dB, an SSIM of 0.89, and an LPIPS of 0.21 on held-out views, representing a 3.85 dB improvement over standard 3D-GS, delivering inspection-grade interactive 3D models without controlled studio infrastructure.

关键词: 3D reconstruction, vehicle exterior, dynamic scene, Structure from Motion (SfM), 3D Gaussian Splatting, instance segmentation, wide-angle lens distortion, automotive marketplace

120. ❌ VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

作者: Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, Marta Tintore Gazulla 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26599v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频生成中的几何一致性改进，核心是VGGRPO框架，涉及视频扩散模型、几何基础模型和强化学习优化。与大多数关键词无关，因为论文不涉及语言模型、推理方法、对齐技术等。唯一高度相关的关键词是’World Models AND General World Models’（10分），因为论文明确使用几何基础模型作为世界模型来解码场景几何。‘Post-training OR Supervised Fine-tuning OR SFT’得5分，因为论文提到’post-training’，但并非核心。其他关键词得0分，因论文主题是视频生成，而非语言模型或相关技术。

!!! tip deepseek-chat TL;DR

论文提出VGGRPO框架，通过引入潜在几何模型和强化学习优化，解决了视频扩散模型中几何一致性问题，实现了更稳定、一致的高质量视频生成。

摘要翻译

大规模视频扩散模型在视觉质量上取得了显著成果，但往往难以保持几何一致性。现有方法通常通过为生成器添加额外模块或应用几何感知对齐来提升一致性。然而，架构修改可能损害互联网规模预训练模型的泛化能力，而现有的对齐方法局限于静态场景，且依赖RGB空间的奖励机制——这需要重复的VAE解码，导致巨大的计算开销，并难以泛化到高度动态的真实世界场景。为在保持预训练模型能力的同时提升几何一致性，我们提出了VGGRPO（视觉几何GRPO），一种潜在空间几何引导的几何感知视频后训练框架。VGGRPO引入了潜在几何模型（Latent Geometry Model, LGM），该模型将视频扩散潜在表示与几何基础模型相衔接，从而能够直接从潜在空间解码场景几何。通过利用具备4D重建能力的几何模型构建LGM，VGGRPO自然扩展到动态场景，克服了先前方法局限于静态场景的不足。在此基础上，我们执行潜在空间群组相对策略优化（Group Relative Policy Optimization），采用两种互补的奖励机制：一是相机运动平滑性奖励，用于惩罚抖动轨迹；二是几何重投影一致性奖励，用于强化跨视角的几何连贯性。在静态与动态基准测试上的实验表明，VGGRPO在提升相机稳定性、几何一致性和整体质量的同时，消除了昂贵的VAE解码开销，使得潜在空间几何引导的强化学习成为一种高效且灵活的实现世界一致视频生成的方法。

摘要 (Abstract)

Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.

关键词: video generation, geometric consistency, diffusion models, latent space, reinforcement learning, world models, 4D reconstruction, post-training

121. ❌ From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning

作者: Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Xilin Zhao, Qingming Huang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26597v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究图像到视频的表示迁移学习，属于计算机视觉领域，而非大语言模型或深度学习技术原理的直接创新。与关键词的相关性如下：1. “Pre-training OR Continual Pre-training OR Domain Adaptation” 得5分，因为论文涉及预训练模型迁移和领域适应（图像到视频）。2. “Post-training OR Supervised Fine-tuning OR SFT” 得5分，因为论文讨论微调（fine-tuning）及其权衡。3. “PEFT OR LoRA OR Parameter-efficient Fine-tuning” 得5分，因为论文提出轻量级投影层进行参数高效调整。其他关键词均与大语言模型、推理、对齐、科学AI应用等无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对图像预训练模型迁移到视频任务时存在的视频内时间一致性和视频间语义可分性权衡问题，提出了Co-Settle框架，通过轻量级投影层和自监督训练在多个视频任务上实现了改进。

摘要翻译

近期研究通过将图像预训练模型迁移至视频任务，在视频表征学习领域取得了显著进展，这类方法通常依赖复杂的时序模块和视频微调。然而，对重型模块进行微调可能损害视频间的语义可分性——即区分不同视频中物体的核心能力。而减少可调参数则会削弱视频内时序一致性，这是同一物体在视频中获得稳定表征的必要条件。这一困境揭示了图像到视频迁移过程中，视频内时序一致性与视频间语义可分性之间可能存在权衡。为此，我们提出一致性-可分性权衡迁移学习框架，该框架在冻结的图像预训练编码器顶端应用轻量级投影层，通过时序循环一致性目标和语义可分性约束来调整表征空间。我们进一步提供理论证明，表明在适当条件下，优化后的投影能在两种特性间实现更好的权衡。基于八个图像预训练模型的实验表明，仅需五个周期的自监督训练，该框架就能在多个层级的视频任务上实现持续性能提升。代码已发布于https://github.com/yafeng19/Co-Settle。

摘要 (Abstract)

Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at https://github.com/yafeng19/Co-Settle.

关键词: video representation learning, image-to-video transfer, self-supervised learning, temporal consistency, semantic separability, parameter-efficient fine-tuning, transfer learning, lightweight projection

122. ❌ The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding

作者: Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26589v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视觉语言模型（VLMs）在场景理解任务中的表现，特别是与人类在affordance任务上的差距。虽然涉及大模型（VLMs属于多模态大模型），但论文重点在于评估模型能力、分析数据偏差和认知科学问题，而非大模型技术原理的创新或具体技术方法的改进。所有关键词均针对大语言模型（LLMs）的特定技术、训练方法、优化技术或应用领域，而本文研究对象是视觉语言模型（VLMs），且未涉及任何关键词中的具体技术（如MoE、Scaling Laws、RLHF、PEFT等）或应用场景（如AI for Science）。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过比较18个视觉语言模型与2000多名人类观察者在15个高级场景理解任务中的表现，发现仅从图像和文本进行分布学习不足以实现基于affordance的场景理解，表明人类视觉认知的某些维度需要基于智能体的三维体验。

摘要翻译

何种信息足以习得人类场景理解的全部丰富内涵？分布假说认为，语言与图像的统计共现关系捕捉了视觉认知背后的概念知识。视觉-语言模型（VLMs）通过海量图文配对语料进行训练，但缺乏具身体验，这使其成为检验分布假说的理想对象。我们报告了两项实验，将18个视觉-语言模型生成的描述与超过2000名人类观察者在15项高层次场景理解任务中的描述进行比较，这些任务涵盖常识知识、功能可供性、感官体验、情感反应及未来预测。由于许多任务缺乏标准答案，我们开发了一种“人类校准余弦距离”（HCD）度量方法，用于衡量视觉-语言模型输出与人类回答分布之间的相似度，并以人类内部变异性作为标度。在实验1中，视觉-语言模型在常识知识任务上接近人类水平，但在功能可供性任务上表现出显著缺陷，这种缺陷无法通过提示工程消除，且未随新模型发布而改善。实验2中，我们检验了六种用于解释这种可供性差距的机制假说，发现该缺陷是结构性的而非风格性的，且提供显式空间信息亦无法解决。语料分析表明，图像描述数据集中包含的、以行动者为中心的功能可供性语言十分稀疏，这与格赖斯理论对具身知识为何在语言中被系统性低估的解释一致。综合来看，这些发现表明，仅通过图像和文本的分布学习不足以实现基于可供性的场景理解，这意味着人类视觉认知的某些维度可能需要那种以行动者为中心的三维体验——这是任何照片或文字说明都无法编码的。

摘要 (Abstract)

What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In Experiment 1, VLMs approached human-level performance on general knowledge tasks, but showed a robust deficit for affordance tasks that resisted prompt engineering and did not improve with newer model releases. In Experiment 2, we tested six mechanistic hypotheses for explaining this affordance gap, finding that the deficit was structural rather than stylistic and was not resolved by providing explicit spatial information. Corpus analyses revealed that image captioning datasets contain sparse agent-addressed affordance language, consistent with Gricean accounts of why embodied knowledge may be systematically underrepresented in language. Together, these findings suggest that distributional learning from images and text is insufficient for affordance-based scene understanding, implying that some dimensions of human visual cognition may require the kind of agent-centered, three-dimensional experience that no photograph or caption can encode.

关键词: Vision-Language Models, Embodied Scene Understanding, Affordance Tasks, Human-Calibrated Cosine Distance, Distributional Hypothesis, Corpus Analysis, Visual Cognition, Agent-centered Experience

123. ❌ From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion

作者: Dávid Pukanec, Tibor Kubík, Michal Španěl 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26588v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究基于扩散模型（ToothCraft）进行患者特异性牙冠补全，属于AI在生物医学（牙科）领域的应用。所有关键词均与大语言模型（LLM）或深度学习通用技术原理相关，而本文专注于3D形状的扩散模型，未涉及LLM、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等LLM相关技术。唯一相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为牙科属于生物医学领域，但论文未明确提及生物信息学或化学信息学，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于扩散模型的ToothCraft系统，用于从合成数据训练自动补全患者特异性牙冠，在合成测试中达到81.8% IoU和0.00034 Chamfer距离，并能有效应用于真实病例。

摘要翻译

本文提出ToothCraft，一种基于扩散模型的牙冠上下文生成方法，该模型通过人工构建的不完整牙齿数据进行训练。基于近期条件扩散模型在三维形状生成方面的进展，我们开发了一种能够依据局部解剖学上下文自动完成牙冠修复的模型。针对该任务训练数据匮乏的问题，我们设计了一种数据增强流程，可从公开的完整牙弓数据集（3DS, ODD）中生成不完整的牙齿几何形态。通过合成多样化的训练样本，我们的方法实现了对广泛类型牙齿缺损的鲁棒性学习。实验结果表明，我们的模型在牙冠重建方面表现出强大能力，在合成损伤的测试修复体上达到了81.8%的交并比（IoU）和0.00034的倒角距离（CD）。实验证明该模型可直接应用于真实临床案例，有效修复不完整牙齿，同时生成的牙冠与对颌牙列的交叠极小，从而降低了咬合干扰的风险。代码、模型权重及数据集信息可通过以下链接获取：https://github.com/ikarus1211/VISAPP_ToothCraft

摘要 (Abstract)

We present ToothCraft, a diffusion-based model for the contextual generation of tooth crowns, trained on artificially created incomplete teeth. Building upon recent advancements in conditioned diffusion models for 3D shapes, we developed a model capable of an automated tooth crown completion conditioned on local anatomical context. To address the lack of training data for this task, we designed an augmentation pipeline that generates incomplete tooth geometries from a publicly available dataset of complete dental arches (3DS, ODD). By synthesising a diverse set of training examples, our approach enables robust learning across a wide spectrum of tooth defects. Experimental results demonstrate the strong capability of our model to reconstruct complete tooth crowns, achieving an intersection over union (IoU) of 81.8% and a Chamfer Distance (CD) of 0.00034 on synthetically damaged testing restorations. Our experiments demonstrate that the model can be applied directly to real-world cases, effectively filling in incomplete teeth, while generated crowns show minimal intersection with the opposing dentition, thus reducing the risk of occlusal interference. Access to the code, model weights, and dataset information will be available at: https://github.com/ikarus1211/VISAPP_ToothCraft

关键词: diffusion model, dental crown completion, 3D shape generation, synthetic data, patient-specific restoration, tooth geometry, conditioned generation, dental AI

124. ❌ Scene Grounding In the Wild

作者: Tamir Cohen, Leo Segre, Shay Shomer-Chai, Shai Avidan, Hadar Averbuch-Elor 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26584v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的3D场景重建问题，提出了一种基于3D高斯泼溅和语义特征的对齐框架，并引入了WikiEarth数据集。论文内容涉及3D重建、几何对齐、域差距、语义特征等计算机视觉技术，但完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词领域。所有关键词均与大模型、深度学习技术、AI科学应用相关，而本文是纯粹的计算机视觉3D重建研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了从无重叠的野外图像中重建大规模3D场景时产生的全局对齐问题，通过使用从Google Earth Studio获得的伪合成参考模型和基于语义特征的优化方案，实现了即使在没有视觉重叠的情况下也能保持全局一致的3D重建对齐。

摘要翻译

从非结构化、真实世界图像中重建大规模场景的精确三维模型，仍然是计算机视觉领域的核心挑战，尤其是在输入视图之间几乎没有或完全没有重叠的情况下。在此类情况下，现有的重建流程通常会产生多个互不连接的局部重建结果，或者错误地将非重叠区域合并为重叠的几何结构。在本工作中，我们提出了一个框架，将每个局部重建结果锚定到场景的一个完整参考模型上，从而即使在缺乏视觉重叠的情况下也能实现全局一致的对齐。我们通过从谷歌地球工作室（Google Earth Studio）生成的密集、地理空间精确的伪合成渲染图中获取参考模型。这些渲染图提供了完整的场景覆盖，但其外观与现实世界照片存在显著差异。我们的核心见解是，尽管存在这种显著的领域差异，但两个领域共享相同的底层场景语义。我们使用三维高斯泼溅（3D Gaussian Splatting）来表示参考模型，并为每个高斯分布补充语义特征，进而将对齐问题表述为一个基于特征的逆向优化方案，该方案在保持参考模型固定的同时，估计一个全局的六自由度（6DoF）姿态和尺度。此外，我们引入了WikiEarth数据集，该数据集将现有的局部三维重建结果与伪合成参考模型进行了配准。我们证明，当使用各种经典和基于学习的流程进行初始化时，我们的方法能持续改进全局对齐效果，同时缓解了最先进的端到端模型的失败模式。所有代码和数据都将公开。

摘要 (Abstract)

Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code and data will be released.

关键词: 3D scene reconstruction, global alignment, 3D Gaussian Splatting, semantic features, domain gap, WikiEarth dataset, inverse optimization, pseudo-synthetic renderings

125. ❌ HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching

作者: Lanmiao Liu, Esam Ghaleb, Aslı Özyürek, Zerrin Yumak 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26553v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于对比流匹配的协同语音手势生成模型，属于计算机视觉、多模态生成和人体运动合成的交叉领域。论文的核心技术是流匹配（Flow Matching）和对比学习，用于生成与语音和文本语义一致的手势。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文专注于手势生成这一特定任务，未涉及LLMs、MoE、Scaling Laws、预训练/后训练、对齐、RAG、推理加速、可解释性等大模型技术，也未应用于生物信息学等科学领域。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对协同语音手势生成中语义接地和跨模态一致性的挑战，提出了一种基于对比流匹配的模型，通过使用不匹配的音频-文本条件作为负样本来训练速度场，从而在BEAT2和SHOW数据集上超越了现有方法。

摘要翻译

尽管伴随语音手势生成领域已取得显著进展，生成整体性且语义扎根的手势仍具挑战。现有方法依赖外部语义检索机制，由于受限于预定义的语言规则，其泛化能力受到制约。基于流匹配的方法虽展现出良好效果，但网络仅通过语义一致样本进行优化，未接触负例，导致其倾向于学习节律性手势而非稀疏动作（如表征性及隐喻性手势）。此外，多数方法通过对身体部位进行孤立建模，未能保持跨模态一致性。本文提出一种基于对比流匹配的伴随语音手势生成模型，该模型使用不匹配的音频-文本条件作为负样本，在训练速度场遵循正确运动轨迹的同时排斥语义不一致的轨迹。我们通过余弦与对比目标将文本、音频及整体动作嵌入复合潜在空间，从而确保跨模态连贯性。在BEAT2和SHOW两个数据集上的大量实验与用户研究表明，所提方法优于当前最优方法。

摘要 (Abstract)

While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.

关键词: co-speech gesture generation, semantic grounding, contrastive flow matching, cross-modal consistency, holistic motion, audio-text conditions, BEAT2 dataset, SHOW dataset

126. ❌ AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing

作者: Tianyu Liu, Weitao Xiong, Kunming Luo, Manyuan Zhang, Peng Liu, Yuan Liu, Ping Tan 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing》专注于计算机视觉和图形学领域，提出了一种用于自动驾驶视频天气转换的3D感知编辑框架。其核心贡献在于几何与光照解耦、G-buffer双通道编辑机制、物理交互和动态3D局部重光照技术。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等直接相关，而本文研究的是生成式视频模型、3D编辑、自动驾驶数据增强，属于计算机视觉/图形学范畴，与提供的大模型及深度学习技术关键词无直接关联。因此，所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AutoWeather4D的前馈式3D感知天气编辑框架，通过G-buffer双通道编辑机制显式解耦几何和光照，以解决现有方法在数据需求、优化成本和几何-光照纠缠方面的瓶颈，实现了高真实感、结构一致且具有细粒度参数物理控制的自动驾驶视频天气转换，可作为实用的数据引擎。

摘要翻译

生成式视频模型在自动驾驶恶劣天气的光写实合成方面取得了显著进展，然而其始终需要海量数据集来学习罕见天气场景。尽管三维感知编辑方法通过增强现有视频素材缓解了数据约束，但这些方法本质上受限于昂贵的逐场景优化成本，并存在固有的几何与光照纠缠问题。本研究提出AutoWeather4D——一种前馈式三维感知天气编辑框架，旨在显式解耦几何与光照关系。我们方法的核心是G缓冲双通道编辑机制：几何通道利用显式结构基础实现表面锚定的物理交互，光照通道则通过解析方法求解光线传输，将局部光源的贡献累积至全局光照中，从而实现动态的三维局部重光照。大量实验表明，AutoWeather4D在达到与生成式基线模型相当的光写实度与结构一致性的同时，能够实现细粒度参数化物理控制，为自动驾驶领域提供了实用的数据引擎。

摘要 (Abstract)

Generative video models have significantly advanced the photorealistic synthesis of adverse weather for autonomous driving; however, they consistently demand massive datasets to learn rare weather scenarios. While 3D-aware editing methods alleviate these data constraints by augmenting existing video footage, they are fundamentally bottlenecked by costly per-scene optimization and suffer from inherent geometric and illumination entanglement. In this work, we introduce AutoWeather4D, a feed-forward 3D-aware weather editing framework designed to explicitly decouple geometry and illumination. At the core of our approach is a G-buffer Dual-pass Editing mechanism. The Geometry Pass leverages explicit structural foundations to enable surface-anchored physical interactions, while the Light Pass analytically resolves light transport, accumulating the contributions of local illuminants into the global illumination to enable dynamic 3D local relighting. Extensive experiments demonstrate that AutoWeather4D achieves comparable photorealism and structural consistency to generative baselines while enabling fine-grained parametric physical control, serving as a practical data engine for autonomous driving.

关键词: Autonomous Driving, Video Weather Conversion, 3D-aware Editing, G-buffer Dual-pass Editing, Geometry-Illumination Decoupling, Dynamic 3D Local Relighting, Generative Video Models, Data Engine

127. ❌ OVI-MAP:Open-Vocabulary Instance-Semantic Mapping

作者: Zilong Deng, Federico Tombari, Marc Pollefeys, Johanna Wald, Daniel Barath 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26541v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文OVI-MAP专注于计算机视觉和机器人领域，研究开放词汇的3D实例语义建图系统。虽然涉及自主代理（autonomous agents）和视觉语言模型（vision-language models），但所有评分关键词均针对大语言模型（LLM）的技术原理、训练方法、推理优化、对齐技术、应用框架等具体方面。论文的核心是视觉建图、实例分割、3D重建和开放词汇识别，未涉及LLM架构、训练、推理或具体应用技术。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

论文提出了OVI-MAP系统，解决了自主代理在复杂环境中进行增量式开放词汇3D实例语义建图的挑战，通过解耦实例重建和语义推理，实现了实时操作并超越了现有基线方法。

摘要翻译

增量式开放词汇三维实例-语义建图对于在复杂日常环境中运行的自主智能体至关重要。然而，由于需要鲁棒的实例分割、实时处理能力以及灵活的开放集推理能力，该任务仍面临挑战。现有方法通常依赖于封闭集假设或密集的逐像素语言融合，这限制了系统的可扩展性与时序一致性。我们提出了OVI-MAP系统，将实例重建与语义推理解耦。我们建议构建一个与类别无关的三维实例地图，该地图通过RGB-D输入增量式构建，而语义特征仅通过视觉-语言模型从少量自动选择的视角中提取。这一设计使得系统能够在在线探索过程中实现稳定的实例跟踪与零样本语义标注。我们的系统能够实时运行，并在标准基准测试中超越了当前最先进的开放词汇建图基线方法。

摘要 (Abstract)

Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks.

关键词: open-vocabulary, 3D instance-semantic mapping, autonomous agents, instance segmentation, vision-language models, real-time processing, zero-shot semantic labeling, incremental reconstruction

128. ❌ Learnable Quantum Efficiency Filters for Urban Hyperspectral Segmentation

作者: Imad Ali Shah, Jiarong Li, Ethan Delaney, Enda Ward, Martin Glavin, Edward Jones, Brian Deegan 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26528v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于高光谱图像分割领域，提出了一种基于物理启发的可学习量子效率滤波器方法，用于城市驾驶场景的高光谱数据降维和语义分割。论文内容涉及计算机视觉、传感器设计、高光谱成像和深度学习在特定领域的应用，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大模型、深度学习技术原理或AI for Science（生物信息学/化学信息学）相关，而本文研究的是高光谱图像处理这一特定计算机视觉任务，与评分关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于物理启发的可学习量子效率滤波器方法，用于城市驾驶场景的高光谱图像语义分割，在多个数据集上实现了优于传统和可学习降维方法的性能提升。

摘要翻译

高光谱传感为城市驾驶场景理解提供了丰富的光谱信息，但其高维特性给数据解译与高效学习带来了挑战。本文提出可学习量子效率（Learnable Quantum Efficiency, LQE）方法——一种受物理学启发的、可解释的降维方法，其通过参数化平滑的高阶光谱响应函数来模拟合理的传感器量子效率曲线。与传统方法或无约束的可学习层不同，LQE施加了基于物理原理的约束条件，包括单一主峰、平滑响应和有界带宽。这种设计产生了一种紧凑的光谱表示，既能保留判别性信息，又能在语义分割模型中保持完全可微性和端到端可训练性。我们在三个公开的多类别高光谱城市驾驶数据集上进行了系统评估，将LQE与六种传统降维方法和七种可学习基线降维方法在六个语义分割模型中进行了比较。在所有语义分割模型和配置中平均，LQE取得了最高的平均交并比（mIoU）：在HyKo、HSI-Drive和Hyperspectral City数据集上，相较于传统方法分别提升了2.45%、0.45%和1.04%，相较于可学习方法分别提升了1.18%、1.56%和0.81%。LQE保持了较高的参数效率（仅需12–36个参数，而其他可学习方法需要51–22K个参数）和具有竞争力的推理延迟。消融研究表明，低阶配置是最优选择，且学习得到的光谱滤波器会收敛到数据集固有的波长模式。这些结果表明，融入物理先验的光谱学习能够同时提升性能与可解释性，为汽车视觉系统的高光谱感知与数据驱动的多光谱传感器设计之间架起了原理性桥梁。

摘要 (Abstract)

Hyperspectral sensing provides rich spectral information for scene understanding in urban driving, but its high dimensionality poses challenges for interpretation and efficient learning. We introduce Learnable Quantum Efficiency (LQE), a physics-inspired, interpretable dimensionality reduction (DR) method that parameterizes smooth high-order spectral response functions that emulate plausible sensor quantum efficiency curves. Unlike conventional methods or unconstrained learnable layers, LQE enforces physically motivated constraints, including a single dominant peak, smooth responses, and bounded bandwidth. This formulation yields a compact spectral representation that preserves discriminative information while remaining fully differentiable and end-to-end trainable within semantic segmentation models (SSMs). We conduct systematic evaluations across three publicly available multi-class hyperspectral urban driving datasets, comparing LQE against six conventional and seven learnable baseline DR methods across six SSMs. Averaged across all SSMs and configurations, LQE achieves the highest average mIoU, improving over conventional methods by 2.45%, 0.45%, and 1.04%, and over learnable methods by 1.18%, 1.56%, and 0.81% on HyKo, HSI-Drive, and Hyperspectral City, respectively. LQE maintains strong parameter efficiency (12–36 parameters compared to 51–22K for competing learnable approaches) and competitive inference latency. Ablation studies show that low-order configurations are optimal, while the learned spectral filters converge to dataset-intrinsic wavelength patterns. These results demonstrate that physics-informed spectral learning can improve both performance and interpretability, providing a principled bridge between hyperspectral perception and data-driven multispectral sensor design for automotive vision systems.

关键词: Hyperspectral segmentation, Learnable Quantum Efficiency, Dimensionality reduction, Urban driving, Semantic segmentation, Physics-inspired, Spectral filters, Automotive vision

129. ❌ Conditional Diffusion for 3D CT Volume Reconstruction from 2D X-rays

作者: Martin Rath, Morteza Ghahremani, Yitong Li, Ashkan Taghipour, Marcus Makowski, Christian Wachinger 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26509v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学影像重建，使用扩散模型从2D X光重建3D CT，属于AI在生物医学领域的应用，因此仅与’AI for Science OR Bioinformatics OR Cheminformatics’相关（评分8.0），其他关键词均涉及大语言模型或深度学习通用技术，与论文内容无关（评分0.0）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AXON的多阶段扩散框架，用于从真实X光直接重建高保真3D CT体积，以解决CT成本高、辐射强的问题，并在公开和外部数据集上显著优于现有方法。

摘要翻译

计算机断层扫描（CT）能提供丰富的三维解剖细节，但常受限于高辐射暴露、高昂成本和有限的可及性。尽管标准胸部X光片具有成本效益且广泛可用，但其仅能提供二维投影，病理信息有限。从二维X光片重建三维CT体数据为提升诊断可及性提供了变革性解决方案，然而现有方法主要依赖合成X射线投影，限制了临床泛化能力。本研究提出AXON——一个基于多阶段扩散模型的框架，可直接从真实X光片重建高保真三维CT体数据。AXON采用由粗到精的策略：初始阶段采用基于布朗桥扩散模型的全局结构合成，随后通过基于ControlNet的细化阶段进行局部强度优化。该框架还支持双平面X射线输入，以缓解二维到三维重建固有的深度模糊性问题。我们集成了超分辨率网络对生成体数据进行上采样，以达到诊断级分辨率。在公开及外部数据集上的评估表明，AXON显著优于现有先进基线模型，PSNR提升11.9%，SSIM提高11.0%，且在不同临床数据分布间展现出稳健的泛化能力。代码发布于https://github.com/ai-med/AXON。

摘要 (Abstract)

Computed tomography (CT) provides rich 3D anatomical details but is often constrained by high radiation exposure, substantial costs, and limited availability. While standard chest X-rays are cost-effective and widely accessible, they only provide 2D projections with limited pathological information. Reconstructing 3D CT volumes from 2D X-rays offers a transformative solution to increase diagnostic accessibility, yet existing methods predominantly rely on synthetic X-ray projections, limiting clinical generalization. In this work, we propose AXON, a multi-stage diffusion-based framework that reconstructs high-fidelity 3D CT volumes directly from real X-rays. AXON employs a coarse-to-fine strategy, with a Brownian Bridge diffusion model-based initial stage for global structural synthesis, followed by a ControlNet-based refinement stage for local intensity optimization. It also supports bi-planar X-ray input to mitigate depth ambiguities inherent in 2D-to-3D reconstruction. A super-resolution network is integrated to upscale the generated volumes to achieve diagnostic-grade resolution. Evaluations on both public and external datasets demonstrate that AXON significantly outperforms state-of-the-art baselines, achieving a 11.9% improvement in PSNR and a 11.0% increase in SSIM with robust generalizability across disparate clinical distributions. Our code is available at https://github.com/ai-med/AXON.

关键词: 3D CT reconstruction, 2D X-rays, diffusion model, medical imaging, AXON framework, Brownian Bridge, ControlNet, super-resolution

130. ❌ ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better

作者: Mriganka Nath, Anurag Das, Jiahao Xie, Bernt Schiele 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26486v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于大型视觉语言模型（LVLMs）在视觉输入受损时的幻觉问题，提出ClipTTT方法进行测试时训练以缓解幻觉。与关键词的相关性分析：1）高度相关（10分）：‘Hallucination Mitigation’是论文核心研究问题；2）中等相关（8分）：‘Large Language Models’，论文研究LVLMs属于大模型范畴；3）弱相关（5分）：‘Pre-training’，论文使用预训练的CLIP模型；‘Self-Correction’，ClipTTT通过测试时训练实现模型自我适应；4）无关（0分）：其余关键词未在论文中涉及。

!!! tip deepseek-chat TL;DR

论文针对大型视觉语言模型在视觉输入受损时幻觉加剧的问题，提出了CLIP引导的测试时训练方法ClipTTT，有效缓解了幻觉并提高了描述忠实度。

摘要翻译

大型视觉语言模型（LVLMs）容易产生幻觉现象，尤其在测试阶段视觉输入受损时更为显著。研究表明，此类数据损坏构成了额外的分布偏移，在实际应用中会显著加剧幻觉发生率。为解决这一问题，我们提出CLIP引导的测试时训练方法（ClipTTT），该方法能够在退化条件下仅利用单个测试样本对LVLM进行实时自适应调整。具体而言，我们利用预训练CLIP模型强大的图文对齐能力作为稳定的引导信号，以此识别可靠的自监督目标，实现在不改变基础LVLM架构前提下的快速适应。通过在包含15种常见损坏类型的标准幻觉基准测试上进行大量实验，结果表明ClipTTT能有效缓解视觉损坏条件下的幻觉问题，并显著提升描述的忠实度。

摘要 (Abstract)

Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.

关键词: Large vision-language models, Hallucination, Test-time training, CLIP, Visual corruptions, Self-supervision, Adaptation, Descriptive faithfulness

131. ❌ SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

作者: Weihong Pan, Xiaoyu Zhang, Zhuang Zhang, Zhichao Ye, Nan Wang, Haomin Liu, Guofeng Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26481v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的4D重建技术，提出了一种从稀疏相机输入进行动态场景重建的方法。论文内容涉及生成观察、时空一致性、相机校准等计算机视觉概念，但完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词领域。所有关键词均与大模型、深度学习技术、AI科学应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了从稀疏相机输入进行高质量4D动态场景重建的难题，通过提出时空畸变场和完整重建流程，实现了优于现有方法的时空一致高保真渲染。

摘要翻译

高质量4维重建技术能够实现对动态真实世界进行照片级逼真且沉浸式的渲染。然而，与可通过单台相机完整捕捉的静态场景不同，高质量动态场景的采集通常需要由数十甚至数百台同步相机组成的密集阵列。对此类昂贵实验室设置的依赖严重制约了实际应用的可扩展性。为此，我们提出了一种稀疏相机动态重建框架，该框架能够利用丰富但存在不一致性的生成式观测数据。我们的核心创新在于时空畸变场，该模型为跨空间与时间维度的生成式观测不一致性提供了统一建模机制。在此基础上，我们开发了一套完整流程，实现了从稀疏且未标定相机输入中进行4维重建。我们在多相机动态场景基准数据集上评估了本方法，实现了时空一致的高保真渲染效果，并显著超越了现有技术方案。

摘要 (Abstract)

High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. The reliance on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches.

关键词: 4D reconstruction, sparse cameras, dynamic scenes, spatio-temporal consistency, generative observations, camera calibration, rendering, computer vision

132. ❌ HyVIC: A Metric-Driven Spatio-Spectral Hyperspectral Image Compression Architecture Based on Variational Autoencoders

作者: Martin Hermann Paul Fuchs, Behnood Rasti, Begüm Demir 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26468v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于基于变分自编码器的高光谱图像压缩架构（HyVIC），属于深度学习在遥感科学领域的应用。论文未涉及任何大语言模型（LLM）相关技术，也未讨论LLM的训练、对齐、推理优化、智能体等主题。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将深度学习应用于遥感科学（属于AI for Science范畴），但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于变分自编码器的度量驱动高光谱图像压缩架构（HyVIC），通过平衡空间和光谱特征学习，在多个压缩比下实现了高保真重建，并在BD-PSNR指标上比现有方法提升高达4.66dB。

摘要翻译

遥感领域高光谱数据档案的快速增长对存储与传输的高效压缩方法提出了迫切需求。基于学习的高光谱图像压缩技术的最新进展显著提升了重建保真度与压缩效率。然而，现有方法通常直接套用为自然图像设计的变分图像压缩模型，未能充分考虑高光谱图像固有的独特空谱冗余特性。尤其缺乏明确平衡空间与光谱特征学习的架构设计，限制了其有效利用高光谱数据独特特征的能力。为解决这一问题，我们提出了空谱变分高光谱图像压缩架构。该模型包含四个核心组件：1）可调节空谱编码器；2）空谱超先验编码器；3）空谱超先验解码器；4）可调节空谱解码器。我们论证了空间与光谱特征学习的权衡对重建保真度至关重要，并提出基于度量驱动的策略来系统选择模型超参数。在两个基准数据集上的大量实验证明了该模型的有效性，其在宽泛的压缩比范围内实现了高空间与光谱重建保真度，并以BD-PSNR指标将现有最优性能提升达4.66分贝。基于实验结果，我们提出了对基于学习的变分高光谱图像压缩未来研究方向的见解与实践指导准则。代码与预训练模型权重已公开于https://git.tu-berlin.de/rsim/hyvic。

摘要 (Abstract)

The rapid growth of hyperspectral data archives in remote sensing (RS) necessitates effective compression methods for storage and transmission. Recent advances in learning-based hyperspectral image (HSI) compression have significantly enhanced both reconstruction fidelity and compression efficiency. However, existing methods typically adapt variational image compression models designed for natural images, without adequately accounting for the distinct spatio-spectral redundancies inherent in HSIs. In particular, they lack explicit architectural designs to balance spatial and spectral feature learning, limiting their ability to effectively leverage the unique characteristics of hyperspectral data. To address this issue, we introduce spatio-spectral variational hyperspectral image compression architecture (HyVIC). The proposed model comprises four main components: 1) adjustable spatio-spectral encoder; 2) spatio-spectral hyperencoder; 3) spatio-spectral hyperdecoder; and 4) adjustable spatio-spectral decoder. We demonstrate that the trade-off between spatial and spectral feature learning is crucial for the reconstruction fidelity, and therefore present a metric-driven strategy to systematically select the hyperparameters of the proposed model. Extensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed model, achieving high spatial and spectral reconstruction fidelity across a wide range of compression ratios (CRs) and improving the state of the art by up to 4.66dB in terms of BD-PSNR. Based on our results, we offer insights and derive practical guidelines to guide future research directions in learning-based variational HSI compression. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hyvic .

关键词: Hyperspectral Image Compression, Variational Autoencoders, Spatio-Spectral Redundancies, Reconstruction Fidelity, Metric-Driven Strategy, Remote Sensing, Learning-Based Compression, BD-PSNR

133. ❌ Meta-Learned Adaptive Optimization for Robust Human Mesh Recovery with Uncertainty-Aware Parameter Updates

作者: Shaurjya Mandal, Nutan Sharma, John Galeotti 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26447v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的人体网格恢复任务，提出了一种结合元学习和自适应优化的方法。虽然论文涉及深度学习技术（如元学习、优化算法），但其核心内容与所有评分关键词（均围绕大语言模型及其相关技术）完全无关。论文未提及任何大模型、语言模型、对齐、推理、代理等概念，也未涉及科学AI应用（如生物信息学）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于元学习和不确定性感知自适应更新的新框架，用于解决单图像人体网格恢复中的深度模糊和泛化问题，在标准基准测试中实现了最先进的性能并显著降低了误差。

摘要翻译

从单张图像中恢复人体网格由于固有的深度模糊性和跨域泛化能力有限，仍是一项具有挑战性的任务。尽管现有方法结合了回归与优化策略，但它们常面临测试时优化初始化效果不佳以及优化过程中参数更新效率低下的问题。我们提出了一种新颖的元学习框架，该框架训练模型以生成适用于优化的初始化参数，并在测试时优化过程中引入不确定性感知的自适应更新机制。我们的方法包含三项关键创新：（1）一种元学习策略，通过在训练阶段模拟测试时优化过程来学习更优的参数初始化；（2）选择性参数缓存机制，可识别并冻结已收敛的关节点以降低计算开销；（3）基于分布的自适应更新方法，从学习到的分布中采样参数变化量，在实现鲁棒探索的同时量化不确定性。此外，我们采用随机近似技术来处理复杂损失函数空间中难以计算的梯度。在标准基准测试上的大量实验表明，我们的方法取得了最先进的性能，与强基线模型相比，在3DPW数据集上MPJPE降低了10.3，在Human3.6M数据集上降低了8.0。该方法展现出卓越的域适应能力，在不同环境条件下性能下降最小，同时能提供与真实预测误差相关的有效不确定性估计。通过结合元学习与自适应优化，我们的方法实现了精确的网格恢复，并对挑战性场景具有鲁棒的泛化能力。

摘要 (Abstract)

Human mesh recovery from single images remains challenging due to inherent depth ambiguity and limited generalization across domains. While recent methods combine regression and optimization approaches, they struggle with poor initialization for test-time refinement and inefficient parameter updates during optimization. We propose a novel meta-learning framework that trains models to produce optimization-friendly initializations while incorporating uncertainty-aware adaptive updates during test-time refinement. Our approach introduces three key innovations: (1) a meta-learning strategy that simulates test-time optimization during training to learn better parameter initializations, (2) a selective parameter caching mechanism that identifies and freezes converged joints to reduce computational overhead, and (3) distribution-based adaptive updates that sample parameter changes from learned distributions, enabling robust exploration while quantifying uncertainty. Additionally, we employ stochastic approximation techniques to handle intractable gradients in complex loss landscapes. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance, reducing MPJPE by 10.3 on 3DPW and 8.0 on Human3.6M compared to strong baselines. Our approach shows superior domain adaptation capabilities with minimal performance degradation across different environmental conditions, while providing meaningful uncertainty estimates that correlate with actual prediction errors. Combining meta-learning and adaptive optimization enables accurate mesh recovery and robust generalization to challenging scenarios.

关键词: Human Mesh Recovery, Meta-Learning, Adaptive Optimization, Uncertainty-Aware, Test-time Refinement, Domain Adaptation, Stochastic Approximation, 3D Pose Estimation

134. ❌ Image-based Quantification of Postural Deviations on Patients with Cervical Dystonia: A Machine Learning Approach Using Synthetic Training Data

作者: Roland Stenger, Sebastian Löns, Nele Brügge, Feline Hamami, Alexander Münchau, Theresa Paulus, Anne Weissbach, Tatiana Usnich, Max Borsche, Martje G. Pauly, Lara M. Lange, Markus A. Hobert, Rebecca Herzog, Ana Luísa de Almeida Marcelino, Tina Mainka, Friederike Schumann, Lukas L. Goede, Johanna Reimer, Julienne Haas, Jos Becktepe, Alexander Baumann, Robin Wolke, Chi Wang Ip, Thorsten Odorfer, Daniel Zeller, Lisa Harder-Rauschenberger, John-Ih Lee, Philipp Albrecht, Tristan Kölsche, Joachim K. Krauss, Johanna M. Nagel, Joachim Runge, Johanna Doll-Lee, Simone Zittel, Kai Grimm, Pawel Tacik, André Lee, Tobias Bäumer, Sebastian Fudickar 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26444v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习（特别是预训练的头姿估计算法和在合成数据上训练的模型）来客观评估颈肌张力障碍患者的姿势偏差，属于医学影像分析和计算机视觉在临床医学中的应用。论文与大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词主要针对自然语言处理或通用大模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究将AI应用于生物医学（神经疾病评估），属于AI for Science的范畴，但并非核心创新点（主要创新在于合成数据方法而非AI技术本身），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究开发了一种基于图像的自动化系统，利用预训练算法和合成数据训练的深度学习模型来客观评估颈肌张力障碍患者的头部旋转和平移症状，验证显示与专家评分高度相关，为临床评估提供了标准化工具。

摘要翻译

颈性肌张力障碍（CD）是最常见的肌张力障碍类型，然而目前的评估依赖于主观的临床评分量表，例如多伦多西部痉挛性斜颈评定量表（TWSTRS），该量表需要专业知识、具有主观性，且其部分评分项目的评估者间一致性较低。针对目前缺乏成熟的客观工具来监测疾病严重程度和治疗反应的问题，本研究验证了一种基于图像的自动化头部姿态与偏移估计系统，用于CD患者评估。我们开发了一种评估工具，该工具结合了针对旋转症状的预训练头部姿态估计算法，以及一个专门在约16,000张合成虚拟人像图像上训练的深度学习模型，用于评估罕见的平移症状——特别是侧向偏移。这种合成数据方法克服了临床训练样本稀缺的难题。该系统的性能在一项多中心研究中得到验证，通过使用包含100张真实患者图像和100张带标注的合成虚拟人像的数据集，将其预测评分与20位临床专家的共识评分进行比较。对于旋转症状，自动化系统与专家临床评分表现出高度一致性，在斜颈（r=0.91）、侧颈（r=0.81）和前/后颈（r=0.78）评分上均获得高相关性。对于侧向偏移，该工具与临床评分的相关性为中等（r=0.55），并且在虚拟人像的受控基准测试中表现出比人工评分者更高的准确性。通过利用合成训练数据弥补临床数据缺口，该模型成功推广至真实世界患者，为CD姿势评估提供了一个经过验证的客观工具，有望实现标准化的临床决策和试验评估。

摘要 (Abstract)

Cervical dystonia (CD) is the most common form of dystonia, yet current assessment relies on subjective clinical rating scales, such as the Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS), which requires expertise, is subjective and faces low inter-rater reliability some items of the score. To address the lack of established objective tools for monitoring disease severity and treatment response, this study validates an automated image-based head pose and shift estimation system for patients with CD. We developed an assessment tool that combines a pretrained head-pose estimation algorithm for rotational symptoms with a deep learning model trained exclusively on ~16,000 synthetic avatar images to evaluate rare translational symptoms, specifically lateral shift. This synthetic data approach overcomes the scarcity of clinical training examples. The system’s performance was validated in a multicenter study by comparing its predicted scores against the consensus ratings of 20 clinical experts using a dataset of 100 real patient images and 100 labeled synthetic avatars. The automated system demonstrated strong agreement with expert clinical ratings for rotational symptoms, achieving high correlations for torticollis (r=0.91), laterocollis (r=0.81), and anteroretrocollis (r=0.78). For lateral shift, the tool achieved a moderate correlation (r=0.55) with clinical ratings and demonstrated higher accuracy than human raters in controlled benchmark tests on avatars. By leveraging synthetic training data to bridge the clinical data gap, this model successfully generalizes to real-world patients, providing a validated, objective tool for CD postural assessment that can enable standardized clinical decision-making and trial evaluation.

关键词: cervical dystonia, postural assessment, synthetic training data, head pose estimation, deep learning, objective tool, clinical validation, image-based quantification

135. ❌ SHANDS: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training

作者: Le Ma, Thiago Freitas dos Santos, Nadia Magnenat-Thalmann, Katarzyna Wac 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26400v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉在医疗手术训练中的应用，通过多视角视频数据集和深度学习模型进行手势和错误识别。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词主要针对自然语言处理领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在医疗科学（手术训练）领域的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于手术手势和错误识别的多视角视频数据集SHands，并基于该数据集对深度学习模型进行了基准测试，以支持自动化、可扩展的手术训练AI系统开发。

摘要翻译

在医学生外科培训中，技能熟练度的提升依赖于专家主导的操作评估，这种方式成本高昂、时间有限、难以规模化，且其专业知识仅局限于拥有专科医师的机构。基于人工智能的自动化评估提供了一种可行的替代方案，但其发展受限于缺乏包含真实学员操作错误的数据集，以及训练鲁棒计算机视觉方法所需的多视角变异性。为填补这一空白，我们提出了Surgical-Hands（SHands）——一个用于外科手术手势与错误识别的大规模多视角视频数据集，专为医学培训设计。SHands通过五个互补视角的RGB摄像头，记录了52名参与者（20名专家和32名学员）执行线性切口与缝合操作的过程，每位参与者对每个步骤完成三次标准化试验。视频在帧级别标注了15种手势基元，并包含一套经过验证的8类学员错误分类体系，从而同时支持手势识别与错误检测。我们进一步定义了针对单视角、多视角及跨视角泛化的标准化评估协议，并在该数据集上对前沿深度学习模型进行了基准测试。SHands已公开发布，旨在支持基于临床领域知识构建的、鲁棒且可扩展的外科培训人工智能系统的开发。

摘要 (Abstract)

In surgical training for medical students, proficiency development relies on expert-led skill assessment, which is costly, time-limited, difficult to scale, and its expertise remains confined to institutions with available specialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi-view variability needed to train robust computer vision approaches. To address this gap, we present Surgical-Hands (SHands), a large-scale multi-view video dataset for surgical hand-gesture and error recognition for medical training. \textsc{SHands} captures linear incision and suturing using five RGB cameras from complementary viewpoints, performed by 52 participants (20 experts and 32 trainees), each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 gesture primitives and include a validated taxonomy of 8 trainee error types, enabling both gesture recognition and error detection. We further define standardized evaluation protocols for single-view, multi-view, and cross-view generalization, and benchmark state-of-the-art deep learning models on the dataset. SHands is publicly released to support the development of robust and scalable AI systems for surgical training grounded in clinically curated domain knowledge.

关键词: surgical training, hand-gesture recognition, error recognition, multi-view video dataset, deep learning, computer vision, medical AI, benchmark

作者: Yi Zhang, Yidong Zhao, Qian Tao 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26393v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学图像配准领域，特别是多模态场景下的变形图像配准问题。论文的核心贡献是提出了一种结合冻结预训练单模态配准模型与轻量级适应管道的框架，通过基于对比无关表示生成和细化的风格转移来桥接模态和领域差距。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型、训练技术、推理优化、代理系统等主题，而本文研究的是医学图像处理中的具体技术问题。唯一相关的关键词是：1. “Pre-training OR Continual Pre-training OR Domain Adaptation”（5分）- 论文使用了预训练的单模态模型，并涉及领域适应（domain adaptation）以处理多模态和未见领域；2. “AI for Science OR Bioinformatics OR Cheminformatics”（8分）- 论文属于AI在科学领域的应用，具体是医学图像分析（生物信息学相关），但创新点在于配准方法而非大模型技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合冻结预训练单模态模型与轻量级适应管道的框架，用于解决多模态医学图像配准中的领域适应问题，在Learn2Reg 2025 LUMIR验证集上取得了优于预训练单模态基线的性能。

摘要翻译

可变形图像配准在医学图像分析中仍是一个核心挑战，尤其是在多模态场景下，不同扫描间的强度分布差异显著。尽管深度学习方法能提供高效的前馈预测，但在测试时面对分布偏移时往往难以保持稳健的泛化能力。一种直接的补救方法是全网络微调，然而对于Transformer或深度U-Net等现代架构，在三维图像中进行这种适配在内存和运行时间上的成本都极高。同时，面对剧烈的域偏移时，简单的微调方法更容易出现性能下降。本研究提出一种配准框架，将冻结的预训练单模态配准模型与轻量级适配流程相结合，用于多模态图像配准。具体而言，我们采用基于对比度无关表征生成与优化模块的风格迁移方法，在测试时通过实例优化来弥合模态与域间的差异。该设计与骨干单模态模型的选择正交，从而避免了全微调的计算负担，同时保留了适应未见域的灵活性。我们在Learn2Reg 2025 LUMIR验证集上评估了所提方法，观察到相较于预训练的先进单模态骨干模型取得了持续改进。特别地，该方法在多模态子集上排名第二，在域外子集上排名第三，并以Dice分数在总排名中位列第四。这些结果表明，将冻结的单模态模型与模态适配及轻量级实例优化相结合，为稳健的多模态配准提供了一条有效且实用的路径。

摘要 (Abstract)

Deformable image registration remains a central challenge in medical image analysis, particularly under multi-modal scenarios where intensity distributions vary significantly across scans. While deep learning methods provide efficient feed-forward predictions, they often fail to generalize robustly under distribution shifts at test time. A straightforward remedy is full network fine-tuning, yet for modern architectures such as Transformers or deep U-Nets, this adaptation is prohibitively expensive in both memory and runtime when operating in 3D. Meanwhile, the naive fine-tuning struggles more with potential degradation in performance in the existence of drastic domain shifts. In this work, we propose a registration framework that integrates a frozen pretrained \textbf{mono-modal} registration model with a lightweight adaptation pipeline for \textbf{multi-modal} image registration. Specifically, we employ style transfer based on contrast-agnostic representation generation and refinement modules to bridge modality and domain gaps with instance optimization at test time. This design is orthogonal to the choice of backbone mono-modal model, thus avoids the computational burden of full fine-tuning while retaining the flexibility to adapt to unseen domains. We evaluate our approach on the Learn2Reg 2025 LUMIR validation set and observe consistent improvements over the pretrained state-of-the-art mono-modal backbone. In particular, the method ranks second on the multi-modal subset, third on the out-of-domain subset, and achieves fourth place overall in Dice score. These results demonstrate that combining frozen mono-modal models with modality adaptation and lightweight instance optimization offers an effective and practical pathway toward robust multi-modal registration.

关键词: deformable image registration, multi-modal registration, frozen pretrained model, domain adaptation, style transfer, contrast-agnostic representation, instance optimization, medical image analysis

137. ❌ Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration

作者: I-Hsiang Chen, Isma Hadji, Enrique Sanchez, Adrian Bulat, Sy-Yen Kuo, Radu Timofte, Georgios Tzimiropoulos, Brais Martinez 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26385v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的图像恢复任务，提出了一种名为RAR的迭代恢复框架，该框架将图像质量评估（IQA）和图像恢复（IR）集成到一个统一的模型中。论文的核心技术涉及深度学习在图像处理中的应用，但所有给定的关键词均与大语言模型（LLM）、其训练方法（如预训练、微调、对齐）、推理优化、代理系统、可解释性、科学AI应用等具体技术或概念相关。论文摘要和标题中未提及任何大语言模型、相关训练技术、推理方法或科学领域应用，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为RAR的迭代图像恢复框架，通过将图像质量评估和恢复过程集成到一个统一的端到端可训练模型中，有效处理了单一、未知和复合退化问题，并在多个任务上取得了最先进的性能。

摘要翻译

图像复原旨在从因恶劣天气、模糊或低光照等多种因素退化的输入中恢复高质量图像。尽管近期研究在单一或统一复原任务中取得了显著进展，但在处理未知或复合退化时，这些方法仍存在泛化能力有限和效率低下的问题。为应对这些局限性，我们提出RAR（Restore, Assess and Repeat）流程，将图像质量评估（Image Quality Assessment, IQA）与图像复原（Image Restoration, IR）整合到统一框架中，通过迭代方式高效实现高质量图像复原。具体而言，我们引入一种完全在隐空间域操作的复原流程，可联合执行退化识别、图像复原和质量验证。所构建的模型支持端到端全流程训练，实现了动态调整复原过程的“评估-复原一体化”方法。同时，IQA与IR在统一模型中的紧密集成，最小化了传统分立模块（如图像/文本解码过程中）通常存在的延迟与信息损失。大量实验表明，我们的方法在单一退化、未知退化及复合退化场景下均取得持续改进，从而确立了新的性能标杆。

摘要 (Abstract)

Image restoration aims to recover high quality images from inputs degraded by various factors, such as adverse weather, blur, or low light. While recent studies have shown remarkable progress across individual or unified restoration tasks, they still suffer from limited generalization and inefficiency when handling unknown or composite degradations. To address these limitations, we propose RAR, a Restore, Assess and Repeat process, that integrates Image Quality Assessment (IQA) and Image Restoration (IR) into a unified framework to iteratively and efficiently achieve high quality image restoration. Specifically, we introduce a restoration process that operates entirely in the latent domain to jointly perform degradation identification, image restoration, and quality verification. The resulting model is fully trainable end to end and allows for an all-in-one assess and restore approach that dynamically adapts the restoration process. Also, the tight integration of IQA and IR into a unified model minimizes the latency and information loss that typically arises from keeping the two modules disjoint, (e.g. during image and/or text decoding). Extensive experiments show that our approach consistent improvements under single, unknown and composite degradations, thereby establishing a new state-of-the-art.

关键词: Image Restoration, Iterative Framework, Image Quality Assessment, Unified Model, Degradation Identification, End-to-End Training, Latent Domain Processing, Composite Degradations

138. ❌ Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning

作者: Shida Wang, YongXiang Hua, Zhou Tao, Haoyu Cao, Linli Xu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26365v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	5.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）的视频理解效率优化，与"Large Language Models"高度相关（10分）。提出的SCORE框架通过强化学习实现动态token压缩，直接涉及"Quantization/Model Compression"（8分）和"Speculative Decoding/Inference Acceleration"（8分）。方法通过减少视觉token冗余缓解"context rot"，与"Context Window Extension"有一定关联（5分），压缩策略可能影响KV缓存效率，与"KV Cache Compression"有间接联系（5分）。其他关键词如MoE、SFT、RAG等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在视频理解中面临的计算成本高和上下文退化问题，提出了一种基于强化学习的动态token压缩框架SCORE，在保持99.5%性能的同时实现了16倍的预填充加速。

摘要翻译

多模态大语言模型在视频理解领域展现出卓越能力，但其海量视觉令牌冗余导致计算成本过高，并引发“上下文退化”导致的性能衰减。现有压缩策略通常依赖启发式方法或固定变换，这些方法往往与下游任务目标解耦，限制了其适应性与有效性。为此，我们提出SCORE（基于强化学习的惊喜增强令牌压缩框架），这是一个学习自适应令牌压缩策略的统一框架。SCORE引入了一个轻量级策略网络，其以惊喜增强的状态表示为条件——该表示融合了帧间残差，以显式捕捉时序动态与运动显著性。我们采用分组强化学习方案与分割优势估计器对该策略进行优化，并通过从静态伪视频迁移至真实动态视频的两阶段课程学习实现训练稳定化。在多样化视频理解基准上的大量实验表明，SCORE显著优于现有最优基线方法。值得注意的是，在10%的令牌保留率下，SCORE实现了16倍的前向填充加速，同时保持原模型99.5%的性能，为高效长视频理解提供了可扩展的解决方案。

摘要 (Abstract)

Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ‘‘context rot’’ due to massive visual token redundancy. Existing compression strategies typically rely on heuristics or fixed transformations that are often decoupled from the downstream task objectives, limiting their adaptability and effectiveness. To address this, we propose SCORE (Surprise-augmented token COmpression via REinforcement learning), a unified framework that learns an adaptive token compression policy. SCORE introduces a lightweight policy network conditioned on a surprise-augmented state representation that incorporates inter-frame residuals to explicitly capture temporal dynamics and motion saliency. We optimize this policy using a group-wise reinforcement learning scheme with a split-advantage estimator, stabilized by a two-stage curriculum transferring from static pseudo-videos to real dynamic videos. Extensive experiments on diverse video understanding benchmarks demonstrate that SCORE significantly outperforms state-of-the-art baselines. Notably, SCORE achieves a 16x prefill speedup while preserving 99.5% of original performance at a 10% retention ratio, offering a scalable solution for efficient long-form video understanding.

关键词: Multimodal Large Language Models, Video Understanding, Token Compression, Reinforcement Learning, Computational Efficiency, Context Rot, Adaptive Policy, Inference Acceleration

139. ❌ HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

作者: MD Khalequzzaman Chowdhury Sayem, Mubarrat Tajoar Chowdhury, Yihalem Yimolal Tiruneh, Muneeb A. Khan, Muhammad Salman Ali, Binod Bhattarai, Seungryul Baek 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26362v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究视觉语言模型（VLMs）在细粒度空间推理方面的局限性，特别是手部姿态理解。与大多数大语言模型（LLM）技术关键词无关，因为论文聚焦于视觉语言模型而非纯文本大模型。唯一高度相关的关键词是’PEFT OR LoRA OR Parameter-efficient Fine-tuning’（评分10），因为论文明确使用LoRA进行轻量级微调。‘Hallucination Mitigation OR Factuality OR Truthfulness’（评分5）有一定关联，因为论文提到模型存在幻觉问题（如幻觉手指部分）。‘AI for Science OR Bioinformatics OR Cheminformatics’（评分5）有弱关联，因为论文涉及机器人辅助手术等科学应用场景，但非核心生物信息学或化学信息学。其他关键词均未在论文中涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文通过构建HandVQA基准测试，揭示了当前视觉语言模型在细粒度手部空间推理方面的系统性缺陷，并证明使用LoRA微调学习到的3D空间知识能显著提升下游手势识别等任务的性能。

摘要翻译

理解人类手部的精细关节活动在机器人辅助手术、芯片制造以及基于AR/VR的人机交互等高精度应用场景中至关重要。尽管当前视觉语言模型（VLMs）在通用视觉语言基准测试中已接近人类水平，但在细粒度空间推理方面仍存在困难，尤其是在解析复杂且高度关节化的手部姿态时。我们提出了HandVQA——一个通过视觉问答评估视觉语言模型对精细手部解剖结构理解能力的大规模诊断基准。该基准基于高质量三维手部数据集（FreiHAND、InterHand2.6M、FPHA）构建，包含超过160万道受控选择题，用于探究手部关节间的空间关系，如角度、距离和相对位置。我们评估了多种前沿视觉语言模型（LLaVA、DeepSeek和Qwen-VL）在基础设置和微调设置下的表现，并采用LoRA进行轻量级微调。研究结果揭示了当前模型存在系统性局限，包括虚构手指部位、几何关系误判以及泛化能力不足等问题。HandVQA不仅暴露了这些关键推理缺陷，同时提供了经过验证的改进路径。我们证明，从该基准学习到的三维空间知识能够以零样本方式迁移，显著提升模型在新下游任务中的准确率，例如手势识别（提升10.33%）和手-物体交互（提升2.63%）。

摘要 (Abstract)

Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs’ understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).

关键词: Vision-Language Models, Spatial Reasoning, Hand Pose Understanding, Benchmark Evaluation, LoRA Fine-tuning, 3D Hand Datasets, Gesture Recognition, Hallucination Mitigation

140. ❌ MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

作者: Quan Dao, Dimitris Metaxas 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26357v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的扩散模型和流匹配模型架构创新，提出了一种多尺度Transformer设计（MPDiT）来提升计算效率。所有评分关键词均针对大语言模型（LLM）及相关技术（如训练方法、推理优化、对齐、代理系统等），而本文研究的是视觉生成模型（Diffusion/Flow Matching Models），属于完全不同的技术领域和应用方向，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种多尺度Transformer架构（MPDiT），通过分层处理不同大小的图像块来优化扩散和流匹配模型的训练效率，在ImageNet上实现了计算成本降低50%的同时保持良好的生成性能。

摘要翻译

Transformer架构，特别是扩散Transformer（DiTs），因其相较于卷积UNet的优异性能，已在扩散模型和流匹配模型中得到广泛应用。然而，DiTs的各向同性设计使得每个模块处理相同数量的分块化标记（patchified tokens），导致训练过程中的计算负担相对较重。在本研究中，我们引入了一种多分块Transformer设计：早期模块使用较大的分块以捕捉粗略的全局上下文信息，而后期模块则使用较小的分块以细化局部细节。这种分层设计可将计算成本降低高达50%（以GFLOPs计），同时保持良好的生成性能。此外，我们还提出了改进的时间嵌入（time embedding）与类别嵌入（class embedding）设计，以加速训练收敛。在ImageNet数据集上进行的大量实验验证了我们架构选择的有效性。代码发布于\url{https://github.com/quandao10/MPDiT}。

摘要 (Abstract)

Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices. Code is released at \url{https://github.com/quandao10/MPDiT}

关键词: Transformer, Diffusion Models, Flow Matching, Multi-Patch Architecture, Computational Efficiency, Image Generation, Hierarchical Design, Training Acceleration

141. ❌ From Pen to Pixel: Translating Hand-Drawn Plots into Graphical APIs via a Novel Benchmark and Efficient Adapter

作者: Zhenghao Xu, Mengning Yang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26356v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究手绘图表到图形API的转换，提出了HDpy-13数据集和Plot-Adapter方法。虽然涉及神经网络和模型适配器，但论文内容主要聚焦于计算机视觉、图像处理和特定领域的数据集构建，并未涉及大语言模型、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型、深度学习技术原理或AI for Science无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对手绘图表图像难以推荐图形API的问题，提出了HDpy-13数据集和高效的Plot-Adapter方法，有效提升了手绘图表API推荐的性能并减少了参数和计算成本。

摘要翻译

在当代数据可视化与分析中，绘图工具发挥着关键作用。Plot2API 旨在通过神经网络直接根据参考绘图图像推荐图形应用程序接口（API），以帮助非专业用户和初学者创建所需的图表。然而，以往的 Plot2API 研究主要集中于标准绘图图像的推荐，而忽略了非专业用户和初学者更易获取的手绘绘图图像。更棘手的是，由于领域差异和专业知识的缺乏，基于标准绘图图像训练的 Plot2API 模型以及强大的多模态大语言模型均难以有效推荐适用于手绘绘图图像的 API。为便利非专业用户和初学者，我们引入了一个名为 HDpy-13 的手绘绘图数据集，以提升针对手绘绘图图像的图形 API 推荐性能。此外，为缓解 Plot2API 在多领域和多语言挑战中因参数量增长与计算资源成本带来的巨大压力，我们提出了 Plot-Adapter 方法，该方法允许训练和存储独立的适配器，而无需为每种语言和领域训练完整模型。具体而言，Plot-Adapter 引入了一个轻量级卷积神经网络（CNN）模块以增强局部特征捕捉能力，并采用投影矩阵共享机制进一步减少微调参数数量。实验结果表明，HDpy-13 数据集的有效性与 Plot-Adapter 的高效性均得到验证。

摘要 (Abstract)

As plots play a critical role in modern data visualization and analysis, Plot2API is launched to help non-experts and beginners create their desired plots by directly recommending graphical APIs from reference plot images by neural networks. However, previous works on Plot2API have primarily focused on the recommendation for standard plot images, while overlooking the hand-drawn plot images that are more accessible to non-experts and beginners. To make matters worse, both Plot2API models trained on standard plot images and powerful multi-modal large language models struggle to effectively recommend APIs for hand-drawn plot images due to the domain gap and lack of expertise. To facilitate non-experts and beginners, we introduce a hand-drawn plot dataset named HDpy-13 to improve the performance of graphical API recommendations for hand-drawn plot images. Additionally, to alleviate the considerable strain of parameter growth and computational resource costs arising from multi-domain and multi-language challenges in Plot2API, we propose Plot-Adapter that allows for the training and storage of separate adapters rather than requiring an entire model for each language and domain. In particular, Plot-Adapter incorporates a lightweight CNN block to improve the ability to capture local features and implements projection matrix sharing to reduce the number of fine-tuning parameters further. Experimental results demonstrate both the effectiveness of HDpy-13 and the efficiency of Plot-Adapter.

关键词: hand-drawn plots, graphical API recommendation, HDpy-13 dataset, Plot-Adapter, parameter-efficient fine-tuning, domain adaptation, computer vision, data visualization

142. ❌ Only Whats Necessary: Pareto Optimal Data Minimization for Privacy Preserving Video Anomaly Detection

作者: Nazia Aslam, Abhisek Ray, Thomas B. Moeslund, Kamal Nasrollahi 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26354v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频异常检测（VAD）中的隐私保护问题，提出了一种数据最小化框架来平衡隐私和检测性能。论文内容涉及计算机视觉、隐私保护（GDPR合规）和异常检测，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型技术、训练方法、推理优化、AI代理等主题相关，而本论文的核心是视觉数据隐私处理，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对视频异常检测中个人数据隐私保护问题，提出了一个数据最小化框架，通过平衡隐私和检测性能的帕累托分析找到了最优操作点。

摘要翻译

视频异常检测系统正日益部署于安全关键环境中，其准确检测需要大量数据支持。然而，此类数据可能包含个人可识别信息，例如面部特征和敏感人口统计属性，这给欧盟《通用数据保护条例》下的合规性带来了挑战。GDPR特别要求，个人数据的使用应严格限定于特定处理目的所必需的范围。为此，我们提出了“仅需必要信息”这一隐私设计框架，该框架通过设计控制暴露给检测流程的视觉信息量与类型。该框架结合基于广度与基于深度的数据最小化机制，在抑制PII的同时保留与异常检测相关的关键线索。我们通过将最小化处理后的视频输入至VAD模型和隐私推断模型，评估了一系列最小化配置方案。采用两种基于排序的方法并结合帕累托分析，以刻画隐私保护与检测效用之间的权衡关系。从非支配前沿中，我们确定了最佳平衡点，这些操作点在检测性能有限下降的前提下，实现了个人数据暴露的最小化。在公开数据集上的大量实验验证了所提框架的有效性。

摘要 (Abstract)

Video anomaly detection (VAD) systems are increasingly deployed in safety critical environments and require a large amount of data for accurate detection. However, such data may contain personally identifiable information (PII), including facial cues and sensitive demographic attributes, creating compliance challenges under the EU General Data Protection Regulation (GDPR). In particular, GDPR requires that personal data be limited to what is strictly necessary for a specified processing purpose. To address this, we introduce Only What’s Necessary, a privacy-by-design framework for VAD that explicitly controls the amount and type of visual information exposed to the detection pipeline. The framework combines breadth based and depth based data minimization mechanisms to suppress PII while preserving cues relevant to anomaly detection. We evaluate a range of minimization configurations by feeding the minimized videos to both a VAD model and a privacy inference model. We employ two ranking based methods, along with Pareto analysis, to characterize the resulting trade off between privacy and utility. From the non-dominated frontier, we identify sweet spot operating points that minimize personal data exposure with limited degradation in detection performance. Extensive experiments on publicly available datasets demonstrate the effectiveness of the proposed framework.

关键词: Video Anomaly Detection, Privacy Preservation, Data Minimization, GDPR Compliance, Pareto Analysis, Personally Identifiable Information, Privacy-Utility Trade-off, Visual Information Control

143. ❌ DuSCN-FusionNet: An Interpretable Dual-Channel Structural Covariance Fusion Framework for ADHD Classification Using Structural MRI

作者: Qurat Ul Ain, Alptekin Temizel, Soyiba Jawed 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26351v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于使用深度学习进行ADHD分类的医学影像分析，属于AI在生物医学领域的应用。论文的核心是构建可解释的深度学习框架（DuSCN-FusionNet）用于sMRI图像分析，并采用Grad-CAM进行可解释性分析，因此与’Mechanistic Interpretability OR Explainable AI’有一定关联（评分5分）。同时，该研究属于AI在生物信息学/医学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（评分8分）。论文未涉及大语言模型（LLMs）、模型训练技术（如预训练、微调、对齐）、推理优化、智能体系统或其他大模型相关技术，因此其他所有关键词均评为0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于结构MRI的可解释双通道结构协方差融合框架（DuSCN-FusionNet），用于ADHD分类，在ADHD-200数据集上实现了80.59%的平衡准确率，并通过可解释性方法识别了潜在的大脑生物标志物。

摘要翻译

注意缺陷多动障碍（Attention Deficit Hyperactivity Disorder, ADHD）是一种高发的神经发育性疾病，但由于缺乏可靠的基于影像的生物标志物（尤其是解剖学标记物），其神经生物学诊断仍面临挑战。结构磁共振成像（structural MRI, sMRI）为研究ADHD相关的脑部改变提供了非侵入性手段；然而，大多数深度学习方法作为黑箱系统运行，限制了临床可信度与可解释性。本研究提出DuSCN-FusionNet，一种基于sMRI的可解释ADHD分类框架，该框架利用双通道结构协变网络（Structural Covariance Networks, SCNs）来捕捉脑区间形态学关联。通过基于感兴趣区域（ROI）的平均强度特征和区域内变异描述符，分别构建强度型与异质性SCNs，并经由SCN-CNN编码器进行处理。同时，辅助的ROI级变异特征与全局统计描述符通过后期融合策略进行整合以提升性能。该模型采用分层10折交叉验证与5种子集成策略进行评估，在ADHD-200数据集的北京大学站点上取得了80.59%的平均平衡准确率与0.778的曲线下面积（AUC）。DuSCN-FusionNet进一步实现了81.66%的精确率、80.59%的召回率与80.27%的F1分数。此外，本研究将梯度加权类激活映射（Grad-CAM）适配至SCN领域，以推导ROI级别的重要性评分，从而识别出具有结构相关性的脑区作为潜在的生物标志物。

摘要 (Abstract)

Attention Deficit Hyperactivity Disorder (ADHD) is a highly prevalent neurodevelopmental condition; however, its neurobiological diagnosis remains challenging due to the lack of reliable imaging-based biomarkers, particularly anatomical markers. Structural MRI (sMRI) provides a non-invasive modality for investigating brain alterations associated with ADHD; nevertheless, most deep learning approaches function as black-box systems, limiting clinical trust and interpretability. In this work, we propose DuSCN-FusionNet, an interpretable sMRI-based framework for ADHD classification that leverages dual-channel Structural Covariance Networks (SCNs) to capture inter-regional morphological relationships. ROI-wise mean intensity and intra-regional variability descriptors are used to construct intensity-based and heterogeneity-based SCNs, which are processed through an SCN-CNN encoder. In parallel, auxiliary ROI-wise variability features and global statistical descriptors are integrated via late-stage fusion to enhance performance. The model is evaluated using stratified 10-fold cross-validation with a 5-seed ensemble strategy, achieving a mean balanced accuracy of 80.59% and an AUC of 0.778 on the Peking University site of the ADHD-200 dataset. DuSCN-FusionNet further achieves precision, recall, and F1-scores of 81.66%, 80.59%, and 80.27%, respectively. Moreover, Grad-CAM is adapted to the SCN domain to derive ROI-level importance scores, enabling the identification of structurally relevant brain regions as potential biomarkers.

关键词: ADHD classification, structural MRI, interpretable deep learning, Structural Covariance Networks, brain biomarkers, Grad-CAM, medical image analysis, neurodevelopmental disorders

144. ❌ HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network

作者: Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Jiajia Nie, Yinwei Wei, Yupeng Hu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26341v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是组合图像检索（CIR）任务，提出了一种双路径组合上下文网络（HINT）来解决跨模态对齐和特征融合中的上下文信息利用问题。论文的核心技术是计算机视觉和跨模态检索，涉及图像处理、特征融合和相似度计算，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词都聚焦于大模型技术、训练方法、推理优化、对齐技术、代理系统等，与论文的视觉检索主题无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对组合图像检索中忽视上下文信息的问题，提出了双路径组合上下文网络HINT，通过上下文编码和相似度差异放大机制，在两个基准数据集上实现了最优性能。

摘要翻译

组合图像检索（Composed Image Retrieval，CIR）是一种具有挑战性的图像检索范式。其目标是从大规模图像数据库中，基于由参考图像和修改文本组成的多模态查询，检索出与修改语义一致的目标图像。尽管现有方法在跨模态对齐和特征融合方面取得了显著进展，但仍存在一个关键缺陷：在区分匹配样本时忽略了上下文信息。然而，由于两个挑战的存在，解决这一局限并非易事：1）隐式依赖性；2）缺乏差异放大机制。为应对这些挑战，我们提出了一种双路径组合上下文化网络（dual-patH composItional coNtextualized neTwork，HINT），该网络能够执行上下文化编码并放大匹配与非匹配样本之间的相似性差异，从而提升CIR模型在复杂场景下的性能上限。我们的HINT模型在两个CIR基准数据集的所有指标上均取得了最优性能，证明了其优越性。代码可在https://github.com/zh-mingyu/HINT获取。

摘要 (Abstract)

Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the neglect of contextual information in discriminating matching samples. However, addressing this limitation is not an easy task due to two challenges: 1) implicit dependencies and 2) the lack of a differential amplification mechanism. To address these challenges, we propose a dual-patH composItional coNtextualized neTwork (HINT), which can perform contextualized encoding and amplify the similarity differences between matching and non-matching samples, thus improving the upper performance of CIR models in complex scenarios. Our HINT model achieves optimal performance on all metrics across two CIR benchmark datasets, demonstrating the superiority of our HINT model. Codes are available at https://github.com/zh-mingyu/HINT.

关键词: Composed Image Retrieval, Multimodal Query, Contextual Information, Dual-path Network, Cross-modal Alignment, Feature Fusion, Similarity Amplification, Benchmark Datasets

145. ❌ From Pixels to Privacy: Temporally Consistent Video Anonymization via Token Pruning for Privacy Preserving Action Recognition

作者: Nazia Aslam, Abhisek Ray, Joakim Bruslund Haurum, Lukas Esterle, Kamal Nasrollahi 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26336v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的视频隐私保护技术，使用Vision Transformers进行时空特征解耦和token剪枝，与所有评分关键词（主要针对大语言模型技术、训练方法、推理优化、对齐技术、代理系统等）均无直接关联。论文未涉及任何语言模型、训练方法、推理技术或科学AI应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Vision Transformer注意力机制的时空视频匿名化框架，通过分离动作相关信息和隐私敏感内容来保护视频数据隐私，同时保持动作识别性能。

摘要翻译

大规模视频模型的最新进展显著提升了监控、医疗和娱乐等领域的视频理解能力。然而，这些模型也因编码了包括面部身份、种族和性别在内的敏感属性而放大了隐私风险。尽管图像匿名化已得到广泛研究，但视频匿名化领域仍相对探索不足，尽管现代视频模型可利用时空运动模式作为生物特征标识符。为应对这一挑战，我们提出了一种基于效用与隐私特征系统性解耦的新型注意力驱动时空视频匿名化框架。我们的核心见解是，视觉变换器（Vision Transformers, ViTs）中的注意力机制可被明确构建，以将动作相关信息与隐私敏感内容分离。基于此，我们引入了两个任务特定的分类标记：一个动作CLS标记和一个隐私CLS标记，它们在共享的Transformer骨干网络中学习互补的表征。我们对比其注意力分布以计算每个时空小管的效用-隐私分数，并保留分数最高的前k个小管。该方法选择性地剪枝由隐私线索主导的小管，同时保留对动作识别最关键的部分。大量实验表明，我们的方法在保持与原始视频训练模型相当的动作识别性能的同时，显著减少了隐私泄露。这些结果表明，注意力驱动的时空剪枝为隐私保护视频分析提供了一种有效且原理清晰的解决方案。

摘要 (Abstract)

Recent advances in large-scale video models have significantly improved video understanding across domains such as surveillance, healthcare, and entertainment. However, these models also amplify privacy risks by encoding sensitive attributes, including facial identity, race, and gender. While image anonymization has been extensively studied, video anonymization remains relatively underexplored, even though modern video models can leverage spatiotemporal motion patterns as biometric identifiers. To address this challenge, we propose a novel attention-driven spatiotemporal video anonymization framework based on systematic disentanglement of utility and privacy features. Our key insight is that attention mechanisms in Vision Transformers (ViTs) can be explicitly structured to separate action-relevant information from privacy-sensitive content. Building on this insight, we introduce two task-specific classification tokens, an action CLS token and a privacy CLS token, that learn complementary representations within a shared Transformer backbone. We contrast their attention distributions to compute a utility-privacy score for each spatiotemporal tubelet, and keep the top-k tubelets with the highest scores. This selectively prunes tubelets dominated by privacy cues while preserving those most critical for action recognition. Extensive experiments demonstrate that our approach maintains action recognition performance comparable to models trained on raw videos, while substantially reducing privacy leakage. These results indicate that attention-driven spatiotemporal pruning offers an effective and principled solution for privacy-preserving video analytics.

关键词: video anonymization, privacy preserving, action recognition, Vision Transformers, attention mechanisms, spatiotemporal pruning, utility-privacy trade-off, biometric identifiers

146. ❌ Verify Claimed Text-to-Image Models via Boundary-Aware Prompt Optimization

作者: Zidong Zhao, Yihao Huang, Qing Guo, Tianlin Li, Anran Li, Kailong Wang, Jin Song Dong, Geguang Pu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26328v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文本到图像（T2I）生成模型的验证方法，提出了一种名为边界感知提示优化（BPO）的无参考验证技术。论文的核心是解决T2I模型API的真实性验证问题，通过探索目标模型在嵌入空间中的语义边界来生成特定于模型的验证提示。所有评分关键词均围绕大语言模型（LLMs）及其相关技术（如训练、对齐、推理、优化、应用等），而本论文研究的是文本到图像生成模型，属于计算机视觉和生成模型的交叉领域，并非大语言模型。因此，论文内容与所有给定的大语言模型关键词完全无关，每个关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为边界感知提示优化（BPO）的无参考方法，用于验证文本到图像生成模型API的真实性，通过识别模型特定的语义边界来生成可靠的验证提示，实验表明该方法在多个模型上实现了优越的验证准确性。

摘要翻译

随着文本到图像生成技术的普及，第三方平台日益集成多种模型API以提供便捷的图像创作服务。然而，虚假宣称使用官方模型可能误导用户并损害模型所有者的声誉，这使得模型验证变得至关重要，以确认API底层模型是否与其声明相符。现有方法通过使用官方模型所有者生成的验证提示来解决这一问题，但提示生成过程依赖多个参考模型进行优化，导致计算成本高昂且对模型选择敏感。针对此问题，我们提出一种无需参考的文本到图像模型验证方法，称为边界感知提示优化。该方法直接探索目标模型的内在特性。其核心思路是：尽管不同的文本到图像模型对常规提示会产生相似输出，但它们在嵌入空间中的语义边界（例如“柯基犬”与“贝果”两个概念之间的过渡区域）具有独特性。位于这些边界附近的提示会在目标模型上生成不稳定的输出（例如有时生成柯基犬，有时生成贝果），而在其他模型上则保持稳定。通过识别此类邻近边界的提示，边界感知提示优化能够捕获模型特有的行为模式，从而为区分文本到图像模型提供可靠的验证线索。在五个文本到图像模型和四个基线方法上的实验表明，边界感知提示优化实现了卓越的验证准确率。

摘要 (Abstract)

As Text-to-Image (T2I) generation becomes widespread, third-party platforms increasingly integrate multiple model APIs for convenient image creation. However, false claims of using official models can mislead users and harm model owners’ reputations, making model verification essential to confirm whether an API’s underlying model matches its claim. Existing methods address this by using verification prompts generated by official model owners, but the generation relies on multiple reference models for optimization, leading to high computational cost and sensitivity to model selection. To address this problem, we propose a reference-free T2I model verification method called Boundary-aware Prompt Optimization (BPO). It directly explores the intrinsic characteristics of the target model. The key insight is that although different T2I models produce similar outputs for normal prompts, their semantic boundaries in the embedding space (transition zones between two concepts such as “corgi” and “bagel”) are distinct. Prompts near these boundaries generate unstable outputs (e.g., sometimes a corgi and sometimes a bagel) on the target model but remain stable on other models. By identifying such boundary-adjacent prompts, BPO captures model-specific behaviors that serve as reliable verification cues for distinguishing T2I models. Experiments on five T2I models and four baselines demonstrate that BPO achieves superior verification accuracy.

关键词: Text-to-Image Generation, Model Verification, Boundary-aware Prompt Optimization, Semantic Boundaries, API Authentication, Reference-free Method, Stable Diffusion, DALL-E

作者: Jiayi Chen, Wenxuan Song, Shuai Chen, Jingbo Wang, Zhijun Li, Haoang Li 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26320v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DFM-VLA专注于机器人操作中的视觉-语言-动作（VLA）模型，提出了一种基于离散流匹配（Discrete Flow Matching）的迭代动作精炼方法。虽然论文涉及深度学习在机器人领域的应用，但所有给定的关键词都明确针对大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化等），而本文的核心是VLA模型的动作生成和精炼，并未涉及LLMs、基础模型或任何列出的LLM特定技术。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出DFM-VLA，一种基于离散流匹配的视觉-语言-动作模型，通过迭代精炼动作令牌来改进机器人操作性能，在CALVIN和LIBERO基准测试中优于自回归和扩散基线。

摘要翻译

采用离散标记化方案编码动作的视觉-语言-动作（Vision–Language–Action，VLA）模型在机器人操作任务中日益普及，但现有的解码范式仍存在根本性局限。无论动作是由自回归VLA顺序解码，还是由离散扩散VLA并行解码，一旦生成一个标记，它通常会被固定且无法在后续迭代中修正，因此早期标记的错误难以在后期得到有效纠正。我们提出DFM-VLA，一种基于离散流匹配的VLA模型，用于实现动作标记的迭代优化。DFM-VLA建模了一个标记级概率速度场，该速度场能在优化迭代过程中动态更新整个动作序列。我们研究了两种构建速度场的方法：辅助速度头（auxiliary velocity-head）方案和动作嵌入引导（action-embedding-guided）方案。我们的框架进一步采用两阶段解码策略，包含迭代优化阶段和确定性验证阶段，以确保稳定收敛。在CALVIN、LIBERO以及真实世界操作任务上的大量实验表明，DFM-VLA在操作性能上持续优于强力的自回归、离散扩散和连续扩散基线模型，同时保持了较高的推理效率。具体而言，DFM-VLA在CALVIN上实现了4.44的平均成功长度，在LIBERO上达到了95.7%的平均成功率，凸显了通过离散流匹配进行动作优化对机器人操作的价值。项目地址：\url{https://chris1220313648.github.io/DFM-VLA/}

摘要 (Abstract)

Vision–Language–Action (VLA) models that encode actions using a discrete tokenization scheme are increasingly adopted for robotic manipulation, but existing decoding paradigms remain fundamentally limited. Whether actions are decoded sequentially by autoregressive VLAs or in parallel by discrete diffusion VLAs, once a token is generated, it is typically fixed and cannot be revised in subsequent iterations, so early token errors cannot be effectively corrected later. We propose DFM-VLA, a discrete flow matching VLA for iterative refinement of action tokens. DFM-VLA~models a token-level probability velocity field that dynamically updates the full action sequence across refinement iterations. We investigate two ways to construct the velocity field: an auxiliary velocity-head formulation and an action-embedding-guided formulation. Our framework further adopts a two-stage decoding strategy with an iterative refinement stage followed by deterministic validation for stable convergence. Extensive experiments on CALVIN, LIBERO, and real-world manipulation tasks show that DFM-VLA consistently outperforms strong autoregressive, discrete diffusion, and continuous diffusion baselines in manipulation performance while retaining high inference efficiency. In particular, DFM-VLA achieves an average success length of 4.44 on CALVIN and an average success rate of 95.7% on LIBERO, highlighting the value of action refinement via discrete flow matching for robotic manipulation. Our project is available \url{https://chris1220313648.github.io/DFM-VLA/}

关键词: Vision-Language-Action models, discrete flow matching, iterative action refinement, robotic manipulation, action tokenization, VLA decoding, discrete diffusion, autoregressive models

148. ❌ SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning

作者: Cai Selvas-Sala, Lei Kang, Lluis Gomez 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26316v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态模型（如CLIP）的机器遗忘问题，提出了SALMUBench基准来评估敏感关联级别的遗忘效果。虽然涉及大模型（CLIP）的应用，但研究焦点是机器遗忘这一特定任务，而非大模型技术原理、训练方法、推理优化、对齐、压缩、代理系统或科学AI应用等关键词所涵盖的核心内容。所有关键词均与论文主题无直接关联，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态模型（如CLIP）中敏感信息的移除问题，提出了SALMUBench基准，通过合成数据集和结构化评估协议来精确衡量机器遗忘的效果和副作用，发现现有方法存在遗忘不足或过度泛化的问题。

摘要翻译

随着CLIP等多模态模型日益成为下游系统的核心组成部分，消除敏感信息的需求变得至关重要。然而，针对对比训练编码器的机器遗忘研究仍显不足，且现有评估方法无法诊断细粒度的关联层面遗忘。我们提出了SALMUBench（敏感关联级多模态遗忘基准），该基准基于一个包含6万条人物-属性关联的合成数据集以及两个基础模型构建：一个是受此数据污染的“受损模型”，另一个是不含该数据的“洁净模型”。为隔离遗忘效应，两个模型均在相同的4亿对保留基数据上从头训练，其中受损模型额外在敏感数据集上进行了训练。我们提出了一种新颖的评估方案，采用结构化保留集（保留身份集、保留关联集）来精确衡量遗忘效能与连带损伤。我们的基准测试表明，尽管实现高效用删除是可行的，但现有方法表现出明显的失效模式：它们要么未能有效遗忘，要么因过度泛化而删除了超出预期的内容。SALMUBench为全面的遗忘评估设立了新标准，我们公开发布了数据集、模型、评估脚本和排行榜，以推动未来研究。

摘要 (Abstract)

As multimodal models like CLIP become integral to downstream systems, the need to remove sensitive information is critical. However, machine unlearning for contrastively-trained encoders remains underexplored, and existing evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean model without it. To isolate unlearning effects, both are trained from scratch on the same 400M-pair retain base, with the Compromised model additionally trained on the sensitive set. We propose a novel evaluation protocol with structured holdout sets (holdout identity, holdout association) to precisely measure unlearning efficacy and collateral damage. Our benchmark reveals that while utility-efficient deletion is feasible, current methods exhibit distinct failure modes: they either fail to forget effectively or over-generalize by erasing more than intended. SALMUBench sets a new standard for comprehensive unlearning evaluation, and we publicly release our dataset, models, evaluation scripts, and leaderboards to foster future research.

关键词: multimodal models, machine unlearning, sensitive information removal, benchmark evaluation, association-level forgetting, CLIP, SALMUBench, contrastively-trained encoders

149. ❌ DRUM: Diffusion-based Raydrop-aware Unpaired Mapping for Sim2Real LiDAR Segmentation

作者: Tomoya Miyawaki, Kazuto Nakashima, Yumi Iwashita, Ryo Kurazume 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26263v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DRUM专注于LiDAR语义分割的Sim2Real转换，使用扩散模型作为生成先验，解决合成数据到真实数据的领域适应问题。与大多数关键词（如LLMs、MoE、RLHF、RAG等）无关，因为这些关键词主要针对大语言模型及其相关技术。唯一相关的是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文涉及领域适应（Domain Adaptation）和预训练扩散模型，但并非核心大模型技术，因此给5分（有一定关联）。其他关键词均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于扩散模型的Sim2Real转换框架DRUM，用于改善LiDAR语义分割中合成数据到真实数据的领域适应问题，通过模拟反射强度和射线丢弃噪声来提升模型在真实数据上的性能。

摘要翻译

基于激光雷达的语义分割是自主移动机器人的关键组成部分，然而对激光雷达点云进行大规模标注成本极高且耗时。尽管模拟器能够提供带标签的合成数据，但由于数据层面的域差异，在合成数据上训练的模型在真实世界数据上往往表现不佳。为解决这一问题，我们提出了DRUM，一种新颖的仿真到真实（Sim2Real）转换框架。我们利用在未标注真实世界数据上预训练的扩散模型作为生成先验，并通过复现两个关键测量特征——反射强度（reflectance intensity）和射线丢失噪声（raydrop noise）——来转换合成数据。为提高样本保真度，我们引入了一种射线丢失感知的掩码引导机制，该机制有选择地强制输出与输入合成数据保持一致，同时保留由扩散先验生成的逼真射线丢失噪声。实验结果表明，DRUM在激光雷达数据的多种表示形式中均能持续提升仿真到真实的性能。项目页面详见 https://miya-tomoya.github.io/drum。

摘要 (Abstract)

LiDAR-based semantic segmentation is a key component for autonomous mobile robots, yet large-scale annotation of LiDAR point clouds is prohibitively expensive and time-consuming. Although simulators can provide labeled synthetic data, models trained on synthetic data often underperform on real-world data due to a data-level domain gap. To address this issue, we propose DRUM, a novel Sim2Real translation framework. We leverage a diffusion model pre-trained on unlabeled real-world data as a generative prior and translate synthetic data by reproducing two key measurement characteristics: reflectance intensity and raydrop noise. To improve sample fidelity, we introduce a raydrop-aware masked guidance mechanism that selectively enforces consistency with the input synthetic data while preserving realistic raydrop noise induced by the diffusion prior. Experimental results demonstrate that DRUM consistently improves Sim2Real performance across multiple representations of LiDAR data. The project page is available at https://miya-tomoya.github.io/drum.

关键词: LiDAR semantic segmentation, Sim2Real translation, diffusion model, domain adaptation, raydrop noise, generative prior, synthetic data, autonomous robots

150. ❌ GLASS: Geometry-aware Local Alignment and Structure Synchronization Network for 2D-3D Registration

作者: Zhixin Cheng, Jiacheng Deng, Xinjun Li, Bohao Liao, Li Liu, Xiaotian Yin, Baoqun Yin, Tianzhu Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26262v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的2D-3D配准问题，提出了一种几何感知的局部对齐和结构同步网络。虽然属于AI应用领域，但所有关键词均与大模型、深度学习技术原理、科学AI应用（如生物信息学）等主题无关。论文未涉及任何语言模型、模型训练/微调技术、推理方法、代理系统、模型优化或特定科学领域AI应用。

!!! tip deepseek-chat TL;DR

该论文解决了2D-3D配准中因重复模式和结构不一致导致的错误匹配问题，通过提出局部几何增强和图形分布一致性模块，在RGB-D Scenes v2和7-Scenes基准上实现了最先进的性能。

摘要翻译

图像到点云配准方法通常遵循由粗到精的流程，先提取块级对应关系，再将其优化为密集的像素-点匹配。然而，在具有重复结构的场景中，图像往往缺乏足够的三维结构线索以及与点云的对齐信息，从而导致错误匹配。此外，现有方法通常忽略结构一致性，限制了对对应关系的充分利用。为解决这些问题，我们提出了两个新颖模块：局部几何增强模块与图分布一致性模块。局部几何增强模块通过法向量同时增强图像和点云特征，将几何结构注入图像特征以减少误匹配。图分布一致性模块基于匹配点构建图以更新特征，并显式约束相似性分布。在RGB-D Scenes v2和7-Scenes两个基准数据集上的大量实验与消融研究表明，我们的方法在图像-点云配准任务中达到了最先进的性能水平。

摘要 (Abstract)

Image-to-point cloud registration methods typically follow a coarse-to-fine pipeline, extracting patch-level correspondences and refining them into dense pixel-to-point matches. However, in scenes with repetitive patterns, images often lack sufficient 3D structural cues and alignment with point clouds, leading to incorrect matches. Moreover, prior methods usually overlook structural consistency, limiting the full exploitation of correspondences. To address these issues, we propose two novel modules: the Local Geometry Enhancement (LGE) module and the Graph Distribution Consistency (GDC) module. LGE enhances both image and point cloud features with normal vectors, injecting geometric structure into image features to reduce mismatches. GDC constructs a graph from matched points to update features and explicitly constrain similarity distributions. Extensive experiments and ablations on two benchmarks, RGB-D Scenes v2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance in image-to-point cloud registration.

关键词: 2D-3D registration, image-to-point cloud registration, geometry-aware, local alignment, structure synchronization, normal vectors, graph distribution consistency, state-of-the-art performance

151. ❌ 4DRaL: Bridging 4D Radar with LiDAR for Place Recognition using Knowledge Distillation

作者: Ningyuan Huang, Zhiheng Li, Zheng Fang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26206v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究4D雷达与LiDAR的地点识别，使用知识蒸馏技术，属于机器人感知领域。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为4DRaL的新框架，通过知识蒸馏技术利用高性能LiDAR模型指导4D雷达模型，解决了4D雷达数据噪声和稀疏性导致的地点识别性能受限问题，并在正常和恶劣天气条件下实现了最先进的性能。

摘要翻译

场景识别对于机器人学中的回环检测与全局定位至关重要。尽管主流算法通常依赖相机与激光雷达（LiDAR），这些传感器易受恶劣天气条件的影响。值得庆幸的是，近期发展的四维毫米波雷达（4D radar）为全天候场景识别提供了可行的解决方案。然而，4D雷达数据固有的噪声与稀疏性严重限制了其性能。因此，本文提出了一种名为4DRaL的新型框架，该框架利用知识蒸馏（Knowledge Distillation, KD）来提升4D雷达的场景识别性能。其核心在于采用高性能的激光雷达到激光雷达（L2L）场景识别模型作为教师模型，以指导4D雷达到4D雷达（R2R）场景识别模型的学生模型训练。4DRaL包含三个关键的知识蒸馏模块：用于处理原始4D雷达点云稀疏性的局部图像增强模块、确保学生模型生成更具判别性特征的特征分布蒸馏模块，以及维持教师与学生模型在特征空间一致性的响应蒸馏模块。更重要的是，通过不同的模块配置，4DRaL也可被训练用于4D雷达到激光雷达（R2L）场景识别任务。实验结果表明，无论在正常或恶劣天气条件下，4DRaL在R2R与R2L任务中均实现了最先进的性能。

摘要 (Abstract)

Place recognition is crucial for loop closure detection and global localization in robotics. Although mainstream algorithms typically rely on cameras and LiDAR, these sensors are susceptible to adverse weather conditions. Fortunately, the recently developed 4D millimeter-wave radar (4D radar) offers a promising solution for all-weather place recognition. However, the inherent noise and sparsity in 4D radar data significantly limit its performance. Thus, in this paper, we propose a novel framework called 4DRaL that leverages knowledge distillation (KD) to enhance the place recognition performance of 4D radar. Its core is to adopt a high-performance LiDAR-to-LiDAR (L2L) place recognition model as a teacher to guide the training of a 4D radar-to-4D radar (R2R) place recognition model. 4DRaL comprises three key KD modules: a local image enhancement module to handle the sparsity of raw 4D radar points, a feature distribution distillation module that ensures the student model generates more discriminative features, and a response distillation module to maintain consistency in feature space between the teacher and student models. More importantly, 4DRaL can also be trained for 4D radar-to-LiDAR (R2L) place recognition through different module configurations. Experimental results prove that 4DRaL achieves state-of-the-art performance in both R2R and R2L tasks regardless of normal or adverse weather.

关键词: 4D radar, LiDAR, place recognition, knowledge distillation, robotics, adverse weather, feature distribution, state-of-the-art

152. ❌ Real-Time Branch-to-Tool Distance Estimation for Autonomous UAV Pruning: Benchmarking Five DEFOM-Stereo Variants from Simulation to Jetson Deployment

作者: Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26250v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉和机器人领域，研究基于立体匹配基础模型（DEFOM-Stereo）的无人机自主修剪应用，涉及模型变体训练、合成数据集、边缘设备部署和性能评估。所有关键词均与大语言模型（LLM）或深度学习通用技术原理相关，而论文内容完全不涉及LLM、自然语言处理或深度学习通用技术（如MoE、缩放定律、对齐、推理方法等）。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文应用AI于农业/机器人（可视为科学应用的一个子领域），但并非核心生物信息学或化学信息学，因此给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于DEFOM-Stereo基础模型的五种变体，用于无人机自主修剪中工具到树枝的实时距离估计，通过在合成数据集上训练并在NVIDIA Jetson上部署，发现DEFOM-PrunePlus变体在精度和速度之间提供了最佳平衡，适用于实时安全操作。

摘要翻译

利用无人机进行自主树木修剪是一项安全至上的现实任务：机载感知系统必须实时估算切割工具到细树枝的精确距离，以便无人机能够无碰撞地接近、对准并启动修剪器。我们通过在一个任务特定的合成数据集上训练五种DEFOM-Stereo变体——一种近期基于基础模型的立体匹配器——并将检查点部署在NVIDIA Jetson Orin Super 16 GB设备上来解决此问题。训练数据集在Unreal Engine 5中构建，使用模拟的ZED Mini立体相机从三个视角、2米距离处采集了115棵树木实例的5,520个立体图像对；密集的EXR深度图为细树枝提供了精确且空间完整的监督。在合成测试集上，DEFOM-Stereo ViT-S取得了最佳的深度域精度（EPE 1.74像素，D1-all 5.81%，delta-1 95.90%，深度MAE 23.40厘米），但其在Jetson上的推理速度约2.2 FPS（每帧约450毫秒）对于响应式闭环工具控制而言仍然过慢。新引入的平衡变体DEFOM-PrunePlus（主干约2100万参数，Jetson上约3.3 FPS）提供了最佳的可部署精度-速度权衡（EPE 5.87像素，深度MAE 64.26厘米，delta-1 87.59%）：其帧率足以实现实时引导，且其深度精度支持在2米工作范围内进行安全的树枝接近规划。轻量级的DEFOM-PruneStereo（约6.9 FPS）和DEFOM-PruneNano（约8.5 FPS）运行速度快，但牺牲了显著的精度（深度MAE > 57厘米），使得估计结果对于安全执行而言过于不可靠。在真实照片上的零样本推理证实，全容量模型能够保持树枝的几何结构，验证了从仿真到现实的迁移有效性。我们的结论是，DEFOM-PrunePlus为机载距离估计提供了最实用的精度-延迟平衡，而ViT-S可作为未来硬件发展的参考基准。

摘要 (Abstract)

Autonomous tree pruning with unmanned aerial vehicles (UAVs) is a safety-critical real-world task: the onboard perception system must estimate the metric distance from a cutting tool to thin tree branches in real time so that the UAV can approach, align, and actuate the pruner without collision. We address this problem by training five variants of DEFOM-Stereo - a recent foundation-model-based stereo matcher - on a task-specific synthetic dataset and deploying the checkpoints on an NVIDIA Jetson Orin Super 16 GB. The training corpus is built in Unreal Engine 5 with a simulated ZED Mini stereo camera capturing 5,520 stereo pairs across 115 tree instances from three viewpoints at 2m distance; dense EXR depth maps provide exact, spatially complete supervision for thin branches. On the synthetic test set, DEFOM-Stereo ViT-S achieves the best depth-domain accuracy (EPE 1.74 px, D1-all 5.81%, delta-1 95.90%, depth MAE 23.40 cm) but its Jetson inference speed of ~2.2 FPS (~450 ms per frame) remains too slow for responsive closed-loop tool control. A newly introduced balanced variant, DEFOM-PrunePlus (~21M backbone, ~3.3 FPS on Jetson), offers the best deployable accuracy-speed trade-off (EPE 5.87 px, depth MAE 64.26 cm, delta-1 87.59%): its frame rate is sufficient for real-time guidance and its depth accuracy supports safe branch approach planning at the 2m operating range. The lightweight DEFOM-PruneStereo (~6.9 FPS) and DEFOM-PruneNano (~8.5 FPS) run fast but sacrifice substantial accuracy (depth MAE > 57 cm), making estimates too unreliable for safe actuation. Zero-shot inference on real photographs confirms that full-capacity models preserve branch geometry, validating the sim-to-real transfer. We conclude that DEFOM-PrunePlus provides the most practical accuracy-latency balance for onboard distance estimation, while ViT-S serves as the reference for future hardware.

关键词: autonomous UAV pruning, DEFOM-Stereo, real-time distance estimation, stereo matching, synthetic dataset, NVIDIA Jetson deployment, accuracy-speed trade-off, sim-to-real transfer

153. ❌ Dual-Stage Invariant Continual Learning under Extreme Visual Sparsity

作者: Rangya Zhang, Jiaping Xiao, Lu Bai, Yuhang Zhang, Mir Feroskhan 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26190v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的持续学习（continual learning）和物体检测（object detection），特别是针对极端稀疏视觉条件下的空间物体检测。论文的核心技术是持续学习框架和特征蒸馏方法，与大多数大语言模型（LLM）相关关键词无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文应用于空间科学（space-based RSO detection），属于AI在科学领域的应用，但并非生物信息学或化学信息学，因此给予8分。‘Pre-training OR Continual Pre-training OR Domain Adaptation’中的’Domain Adaptation’与论文的’sequential domain shifts’有一定关联，给予5分。其他关键词均未涉及大模型、深度学习技术原理创新或相关应用。

!!! tip deepseek-chat TL;DR

该论文针对极端稀疏视觉条件下的持续物体检测问题，提出了一个双阶段不变持续学习框架，通过联合蒸馏保持特征表示和检测预测的稳定性，在空间物体检测数据集上实现了+4.0 mAP的性能提升。

摘要翻译

持续学习旨在非平稳环境下保持稳定的适应能力，然而这一问题在目标检测任务中尤为严峻，因为现有方法大多隐式假设了相对均衡的视觉条件。在极端稀疏场景中——例如基于太空的驻留空间目标（RSO，resident space object）检测所观测到的情况——前景信号被背景观测数据完全主导。在此条件下，我们通过理论分析证明：背景主导的梯度会在连续域偏移过程中破坏特征主干网络的稳定性，导致渐进式的表征漂移。这揭示了仅依赖输出层蒸馏的持续学习方法存在结构性局限，因其无法保持中间表征的稳定性。为解决该问题，我们提出一种基于联合蒸馏的双阶段不变性持续学习框架，分别对主干网络表征和检测预测施加结构性与语义一致性约束，从而从源头抑制误差传播，同时保持模型适应性。此外，为在严重数据不平衡条件下调控梯度统计特性，我们提出一种稀疏感知的数据调节策略，结合基于图像块的采样与分布感知的数据增强方法。在高分辨率太空RSO检测数据集上的实验表明，该方法相较于现有持续目标检测方法取得稳定提升，在连续域偏移下实现了平均精度（mAP）绝对值+4.0%的性能增益。

摘要 (Abstract)

Continual learning seeks to maintain stable adaptation under non-stationary environments, yet this problem becomes particularly challenging in object detection, where most existing methods implicitly assume relatively balanced visual conditions. In extreme-sparsity regimes, such as those observed in space-based resident space object (RSO) detection scenarios, foreground signals are overwhelmingly dominated by background observations. Under such conditions, we analytically demonstrate that background-driven gradients destabilize the feature backbone during sequential domain shifts, causing progressive representation drift. This exposes a structural limitation of continual learning approaches relying solely on output-level distillation, as they fail to preserve intermediate representation stability. To address this, we propose a dual-stage invariant continual learning framework via joint distillation, enforcing structural and semantic consistency on both backbone representations and detection predictions, respectively, thereby suppressing error propagation at its source while maintaining adaptability. Furthermore, to regulate gradient statistics under severe imbalance, we introduce a sparsity-aware data conditioning strategy combining patch-based sampling and distribution-aware augmentation. Experiments on a high-resolution space-based RSO detection dataset show consistent improvement over established continual object detection methods, achieving an absolute gain of +4.0 mAP under sequential domain shifts.

关键词: continual learning, object detection, extreme visual sparsity, domain shift, representation drift, joint distillation, space-based RSO detection, sparsity-aware data conditioning

154. ❌ HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning

作者: Xuerui Zhang, Xuehao Wang, Zhan Zhuang, Linglan Zhao, Ziyue Li, Xinmin Zhang, Zhihuan Song, Yu Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26192v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning》专注于计算机视觉领域的密集预测任务（dense prediction），提出了一种基于知识蒸馏的终身学习方法，以处理异构任务序列。论文的核心内容涉及终身学习、知识蒸馏、异构任务、密集预测等，但未涉及大语言模型（LLMs）、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大语言模型、深度学习技术原理或大模型应用相关，而本文研究的是传统计算机视觉任务，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对终身异构学习（LHL）中密集预测任务序列的挑战，提出了一种异构感知蒸馏（HAD）方法，通过自蒸馏和平衡分布、显著性引导的损失函数，有效保留了异构知识，并在实验中显著优于现有方法。

摘要翻译

终身学习旨在保留从先前任务中获得的知识，同时整合一系列新任务的知识。然而，大多数现有研究仅探索同质任务流（例如，仅分类任务），而忽视了在具有不同输出结构的异构任务间进行学习的场景。在本工作中，我们将这一更广泛的设定形式化为终身异构学习（Lifelong Heterogeneous Learning, LHL）。与传统终身学习不同，LHL的任务序列跨越不同的任务类型，学习者需要为不同的输出空间结构保留异构知识。为了具体实现LHL，我们聚焦于密集预测背景下的终身异构学习（LHL4DP），这是一个现实且具有挑战性的场景。为此，我们提出了异构感知蒸馏（Heterogeneity-Aware Distillation, HAD）方法，这是一种无需示例的方法，通过在每个训练阶段进行自蒸馏来保留先前获得的异构知识。所提出的HAD包含两个互补的组件：一是分布平衡的异构感知蒸馏损失，用于缓解预测分布的全局不平衡；二是显著性引导的异构感知蒸馏损失，其通过Sobel算子提取信息丰富的边缘像素，并集中学习这些像素。大量实验表明，所提出的HAD方法在这一新场景中显著优于现有方法。

摘要 (Abstract)

Lifelong learning aims to preserve knowledge acquired from previous tasks while incorporating knowledge from a sequence of new tasks. However, most prior work explores only streams of homogeneous tasks (\textit{e.g.}, only classification tasks) and neglects the scenario of learning across heterogeneous tasks that possess different structures of outputs. In this work, we formalize this broader setting as lifelong heterogeneous learning (LHL). Departing from conventional lifelong learning, the task sequence of LHL spans different task types, and the learner needs to retain heterogeneous knowledge for different output space structures. To instantiate the LHL, we focus on LHL in the context of dense prediction (LHL4DP), a realistic and challenging scenario. To this end, we propose the Heterogeneity-Aware Distillation (HAD) method, an exemplar-free approach that preserves previously gained heterogeneous knowledge by self-distillation in each training phase. The proposed HAD comprises two complementary components, including a distribution-balanced heterogeneity-aware distillation loss to alleviate the global imbalance of prediction distribution and a salience-guided heterogeneity-aware distillation loss that concentrates learning on informative edge pixels extracted with the Sobel operator. Extensive experiments demonstrate that the proposed HAD method significantly outperforms existing methods in this new scenario.

关键词: Lifelong Heterogeneous Learning, Heterogeneity-Aware Distillation, Dense Prediction, Knowledge Distillation, Self-Distillation, Exemplar-Free, Sobel Operator, Salience-Guided

155. ❌ OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement

作者: Rui Wang, Huisi Wu, Jing Qin 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26188v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学图像（超声心动图视频）分割，提出了一种基于正交化状态更新和先验感知特征增强的深度学习框架。其核心是解决视频分割中的时空建模问题（如秩崩溃和噪声干扰），属于计算机视觉和医学图像分析领域。所有关键词均与大型语言模型（LLM）、其训练/对齐技术、推理优化、智能体系统或通用AI技术相关，而本文未涉及任何语言模型或上述通用AI方法。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究将深度学习应用于生物医学（心脏功能评估），属于AI在科学领域的应用，但并非LLM相关。因此，仅该关键词得10分（核心应用），其余均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究解决了超声心动图视频中左心室分割的时空一致性问题，通过正交化状态更新和先验感知特征增强，在CAMUS和EchoNet-Dynamic数据集上实现了最先进的分割精度和实时推理效率。

摘要翻译

从超声心动图视频中实现左心室的精确且时序一致的分割，对于估算射血分数和评估心脏功能至关重要。然而，由于严重的散斑噪声和快速的非刚性形变，建模时空动态仍然十分困难。现有的线性循环模型为时序跟踪提供了高效的上下文关联召回能力，但其依赖于无约束的状态更新，这会导致状态矩阵出现渐进性的奇异值衰减，即所谓的秩崩溃现象，从而使解剖细节被噪声淹没。为解决此问题，我们提出了OSA框架，该框架将状态演化约束在St流形上。我们引入了正交化状态更新机制，该机制将记忆演化表述为在St流形上的欧几里得投影梯度下降，以防止秩崩溃并保持稳定的时序过渡。此外，一个解剖先验感知的特征增强模块通过物理驱动的过程，显式地将解剖结构与散斑噪声分离，为时序跟踪器提供抗噪声的结构线索。在CAMUS和EchoNet-Dynamic数据集上的综合实验表明，OSA实现了最先进的分割精度和时序稳定性，同时保持了临床部署所需的实时推理效率。代码可在https://github.com/wangri2025/OSA获取。

摘要 (Abstract)

Accurate and temporally consistent segmentation of the left ventricle from echocardiography videos is essential for estimating the ejection fraction and assessing cardiac function. However, modeling spatiotemporal dynamics remains difficult due to severe speckle noise and rapid non-rigid deformations. Existing linear recurrent models offer efficient in-context associative recall for temporal tracking, but rely on unconstrained state updates, which cause progressive singular value decay in the state matrix, a phenomenon known as rank collapse, resulting in anatomical details being overwhelmed by noise. To address this, we propose OSA, a framework that constrains the state evolution on the Stiefel manifold. We introduce the Orthogonalized State Update (OSU) mechanism, which formulates the memory evolution as Euclidean projected gradient descent on the Stiefel manifold to prevent rank collapse and maintain stable temporal transitions. Furthermore, an Anatomical Prior-aware Feature Enhancement module explicitly separates anatomical structures from speckle noise through a physics-driven process, providing the temporal tracker with noise-resilient structural cues. Comprehensive experiments on the CAMUS and EchoNet-Dynamic datasets show that OSA achieves state-of-the-art segmentation accuracy and temporal stability, while maintaining real-time inference efficiency for clinical deployment. Codes are available at https://github.com/wangrui2025/OSA.

关键词: Echocardiography video segmentation, Left ventricle segmentation, Orthogonalized State Update, Anatomical prior-aware feature enhancement, Temporal consistency, Rank collapse, Stiefel manifold, Real-time inference

156. ❌ GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport

作者: Youngju Na, Jaeseong Yun, Soohyun Ryu, Hyunsu Kim, Sung-Eui Yoon, Suyong Yeon 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26181v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GLINT专注于计算机视觉和图形学领域，研究3D高斯溅射在透明场景重建中的扩展，核心贡献是提出一种分解高斯表示来建模场景级透明度。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science等主题相关，而本文属于计算机图形学/3D重建领域，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

论文解决了3D高斯溅射无法建模透明物体（如玻璃）的问题，提出了GLINT框架，通过分解高斯表示分别建模反射和透射辐射，实现了复杂透明场景的改进重建。

摘要翻译

尽管三维高斯溅射已成为一种强大的范式，但其本质上无法建模诸如玻璃面板之类的透明物体。核心挑战在于如何解耦来自透明界面与透过玻璃观察到的传输几何体之间相互交织的辐射贡献。我们提出了GLINT框架，该框架通过显式分解的高斯表示来建模场景尺度的透明度。GLINT重建主要界面，并分别对反射和透射辐射进行建模，从而实现一致的辐射传输。在优化过程中，GLINT利用分解引发的几何分离线索，以及来自预训练视频重光照模型的几何与材质先验，自举实现透明度定位。大量实验表明，在重建复杂透明场景方面，本方法相较于现有技术取得了持续性的改进。

摘要 (Abstract)

While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and models reflected and transmitted radiance separately, enabling consistent radiance transport. During optimization, GLINT bootstraps transparency localization from geometry-separation cues induced by the decomposition, together with geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate consistent improvements over prior methods for reconstructing complex transparent scenes.

关键词: 3D Gaussian splatting, transparency modeling, radiance transport, scene reconstruction, decomposed Gaussian representation, transparent interfaces, video relighting model, optimization

157. ❌ DUGAE: Unified Geometry and Attribute Enhancement via Spatiotemporal Correlations for G-PCC Compressed Dynamic Point Clouds

作者: Pan Zhao, Hui Yuan, Chang Sun, Chongzhen Tian, Raouf Hamzaoui, Sam Kwong 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26183v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于动态点云压缩的质量增强，使用基于稀疏卷积和运动补偿的深度学习网络，属于计算机视觉和多媒体处理领域。所有评分关键词均与大语言模型、模型训练、对齐、推理优化、代理系统等大模型技术相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DUGAE的统一几何和属性增强框架，通过显式利用时空相关性来提升G-PCC压缩动态点云的视觉质量，在多个数据集上显著提高了几何和属性的重建性能。

摘要翻译

现有的点云解码后质量增强方法主要针对静态数据设计，通常独立处理每一帧，因此无法有效利用点云序列中存在的时空相关性。我们提出了一种面向G-PCC压缩动态点云的统一几何与属性增强框架（DUGAE），该框架显式地利用了帧间在几何与属性两方面的时空相关性。首先，基于稀疏卷积（SPConv）和特征域几何运动补偿（GMC）的动态几何增强网络（DGE-Net）对齐并聚合时空信息。其次，一个细节感知的k近邻（DA-KNN）重着色模块在编码端将原始属性映射到增强后的几何结构上，从而提升映射完整性并保留属性细节。最后，一个具备专用时序特征提取模块和特征域属性运动补偿（AMC）的动态属性增强网络（DAE-Net）通过建模复杂的时空相关性来优化属性。在来自8iVFB v2、Owlii和MVUB数据集的七个动态点云上，DUGAE显著提升了最新G-PCC基于几何的实体内容测试模型（GeS-TM v10）的性能。在几何（D1）方面，其平均BD-PSNR增益达到11.03 dB，BD比特率降低了93.95%。对于亮度分量，其BD-PSNR增益为4.23 dB，BD比特率降低了66.61%。DUGAE也提升了感知质量（以PCQM度量），并超越了V-PCC的性能。我们的源代码将在GitHub上发布：https://github.com/yuanhui0325/DUGAE

摘要 (Abstract)

Existing post-decoding quality enhancement methods for point clouds are designed for static data and typically process each frame independently. As a result, they cannot effectively exploit the spatiotemporal correlations present in point cloud sequences.We propose a unified geometry and attribute enhancement framework (DUGAE) for G-PCC compressed dynamic point clouds that explicitly exploits inter-frame spatiotemporal correlations in both geometry and attributes. First, a dynamic geometry enhancement network (DGE-Net) based on sparse convolution (SPConv) and feature-domain geometry motion compensation (GMC) aligns and aggregates spatiotemporal information. Then, a detail-aware k-nearest neighbors (DA-KNN) recoloring module maps the original attributes onto the enhanced geometry at the encoder side, improving mapping completeness and preserving attribute details. Finally, a dynamic attribute enhancement network (DAE-Net) with dedicated temporal feature extraction and feature-domain attribute motion compensation (AMC) refines attributes by modeling complex spatiotemporal correlations. On seven dynamic point clouds from the 8iVFB v2, Owlii, and MVUB datasets, DUGAE significantly enhanced the performance of the latest G-PCC geometry-based solid content test model (GeS-TM v10). For geometry (D1), it achieved an average BD-PSNR gain of 11.03 dB and a 93.95% BD-bitrate reduction. For the luma component, it achieved a 4.23 dB BD-PSNR gain with a 66.61% BD-bitrate reduction. DUGAE also improved perceptual quality (as measured by PCQM) and outperformed V-PCC. Our source code will be released on GitHub at: https://github.com/yuanhui0325/DUGAE

关键词: dynamic point clouds, quality enhancement, spatiotemporal correlations, geometry enhancement, attribute enhancement, G-PCC compression, motion compensation, deep learning

158. ❌ Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning

作者: Bozhao Li, Shaocong Wu, Tong Shao, Senqiao Yang, Qiben Shan, Zhuotao Tian, Jingyong Su 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26179v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的开放词汇目标检测，提出了一种名为上下文一致性学习（CCL）的新框架，通过上下文引导数据生成（CBDG）和上下文一致性损失（CCLoss）来提高模型在不同背景下的鲁棒性。虽然论文涉及深度学习技术，但所有关键词均与大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化等）或科学AI应用（如生物信息学）相关，而本文研究的是纯视觉任务，未涉及任何语言模型、大模型技术原理或科学领域应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对开放词汇目标检测中模型在不同背景下识别同一物体时性能下降的问题，提出了上下文一致性学习框架，通过在OmniLabel和D3数据集上分别提升16.3 AP和14.9 AP，显著增强了模型的鲁棒性和泛化能力。

摘要翻译

开放词汇目标检测领域的最新进展主要聚焦于两个方面：扩大数据集规模以及利用对比学习对齐语言与视觉模态。然而，这些方法往往忽视了单一模态内的内部一致性，尤其是在背景或环境发生变化时。这种一致性的缺失会导致性能下降，因为模型难以在不同场景中检测同一物体，这揭示了其鲁棒性存在的不足。为解决这一问题，我们提出了上下文一致性学习（CCL）这一新颖框架，该框架整合了两项关键策略：上下文自举数据生成（CBDG）与上下文一致性损失（CCLoss）。CBDG作为一种数据生成机制，能够生成包含相同物体但背景多样的图像。这一点至关重要，因为仅凭现有数据集无法支持我们的CCL框架。而CCLoss则进一步强化了物体特征在不同环境下的不变性，从而提升了模型在多样化场景中的鲁棒性。这些策略共同构成了一个确保同一模态内上下文一致性的统一框架。我们的方法取得了最先进的性能，在OmniLabel数据集上超越了先前方法+16.3 AP，在D3数据集上超越了+14.9 AP。这些结果证明了加强模态内一致性的重要性，它能显著提升模型在多样化环境中的泛化能力。我们的代码已公开于：https://github.com/bozhao-li/CCL。

摘要 (Abstract)

Recent advances in open-vocabulary object detection focus primarily on two aspects: scaling up datasets and leveraging contrastive learning to align language and vision modalities. However, these approaches often neglect internal consistency within a single modality, particularly when background or environmental changes occur. This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG functions as a data generation mechanism, producing images that contain the same objects across diverse backgrounds. This is essential because existing datasets alone do not support our CCL framework. The CCLoss further enforces the invariance of object features despite environmental changes, thereby improving the model’s robustness in different scenes. These strategies collectively form a unified framework for ensuring contextual consistency within the same modality. Our method achieves state-of-the-art performance, surpassing previous approaches by +16.3 AP on OmniLabel and +14.9 AP on D3. These results demonstrate the importance of enforcing intra-modal consistency, significantly enhancing model generalization in diverse environments. Our code is publicly available at: https://github.com/bozhao-li/CCL.

关键词: Open-vocabulary object detection, Contextual Consistency Learning, Robustness, Contrastive learning, Data generation, Feature invariance, Generalization, Computer vision

159. ❌ ComVi: Context-Aware Optimized Comment Display in Video Playback

作者: Minsun Kim, Dawon Lee, Junyong Noh 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26173v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究视频播放中的评论显示优化系统ComVi，通过计算视听相关性将评论映射到相关时间戳，并进行优化排序，属于人机交互和多媒体系统领域。论文未涉及任何大模型、深度学习技术原理或AI for Science应用，所有关键词均与大模型技术、训练方法、推理优化、对齐、应用等无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对视频播放中评论与内容不同步导致剧透和沉浸感中断的问题，提出了ComVi系统，通过视听相关性分析和优化算法实现时间同步的评论显示，用户研究显示71.9%的参与者更偏好该系统。

摘要翻译

在YouTube等通用视频分享平台上，评论的展示与视频播放相互独立。由于观众常在观看视频时阅读评论，他们可能会遇到与当前画面无关的评论，这些评论可能泄露关键情节并破坏观看沉浸感。为解决这一问题，我们提出了ComVi——一种在上下文相关时刻展示评论的新颖系统，使观众能够同时观看时间同步的评论与视频内容。我们首先通过计算视听相关性，将所有评论映射至相关的视频时间戳，随后通过一项优化构建评论序列，该优化综合考虑了时间相关性、评论热度（点赞数）以及确保舒适阅读的展示时长。在一项用户研究中，ComVi相比传统视频界面（即YouTube与弹幕系统）提供了显著更具吸引力的体验，71.9%的参与者选择ComVi作为其最偏好的界面。

摘要 (Abstract)

On general video-sharing platforms like YouTube, comments are displayed independently of video playback. As viewers often read comments while watching a video, they may encounter ones referring to moments unrelated to the current scene, which can reveal spoilers and disrupt immersion. To address this problem, we present ComVi, a novel system that displays comments at contextually relevant moments, enabling viewers to see time-synchronized comments and video content together. We first map all comments to relevant video timestamps by computing audio-visual correlation, then construct the comment sequence through an optimization that considers temporal relevance, popularity (number of likes), and display duration for comfortable reading. In a user study, ComVi provided a significantly more engaging experience than conventional video interfaces (i.e., YouTube and Danmaku), with 71.9% of participants selecting ComVi as their most preferred interface.

关键词: video playback, comment display, context-aware, time-synchronized, audio-visual correlation, optimization, user experience, YouTube interface

160. ❌ CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

作者: Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fan Li, Wangmeng Zuo, Hongxun Yao 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26174v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究基于指令的多模态图像编辑的自动化评估方法，涉及Multimodal Large Language Models (MLLMs)和指令调优，因此与’Large Language Models OR LLMs OR Foundation Models’（8分）、‘Instruction Tuning OR Alignment OR Value Alignment’（8分）和’Mechanistic Interpretability OR Explainable AI’（8分）相关。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、推理加速、量化等未在摘要中提及，与论文核心内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了CREval，一种基于问答的自动化评估框架，用于评估模型在复杂指令下的创意图像编辑性能，并创建了CREval-Bench基准，发现闭源模型优于开源模型，但所有模型在复杂创意编辑任务上仍有困难。

摘要翻译

基于指令的多模态图像编辑技术近期发展迅速。然而，现有评估方法缺乏系统化且与人类认知对齐的框架，难以有效衡量模型在复杂创意编辑任务上的性能。为填补这一空白，我们提出了CREval——一个完全自动化的基于问答（QA）的评估流程，该方法克服了不透明的多模态大语言模型（Multimodal Large Language Models, MLLMs）评分存在的不完整性和低可解释性问题。同时，我们推出了CREval-Bench，这是一个专门为复杂指令下的创意图像编辑设计的综合性基准测试集。CREval-Bench涵盖三个类别和九个创意维度，包含超过800个编辑样本和1.3万个评估查询。借助该流程与基准，我们系统评估了一系列当前领先的开源与闭源模型。结果表明，尽管闭源模型在复杂创意任务上总体优于开源模型，但所有模型仍难以有效完成此类编辑。此外，用户研究表明CREval的自动化指标与人类判断具有高度一致性。因此，CREval为评估图像编辑模型在复杂创意图像处理任务上的表现提供了可靠基础，并指明了未来研究的关键挑战与机遇。

摘要 (Abstract)

Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval’s automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

关键词: Multimodal Large Language Models, Instruction-based image manipulation, Automated evaluation, Creative image editing, Complex instructions, Benchmark, Interpretability, Human-aligned assessment

161. ❌ Provably Contractive and High-Quality Denoisers for Convergent Restoration

作者: Shubhi Shukla, Pravin Nair 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26168v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像去噪和恢复，研究的是卷积神经网络和注意力机制的稳定性问题，提出了具有可证明收缩性的去噪器。所有评分关键词都涉及大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用，而该论文完全不涉及这些主题。论文没有提到任何语言模型、预训练、微调、对齐、推理加速、AI for Science等概念，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了现有图像去噪模型缺乏稳定性保证的问题，提出了一种可证明收缩性的去噪器网络，在保持高质量去噪性能的同时确保了输入扰动下的输出稳定性，并可作为Plug-and-Play算法的有效正则化器。

摘要翻译

图像复原旨在从退化测量中恢复清晰图像，在监控、国防和医学成像等多个领域具有重要应用。尽管现有基于卷积和注意力机制的神经网络已实现最先进的复原性能，但其在输入微小变化时缺乏稳定性保证，暴露出鲁棒性与精度之间的权衡问题。本研究开发了可证明具有压缩性（全局利普希茨常数 $< 1$）的去噪网络，显著缩小了这一差距。我们的设计结合了通过展开技术获得的近端层与利普希茨约束的卷积优化层。基于压缩性特性，我们的去噪网络保证强度为 $|δ|\le\varepsilon$ 的输入扰动最多引起 $\varepsilon$ 级别的输出变化，而DnCNN和Restormer等强基线模型在相同扰动下可能产生更大偏差。在图像去噪任务中，所提模型与无约束的最先进去噪器性能相当，报告了可证明1-利普希茨模型中最为接近的性能差距，并证实此类差距确实可通过压缩性去噪器实现。此外，所提出的去噪器可作为图像复原的强正则化器，在即插即用算法中可证明实现收敛。我们的研究结果表明，强制严格的利普希茨约束并不会固有地降低输出质量，这对学界普遍假设提出了挑战，并将该领域推向可验证且稳定的视觉模型发展。代码与预训练模型公开于https://github.com/SHUBHI1553/Contractive-Denoisers。

摘要 (Abstract)

Image restoration, the recovery of clean images from degraded measurements, has applications in various domains like surveillance, defense, and medical imaging. Despite achieving state-of-the-art (SOTA) restoration performance, existing convolutional and attention-based networks lack stability guarantees under minor shifts in input, exposing a robustness accuracy trade-off. We develop provably contractive (global Lipschitz $< 1$) denoiser networks that considerably reduce this gap. Our design composes proximal layers obtained from unfolding techniques, with Lipschitz-controlled convolutional refinements. By contractivity, our denoiser guarantees that input perturbations of strength $|δ|\le\varepsilon$ induce at most $\varepsilon$ change at the output, while strong baselines such as DnCNN and Restormer can exhibit larger deviations under the same perturbations. On image denoising, the proposed model is competitive with unconstrained SOTA denoisers, reporting the tightest gap for a provably 1-Lipschitz model and establishing that such gaps are indeed achievable by contractive denoisers. Moreover, the proposed denoisers act as strong regularizers for image restoration that provably effect convergence in Plug-and-Play algorithms. Our results show that enforcing strict Lipschitz control does not inherently degrade output quality, challenging a common assumption in the literature and moving the field toward verifiable and stable vision models. Codes and pretrained models are available at https://github.com/SHUBHI1553/Contractive-Denoisers

关键词: image restoration, denoiser networks, contractive, Lipschitz control, stability guarantees, Plug-and-Play algorithms, convolutional refinements, robustness accuracy trade-off

162. ❌ IP-Bench: Benchmark for Image Protection Methods in Image-to-Video Generation Scenarios

作者: Xiaofeng Li, Leyi Sheng, Zhen Sun, Zongmin Zhang, Jiaheng Wei, Xinlei He 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26154v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文IP-Bench专注于图像保护方法在图像到视频生成场景中的基准测试，研究内容涉及图像保护、视频生成、基准评估、鲁棒性攻击等。所有给定的关键词均与大语言模型、深度学习技术原理、AI for Science等具体技术相关，而本文的核心是图像保护基准测试，与这些关键词无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对图像到视频生成场景中缺乏统一评估框架的问题，提出了首个系统性的图像保护基准IP-Bench，用于评估保护方法的鲁棒性和跨模型/跨模态可迁移性。

摘要翻译

随着图像到视频（I2V）生成模型的快速发展，其被滥用于创建恶意内容的可能性已成为一个重要关切。例如，单张图像可能被利用来生成虚假视频，用以吸引关注并谋取利益。这种现象被称为I2V生成滥用。现有的图像保护方法缺乏统一的基准，导致评估框架不完整。此外，这些方法尚未在I2V生成场景中及针对预处理攻击进行系统评估，这使其在实际部署场景中的有效性评估变得复杂。为应对这一挑战，我们提出了IP-Bench（图像保护基准），这是首个旨在系统评估I2V生成场景中保护方法的基准。该基准涵盖了6种代表性保护方法和5种先进的I2V模型。进一步地，我们的工作通过两种实际场景下的鲁棒性攻击策略，系统评估了保护方法的鲁棒性，并分析了其跨模型与跨模态的可迁移性。总体而言，IP-Bench为I2V生成场景中的图像保护方法建立了一个系统化、可复现且可扩展的评估框架。

摘要 (Abstract)

With the rapid advancement of image-to-video (I2V) generation models, their potential for misuse in creating malicious content has become a significant concern. For instance, a single image can be exploited to generate a fake video, which can be used to attract attention and gain benefits. This phenomenon is referred to as an I2V generation misuse. Existing image protection methods suffer from the absence of a unified benchmark, leading to an incomplete evaluation framework. Furthermore, these methods have not been systematically assessed in I2V generation scenarios and against preprocessing attacks, which complicates the evaluation of their effectiveness in real-world deployment scenarios.To address this challenge, we propose IP-Bench (Image Protection Bench), the first systematic benchmark designed to evaluate protection methods in I2V generation scenarios. This benchmark examines 6 representative protection methods and 5 state-of-the-art I2V models. Furthermore, our work systematically evaluates protection methods’ robustness with two robustness attack strategies under practical scenarios and analyzes their cross-model & cross-modality transferability. Overall, IP-Bench establishes a systematic, reproducible, and extensible evaluation framework for image protection methods in I2V generation scenarios.

关键词: Image Protection, Image-to-Video Generation, Benchmark, Robustness Evaluation, Cross-model Transferability, Cross-modality Transferability, Preprocessing Attacks, I2V Models

163. ❌ Gaussian Shannon: High-Precision Diffusion Model Watermarking Based on Communication

作者: Yi Zhang, Hongbo Huang, Liang-Jie Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26167v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型（Diffusion Models）的水印技术，研究如何将水印嵌入到初始高斯噪声中，以实现鲁棒追踪和精确比特恢复。论文的核心是图像生成模型的安全性和版权保护，而非大语言模型（LLMs）或深度学习技术原理的创新。所有评分关键词均与大语言模型、其训练方法、推理优化、对齐技术、代理系统、科学应用等直接相关，而本文研究的是扩散模型这一特定生成模型的水印问题，与这些关键词无直接关联。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对扩散模型生成图像存在的版权和虚假信息风险，提出了一种名为Gaussian Shannon的水印框架，通过将水印嵌入初始高斯噪声并设计级联防御机制，实现了鲁棒追踪和精确比特恢复，在多种扰动下达到了最先进的比特级精度。

摘要翻译

扩散模型能生成高质量图像，但也带来版权侵犯与虚假信息等严重风险。水印技术是追踪和认证AI生成内容的关键防御手段。然而，现有方法依赖基于阈值的检测，仅支持模糊匹配，无法无损恢复结构化水印数据比特，因此不适用于离线验证或需要无损元数据的应用场景（如许可指令）。为解决此问题，本文提出高斯香农（Gaussian Shannon）水印框架，将扩散过程视为噪声通信信道，同时实现鲁棒追踪与精确比特恢复。该方法将水印嵌入初始高斯噪声中，无需微调模型且不损失生成质量。我们识别出两类信道干扰——局部比特翻转与全局随机失真，并设计了结合纠错码与多数投票的级联防御机制，从而确保语义载荷的可靠端到端传输。在三种Stable Diffusion变体与七类扰动下的实验表明，高斯香农在保持高真阳性率的同时实现了最优的比特级精度，能够为实际部署提供可信的权利归属验证。源代码已公开于：https://github.com/Rambo-Yi/Gaussian-Shannon

摘要 (Abstract)

Diffusion models generate high-quality images but pose serious risks like copyright violation and disinformation. Watermarking is a key defense for tracing and authenticating AI-generated content. However, existing methods rely on threshold-based detection, which only supports fuzzy matching and cannot recover structured watermark data bit-exactly, making them unsuitable for offline verification or applications requiring lossless metadata (e.g., licensing instructions). To address this problem, in this paper, we propose Gaussian Shannon, a watermarking framework that treats the diffusion process as a noisy communication channel and enables both robust tracing and exact bit recovery. Our method embeds watermarks in the initial Gaussian noise without fine-tuning or quality loss. We identify two types of channel interference, namely local bit flips and global stochastic distortions, and design a cascaded defense combining error-correcting codes and majority voting. This ensures reliable end-to-end transmission of semantic payloads. Experiments across three Stable Diffusion variants and seven perturbation types show that Gaussian Shannon achieves state-of-the-art bit-level accuracy while maintaining a high true positive rate, enabling trustworthy rights attribution in real-world deployment. The source code have been made available at: https://github.com/Rambo-Yi/Gaussian-Shannon

关键词: Diffusion Models, Watermarking, Gaussian Noise, Bit Recovery, Error-Correcting Codes, Stable Diffusion, Copyright Protection, AI-Generated Content

164. ❌ Efficient Few-Shot Learning for Edge AI via Knowledge Distillation on MobileViT

作者: Shuhei Tsuyuki, Reda Bensaid, Jérémy Morlier, Mathieu Léonardon, Naoya Onizawa, Vincent Gripon, Takahiro Hanyu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26145v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于边缘AI设备上的高效深度学习模型，特别是通过知识蒸馏实现小样本学习。与关键词的相关性分析：1）与’Small Language Models OR SLMs OR On-device AI’高度相关（8分），因为论文研究边缘设备上的轻量级模型部署；2）与’Quantization OR Model Compression OR Low-bit Weights’高度相关（8分），因为论文通过知识蒸馏减少参数和计算复杂度，属于模型压缩技术；3）与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文涉及预训练方法；4）其他关键词（如LLMs、MoE、RAG等）与论文内容完全无关（0分），因为论文研究的是计算机视觉领域的MobileViT和知识蒸馏，而非大语言模型相关技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于知识蒸馏的预训练方法，用于MobileViT骨干网络在边缘AI设备上的小样本学习，在MiniImageNet基准测试中相比ResNet12基线实现了14%和6.7%的准确率提升，同时减少了69%的参数和88%的计算复杂度，并在Jetson Orin Nano平台上验证了37%的动态能耗降低和2.6ms的延迟。

摘要翻译

高效且适应性强的深度学习模型是深度学习研究的重要领域，其驱动力源于边缘设备对高效模型的迫切需求。小样本学习使得深度学习模型能够在低数据条件下运行，这一能力在实际应用中备受青睐，因为在现实场景中收集大规模标注数据集成本高昂或难以实现。这一挑战在边缘计算场景中尤为突出，此类场景通常面临连接受限、需低延迟响应或能耗约束严格等问题。本文提出并评估了一种专为边缘计算设计的MobileViT骨干网络的预训练方法。具体而言，我们采用知识蒸馏技术，将大规模教师模型的泛化能力迁移至轻量级学生模型。在MiniImageNet基准测试中，与ResNet12基线相比，该方法在单样本和五样本分类任务上分别实现了14%和6.7%的准确率提升，同时将参数量减少69%，并将模型的计算复杂度（以FLOPs衡量）降低88%。此外，我们将所提模型部署于Jetson Orin Nano平台，并通过电源端直接测量功耗，结果表明动态能耗降低了37%，延迟仅为2.6毫秒。这些结果证明，该方法为在边缘AI硬件上部署小样本学习模型提供了一种具有前景且实用的解决方案。

摘要 (Abstract)

Efficient and adaptable deep learning models are an important area of deep learning research, driven by the need for highly efficient models on edge devices. Few-shot learning enables the use of deep learning models in low-data regimes, a capability that is highly sought after in real-world applications where collecting large annotated datasets is costly or impractical. This challenge is particularly relevant in edge scenarios, where connectivity may be limited, low-latency responses are required, or energy consumption constraints are critical. We propose and evaluate a pre-training method for the MobileViT backbone designed for edge computing. Specifically, we employ knowledge distillation, which transfers the generalization ability of a large-scale teacher model to a lightweight student model. This method achieves accuracy improvements of 14% and 6.7% for one-shot and five-shot classification, respectively, on the MiniImageNet benchmark, compared to the ResNet12 baseline, while reducing by 69% the number of parameters and by 88% the computational complexity of the model, in FLOPs. Furthermore, we deployed the proposed models on a Jetson Orin Nano platform and measured power consumption directly at the power supply, showing that the dynamic energy consumption is reduced by 37% with a latency of 2.6 ms. These results demonstrate that the proposed method is a promising and practical solution for deploying few-shot learning models on edge AI hardware.

关键词: knowledge distillation, few-shot learning, edge AI, MobileViT, model efficiency, parameter reduction, computational complexity, power consumption

165. ❌ PruneFuse: Efficient Data Selection via Weight Pruning and Network Fusion

作者: Humaira Kousar, Hasnain Irshad Bhatti, Jaekyun Moon 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26138v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PruneFuse专注于通用深度神经网络的数据选择效率优化，提出了一种结合网络剪枝和融合的方法来降低计算成本并加速训练。虽然研究涉及模型压缩（剪枝）和训练优化，但所有关键词均明确针对大语言模型（LLM）或特定AI子领域（如科学AI、推理方法、对齐技术等）。论文未提及LLM、MoE、SLM、Scaling Laws、预训练/后训练、对齐、RLHF、PEFT、RAG、长上下文、注意力优化、推理方法、智能体、工具使用、多智能体、量化、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。因此，所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PruneFuse的高效数据选择方法，通过结合网络剪枝和融合来降低深度神经网络训练的计算成本并加速整体训练过程。

摘要翻译

高效的数据选择对于提升深度神经网络的训练效率及降低标注需求至关重要。传统方法常面临高昂计算成本，限制了其可扩展性与实际应用。本文提出PruneFuse——一种创新策略，该策略利用剪枝网络进行数据选择，随后将其与原始网络融合以优化训练过程。PruneFune包含两个阶段：首先，通过结构化剪枝构建一个规模较小的剪枝网络，该网络因其与原始网络的结构一致性而非常适合数据选择任务。随后训练该小型网络，使其从数据集中筛选出信息量最高的样本。其次，将训练后的剪枝网络与原始网络无缝融合。这一整合过程充分利用了剪枝网络训练阶段获得的知识，以促进融合网络的学习进程，同时为网络探索更鲁棒的解决方案保留空间。在多类数据集上的大量实验表明，PruneFuse能显著降低数据选择的计算成本，取得优于基线模型的性能表现，并加速整体训练流程。

摘要 (Abstract)

Efficient data selection is crucial for enhancing the training efficiency of deep neural networks and minimizing annotation requirements. Traditional methods often face high computational costs, limiting their scalability and practical use. We introduce PruneFuse, a novel strategy that leverages pruned networks for data selection and later fuses them with the original network to optimize training. PruneFuse operates in two stages: First, it applies structured pruning to create a smaller pruned network that, due to its structural coherence with the original network, is well-suited for the data selection task. This small network is then trained and selects the most informative samples from the dataset. Second, the trained pruned network is seamlessly fused with the original network. This integration leverages the insights gained during the training of the pruned network to facilitate the learning process of the fused network while leaving room for the network to discover more robust solutions. Extensive experimentation on various datasets demonstrates that PruneFuse significantly reduces computational costs for data selection, achieves better performance than baselines, and accelerates the overall training process.

关键词: data selection, weight pruning, network fusion, training efficiency, computational cost reduction, structured pruning, informative samples, deep neural networks

166. ❌ InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution

作者: Jintong Hu, Bin Chen, Zhenyu Hu, Jiayue Liu, Guo Wang, Lu Qi 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26134v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文InstaVSR专注于视频超分辨率（VSR）任务，提出了一种轻量级扩散框架以提高效率和时序一致性。虽然论文涉及扩散模型（一种生成模型），但所有评分关键词均针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、量化、代理等）。论文内容与LLMs、语言处理或大模型在科学领域的应用无直接关联，也未涉及任何关键词中的特定技术（如指令调优、上下文学习、模型压缩等）。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出InstaVSR，一种轻量级扩散框架，用于高效且时序一致的视频超分辨率，在显著降低计算成本的同时保持感知质量。

摘要翻译

视频超分辨率（Video Super-resolution, VSR）旨在从低分辨率输入中重建高分辨率帧。尽管基于扩散模型的方法已显著提升了感知质量，但将其扩展至视频领域仍面临两大挑战：一是强生成先验可能引入时序不稳定性，二是多帧扩散流程的计算成本通常过高，难以实际部署。为同时应对这两项挑战，我们提出了InstaVSR——一种用于高效视频超分辨率的轻量化扩散框架。InstaVSR融合了三个关键要素：（1）经过剪枝的一步扩散主干网络，该网络移除了传统基于扩散的VSR流程中多个计算代价高昂的组件；（2）采用光流引导的时序正则化进行循环训练，以提升帧间稳定性；（3）在潜在空间与像素空间进行双重对抗学习，以在主干网络简化后保持感知质量。在NVIDIA RTX 4090上，InstaVSR处理一段30帧、2K$\times$2K分辨率的视频耗时不足一分钟，仅占用7 GB内存。与现有基于扩散的方法相比，InstaVSR在显著降低计算成本的同时，保持了优异的感知质量，并实现了更为平滑的时序过渡。

摘要 (Abstract)

Video super-resolution (VSR) seeks to reconstruct high-resolution frames from low-resolution inputs. While diffusion-based methods have substantially improved perceptual quality, extending them to video remains challenging for two reasons: strong generative priors can introduce temporal instability, and multi-frame diffusion pipelines are often too expensive for practical deployment. To address both challenges simultaneously, we propose InstaVSR, a lightweight diffusion framework for efficient video super-resolution. InstaVSR combines three ingredients: (1) a pruned one-step diffusion backbone that removes several costly components from conventional diffusion-based VSR pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after backbone simplification. On an NVIDIA RTX 4090, InstaVSR processes a 30-frame video at 2K$\times$2K resolution in under one minute with only 7 GB of memory usage, substantially reducing the computational cost compared to existing diffusion-based methods while maintaining favorable perceptual quality with significantly smoother temporal transitions.

关键词: Video Super-Resolution, Diffusion Models, Temporal Consistency, Efficient Inference, Lightweight Framework, Adversarial Learning, Flow-guided Regularization

167. ❌ TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life

作者: Mridul Khurana, Amin Karimi Monsefi, Justin Lee, Medha Sawhney, David Carlyn, Julia Chae, Jianyang Gu, Rajiv Ramnath, Sara Beery, Wei-Lun Chao, Anuj Karpatne, Cheng Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究使用视觉分类模型（VTMs）指导细粒度物种图像生成，属于AI在生物科学领域的应用。与大多数关键词（如MoE、SFT、RAG等）无关，因为这些关键词涉及大模型技术原理或通用应用，而本文专注于特定领域的图像生成方法。唯一高度相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’（10分），因为论文明确应用AI技术解决生物物种生成的科学问题。‘Large Language Models OR LLMs OR Foundation Models’得5分，因为论文提到使用多模态大语言模型作为评估指标，但这不是核心研究内容。其他关键词均得0分，因论文未涉及这些技术或概念。

!!! tip deepseek-chat TL;DR

该论文提出TaxaAdapter方法，通过注入视觉分类模型嵌入来改进文本到图像扩散模型，解决了生成细粒度物种图像时物种身份保真度不足的问题，显著提升了形态学一致性和物种识别准确性。

摘要翻译

在生命之树中精确生成图像是一项艰巨的挑战：地球上有超过1000万个不同的物种，其中许多仅通过细微的视觉特征加以区分。尽管文本到图像合成技术取得了显著进展，但现有模型往往无法捕捉决定物种身份的细粒度视觉线索，即使其生成的图像看起来具有照片般的真实感。为此，我们提出了TaxaAdapter，这是一种简单而轻量的方法，它整合了视觉分类学模型（如BioCLIP）来指导细粒度的物种生成。我们的方法将VTM嵌入向量注入到一个冻结的文本到图像扩散模型中，在保持对姿态、风格和背景等属性的灵活文本控制的同时，提高了物种层面的保真度。大量实验表明，与强大的基线模型相比，TaxaAdapter凭借更简洁的架构和训练方案，持续提升了形态保真度和物种身份准确性。为了更好地评估这些改进，我们还引入了一种基于多模态大语言模型的评估指标，该指标能总结生成图像和真实图像在特征层面的描述，从而为形态一致性提供了一种更具可解释性的度量方法。此外，我们观察到TaxaAdapter展现出强大的泛化能力，能够在极具挑战性的场景下实现物种合成，例如仅用少量训练图像的少样本物种，甚至是在训练期间未见过的物种。总体而言，我们的研究结果凸显了视觉分类学模型是实现可扩展、细粒度物种生成的关键要素。

摘要 (Abstract)

Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.

关键词: TaxaAdapter, Vision Taxonomy Models, fine-grained image generation, species identity, diffusion model, BioCLIP, morphological consistency, few-shot species synthesis

168. ❌ FINDER: Zero-Shot Field-Integrated Network for Distortion-free EPI Reconstruction in Diffusion MRI

作者: Namgyu Han, Seong Dae Yun, Chaeeun Lim, Sunghyun Seok, Sunju Kim, Yoonhwan Kim, Yohan Jun, Tae Hyung Kim, Berkin Bilgic, Jaejin Cho 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26117v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像（扩散MRI）重建，使用深度学习技术（物理引导展开网络、隐式神经表示）解决EPI图像几何失真问题。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学成像领域的应用，但并非核心匹配（论文未直接提及生物信息学或化学信息学），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FINDER的零样本、扫描特异性框架，通过联合优化底层图像和B0场图，有效解决了扩散MRI中EPI序列的几何失真问题，实现了优于现有方法的几何保真度和图像质量。

摘要翻译

回波平面成像（EPI）仍是扩散磁共振成像的基石，但其快速采样策略使序列对$B_{0}$场不均匀性高度敏感，容易导致严重的几何畸变。尽管深度学习已助力改进磁共振成像重建，但将稳健的几何畸变校正整合到自监督框架中仍是一个尚未满足的需求。为此，我们提出了FINDER（用于无畸变EPI重建的场集成网络），这是一种新颖的零样本、扫描特异性框架，将重建问题重新定义为对底层图像与$B_{0}$场图的联合优化。具体而言，我们采用物理引导的展开网络，该网络集成了双域去噪器与虚拟线圈扩展以强制实现稳健的数据一致性。同时，我们结合了一种以空间坐标和潜在图像特征为条件的隐式神经表示（INR），将失谐场建模为一个连续、可微的函数。通过采用交替最小化策略，FINDER协同更新重建网络与场图，有效将磁化率引起的几何畸变与解剖结构分离。实验结果表明，与现有先进基线方法相比，FINDER在几何保真度和图像质量方面均表现更优，为高质量扩散成像提供了一个稳健的解决方案。

摘要 (Abstract)

Echo-planar imaging (EPI) remains the cornerstone of diffusion MRI, but it is prone to severe geometric distortions due to its rapid sampling scheme that renders the sequence highly sensitive to $B_{0}$ field inhomogeneities. While deep learning has helped improve MRI reconstruction, integrating robust geometric distortion correction into a self-supervised framework remains an unmet need. To address this, we present FINDER (Field-Integrated Network for Distortion-free EPI Reconstruction), a novel zero-shot, scan-specific framework that reformulates reconstruction as a joint optimization of the underlying image and the $B_{0}$ field map. Specifically, we employ a physics-guided unrolled network that integrates dual-domain denoisers and virtual coil extensions to enforce robust data consistency. This is coupled with an Implicit Neural Representation (INR) conditioned on spatial coordinates and latent image features to model the off-resonance field as a continuous, differentiable function. Employing an alternating minimization strategy, FINDER synergistically updates the reconstruction network and the field map, effectively disentangling susceptibility-induced geometric distortions from anatomical structures. Experimental results demonstrate that FINDER achieves superior geometric fidelity and image quality compared to state-of-the-art baselines, offering a robust solution for high-quality diffusion imaging.

关键词: diffusion MRI, EPI reconstruction, geometric distortion correction, zero-shot learning, physics-guided unrolled network, Implicit Neural Representation (INR), B0 field map, self-supervised framework

169. ❌ SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection

作者: Jiaming Liang, Yifeng Zhan, Chunlin Liu, Weihua Zheng, Bingye Peng, Qiwei Liang, Boyang Cai, Xiaochun Mai, Qiang Nie 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26109v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究开放词汇伪装目标检测，利用大规模视觉-语言预训练模型（如CLIP）进行零样本泛化，因此与’Large Language Models OR LLMs OR Foundation Models’和’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），但并非核心。其他关键词如MoE、SFT、RAG、量化等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对开放词汇伪装目标检测中目标与背景特征高度相似导致检测困难的问题，提出了特异性驱动的动态聚焦方法，在OVCOD-D基准上实现了56.4的AP值。

摘要翻译

开放词汇目标检测（Open-vocabulary Object Detection, OVOD）旨在通过利用文本提示来检测开放世界中的已知与未知物体。得益于大规模视觉-语言预训练模型的出现，OVOD已展现出强大的零样本泛化能力。然而，在处理伪装物体时，由于物体与背景的视觉特征高度相似，检测器往往难以区分和定位目标。为弥补这一差距，我们通过为精心挑选的伪装物体图像添加细粒度文本描述，构建了一个名为OVCOD-D的基准数据集。由于现有伪装物体数据集的规模有限，我们采用在大规模目标检测数据集上预训练的检测器作为基线方法，因其具备更强的零样本泛化能力。在多模态大模型生成的特性感知子描述中，仍存在混淆性和过度修饰的修饰成分。为减轻此类干扰，我们设计了一种子描述主成分对比融合策略，以降低噪声文本成分的影响。此外，针对伪装物体视觉特征与周围环境高度相似这一挑战，我们提出了一种特性引导的区域弱对齐与动态聚焦方法，旨在增强检测器从背景中区分伪装物体的能力。在开放集评估设定下，所提方法在OVCOD-D基准上达到了56.4的平均精度（AP）。

摘要 (Abstract)

Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts. Benefiting from the emergence of large-scale vision–language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named OVCOD-D by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector’s ability to discriminate camouflaged objects from background. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark.

关键词: Open-vocabulary object detection, Camouflaged object detection, Vision-language pre-trained models, Zero-shot generalization, Specificity-driven dynamic focusing, OVCOD-D benchmark, Multimodal large models, Textual description fusion

170. ❌ Accurate Precipitation Forecast by Efficiently Learning from Massive Atmospheric Variables and Unbalanced Distribution

作者: Shuangliang Li, Siwei Li, Li Li, Weijie Zou, Jie Yang, Maolin Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26108v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于短期降水预报，提出了一种能够有效利用大量大气观测数据并处理样本不平衡问题的新型预测模型。论文的核心是气象学/气候学领域的深度学习应用，而非大语言模型（LLM）技术。所有关键词中，只有“AI for Science OR Bioinformatics OR Cheminformatics”与论文主题直接相关，因为降水预报属于“AI for Science”范畴。其他关键词均涉及LLM架构、训练、对齐、推理、代理等具体技术或概念，与本文研究的传统深度学习气象预测模型无直接关联。

!!! tip deepseek-chat TL;DR

该研究解决了短期降水预报中因数据利用效率低和降水样本极端不平衡导致的精度与效率瓶颈，提出了一种能高效利用海量大气观测数据并引入新型损失函数（WMCE）的预测模型，实验表明其在精度和效率上均显著优于现有基线。

摘要翻译

短期（0-24小时）降水预报对社会经济活动和公共安全具有极高价值。然而，降水事件演化模式的高度复杂性、降水与非降水样本间的极端不平衡性，以及现有模型无法高效利用海量多源大气观测数据的问题，制约了降水预报精度与计算效率的提升。为应对上述挑战，本研究开发了一种新型预报模型，该模型能通过自动提取并迭代预测与降水演化密切相关的潜在特征，从而高效利用海量大气观测数据。此外，本研究引入了一种“WMCE”损失函数，旨在准确识别极端稀少的降水事件，同时精确预测其强度值。在两个数据集上的大量实验表明，我们提出的模型在精度和效率上均显著且持续优于所有主流基线方法。与现有方法相比，所提出的预报模型大幅降低了获得有价值预测结果所需的计算成本，从而成为高效实用降水预报领域的一个里程碑。

摘要 (Abstract)

Short-term (0-24 hours) precipitation forecasting is highly valuable to socioeconomic activities and public safety. However, the highly complex evolution patterns of precipitation events, the extreme imbalance between precipitation and non-precipitation samples, and the inability of existing models to efficiently and effectively utilize large volumes of multi-source atmospheric observation data hinder improvements in precipitation forecasting accuracy and computational efficiency. To address the above challenges, this study developed a novel forecasting model capable of effectively and efficiently utilizing massive atmospheric observations by automatically extracting and iteratively predicting the latent features strongly associated with precipitation evolution. Furthermore, this study introduces a ‘WMCE’ loss function, designed to accurately discriminate extremely scarce precipitation events while precisely predicting their intensity values. Extensive experiments on two datasets demonstrate that our proposed model substantially and consistently outperforms all prevalent baselines in both accuracy and efficiency. Moreover, the proposed forecasting model substantially lowers the computational cost required to obtain valuable predictions compared to existing approaches, thereby positioning it as a milestone for efficient and practical precipitation forecasting.

关键词: precipitation forecasting, atmospheric observations, imbalanced distribution, WMCE loss function, computational efficiency, deep learning, short-term forecast, feature extraction

171. ❌ AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation

作者: Hyeongyu Kim, Geonhui Han, Dosik Hwang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26096v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的测试时适应（TTA）方法，提出通过动态调整激活函数来提升模型在分布偏移下的鲁棒性。所有评分关键词均与大语言模型、深度学习技术原理创新或科学AI应用相关，而本文研究的是传统计算机视觉模型的适应方法，未涉及大模型、LLM相关技术、AI for Science应用或深度学习技术原理的创新，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AcTTA的测试时适应框架，通过动态调整激活函数而非网络权重来提升模型在分布偏移下的性能，在多个图像数据集上超越了基于归一化的现有方法。

摘要翻译

测试时自适应（Test-Time Adaptation, TTA）旨在通过在推理过程中更新模型参数来缓解分布漂移下的性能下降。现有方法主要围绕仿射调制构建自适应框架，侧重于重新校准归一化层。这一视角虽然有效，却忽略了表征动态中另一个有影响力的组件：激活函数。我们重新审视了这一被忽视的领域，并提出了AcTTA，这是一个激活感知框架，它从可学习的角度重新诠释了传统激活函数，并在测试时自适应地更新它们。AcTTA将传统激活函数（如ReLU、GELU）重新参数化为可调节响应阈值和梯度敏感度的形式，使网络能够在域漂移下调整激活行为。这种函数重参数化使得激活行为能够持续调整，而无需修改网络权重或依赖源数据。尽管方法简洁，AcTTA在多种损坏条件下实现了稳健且稳定的自适应。在CIFAR10-C、CIFAR100-C和ImageNet-C数据集上，AcTTA一致超越了基于归一化的TTA方法。我们的研究结果表明，激活自适应是实现域漂移鲁棒性测试时学习的一条紧凑而有效的途径，拓宽了当前以仿射变换为核心的自适应视角。

摘要 (Abstract)

Test-time adaptation (TTA) aims to mitigate performance degradation under distribution shifts by updating model parameters during inference. Existing approaches have primarily framed adaptation around affine modulation, focusing on recalibrating normalization layers. This perspective, while effective, overlooks another influential component in representation dynamics: the activation function. We revisit this overlooked space and propose AcTTA, an activation-aware framework that reinterprets conventional activation functions from a learnable perspective and updates them adaptively at test time. AcTTA reformulates conventional activation functions (e.g., ReLU, GELU) into parameterized forms that shift their response threshold and modulate gradient sensitivity, enabling the network to adjust activation behavior under domain shifts. This functional reparameterization enables continuous adjustment of activation behavior without modifying network weights or requiring source data. Despite its simplicity, AcTTA achieves robust and stable adaptation across diverse corruptions. Across CIFAR10-C, CIFAR100-C, and ImageNet-C, AcTTA consistently surpasses normalization-based TTA methods. Our findings highlight activation adaptation as a compact and effective route toward domain-shift-robust test-time learning, broadening the prevailing affine-centric view of adaptation.

关键词: Test-time adaptation, Activation function, Domain shift, Dynamic activation, Parameterized activation, Distribution shift, Robust adaptation, Image classification

172. ❌ CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection

作者: Youngjun Song, Hyeongyu Kim, Dosik Hwang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26092v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的测试时适应（TTA）方法，具体针对恶劣天气下的目标检测任务。论文提出的CD-Buffer框架涉及特征层面的域适应、通道移除和精炼机制，但所有关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用无关。论文未提及任何语言模型、模型训练技术、推理方法、对齐技术、代理系统或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CD-Buffer的互补双缓冲框架，通过统一的差异度量自适应地平衡特征移除和精炼策略，以解决测试时适应在恶劣天气目标检测中面对不同域偏移严重程度时的泛化问题，并在多个数据集上实现了最先进的性能。

摘要翻译

测试时自适应（Test-Time Adaptation, TTA）能够在不进行离线重训练的情况下实现对域偏移的实时适应。近期的TTA方法主要探索了添加式策略，即引入轻量级模块进行特征优化。最近，一种通过移除域敏感通道的削减式方法作为一种替代方向出现。我们观察到这两种范式表现出互补的有效性模式：削减式方法在严重偏移下通过移除受损特征表现优异，而添加式方法在需要精细调整的中等偏移下更为有效。然而，每种范式仅在有限的偏移严重程度范围内有效，无法泛化到不同的损坏程度。这引出了以下问题：我们能否基于测量到的特征级域偏移，自适应地平衡这两种策略？我们提出了CD-Buffer，一种新颖的互补双缓冲框架，其中削减机制和添加机制在统一差异度量的驱动下，沿相反但协调的方向运作。我们的核心创新在于差异驱动的耦合：该框架通过统一的差异度量将特征移除与优化过程相耦合，根据特征级偏移的严重程度自动平衡两种策略。这建立了自动的通道级平衡机制，能够针对异质的偏移幅度进行差异化处理，而无需人工调参。在KITTI、Cityscapes和ACDC数据集上的大量实验证明了本方法具有最先进的性能，在不同天气条件和严重程度下均能取得优越结果。

摘要 (Abstract)

Test-Time Adaptation (TTA) enables real-time adaptation to domain shifts without off-line retraining. Recent TTA methods have predominantly explored additive approaches that introduce lightweight modules for feature refinement. Recently, a subtractive approach that removes domain-sensitive channels has emerged as an alternative direction. We observe that these paradigms exhibit complementary effectiveness patterns: subtractive methods excel under severe shifts by removing corrupted features, while additive methods are effective under moderate shifts requiring refinement. However, each paradigm operates effectively only within limited shift severity ranges, failing to generalize across diverse corruption levels. This leads to the following question: can we adaptively balance both strategies based on measured feature-level domain shift? We propose CD-Buffer, a novel complementary dual-buffer framework where subtractive and additive mechanisms operate in opposite yet coordinated directions driven by a unified discrepancy metric. Our key innovation lies in the discrepancy-driven coupling: Our framework couples removal and refinement through a unified discrepancy metric, automatically balancing both strategies based on feature-level shift severity. This establishes automatic channel-wise balancing that adapts differentiated treatment to heterogeneous shift magnitudes without manual tuning. Extensive experiments on KITTI, Cityscapes, and ACDC datasets demonstrate state-of-the-art performance, consistently achieving superior results across diverse weather conditions and severity levels.

关键词: Test-Time Adaptation, Domain Shift, Object Detection, Adverse Weather, Feature Refinement, Channel Removal, Discrepancy Metric, Dual-Buffer Framework

173. ❌ Learnable Instance Attention Filtering for Adaptive Detector Distillation

作者: Chen Liu, Qizhen Lan, Zhicheng Ding, Xinyu Chu, Qing Tian 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26088v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的知识蒸馏（Knowledge Distillation），特别是目标检测任务中的实例注意力过滤方法。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是视觉模型的效率优化，属于传统深度学习应用范畴，与评分关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LIAF-KD的可学习实例注意力过滤框架，用于自适应检测器知识蒸馏，通过在KITTI和COCO数据集上的实验，实现了在不增加复杂度的情况下提升学生模型性能2%的效果。

摘要翻译

随着深度视觉模型为追求更高性能而日益复杂化，部署效率已成为关键问题。知识蒸馏（Knowledge Distillation, KD）通过将大型教师模型的知识迁移至紧凑的学生模型来缓解这一矛盾。尽管许多基于特征的KD方法依赖空间过滤来引导蒸馏过程，但它们通常对所有目标实例进行统一处理，忽略了实例级别的差异性。此外，现有的注意力过滤机制往往是启发式或教师驱动的，而非基于学生模型的学习状态进行优化。为应对这些局限，本文提出可学习实例注意力过滤的自适应检测器蒸馏框架（Learnable Instance Attention Filtering for Adaptive Detector Distillation, LIAF-KD），该新颖框架引入可学习的实例选择器，在蒸馏过程中动态评估并重新加权实例的重要性。值得注意的是，学生模型能够根据其动态学习状态参与此过程。在KITTI和COCO数据集上的实验表明，该方法取得了持续的性能提升：在未增加计算复杂度的前提下，使用GFL ResNet-50作为学生模型时获得了2%的性能增益，其效果优于当前最先进的方法。

摘要 (Abstract)

As deep vision models grow increasingly complex to achieve higher performance, deployment efficiency has become a critical concern. Knowledge distillation (KD) mitigates this issue by transferring knowledge from large teacher models to compact student models. While many feature-based KD methods rely on spatial filtering to guide distillation, they typically treat all object instances uniformly, ignoring instance-level variability. Moreover, existing attention filtering mechanisms are typically heuristic or teacher-driven, rather than learned with the student. To address these limitations, we propose Learnable Instance Attention Filtering for Adaptive Detector Distillation (LIAF-KD), a novel framework that introduces learnable instance selectors to dynamically evaluate and reweight instance importance during distillation. Notably, the student contributes to this process based on its evolving learning state. Experiments on the KITTI and COCO datasets demonstrate consistent improvements, with a 2% gain on a GFL ResNet-50 student without added complexity, outperforming state-of-the-art methods.

关键词: Knowledge Distillation, Instance Attention Filtering, Adaptive Detector Distillation, Object Detection, Deep Vision Models, Student-Teacher Learning, Model Efficiency, Feature-based Distillation

174. ❌ MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality

作者: Kyungwon Kim, Dosik Hwang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26071v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像和基因组学数据的多模态生存预测，提出了一种名为MUST的Transformer框架来处理缺失模态问题，并利用条件潜在扩散模型生成缺失表示。论文的核心是医学AI应用（精准肿瘤学），属于"AI for Science"范畴，因此该关键词得10分。然而，论文并未涉及大语言模型（LLM）、模型架构创新（如MoE、量化）、训练技术（如RLHF、PEFT）、推理优化（如RAG、注意力机制）或智能体系统等主题，所有其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MUST的新型Transformer框架，通过显式分解模态表示和条件潜在扩散模型，解决了多模态医学数据中缺失模态的生存预测问题，在TCGA癌症数据集上实现了最先进的性能。

摘要翻译

基于多模态医学数据的精准生存预测对精准肿瘤学至关重要，但临床部署面临一个持续存在的挑战：由于成本限制、技术局限或回顾性数据可用性问题，模态数据常常不完整。尽管现有方法尝试通过特征对齐或联合分布学习来处理缺失模态，但这些方法本质上缺乏对每个模态独特贡献的明确建模——即区别于其他模态可推导信息的部分。我们提出MUST（模态特异性表征感知Transformer），这是一种新型框架，通过学习低秩共享子空间中的代数约束，将每个模态的表征显式分解为模态特异性成分和跨模态情境化成分。这种分解能够精确识别当某一模态缺失时损失的信息。对于无法从现有模态推断的真正模态特异性信息，我们采用条件潜在扩散模型，基于恢复的共享信息和学习到的结构先验生成高质量表征。在五个TCGA癌症数据集上的大量实验表明，MUST在使用完整数据时达到了最先进的性能，同时在病理学缺失和基因组学缺失条件下均保持稳健的预测能力，且推理延迟处于临床可接受范围。

摘要 (Abstract)

Accurate survival prediction from multimodal medical data is essential for precision oncology, yet clinical deployment faces a persistent challenge: modalities are frequently incomplete due to cost constraints, technical limitations, or retrospective data availability. While recent methods attempt to address missing modalities through feature alignment or joint distribution learning, they fundamentally lack explicit modeling of the unique contributions of each modality as opposed to the information derivable from other modalities. We propose MUST (Modality-Specific representation-aware Transformer), a novel framework that explicitly decomposes each modality’s representation into modality-specific and cross-modal contextualized components through algebraic constraints in a learned low-rank shared subspace. This decomposition enables precise identification of what information is lost when a modality is absent. For the truly modality-specific information that cannot be inferred from available modalities, we employ conditional latent diffusion models to generate high-quality representations conditioned on recovered shared information and learned structural priors. Extensive experiments on five TCGA cancer datasets demonstrate that MUST achieves state-of-the-art performance with complete data while maintaining robust predictions in both missing pathology and missing genomics conditions, with clinically acceptable inference latency.

关键词: survival prediction, multimodal medical data, missing modality, Transformer, modality-specific representation, latent diffusion models, precision oncology, TCGA datasets

175. ❌ PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery

作者: Elkhan Ismayilzada, Yufei Zhang, Zijun Cui 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26068v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于手部运动恢复的计算机视觉任务，提出了一种物理感知的扩散框架来改进手部姿态序列的物理一致性。论文的核心技术是扩散模型、MeshCNN-Transformer架构和物理动力学建模，与大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学计算（具体是计算机视觉和物理模拟）中的应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种物理感知的条件扩散框架，用于从图像中恢复物理上合理的手部运动，并通过拉普拉斯近似估计运动估计的物理方差，在实验中优于基于图像的初始化方法。

摘要翻译

基于图像的手部重建技术已取得显著进展，能够提供精确的单帧估计结果，但这些方法通常缺乏物理一致性，且无法评估运动满足物理规律的可信度。本文提出一种新颖的物理感知条件扩散框架，该框架能够将含噪声的姿态序列优化为物理合理的手部运动，同时估计运动预测中的物理方差。基于MeshCNN-Transformer主干网络，我们为关节化手部构建了欧拉-拉格朗日动力学模型。与以往强制残差为零的研究不同，我们将计算得到的动态残差视为虚拟观测量，从而更有效地融合物理约束。通过末层拉普拉斯近似，我们的方法能够生成逐关节、逐时间步的方差度量，以评估物理一致性，并提供可解释的方差图谱来指示物理一致性减弱的区域。在两个知名手部数据集上的实验表明，相较于基于图像的强初始化方法及当前主流的基于视频的方法，本方法均取得稳定提升。定性结果证实，我们的方差估计与基于图像的运动预测中的物理合理性保持一致。

摘要 (Abstract)

Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN-Transformer backbone, we formulate Euler-Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.

关键词: hand motion recovery, physics-aware diffusion, conditional diffusion framework, MeshCNN-Transformer, Euler-Lagrange dynamics, physics variance estimation, Laplace approximation, physical plausibility

176. ❌ Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline

作者: Qizhi Xie, Kun Yuan, Yunpeng Qu, Ming Sun, Chao Zhou, Jihong Zhu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26055v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频流畅度评估（VFA）这一计算机视觉任务，提出了新任务定义、基准数据集FluVid和基线模型FluNet。论文内容涉及视频质量评估、时间维度分析、自注意力机制等，但完全不涉及大语言模型、深度学习技术原理创新或AI for Science等关键词领域。所有关键词均与大模型、深度学习技术原理或科学AI应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了视频流畅度评估（VFA）这一新任务，构建了FluVid数据集并开发了FluNet基线模型，实现了该任务上的最先进性能。

摘要翻译

准确评估人类对视频流畅性的主观反馈（如运动一致性与帧连续性），对视频流媒体和游戏等应用至关重要。然而，该问题长期被忽视，因为现有研究主要将其作为视频质量评估任务中的一个子维度进行处理。在本研究中，我们通过初步实验发现，当前视频质量评估预测严重低估了流畅性维度，从而限制了其实际适用性。为此，我们首次将视频流畅性评估确立为专注于时间维度的独立感知任务。为推进该领域研究：1）我们构建了首个面向流畅性的数据集FluVid，包含4,606段自然场景视频，其流畅度分布均衡，并首次制定了专门的评分标准与人工标注流程；2）我们在FluVid上建立了包含23种方法的大规模基准测试，这是目前最全面的流畅性评估基准，为定制化模型设计提供了重要洞见；3）我们提出了名为FluNet的基线模型，该模型采用时序置换自注意力机制来增强输入的流畅性信息表征，并强化长程帧间交互。本研究不仅实现了最先进的性能表现，更重要的是为学界探索视频流畅性评估解决方案提供了系统性的研究路线图。

摘要 (Abstract)

Accurately estimating humans’ subjective feedback on video fluency, e.g., motion consistency and frame continuity, is crucial for various applications like streaming and gaming. Yet, it has long been overlooked, as prior arts have focused on solving it in the video quality assessment (VQA) task, merely as a sub-dimension of overall quality. In this work, we conduct pilot experiments and reveal that current VQA predictions largely underrepresent fluency, thereby limiting their applicability. To this end, we pioneer Video Fluency Assessment (VFA) as a standalone perceptual task focused on the temporal dimension. To advance VFA research, 1) we construct a fluency-oriented dataset, FluVid, comprising 4,606 in-the-wild videos with balanced fluency distribution, featuring the first-ever scoring criteria and human study for VFA. 2) We develop a large-scale benchmark of 23 methods, the most comprehensive one thus far on FluVid, gathering insights for VFA-tailored model designs. 3) We propose a baseline model called FluNet, which deploys temporal permuted self-attention (T-PSA) to enrich input fluency information and enhance long-range inter-frame interactions. Our work not only achieves state-of-the-art performance but, more importantly, offers the community a roadmap to explore solutions for VFA.

关键词: Video Fluency Assessment, VFA, FluVid dataset, temporal dimension, self-attention, T-PSA, benchmark, perceptual task

177. ❌ Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives

作者: Daiqiang Li, Zihao Pan, Zeyu Zhang, Ronghao Chen, Huacan Wang, Honggang Chen, Haiyun Jiang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究GUI视觉代理中的历史截图token剪枝策略，核心涉及多模态大语言模型（MLLMs）和GUI代理（属于LLM Agents范畴）。与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为MLLMs是LLMs的扩展；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文直接研究GUI视觉代理。其他关键词如MoE、SFT、RAG、推理方法、压缩技术等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于多模态大语言模型的GUI视觉代理中历史截图token剪枝策略，发现背景区域具有语义价值、随机剪枝能保持空间结构、以及分配更多token给近期截图的时效性策略能有效降低计算成本并保持性能。

摘要翻译

近年来，基于多模态大语言模型（MLLMs）构建的图形用户界面视觉代理在导航任务中展现出强大潜力。然而，高分辨率的图形用户界面截图会产生大量视觉标记，直接完整保留历史信息在计算上代价高昂。本文针对图形用户界面场景中的历史截图标记剪枝进行了实证研究，并提炼出三条对设计有效剪枝策略至关重要的实践洞见。首先，我们观察到图形用户界面截图呈现出独特的前景-背景语义构成。为探究这一特性，我们应用一种简单的基于边缘的分割方法将截图划分为前景和背景区域。出乎意料的是，与通常认为背景区域语义价值较低的假设相反，背景区域能有效捕捉界面状态转换，从而为图形用户界面推理提供辅助线索。其次，与精心设计的剪枝策略相比，随机剪枝在保留空间结构方面具有内在优势，能在相同计算预算下实现更优性能。最后，我们发现图形用户界面代理表现出与人类认知类似的近因效应：通过为近期截图分配更多标记预算，同时对远期截图进行深度压缩，我们能在保持性能几乎不变的同时显著降低计算成本。这些发现为高效图形用户界面视觉代理的设计提供了新的见解与实践指导。

摘要 (Abstract)

In recent years, GUI visual agents built upon Multimodal Large Language Models (MLLMs) have demonstrated strong potential in navigation tasks. However, high-resolution GUI screenshots produce a large number of visual tokens, making the direct preservation of complete historical information computationally expensive. In this paper, we conduct an empirical study on token pruning for historical screenshots in GUI scenarios and distill three practical insights that are crucial for designing effective pruning strategies. First, we observe that GUI screenshots exhibit a distinctive foreground-background semantic composition. To probe this property, we apply a simple edge-based separation to partition screenshots into foreground and background regions. Surprisingly, we find that, contrary to the common assumption that background areas have little semantic value, they effectively capture interface-state transitions, thereby providing auxiliary cues for GUI reasoning. Second, compared with carefully designed pruning strategies, random pruning possesses an inherent advantage in preserving spatial structure, enabling better performance under the same computational budget. Finally, we observe that GUI Agents exhibit a recency effect similar to human cognition: by allocating larger token budgets to more recent screenshots and heavily compressing distant ones, we can significantly reduce computational cost while maintaining nearly unchanged performance. These findings offer new insights and practical guidance for the design of efficient GUI visual agents.

关键词: GUI visual agents, Multimodal Large Language Models, token pruning, historical screenshots, foreground-background semantics, spatial structure preservation, recency effect, computational efficiency

178. ❌ Face2Parts: Exploring Coarse-to-Fine Inter-Regional Facial Dependencies for Generalized Deepfake Detection

作者: Kutub Uddin, Nusrat Tasnim, Byung Tae Oh 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的深度伪造检测，提出了一种基于分层特征表示和通道注意力机制的混合方法，用于分析面部区域间的依赖关系。论文内容与所有评分关键词（均涉及大模型、深度学习技术原理或AI在科学领域的应用）完全无关，未涉及任何大模型技术、训练方法、推理优化、对齐技术、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Face2Parts的混合方法，通过分层特征表示和通道注意力机制分析面部区域间的粗到细依赖关系，以改进深度伪造检测，并在多个基准数据集上实现了优于现有方法的性能。

摘要翻译

多媒体数据，尤其是图像与视频，已成为监控、视觉交互、生物识别、证据收集和广告等多种应用场景中不可或缺的组成部分。然而，业余或熟练的伪造者可能出于诽谤动机模拟这些数据以制造深度伪造内容。为应对这一挑战，学界已开发出多种取证方法来确保内容的真实性。这些方法的有效性取决于其关注焦点，而篡改手段的多样性也带来了诸多挑战。本文分析了现有取证方法，发现每种方法通过关注特定面部区域（如整体帧、面部整体、嘴唇、眼睛或鼻子）在检测深度伪造痕迹方面均具有独特优势。基于这些观察，我们提出了一种名为 Face2Parts 的新型混合方法，该方法基于分层特征表示（Hierarchical Feature Representation, HFR），利用从粗到细的信息来提升深度伪造检测性能。所提出的方法分别从整体帧、面部整体以及关键面部区域（即嘴唇、眼睛和鼻子）提取特征，以探索从粗到细的关联关系。通过通道注意力机制和深度三元组学习，该方法能够捕获面部区域间的相互依赖关系。我们在基准深度伪造数据集上，针对数据集内、数据集间以及篡改方法间的不同设置评估了所提方法。该方法在 FF++、CDF1、CDF2、DFD、DFDC、DTIM、PDD 和 WLDR 数据集上分别取得了平均 AUC 为 98.42%、79.80%、85.34%、89.41%、84.07%、95.62%、80.76% 和 100% 的结果。实验表明，我们的方法具有良好的泛化能力，取得了优于现有方法的性能表现。

摘要 (Abstract)

Multimedia data, particularly images and videos, is integral to various applications, including surveillance, visual interaction, biometrics, evidence gathering, and advertising. However, amateur or skilled counterfeiters can simulate them to create deepfakes, often for slanderous motives. To address this challenge, several forensic methods have been developed to ensure the authenticity of the content. The effectiveness of these methods depends on their focus, with challenges arising from the diverse nature of manipulations. In this article, we analyze existing forensic methods and observe that each method has unique strengths in detecting deepfake traces by focusing on specific facial regions, such as the frame, face, lips, eyes, or nose. Considering these insights, we propose a novel hybrid approach called Face2Parts based on hierarchical feature representation ($HFR$) that takes advantage of coarse-to-fine information to improve deepfake detection. The proposed method involves extracting features from the frame, face, and key facial regions (i.e., lips, eyes, and nose) separately to explore the coarse-to-fine relationships. This approach enables us to capture inter-dependencies among facial regions using a channel-attention mechanism and deep triplet learning. We evaluated the proposed method on benchmark deepfake datasets in both intra-, inter-dataset, and inter-manipulation settings. The proposed method achieves an average AUC of 98.42% on FF++, 79.80% on CDF1, 85.34% on CDF2, 89.41% on DFD, 84.07% on DFDC, 95.62% on DTIM, 80.76% on PDD, and 100% on WLDR, respectively. The results demonstrate that our approach generalizes effectively and achieves promising performance to outperform the existing methods.

关键词: deepfake detection, facial regions, hierarchical feature representation, channel-attention mechanism, deep triplet learning, inter-dependencies, generalization, benchmark datasets

179. ❌ Knowledge is Power: Advancing Few-shot Action Recognition with Multimodal Semantics from MLLMs

作者: Jiazheng Xing, Chao Xu, Hangjie Yuan, Mengmeng Wang, Jun Dan, Hangwei Qian, Yong Liu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26033v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在少样本动作识别（FSAR）中的应用，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为MLLMs是LLMs的扩展。论文提到’minimal trainable parameters’，与参数高效微调（PEFT）有一定关联（5分），但未明确使用LoRA等具体技术。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT、AI for Science等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出FSAR-LLaVA方法，首次利用多模态大语言模型（MLLMs）作为知识库直接增强少样本动作识别，通过提取多模态特征、构建任务导向原型和设计多模态匹配度量，在多个任务上实现了优越性能且参数高效。

摘要翻译

多模态大语言模型（MLLMs）推动了少样本动作识别（FSAR）领域的发展。然而，该领域的初步探索主要集中于生成描述文本以构建次优的特征->描述->特征流程，并仅在视觉空间内采用度量学习。本文提出FSAR-LLaVA，这是首个利用MLLMs（如Video-LLaVA）作为多模态知识库直接增强FSAR的端到端方法。首先，在特征层面，我们利用MLLM的多模态解码器提取时空与语义信息均得到强化的表征，随后通过我们设计的多模态特征增强模块将其解耦并增强为独立的视觉与文本特征，从而充分挖掘其语义知识以服务于FSAR。其次，我们借助MLLMs的通用能力构建输入提示，使其灵活适应多样化场景，并利用其对齐输出来驱动我们设计的复合任务导向原型构建，有效弥合了元训练集与元测试集之间的分布差异。最后，为使多模态特征能共同指导度量学习，我们引入了一种无需训练的多模态原型匹配度量方法，该方法能自适应地选择最具决定性的线索，并高效利用MLLMs生成的解耦特征表征。大量实验表明，该方法在仅需极少可训练参数的情况下，于多种任务中均展现出优越性能。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have propelled the field of few-shot action recognition (FSAR). However, preliminary explorations in this area primarily focus on generating captions to form a suboptimal feature->caption->feature pipeline and adopt metric learning solely within the visual space. In this paper, we propose FSAR-LLaVA, the first end-to-end method to leverage MLLMs (such as Video-LLaVA) as a multimodal knowledge base for directly enhancing FSAR. First, at the feature level, we leverage the MLLM’s multimodal decoder to extract spatiotemporally and semantically enriched representations, which are then decoupled and enhanced by our Multimodal Feature-Enhanced Module into distinct visual and textual features that fully exploit their semantic knowledge for FSAR. Next, we leverage the versatility of MLLMs to craft input prompts that flexibly adapt to diverse scenarios, and use their aligned outputs to drive our designed Composite Task-Oriented Prototype Construction, effectively bridging the distribution gap between meta-train and meta-test sets. Finally, to enable multimodal features to guide metric learning jointly, we introduce a training-free Multimodal Prototype Matching Metric that adaptively selects the most decisive cues and efficiently leverages the decoupled feature representations produced by MLLMs. Extensive experiments demonstrate superior performance across various tasks with minimal trainable parameters.

关键词: Multimodal Large Language Models, Few-shot Action Recognition, Video-LLaVA, Multimodal Feature-Enhanced Module, Composite Task-Oriented Prototype Construction, Multimodal Prototype Matching Metric, Parameter-efficient, Knowledge Base

180. ❌ Learning to Trim: End-to-End Causal Graph Pruning with Dynamic Anatomical Feature Banks for Medical VQA

作者: Zibo Xu, Qiang Li, Weizhi Nie, Yuting Su 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26028v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学视觉问答（MedVQA）领域，提出了一种可学习的因果修剪框架（LCT）和动态解剖特征库（DAFB）来解决模型泛化问题。论文的核心是医学AI应用（特别是生物信息学/医学影像分析），与大多数大模型技术关键词（如LLMs、MoE、SFT、RLHF、RAG等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物医学AI应用领域，得10分。‘Mechanistic Interpretability OR Explainable AI’得5分，因为论文涉及因果推理和可解释性（通过因果修剪强调因果信号），但这不是核心焦点。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对医学视觉问答模型因依赖数据集特定相关性而泛化能力有限的问题，提出了一个可学习的因果修剪框架，通过动态解剖特征库和可微分修剪模块来抑制虚假相关性并强调因果证据，从而提高了模型的鲁棒性和泛化能力。

摘要翻译

医学视觉问答模型常因依赖数据集特定相关性（如重复出现的解剖结构模式或问题类型规律）而非真实诊断证据，导致泛化能力受限。现有因果方法通常以静态调整或事后修正的形式实现。为解决这一问题，我们提出一种可学习因果剪裁框架，将因果剪枝整合至端到端优化过程中。我们引入动态解剖特征库，通过动量机制更新，以捕捉高频解剖与语言模式的全局原型，作为数据集层面规律的近似表征。进一步设计可微分剪裁模块，用于估计实例级表征与全局特征库间的依赖关系。与全局原型高度相关的特征被软性抑制，而实例特异性证据则得到增强。这种可学习机制促使模型自适应地优先关注因果信号而非伪相关性。在VQA-RAD、SLAKE、SLAKE-CP和PathVQA数据集上的实验表明，相较于现有去偏策略，本方法持续提升了模型的鲁棒性与泛化能力。

摘要 (Abstract)

Medical Visual Question Answering (MedVQA) models often exhibit limited generalization due to reliance on dataset-specific correlations, such as recurring anatomical patterns or question-type regularities, rather than genuine diagnostic evidence. Existing causal approaches are typically implemented as static adjustments or post-hoc corrections. To address this issue, we propose a Learnable Causal Trimming (LCT) framework that integrates causal pruning into end-to-end optimization. We introduce a Dynamic Anatomical Feature Bank (DAFB), updated via a momentum mechanism, to capture global prototypes of frequent anatomical and linguistic patterns, serving as an approximation of dataset-level regularities. We further design a differentiable trimming module that estimates the dependency between instance-level representations and the global feature bank. Features highly correlated with global prototypes are softly suppressed, while instance-specific evidence is emphasized. This learnable mechanism encourages the model to prioritize causal signals over spurious correlations adaptively. Experiments on VQA-RAD, SLAKE, SLAKE-CP and PathVQA demonstrate that LCT consistently improves robustness and generalization over existing debiasing strategies.

关键词: Medical Visual Question Answering, Causal Graph Pruning, Dynamic Anatomical Feature Bank, Generalization, Debiasing, End-to-End Optimization, Robustness, MedVQA

作者: Danny Abraham, Nikhil Kamalkumar Advani, Arun Das, Nikil Dutt 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26018v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶场景中的3D车道段检测和拓扑推理，采用基于Transformer的架构进行几何感知的细化。虽然属于深度学习在特定领域（自动驾驶）的应用，但研究内容与提供的关键词列表（主要围绕大语言模型、训练技术、推理方法、代理系统等）无直接关联。论文未涉及任何大语言模型、MoE、训练对齐技术、推理加速、代理系统或AI for Science的具体主题。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GeoReFormer的几何感知细化Transformer架构，用于自动驾驶中的3D车道段检测和拓扑推理，在OpenLane-V2基准测试中实现了最先进的性能并提高了拓扑一致性。

摘要翻译

精确的三维车道段检测与拓扑推理对于自动驾驶中结构化在线地图构建至关重要。近期基于Transformer的方法将此任务形式化为基于查询的集合预测，但大多沿用了原本为紧凑目标检测设计的解码器架构。然而，车道段是嵌入有向图中的连续折线，通用的查询初始化与无约束优化并未显式编码这种几何与关系结构。我们提出GeoReFormer（几何感知优化Transformer），这是一种统一的基于查询的架构，将几何与拓扑感知的归纳偏置直接嵌入Transformer解码器中。GeoReFormer引入了数据驱动的几何先验以实现结构化查询初始化，采用有界坐标空间优化以稳定折线形变，并通过逐查询门控拓扑传播选择性整合关系上下文。在OpenLane-V2基准测试中，GeoReFormer以34.5%的平均精度均值（mAP）实现了最先进的性能，同时在强Transformer基线上提升了拓扑一致性，验证了显式几何与关系结构编码的有效性。

摘要 (Abstract)

Accurate 3D lane segment detection and topology reasoning are critical for structured online map construction in autonomous driving. Recent transformer-based approaches formulate this task as query-based set prediction, yet largely inherit decoder designs originally developed for compact object detection. However, lane segments are continuous polylines embedded in directed graphs, and generic query initialization and unconstrained refinement do not explicitly encode this geometric and relational structure. We propose GeoReFormer (Geometry-aware Refinement Transformer), a unified query-based architecture that embeds geometry- and topology-aware inductive biases directly within the transformer decoder. GeoReFormer introduces data-driven geometric priors for structured query initialization, bounded coordinate-space refinement for stable polyline deformation, and per-query gated topology propagation to selectively integrate relational context. On the OpenLane-V2 benchmark, GeoReFormer achieves state-of-the-art performance with 34.5% mAP while improving topology consistency over strong transformer baselines, demonstrating the utility of explicit geometric and relational structure encoding.

关键词: 3D lane segment detection, topology reasoning, autonomous driving, transformer decoder, geometry-aware refinement, structured query initialization, polyline deformation, OpenLane-V2 benchmark

182. ❌ Cone-Beam CT Image Quality Enhancement Using A Latent Diffusion Model Trained with Simulated CBCT Artifacts

作者: Naruki Murahashi, Mitsuhiro Nakamura, Megumi Nakao 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26014v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像处理领域，使用条件潜在扩散模型（conditional latent diffusion model）来增强锥形束CT（CBCT）图像质量。论文的核心技术是扩散模型在医学图像生成和增强中的应用，属于AI在科学（特别是医学影像）领域的应用。然而，论文并未涉及任何大语言模型（LLM）、深度学习技术原理创新（如MoE、Scaling Laws、PEFT等）、推理技术（如CoT、MCTS）、对齐技术（如RLHF、Instruction Tuning）、代理系统或模型优化技术（如Quantization、Speculative Decoding）。唯一相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文将AI（扩散模型）应用于医学影像（属于生物信息学相关领域），但并非核心创新点，因此给予5分（有一定关联）。其他关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于条件潜在扩散模型的方法，用于增强锥形束CT（CBCT）图像质量，通过使用模拟CBCT伪影的伪CBCT图像进行自监督学习，在保持解剖结构的同时显著减少结构变化，并实现更快的处理速度和优于传统方法的性能。

摘要翻译

与传统CT图像相比，锥形束计算机断层扫描（CBCT）图像因对比度低且伪影含量高，在临床医学中存在应用难题。尽管已有一些研究致力于提升图像质量，但在受器官形变影响的区域，此类质量提升方法可能导致解剖结构改变。本研究提出一种基于条件潜在扩散模型的无过校正CBCT图像质量增强方法，该方法利用伪CBCT图像进行训练。伪CBCT图像通过简易模拟CBCT伪影的方法从CT图像生成，并与原始CT图像保持空间一致性。通过对这些空间配对的图像进行自监督学习，我们能够在提升图像质量的同时保持解剖结构不变。此外，将条件扩散模型框架扩展至潜在空间，提升了图像处理效率。我们的模型在盆腔CT-伪CBCT配对数据上进行训练，并应用于伪CBCT及真实CBCT数据。使用75例数据的实验结果表明：与传统基于真实图像学习的方法相比，本方法引起的结构变化量（以像素数计）不足其千分之一；生成图像与参考图像的CT值分布相关系数达到0.916，接近传统方法水平。研究还证实，即使在受限的训练条件下，与条件扩散模型框架相比，所提框架实现了更快的处理速度与更优的增强性能。

摘要 (Abstract)

Cone-beam computed tomography (CBCT) images are problematic in clinical medicine because of their low contrast and high artifact content compared with conventional CT images. Although there are some studies to improve image quality, in regions subject to organ deformation, the anatomical structure may change after such image quality improvement. In this study, we propose an overcorrection-free CBCT image quality enhancement method based on a conditional latent diffusion model using pseudo-CBCT images. Pseudo-CBCT images are created from CT images using a simple method that simulates CBCT artifacts and are spatially consistent with the CT images. By performing self-supervised learning with these spatially consistent paired images, we can improve image quality while maintaining anatomical structures. Furthermore, extending the framework of the conditional diffusion model to latent space improves the efficiency of image processing. Our model was trained on pelvic CT-pseudo-CBCT paired data and was applied to both pseudo-CBCT and real CBCT data. The experimental results using data of 75 cases show that with our proposed method, the structural changes were less than 1/1000th (in terms of the number of pixels) of those of a conventional method involving learning with real images, and the correlation coefficient between the CT value distributions of the generated and reference images was 0.916, approaching the same level as conventional methods. We also confirmed that the proposed framework achieves faster processing and superior improvement performance compared with the framework of a conditional diffusion model, even under constrained training settings.

关键词: Cone-beam CT, Image quality enhancement, Latent diffusion model, Conditional diffusion model, Self-supervised learning, Medical imaging, Artifact reduction, Anatomical structure preservation

183. ❌ An LP-based Sampling Policy for Multi-Armed Bandits with Side-Observations and Stochastic Availability

作者: Ashutosh Soni, Peizhong Ju, Atilla Eryilmaz, Ness B. Shroff 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26647v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是随机多臂老虎机问题，专注于网络结构中的侧向观测和随机可用性约束，属于经典的强化学习/在线学习领域。论文内容完全不涉及大语言模型、深度学习、大模型技术原理或AI在科学领域的应用。所有关键词都聚焦于大模型相关技术，而本文是传统的统计学习算法研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了具有侧向观测和随机可用性约束的随机多臂老虎机问题，提出了一种基于线性规划的UCB-LP-A策略，在理论上证明了其遗憾上界，并通过仿真验证了其优于忽略侧向信息或可用性约束的启发式方法。

摘要翻译

本研究探讨了具有底层网络结构的随机多臂老虎机问题，该结构使得相关行动之间能够进行侧向观测。我们采用二分图将行动与一组未知变量相连接，使得选择某一行动可揭示其关联的所有未知变量的观测信息。以往研究通常假设所有行动始终可被选择，而本文则研究了更具实践意义的随机可用性场景：在每一轮中可行行动集合（即“激活集”）动态变化。该框架模拟了兼具结构依赖性和波动性的真实世界系统，例如社交网络中用户可提供其同伴偏好的侧向信息，但并非始终在线可供查询。为应对这一挑战，我们提出UCB-LP-A策略，该策略采用线性规划方法在随机可用性条件下优化探索与利用的权衡。与假设持续访问能力的标准网络老虎机算法不同，UCB-LP-A通过计算可实现激活集上的最优采样分布，确保仅利用当前活跃臂收集必要观测。我们推导了该策略遗憾值的理论上界，刻画了网络结构和激活概率的共同影响。最后通过数值模拟证明，UCB-LP-A显著优于那些忽略侧向信息或可用性约束的现有启发式方法。

摘要 (Abstract)

We study the stochastic multi-armed bandit (MAB) problem where an underlying network structure enables side-observations across related actions. We use a bipartite graph to link actions to a set of unknowns, such that selecting an action reveals observations for all the unknowns it is connected to. While previous works rely on the assumption that all actions are permanently accessible, we investigate the more practical setting of stochastic availability, where the set of feasible actions (the “activation set”) varies dynamically in each round. This framework models real-world systems with both structural dependencies and volatility, such as social networks where users provide side-information about their peers’ preferences, yet are not always online to be queried. To address this challenge, we propose UCB-LP-A, a novel policy that leverages a Linear Programming (LP) approach to optimize exploration-exploitation trade-offs under stochastic availability. Unlike standard network bandit algorithms that assume constant access, UCB-LP-A computes an optimal sampling distribution over the realizable activation sets, ensuring that the necessary observations are gathered using only the currently active arms. We derive a theoretical upper bound on the regret of our policy, characterizing the impact of both the network structure and the activation probabilities. Finally, we demonstrate through numerical simulations that UCB-LP-A significantly outperforms existing heuristics that ignore either the side-information or the availability constraints.

关键词: multi-armed bandits, side-observations, stochastic availability, linear programming, regret analysis, network structure, exploration-exploitation trade-off, UCB-LP-A

184. ❌ Automatic Laplace Collapsed Sampling: Scalable Marginalisation of Latent Parameters via Automatic Differentiation

作者: Toby Lovick, David Yallup, Will Handley 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26644v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于贝叶斯模型参数边缘化的通用计算框架ALCS，结合了自动微分和嵌套采样技术。该研究属于计算统计学和贝叶斯推断领域，专注于开发高效的数值计算方法。所有评分关键词均与大语言模型、深度学习技术原理及其应用相关，而本文完全不涉及这些主题。论文没有讨论任何形式的语言模型、深度学习架构、训练方法、推理优化、对齐技术、代理系统或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ALCS的通用框架，通过自动微分和拉普拉斯近似来边缘化贝叶斯模型中的潜在参数，结合嵌套采样实现高维设置下的可扩展贝叶斯证据计算。

摘要翻译

本文提出自动拉普拉斯坍缩采样（Automatic Laplace Collapsed Sampling, ALCS），这是一个利用自动微分对贝叶斯模型中潜参数进行边缘化的通用框架。我们将其与嵌套采样相结合，以稳健高效的方式探索超参数空间。在每次嵌套采样的似然评估中，ALCS通过最大后验（MAP）优化和拉普拉斯近似，将高维潜变量 $z$ 坍缩为标量贡献，二者均使用自动微分计算。这将有效维度从 $d_θ+ d_z$ 降低至仅 $d_θ$，使得高维场景下的贝叶斯证据计算变得可行，无需手动推导梯度或海森矩阵，且仅需极少的模型特定工程化工作。MAP优化和海森矩阵评估在GPU硬件上跨存活点并行执行，使该方法能够大规模实用。我们还证明，自动微分支持超越拉普拉斯近似的局部近似方法，例如扩展到Student-$t$等参数族，从而改善了重尾潜变量的证据估计。我们在涵盖层次模型、时间序列模型和离散似然模型的一系列基准测试上验证了ALCS，并确定了高斯近似成立的条件范围。这使得事后有效样本量（ESS）诊断能够定位超参数空间中的失效区域，而无需昂贵的联合采样。

摘要 (Abstract)

We present Automatic Laplace Collapsed Sampling (ALCS), a general framework for marginalising latent parameters in Bayesian models using automatic differentiation, which we combine with nested sampling to explore the hyperparameter space in a robust and efficient manner. At each nested sampling likelihood evaluation, ALCS collapses the high-dimensional latent variables $z$ to a scalar contribution via maximum a posteriori (MAP) optimisation and a Laplace approximation, both computed using autodiff. This reduces the effective dimension from $d_θ+ d_z$ to just $d_θ$, making Bayesian evidence computation tractable for high-dimensional settings without hand-derived gradients or Hessians, and with minimal model-specific engineering. The MAP optimisation and Hessian evaluation are parallelised across live points on GPU-hardware, making the method practical at scale. We also show that automatic differentiation enables local approximations beyond Laplace to parametric families such as the Student-$t$, which improves evidence estimates for heavy-tailed latents. We validate ALCS on a suite of benchmarks spanning hierarchical, time-series, and discrete-likelihood models and establish where the Gaussian approximation holds. This enables a post-hoc ESS diagnostic that localises failures across hyperparameter space without expensive joint sampling.

关键词: Automatic Laplace Collapsed Sampling, Bayesian models, latent parameter marginalization, automatic differentiation, nested sampling, Laplace approximation, high-dimensional inference, evidence computation

185. ❌ Context-specific Credibility-aware Multimodal Fusion with Conditional Probabilistic Circuits

作者: Pranuthi Tenali, Sahil Sidheekh, Saurabh Mathur, Erik Blasch, Kristian Kersting, Sriraam Natarajan 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26629v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态融合中的动态可靠性评估，使用条件概率电路（CPC）建模实例级源可靠性，提出CSIC度量。与大多数大模型/深度学习技术关键词无关，仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文强调基于概率电路融合的可解释性优势，但并非核心研究大模型的可解释性。

!!! tip deepseek-chat TL;DR

该论文针对多模态融合中源可靠性随情境变化的问题，提出了基于条件概率电路的上下文特定可信度感知融合框架C²MF，通过CSIC度量实现自适应可靠性评估，在冲突基准测试中比静态可靠性基线提升高达29%的预测准确性。

摘要翻译

多模态融合需要整合来自多个信源的信息，这些信息可能因情境不同而产生冲突。现有的融合方法通常依赖于对信源可靠性的静态假设，当某一模态因传感器退化或特定类别损坏等情境因素变得不可靠时，这些方法解决冲突的能力受到限制。我们提出C$^2$MF——一种情境感知的可信度感知多模态融合框架，该框架使用条件概率电路（Conditional Probabilistic Circuit，CPC）对每个实例的信源可靠性进行建模。我们通过情境特定信息可信度（Context-Specific Information Credibility，CSIC）形式化地定义了实例级可靠性，该度量基于KL散度，可直接从CPC精确计算得出。CSIC将传统的静态可信度估计推广为一种特例，实现了原则性且自适应的可靠性评估。为评估跨模态冲突下的鲁棒性，我们提出了冲突基准测试（Conflict benchmark），其中特定类别的损坏被设计用于在不同模态间人为制造差异。实验结果表明，在高噪声场景下，C$^2$MF相较于静态可靠性基线方法的预测准确率提升最高达29%，同时保持了基于概率电路的融合方法在可解释性方面的优势。

摘要 (Abstract)

Multimodal fusion requires integrating information from multiple sources that may conflict depending on context. Existing fusion approaches typically rely on static assumptions about source reliability, limiting their ability to resolve conflicts when a modality becomes unreliable due to situational factors such as sensor degradation or class-specific corruption. We introduce C$^2$MF, a context-specfic credibility-aware multimodal fusion framework that models per-instance source reliability using a Conditional Probabilistic Circuit (CPC). We formalize instance-level reliability through Context-Specific Information Credibility (CSIC), a KL-divergence-based measure computed exactly from the CPC. CSIC generalizes conventional static credibility estimates as a special case, enabling principled and adaptive reliability assessment. To evaluate robustness under cross-modal conflicts, we propose the Conflict benchmark, in which class-specific corruptions deliberately induce discrepancies between different modalities. Experimental results show that C$^2$MF improves predictive accuracy by up to 29% over static-reliability baselines in high-noise settings, while preserving the interpretability advantages of probabilistic circuit-based fusion.

关键词: multimodal fusion, context-specific credibility, conditional probabilistic circuits, instance-level reliability, conflict resolution, interpretability, robustness, KL-divergence

186. ❌ Benchmarking Tabular Foundation Models for Conditional Density Estimation in Regression

作者: Rafael Izbicki, Pedro L. C. Rodrigues 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26611v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于表格基础模型（TabPFN、TabICL）在条件密度估计（CDE）任务中的基准测试，属于大模型在科学/数据科学领域的应用研究。核心相关关键词是’Large Language Models OR LLMs OR Foundation Models’（10分），因为论文明确研究表格基础模型作为基础模型的一种。‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）有一定关联，因为论文涉及天文学（SDSS DR18）等科学应用，但非主要焦点。其他关键词（如MoE、SFT、RAG等）均未在论文中提及或相关，故得0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了表格基础模型（如TabPFN、TabICL）作为通用条件密度估计方法的性能，发现它们在大多数数据集上优于传统基线，尤其在样本量较小时表现突出，并展示了在天文数据（SDSS DR18）中的实际应用优势。

摘要翻译

条件密度估计（Conditional Density Estimation, CDE）——即在给定表格型协变量的情况下恢复响应变量的完整条件分布——在存在异方差性、多模态或非对称不确定性的场景中至关重要。近年来出现的表格基础模型，如TabPFN和TabICL，天然能够生成预测分布，但作为通用CDE方法的有效性尚未得到系统评估，这与已被深入研究的点预测性能形成对比。我们在39个真实世界数据集上，针对训练规模从50到20,000的不同情况，将三种表格基础模型变体与一系列参数化、基于树结构及神经网络的CDE基线方法进行了基准比较，使用了涵盖密度准确性、校准性和计算时间的六项指标。在所有样本规模下，基础模型在绝大多数测试数据集上取得了最佳的CDE损失、对数似然和连续分级概率评分。在小样本规模下其校准性能具有竞争力，但对于某些指标和数据集，在大样本规模时落后于任务特定的神经网络基线，这表明事后重新校准可能是一种有价值的补充。在使用SDSS DR18数据的光度红移案例研究中，基于50,000个训练星系的TabPFN模型的表现优于所有使用完整500,000个星系数据集训练的基线方法。综合来看，这些结果确立了表格基础模型作为强大的即用型条件密度估计器的地位。

摘要 (Abstract)

Conditional density estimation (CDE) - recovering the full conditional distribution of a response given tabular covariates - is essential in settings with heteroscedasticity, multimodality, or asymmetric uncertainty. Recent tabular foundation models, such as TabPFN and TabICL, naturally produce predictive distributions, but their effectiveness as general-purpose CDE methods has not been systematically evaluated, unlike their performance for point prediction, which is well studied. We benchmark three tabular foundation model variants against a diverse set of parametric, tree-based, and neural CDE baselines on 39 real-world datasets, across training sizes from 50 to 20,000, using six metrics covering density accuracy, calibration, and computation time. Across all sample sizes, foundation models achieve the best CDE loss, log-likelihood, and CRPS on the large majority of datasets tested. Calibration is competitive at small sample sizes but, for some metrics and datasets, lags behind task-specific neural baselines at larger sample sizes, suggesting that post-hoc recalibration may be a valuable complement. In a photometric redshift case study using SDSS DR18, TabPFN exposed to 50,000 training galaxies outperforms all baselines trained on the full 500,000-galaxy dataset. Taken together, these results establish tabular foundation models as strong off-the-shelf conditional density estimators.

关键词: tabular foundation models, conditional density estimation, benchmarking, TabPFN, TabICL, predictive distributions, SDSS DR18, photometric redshift

187. ❌ Hardware-Aware Tensor Networks for Real-Time Quantum-Inspired Anomaly Detection at Particle Colliders

作者: Sagar Addepalli, Prajita Bhattarai, Abhilasha Dave, Julia Gonski 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26604v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究量子启发的张量网络在粒子对撞机实时异常检测中的应用，属于AI for Science领域，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但未涉及大模型、深度学习技术原理或任何其他关键词（0分）。

!!! tip deepseek-chat TL;DR

该论文开发了一种基于张量网络的量子启发算法，用于粒子对撞机中的实时异常检测，实现了在资源受限环境下的高效部署。

摘要翻译

量子机器学习能够捕捉高维特征空间中的复杂关联性，这对于在粒子对撞机事件中探测超越标准模型（Beyond the Standard Model, BSM）物理的挑战至关重要，同时在未来量子处理器上具备实现前所未有的计算效率的潜力。通过开发可在经典硬件上部署的量子启发算法，能够在当前科学实验的“边缘”实现应用，从而在近期利用这些优势。本研究展示了利用张量网络在对撞机探测器中实现实时异常检测的方法。我们开发了一种间隔矩阵乘积算子（Spaced Matrix Product Operator, SMPO），该算子对多种超越标准模型基准测试具有敏感性，并且可在现场可编程门阵列（Field Programmable Gate Array, FPGA）硬件中实现，其资源消耗与延迟时间符合触发系统部署的要求。此外，本文引入了级联SMPO架构作为SMPO的一种变体，该架构在资源受限环境下的边缘应用中，以关键方式提供了更高的灵活性和效率。这些结果表明，在高能对撞机中部署量子启发式机器学习具有显著优势及近期的可行性。

摘要 (Abstract)

Quantum machine learning offers the ability to capture complex correlations in high-dimensional feature spaces, crucial for the challenge of detecting beyond the Standard Model physics in collider events, along with the potential for unprecedented computational efficiency in future quantum processors. Near-term utilization of these benefits can be achieved by developing quantum-inspired algorithms for deployment in classical hardware to enable applications at the “edge” of current scientific experiments. This work demonstrates the use of tensor networks for real-time anomaly detection in collider detectors. A spaced matrix product operator (SMPO) is developed that provides sensitivity to a variety beyond the Standard Model benchmarks, and can be implemented in field programmable gate array hardware with resources and latency consistent with trigger deployment. The cascaded SMPO architecture is introduced as an SMPO variation that affords greater flexibility and efficiency in ways that are key to edge applications in resource-constrained environments. These results reveal the benefit and near-term feasibility of deploying quantum-inspired ML in high energy colliders.

关键词: tensor networks, quantum machine learning, anomaly detection, particle colliders, real-time processing, field programmable gate array, beyond Standard Model physics, edge applications

188. ❌ Characterization and forecasting of national-scale solar power ramp events

作者: Luca Lanzilao, Angela Meyer 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26596v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究太阳能功率波动事件的表征与预测，属于可再生能源和电网管理领域。虽然使用了深度学习模型（如IrradianceNet）进行预测，但论文核心是应用现有模型解决具体工程问题，而非大模型技术原理的创新。所有关键词（除最后一个）均涉及大模型、深度学习技术原理、训练方法、推理优化、对齐、代理系统等，与论文内容完全无关。最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文将AI/深度学习应用于科学领域（能源科学），但并非核心内容，只是工具应用，因此给5分。

!!! tip deepseek-chat TL;DR

该研究分析了全国尺度太阳能功率波动事件的特征和气象驱动因素，并评估了深度学习与物理模型的预测性能，发现现有模型难以准确预测波动事件，需要改进高分辨率时空建模以支持大规模太阳能并网。

摘要翻译

太阳能发电的快速增长正在重塑电力系统运行模式，并增加了电网管理的复杂性。随着光伏（Photovoltaic, PV）装机容量的扩大，光伏发电的短期波动带来了显著的运行不确定性。与此同时，太阳能功率爬坡事件（solar ramp events）因剧烈的功率突变而加剧了电网失稳和非计划停电的风险。因此，准确识别、预测和缓解太阳能爬坡事件对于维持电网稳定至关重要。本研究以15分钟分辨率分析了6434座光伏电站两年的发电数据。我们建立了量化指标来定义太阳能爬坡事件，并在全国范围内系统性地描述了其发生频率、频次和强度特征。此外，我们探究了爬坡事件的气象驱动因素，重点分析了中尺度云系（mesoscale cloud systems）的作用。特别指出的是，我们观察到功率爬升事件通常与早晨云层消散相关，而功率骤降事件则常发生在午后云量增加时。进一步地，我们采用了一种新近开发的时空预测框架，评估了基于深度学习和物理模型的确定性及概率性光伏功率预测方法，包括SolarSTEPS、SHADECast、IrradianceNet和IFS-ENS。结果表明，SHADECast是最可靠的模型，在提前两小时的预测中其连续分级概率评分（CRPS）比SolarSTEPS低10.8%。然而，最先进的临近预报模型仍难以捕捉爬坡动态，其预测均方根误差（RMSE）较正常运行条件最高增加50%。总体而言，这些结果强调需要改进高分辨率时空建模方法，以提升爬坡事件预测能力，从而支持大规模太阳能发电安全可靠地并入电力系统。

摘要 (Abstract)

The rapid growth of solar energy is reshaping power system operations and increasing the complexity of grid management. As photovoltaic (PV) capacity expands, short-term fluctuations in PV generation introduce substantial operational uncertainty. At the same time, solar power ramp events intensify risks of grid instability and unplanned outages due to sudden large power fluctuations. Accurate identification, forecasting and mitigation of solar ramp events are therefore critical to maintaining grid stability. In this study, we analyze two years of PV power production from 6434 PV stations at 15-minute resolution. We develop quantitative metrics to define solar ramp events and systematically characterize their occurrence, frequency, and magnitude at a national scale. Furthermore, we examine the meteorological drivers of ramp events, highlighting the role of mesoscale cloud systems. In particular, we observe that ramp-up events are typically associated with cloud dissipation during the morning, while ramp-down events commonly occur when cloud cover increases in the afternoon. Additionally, we adopt a recently developed spatiotemporal forecasting framework to evaluate both deterministic and probabilistic PV power forecasts derived from deep learning and physics-based models, including SolarSTEPS, SHADECast, IrradianceNet, and IFS-ENS. The results show that SHADECast is the most reliable model, achieving a CRPS 10.8% lower than that of SolarSTEPS at a two-hour lead time. Nonetheless, state-of-the-art nowcasting models struggle to capture ramp dynamics, with forecast RMSE increasing by up to 50% compared to normal operating conditions. Overall, these results emphasize the need for improved high-resolution spatiotemporal modelling to enhance ramp prediction skill and support the reliable integration of large-scale solar generation into power systems.

关键词: solar power ramp events, photovoltaic generation, grid stability, spatiotemporal forecasting, deep learning models, probabilistic forecasts, mesoscale cloud systems, renewable energy integration

189. ❌ The Climber’s Grip – Personalized Deep Learning Models for Fear and Muscle Activity in Climbing

作者: Matthias Boeker, Dana Swarbrick, Ulysse T. A. Côté-Allard, Marc T. P. Adam, Hugo L. Hammer, Pål Halvorsen 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26575v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究攀岩运动中恐惧感知与肌肉活动的心理生理关系，使用统计建模和深度学习技术进行个性化建模。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文将深度学习应用于体育科学（可视为科学领域的一个应用），但并非核心的生物信息学或化学信息学，且未涉及大模型，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究通过结合统计建模和深度学习技术，探究了攀岩者在先锋攀登和顶绳攀登中感知恐惧与肌肉活动的关系，发现肌肉疲劳与先锋攀登中的恐惧增加显著相关，并证明个性化随机效应模型能提升预测性能。

摘要翻译

攀岩是一项融合身体需求与情绪认知挑战的多维度运动。先锋攀登与顶绳攀登的坠落距离存在差异，前者通常涉及更长的冲坠，这可能导致不同的风险感知与恐惧体验。本研究结合统计建模与深度学习技术，探讨了攀岩者感知恐惧与肌肉活动之间的心理生理学关联。我们招募了19名攀岩者开展实验，在先锋攀登和顶绳攀登过程中同步采集肌电图（EMG）、心电图（ECG）及手臂运动数据，并针对攀登各阶段收集感知恐惧评分。通过线性混合效应模型，我们分析了感知恐惧与生理指标之间的关系。为捕捉该关联的非线性动态特征，我们进一步将分析拓展至深度学习模型，并引入随机效应以实现个性化建模。结果表明，随机效应的加入显著提升了模型在均方误差（MSE）、平均绝对误差（MAE）和均方根误差（RMSE）上的表现。研究发现，在先锋攀登过程中，肌肉疲劳与恐惧感的增强呈显著正相关。本研究揭示了统计学方法与深度学习技术相结合，在建模攀岩中心理与生理状态交互作用方面的潜力。

摘要 (Abstract)

Climbing is a multifaceted sport that combines physical demands and emotional and cognitive challenges. Ascent styles differ in fall distance with lead climbing involving larger falls than top rope climbing, which may result in different perceived risk and fear. In this study, we investigated the psychophysiological relationship between perceived fear and muscle activity in climbers using a combination of statistical modeling and deep learning techniques. We conducted an experiment with 19 climbers, collecting electromyography (EMG), electrocardiography (ECG) and arm motion data during lead and top rope climbing. Perceived fear ratings were collected for the different phases of the climb. Using a linear mixed-effects model, we analyzed the relationships between perceived fear and physiological measures. To capture the non-linear dynamics of this relationship, we extended our analysis to deep learning models and integrated random effects for a personalized modeling approach. Our results showed that random effects improved model performance of the mean squared error (MSE), mean absolute error (MAE) and root mean squared error (RMSE). The results showed that muscle fatigue correlates significantly with increased fear during \textit{lead climbing}. This study highlights the potential of combining statistical and deep learning approaches for modeling the interplay between psychological and physiological states during climbing.

关键词: climbing, fear, muscle activity, deep learning, personalized modeling, psychophysiological relationship, EMG, lead climbing

190. ❌ Machine Unlearning under Retain-Forget Entanglement

作者: Jingpu Cheng, Ping Liu, Qianxiao Li, Chi Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26569v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器遗忘（machine unlearning）问题，提出处理保留集与遗忘集纠缠的两阶段优化框架，涉及神经网络架构、优化方法和基准测试。所有评分关键词均聚焦于大模型、深度学习技术原理及其在科学领域的应用，而本文的核心是机器遗忘这一特定机器学习任务，未涉及大模型、LLMs、MoE、量化、推理加速、对齐、RAG、智能体等任何评分关键词相关技术。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对机器遗忘中保留集与遗忘集相互纠缠的问题，提出了一种两阶段优化框架，通过增强拉格朗日方法和基于Wasserstein-2距离的梯度投影，在有效遗忘目标数据的同时，最大限度地减少对相关保留样本性能的影响，并在多个任务和架构上验证了其优于现有基线。

摘要翻译

在机器遗忘任务中，遗忘特定数据子集很少是孤立进行的。通常，与遗忘集密切相关的保留样本可能会受到无意影响，尤其是当它们从预训练中继承相关特征或表现出强烈的语义相似性时。为应对这一挑战，我们提出了一种新颖的两阶段优化框架，专门设计用于处理此类保留-遗忘纠缠问题。第一阶段采用增广拉格朗日方法，在提升遗忘集损失的同时，保持与遗忘集关联较弱的保留样本的准确性。第二阶段引入基于Wasserstein-2距离正则化的梯度投影步骤，以缓解语义相关保留样本的性能退化，且不损害遗忘目标的有效性。我们通过在多种遗忘任务、标准基准数据集和多样化神经架构上的综合实验验证了本方法，结果表明其能够实现高效可靠的遗忘，同时在准确性保持和移除保真度方面均优于现有基线。

摘要 (Abstract)

Forgetting a subset in machine unlearning is rarely an isolated task. Often, retained samples that are closely related to the forget set can be unintentionally affected, particularly when they share correlated features from pretraining or exhibit strong semantic similarities. To address this challenge, we propose a novel two-phase optimization framework specifically designed to handle such retai-forget entanglements. In the first phase, an augmented Lagrangian method increases the loss on the forget set while preserving accuracy on less-related retained samples. The second phase applies a gradient projection step, regularized by the Wasserstein-2 distance, to mitigate performance degradation on semantically related retained samples without compromising the unlearning objective. We validate our approach through comprehensive experiments on multiple unlearning tasks, standard benchmark datasets, and diverse neural architectures, demonstrating that it achieves effective and reliable unlearning while outperforming existing baselines in both accuracy retention and removal fidelity.

关键词: machine unlearning, retain-forget entanglement, two-phase optimization, augmented Lagrangian method, gradient projection, Wasserstein-2 distance, accuracy retention, removal fidelity

191. ❌ A Lyapunov Analysis of Softmax Policy Gradient for Stochastic Bandits

作者: Tor Lattimore 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26547v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究经典随机多臂老虎机问题中的策略梯度算法理论分析，属于强化学习基础理论范畴。所有评分关键词均聚焦于大模型、深度学习技术及其应用，而本文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文将连续时间随机多臂老虎机的策略梯度分析扩展到离散时间设置，证明了在特定学习率下遗憾界为O(k log(k) log(n)/η)。

摘要翻译

我们将Lattimore (2026)针对连续时间$k$臂随机赌博机（$k$-armed stochastic bandits）的策略梯度分析，调整至标准离散时间设定。与连续时间情况一致，我们证明当学习率$η= O(Δ_{\min}^2/(Δ_{\max} \log(n)))$时，其遗憾（regret）为$O(k \log(k) \log(n) / η)$，其中$n$为时间范围（horizon），$Δ_{\min}$与$Δ_{\max}$分别表示最小与最大收益间隔（gaps）。

摘要 (Abstract)

We adapt the analysis of policy gradient for continuous time $k$-armed stochastic bandits by Lattimore (2026) to the standard discrete time setup. As in continuous time, we prove that with learning rate $η= O(Δ_{\min}^2/(Δ_{\max} \log(n)))$ the regret is $O(k \log(k) \log(n) / η)$ where $n$ is the horizon and $Δ_{\min}$ and $Δ_{\max}$ are the minimum and maximum gaps.

关键词: Lyapunov analysis, softmax policy gradient, stochastic bandits, regret bound, learning rate, discrete time, multi-armed bandit, reinforcement learning

192. ❌ Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

作者: Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, Jason D. Lee 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26554v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Muon等谱优化器在大型语言模型训练中的性能优势，通过线性联想记忆问题分析其存储容量和恢复率，与SGD对比。核心相关关键词：1. ‘Large Language Models’ (5分)：论文明确研究语言模型训练优化器；2. ‘Scaling Laws AND Data Quality’ (8分)：论文核心贡献是建立Muon和SGD的缩放定律，分析恢复率与频率分布关系；3. ‘Pre-training’ (5分)：优化器研究直接关联预训练过程。其他关键词如MoE、SFT、RAG等未涉及，评0分。

!!! tip deepseek-chat TL;DR

论文通过线性联想记忆问题分析Muon谱优化器在语言模型训练中的优势，发现Muon比SGD具有更高的存储容量和更快的初始恢复率，并建立了相关的缩放定律。

摘要翻译

以Muon为代表的谱优化器近期在大规模语言模型训练中展现出卓越的实证性能，但其优势的来源与程度仍缺乏深入理解。我们通过线性联想记忆问题——一个用于模拟基于Transformer模型事实召回的可解析模型——来研究这一问题。特别地，我们超越了正交嵌入的设定，考察了高斯分布的输入与输出，这使得存储关联的数量能够远超嵌入维度。我们的主要结果精确刻画了在幂律频率分布下，Muon与随机梯度下降法（SGD）在逻辑回归损失上单步更新的恢复率。研究表明，Muon的存储容量显著超越SGD，且Muon在更大的临界批量大小时达到饱和。我们进一步基于阈值梯度近似分析了多步动态过程，发现Muon在初始阶段实现了比SGD快得多的恢复速率，而两种方法最终都以相近的速度收敛至信息理论极限。在合成任务上的实验验证了所预测的缩放规律。本分析为量化理解Muon的信号放大机制提供了依据，并为在更实际的语言建模任务与优化器中建立缩放规律奠定了基础。

摘要 (Abstract)

Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon and SGD on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and moreover Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds. Experiments on synthetic tasks validate the predicted scaling laws. Our analysis provides a quantitative understanding of the signal amplification of Muon and lays the groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.

关键词: spectral optimizers, Muon, large-scale language model training, linear associative memory, scaling laws, storage capacity, recovery rates, SGD comparison

193. ❌ The internal law of a material can be discovered from its boundary

作者: Francesco Regazzoni 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26517v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究材料科学中的物理规律发现，提出了一种名为Neural-DFEM的方法，该方法将可微有限元求解器嵌入学习循环，利用神经网络架构（Hyperelastic Neural Networks）从边界测量中无监督地发现超弹性材料本构定律。论文的核心是物理信息机器学习（Physics-informed Machine Learning）在材料科学中的应用，属于“AI for Science”的范畴，因此与“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联（评5分）。然而，论文完全不涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练对齐技术、推理优化、智能体等）或任何其他评分关键词，因此其他所有关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Neural-DFEM的物理信息机器学习方法，能够仅从边界测量数据中无监督地发现超弹性材料的本构定律，并在几何和载荷条件下实现泛化，同时保持高精度和对噪声的鲁棒性。

摘要翻译

自人类文明肇始之初，技术进步便始终与理解和预测材料力学行为的能力紧密相连。近年来，这一挑战日益被置于数据驱动科学发现这一更广泛的范式之中，即直接从观测数据中推断支配性规律。然而，现有方法通常需要应力-应变对或全场位移测量数据，而这些数据在实践中往往难以获取。我们提出了Neural-DFEM方法，该方法即使仅从部分观测数据（例如仅边界测量数据）也能实现超弹性材料定律的无监督发现。该方法将可微分有限元求解器嵌入学习循环中，直接将候选能量泛函与可获得的测量数据联系起来。为了在整个训练过程中保证热力学一致性和数学适定性，该方法采用了超弹性神经网络（Hyperelastic Neural Networks），这是一种新颖的结构保持型神经架构，其设计本身即能确保框架无差异性、材料对称性、多凸性以及强制性。由此构建的框架能够在二维和三维场景中实现稳健的材料模型发现，包括仅边界测量的情况。Neural-DFEM能够实现跨几何构型和加载条件的泛化，并展现出前所未有的准确性以及对测量噪声的强鲁棒性。我们的研究结果表明，当学习架构中嵌入了强大的物理归纳偏置时，即使在部分可观测条件下，也能实现材料定律的可靠辨识。

摘要 (Abstract)

Since the earliest stages of human civilization, advances in technology have been tightly linked to our ability to understand and predict the mechanical behavior of materials. In recent years, this challenge has increasingly been framed within the broader paradigm of data-driven scientific discovery, where governing laws are inferred directly from observations. However, existing methods require either stress-strain pairs or full-field displacement measurements, which are often inaccessible in practice. We introduce Neural-DFEM, a method that enables unsupervised discovery of hyperelastic material laws even from partial observations, such as boundary-only measurements. The method embeds a differentiable finite element solver within the learning loop, directly linking candidate energy functionals to available measurements. To guarantee thermodynamic consistency and mathematical well-posedness throughout training, the method employs Hyperelastic Neural Networks, a novel structure-preserving neural architecture that enforces frame indifference, material symmetry, polyconvexity, and coercivity by design. The resulting framework enables robust material model discovery in both two- and three-dimensional settings, including scenarios with boundary-only measurements. Neural-DFEM allows for generalization across geometries and loading conditions, and exhibits unprecedented accuracy and strong resilience to measurement noise. Our results demonstrate that reliable identification of material laws is achievable even under partial observability when strong physical inductive biases are embedded in the learning architecture.

关键词: material law discovery, hyperelastic materials, Neural-DFEM, differentiable finite element solver, unsupervised learning, boundary-only measurements, physics-informed machine learning, thermodynamic consistency

194. ❌ Identifying Connectivity Distributions from Neural Dynamics Using Flows

作者: Timothy Doyeon Kim, Ulises Pereira-Obilinovic, Yiliu Wang, Eric Shea-Brown, Uygar Sümbül 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26506v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于神经科学中的计算建模和推断问题，使用低秩循环神经网络（lrRNNs）和连续归一化流（CNFs）来推断神经连接结构。论文与大多数大模型和深度学习技术关键词无关，因为这些关键词主要涉及语言模型、训练方法、推理优化、对齐技术等。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文涉及对神经动力学的机制性解释；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文属于AI在科学（神经科学）领域的应用，但并非生物信息学或化学信息学。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文解决了从神经动力学中推断连接结构的退化性问题，通过开发一个基于最大熵和连续归一化流的推断框架，学习与观测动力学一致的连接权重分布，从而识别计算上必需的连接结构而非伪影。

摘要翻译

连接结构塑造神经计算，但从群体记录中推断这种结构存在简并性：多种连接结构可产生相同的动力学。近期研究利用低秩循环神经网络（lrRNNs）从观测活动中推断低维潜在动力学与连接结构，从而实现对动力学的机制性解释。然而，训练lrRNNs的标准方法可能恢复与底层动力学无关的虚假结构。我们首先刻画了lrRNNs中连接结构的可识别性，并确定了唯一解存在的条件。随后，为寻找此类解，我们开发了一个基于最大熵与连续归一化流（CNFs）的推断框架，通过流匹配进行训练。我们的方法并非估计单一连接矩阵，而是学习与观测动力学一致且最无偏的连接权重分布。该方法能够捕捉复杂但必要的分布，例如实证数据中发现的厚尾连接分布。我们在合成数据集上验证了该方法，这些数据集包含能产生多稳态吸引子、极限环和环状吸引子的连接结构，并展示了其在决策过程中大鼠前额叶皮层记录数据中的适用性。我们的框架将环路推断从恢复连接结构转向识别哪些连接结构是计算必需的，而哪些是欠约束推断的伪影。

摘要 (Abstract)

Connectivity structure shapes neural computation, but inferring this structure from population recordings is degenerate: multiple connectivity structures can generate identical dynamics. Recent work uses low-rank recurrent neural networks (lrRNNs) to infer low-dimensional latent dynamics and connectivity structure from observed activity, enabling a mechanistic interpretation of the dynamics. However, standard approaches for training lrRNNs can recover spurious structures irrelevant to the underlying dynamics. We first characterize the identifiability of connectivity structures in lrRNNs and determine conditions under which a unique solution exists. Then, to find such solutions, we develop an inference framework based on maximum entropy and continuous normalizing flows (CNFs), trained via flow matching. Instead of estimating a single connectivity matrix, our method learns the maximally unbiased distribution over connection weights consistent with observed dynamics. This approach captures complex yet necessary distributions such as heavy-tailed connectivity found in empirical data. We validate our method on synthetic datasets with connectivity structures that generate multistable attractors, limit cycles, and ring attractors, and demonstrate its applicability in recordings from rat frontal cortex during decision-making. Our framework shifts circuit inference from recovering connectivity to identifying which connectivity structures are computationally required, and which are artifacts of underconstrained inference.

关键词: neural connectivity inference, low-rank recurrent neural networks, continuous normalizing flows, maximum entropy, degenerate inference, attractor dynamics, circuit inference, rat frontal cortex

195. ❌ Conditional Neural Bayes Ratio Estimation for Experimental Design Optimisation

作者: S. A. K. Leeney, T. Gessey-Jones, W. J. Handley, E. de Lera Acedo, H. T. J. Bevins, J. L. Tutt 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26489v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出Conditional Neural Bayes Ratio Estimation (cNBRE)方法，用于实验设计优化，应用于21-cm射电宇宙学。该方法基于神经网络进行贝叶斯比率估计，属于科学计算和实验设计优化领域。所有关键词均与大语言模型、深度学习技术原理、模型训练优化、推理加速、对齐、智能体等具体技术直接相关，而本文的核心是贝叶斯统计方法和实验设计，未涉及任何大模型或深度学习技术。仅最后一个关键词’AI for Science’与论文的’科学应用’背景有微弱关联，但论文未使用AI进行科学发现，而是使用统计方法进行实验设计优化，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对前沿科学实验设计优化问题，提出了Conditional Neural Bayes Ratio Estimation (cNBRE)方法，通过条件化设计参数实现贝叶斯因子估计，应用于21-cm宇宙学实验，成功量化了天线方向对探测概率的影响（约20个百分点变化），为高效、全局优化的实验设计提供了框架。

摘要翻译

在探测能力极限运行的前沿实验中，仪器设计直接决定了发现概率。本文提出条件神经贝叶斯比估计方法（Conditional Neural Bayes Ratio Estimation，简称cNBRE），该方法通过引入设计参数作为条件变量，扩展了神经贝叶斯比估计框架，使得单个训练完成的网络能够对连续设计空间中的贝叶斯因子进行估计。将cNBRE应用于代表REACH实验的21厘米射电宇宙学模拟时，其摊销计算特性实现了传统逐点方法难以处理的系统性设计空间探索，同时复现了已知的物理关系。分析表明，在单夜观测中，天线朝向可导致探测概率产生约20个百分点的变化——这一设计决策若能在天线建造前确定，其实施成本极低。该框架为广泛的科学应用提供了高效、全局优化的实验设计方法。

摘要 (Abstract)

For frontier experiments operating at the edge of detectability, instrument design directly determines the probability of discovery. We introduce Conditional Neural Bayes Ratio Estimation (cNBRE), which extends neural Bayes ratio estimation by conditioning on design parameters, enabling a single trained network to estimate Bayes factors across a continuous design space. Applied to 21-cm radio cosmology with simulations representative of the REACH experiment, the amortised nature of cNBRE enables systematic design space exploration that would be intractable with traditional point-wise methods, while recovering established physical relationships. The analysis demonstrates a ~20 percentage point variation in detection probability with antenna orientation for a single night of observation, a design decision that would be trivial to implement if determined prior to antenna construction. This framework enables efficient, globally-informed experimental design optimisation for a wide range of scientific applications.

关键词: Conditional Neural Bayes Ratio Estimation, experimental design optimisation, Bayes factors, 21-cm radio cosmology, REACH experiment, design space exploration, detection probability, antenna orientation

196. ❌ Shapley meets Rawls: an integrated framework for measuring and explaining unfairness

作者: Fadoua Amri-Jouidel, Emmanuel Kemel, Stéphane Mussard 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26476v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究机器学习中的公平性和可解释性，提出使用Shapley值来定义和解释不公平性，属于可解释AI（XAI）领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、应用等）完全无关，仅与’Mechanistic Interpretability OR Explainable AI’高度相关，因为论文核心是解释不公平性的来源，属于可解释AI范畴。其他关键词均未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文提出一个集成框架，使用Shapley值来度量和解释机器学习模型在标准群体公平性准则下的不公平性，并在Census Income数据集上验证了其有效性。

摘要翻译

可解释性与公平性长期以来主要被分别考量，近期虽有研究尝试解释不公平性的来源，但相关探讨仍属少数。本文表明，在标准群体公平性准则下，夏普利值（Shapley value）可用于定义并解释不公平性。这为评估和推断不公平性及其影响因素提供了一个集成框架。我们的框架还可从夏普利值扩展至高效对称线性值（Efficient-Symmetric-Linear, ESL values）家族，其中部分值能提供更稳健的公平性定义，并缩短计算时间。我们在UCI机器学习资源库的“人口普查收入”数据集上进行了示例验证。与传统自助法检验相比，我们的方法以更短的计算时间揭示了“年龄”、“工作时长”和“婚姻状况”等因素导致了性别不公平性。

摘要 (Abstract)

Explainability and fairness have mainly been considered separately, with recent exceptions trying the explain the sources of unfairness. This paper shows that the Shapley value can be used to both define and explain unfairness, under standard group fairness criteria. This offers an integrated framework to estimate and derive inference on unfairness as-well-as the features that contribute to it. Our framework can also be extended from Shapley values to the family of Efficient-Symmetric-Linear (ESL) values, some of which offer more robust definitions of fairness, and shorter computation times. An illustration is run on the Census Income dataset from the UCI Machine Learning Repository. Our approach shows that Age", Number of hours" and ``Marital status" generate gender unfairness, using shorter computation time than traditional Bootstrap tests.

关键词: fairness, explainability, Shapley value, group fairness, unfairness measurement, feature contribution, ESL values, Census Income dataset

197. ❌ Automatic feature identification in least-squares policy iteration using the Koopman operator framework

作者: Christian Mugisho Zagabe, Sebastian Peitz 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26464v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究强化学习中的Koopman自编码器最小二乘策略迭代算法，专注于自动特征学习和动态系统控制，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Koopman自编码器的最小二乘策略迭代算法，用于强化学习中的自动特征学习，并在随机链行走和倒立摆控制问题上验证了其与经典方法相当的性能。

摘要翻译

本文提出了一种基于库普曼自编码器的最小二乘策略迭代（KAE-LSPI）强化学习算法。该算法通过将所谓的最小二乘定点逼近方法重构为扩展动态模态分解（EDMD）的形式，从而能够借助库普曼自编码器（KAE）框架实现自动特征学习。此方法的提出动机源于线性强化学习技术中缺乏系统性的特征或核函数选择机制。我们以随机链行走和倒立摆控制问题为例，将KAE-LSPI算法与先前两种方法——经典最小二乘策略迭代（LSPI）和基于核函数的最小二乘策略迭代（KLSPI）进行了比较。与以往研究不同，本方法无需预先固定特征或核函数。实验结果表明，相较于经典LSPI算法中预设的特征数量，KAE技术学习到的特征数量保持在合理范围内。其收敛至最优或近似最优策略的性能也与其他两种方法相当。

摘要 (Abstract)

In this paper, we present a Koopman autoencoder-based least-squares policy iteration (KAE-LSPI) algorithm in reinforcement learning (RL). The KAE-LSPI algorithm is based on reformulating the so-called least-squares fixed-point approximation method in terms of extended dynamic mode decomposition (EDMD), thereby enabling automatic feature learning via the Koopman autoencoder (KAE) framework. The approach is motivated by the lack of a systematic choice of features or kernels in linear RL techniques. We compare the KAE-LSPI algorithm with two previous works, the classical least-squares policy iteration (LSPI) and the kernel-based least-squares policy iteration (KLSPI), using stochastic chain walk and inverted pendulum control problems as examples. Unlike previous works, no features or kernels need to be fixed a priori in our approach. Empirical results show the number of features learned by the KAE technique remains reasonable compared to those fixed in the classical LSPI algorithm. The convergence to an optimal or a near-optimal policy is also comparable to the other two methods.

关键词: Koopman autoencoder, least-squares policy iteration, reinforcement learning, automatic feature learning, extended dynamic mode decomposition, stochastic chain walk, inverted pendulum control

198. ❌ Fair Data Pre-Processing with Imperfect Attribute Space

作者: Ying Zheng, Yangfan Jiang, Kian-Lee Tan 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26456v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究公平数据预处理框架LatentPre，专注于机器学习中的偏见缓解，通过引入潜在属性处理不完美属性空间。所有评分关键词均涉及大模型、深度学习技术原理或特定AI应用（如生物信息学），而本文属于通用机器学习公平性研究，未涉及大模型、深度学习技术或科学AI应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对现实世界中不完美属性空间下的公平数据预处理问题，提出了LatentPre框架，通过引入可识别的潜在属性和期望最大化估计，在保持数据效用的同时有效消除偏见模式。

摘要翻译

公平数据预处理是缓解机器学习偏见的常用策略。当前一项前景广阔的研究方向聚焦于通过校准数据集来满足预设的公平性策略，使得敏感属性仅能通过明确指定的合法因果路径影响结果。尽管这些方法在数据洁净且信息完整时表现良好，但在现实场景中，当属性空间不完美（即决策相关因素被判定为不可用甚至缺失）时，它们往往失效。为弥补这一不足，我们提出了LatentPre——一个能够在实际场景中实现原则性、鲁棒性公平数据处理的新框架。该方法不依赖于观测属性，而是通过引入能够捕捉关键但微妙信号的潜在属性来增强公平性策略，使框架能够在属性空间不完美的情况下仍能有效运作。这些潜在属性经过策略性引入以确保可识别性，并通过定制的期望最大化范式进行估计。原始数据随后依据这一潜在增强策略进行精细校准，在有效消除偏见模式的同时保留合理的关联模式。大量实验表明，LatentPre在不同场景中始终能实现优异的公平性与效用平衡，推动了实用性公平数据管理的发展。

摘要 (Abstract)

Fair data pre-processing is a widely used strategy for mitigating bias in machine learning. A promising line of research focuses on calibrating datasets to satisfy a designed fairness policy so that sensitive attributes influence outcomes only through clearly specified legitimate causal pathways. While effective on clean and information-rich data, these methods often break down in real-world scenarios with imperfect attribute spaces, where decision-relevant factors may be deemed unusable or even missing. To address this gap, we propose LatentPre, a novel framework that enables principled and robust fair data processing in practical settings. Instead of relying solely on observed attributes, LatentPre augments the fairness policy with latent attributes that capture essential but subtle signals, enabling the framework to operate as if the attribute space were perfect. These latent attributes are strategically introduced to guarantee identifiability and are estimated using a tailored expectation-maximization paradigm. The raw data is then carefully refined to conform to this latent-augmented policy, effectively removing biased patterns while preserving justifiable ones. Extensive experiments demonstrate that LatentPre consistently achieves strong fairness-utility trade-offs across diverse scenarios, advancing practical fairness-aware data management.

关键词: fair data pre-processing, bias mitigation, imperfect attribute space, latent attributes, expectation-maximization, fairness-utility trade-off, data management

199. ❌ Interpretable long-term traffic modelling on national road networks using theory-informed deep learning

作者: Yue Li, Shujuan Chen, Akihiro Shimoda, Ying Jin 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26440v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出DeepDemand框架，将交通需求理论与深度学习结合用于长期交通建模，属于深度学习在交通科学领域的应用。所有关键词均与大语言模型（LLM）相关，而本文专注于传统深度学习（非LLM）在交通领域的应用，因此除’Mechanistic Interpretability OR Explainable AI’（因论文强调可解释性分析）得5分外，其余关键词均得0分。论文未涉及大模型技术原理创新或LLM在不同领域的研究应用。

!!! tip deepseek-chat TL;DR

该研究提出了一种理论指导的深度学习框架DeepDemand，用于预测长期高速公路交通量，通过整合交通需求理论实现了比传统方法更好的预测准确性和地理可转移性，同时保持了模型的可解释性。

摘要翻译

长期交通建模是交通规划的基础，但现有方法往往需要在可解释性、可迁移性和预测准确性之间进行权衡。经典出行需求模型提供了行为结构，但依赖于强假设和大量校准；而通用深度学习模型能够捕捉复杂模式，却往往缺乏理论基础和空间可迁移性，这限制了它们在长期规划应用中的实用性。我们提出了DeepDemand，这是一个融合理论知识的深度学习框架，它嵌入了出行需求理论的关键组成部分，利用外部社会经济特征和路网结构来预测长期公路交通流量。该框架集成了一个用于局部起讫点（OD）区域提取和OD对筛选的竞争性双源Dijkstra算法，以及一个建模OD交互和出行时间阻抗的可微分架构。模型使用英国战略公路网八年（2017-2024年）的观测数据进行评估，覆盖5088个公路路段。在随机交叉验证下，DeepDemand的R2达到0.718，平均绝对误差（MAE）为7406辆车，优于线性回归、岭回归、随机森林和重力式基准模型。在空间交叉验证下性能依然强劲（R2 = 0.665），表明其具有良好的地理可迁移性。可解释性分析揭示了一个稳定的非线性出行时间阻抗模式、关键的社会经济需求驱动因素，以及与主要就业中心和交通枢纽相一致的多中心OD交互结构。这些结果凸显了将交通理论与深度学习相结合，以实现可解释的公路交通建模和实际规划应用的价值。

摘要 (Abstract)

Long-term traffic modelling is fundamental to transport planning, but existing approaches often trade off interpretability, transferability, and predictive accuracy. Classical travel demand models provide behavioural structure but rely on strong assumptions and extensive calibration, whereas generic deep learning models capture complex patterns but often lack theoretical grounding and spatial transferability, limiting their usefulness for long-term planning applications. We propose DeepDemand, a theory-informed deep learning framework that embeds key components of travel demand theory to predict long-term highway traffic volumes using external socioeconomic features and road-network structure. The framework integrates a competitive two-source Dijkstra procedure for local origin-destination (OD) region extraction and OD pair screening with a differentiable architecture modelling OD interactions and travel-time deterrence. The model is evaluated using eight years (2017-2024) of observations on the UK strategic road network, covering 5088 highway segments. Under random cross-validation, DeepDemand achieves an R2 of 0.718 and an MAE of 7406 vehicles, outperforming linear, ridge, random forest, and gravity-style baselines. Performance remains strong under spatial cross-validation (R2 = 0.665), indicating good geographic transferability. Interpretability analysis reveals a stable nonlinear travel-time deterrence pattern, key socioeconomic drivers of demand, and polycentric OD interaction structures aligned with major employment centres and transport hubs. These results highlight the value of integrating transport theory with deep learning for interpretable highway traffic modelling and practical planning applications.

关键词: traffic modelling, deep learning, travel demand theory, interpretability, highway traffic volumes, spatial transferability, origin-destination interactions, transport planning

200. ❌ Reconstructing Quantum Dot Charge Stability Diagrams with Diffusion Models

作者: Vinicius Hernandes, Joseph Rogers, Rouven Koch, Thomas Spriggs, Brennan Undseth, Anasua Chatterjee, Lieven M. K. Vandersypen, Eliska Greplova 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26432v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文研究量子点设备表征加速，使用条件扩散模型从稀疏测量重建电荷稳定性图。所有关键词均与论文内容无关，仅“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联（8分），因论文属于AI在科学（量子计算）领域的应用，但非生物信息学或化学信息学。其他关键词涉及大模型技术、训练方法、推理优化、代理系统等，论文未涉及这些主题。

!!! tip deepseek-chat TL;DR

该论文提出使用条件扩散模型从稀疏测量中重建量子点电荷稳定性图，以加速量子设备表征，仅需4%的数据即可保持关键物理特征。

摘要翻译

高效表征量子点器件是基于受限自旋的量子处理器规模化过程中的关键瓶颈。测量高分辨率的电荷稳定性图谱（CSDs，即定义量子点占据状态的关键数据图谱）耗时严重，尤其在新兴架构中，CSDs必须通过无法直接探测相关量子点电荷的远程传感器获取。本研究提出一种生成式方法，通过条件扩散模型从稀疏测量数据中重建完整的CSDs，从而加速采集过程。我们采用两种实验导向的掩蔽策略评估该方法：基于均匀网格的采样和线扫描测量。我们使用约9,000个样本训练的轻量级架构成功重建了CSDs，仅需总测量数据的4%即可保留电荷跃迁线等关键物理特征。通过与插值方法对比，我们发现后者在重建大面积未测量区域时失效。研究结果表明，生成模型能显著降低量子器件的表征开销，并为实验实现提供了可靠路径。

摘要 (Abstract)

Efficiently characterizing quantum dot (QD) devices is a critical bottleneck when scaling quantum processors based on confined spins. Measuring high-resolution charge stability diagrams (or CSDs, data maps which crucially define the occupation of QDs) is time-consuming, particularly in emerging architectures where CSDs must be acquired with remote sensors that cannot probe the charge of the relevant dots directly. In this work, we present a generative approach to accelerate acquisition by reconstructing full CSDs from sparse measurements, using a conditional diffusion model. We evaluate our approach using two experimentally motivated masking strategies: uniform grid-based sampling, and line-cut sweeps. Our lightweight architecture, trained on approximately 9,000 examples, successfully reconstructs CSDs, maintaining key physically important features such as charge transition lines, from as little as 4% of the total measured data. We compare the approach to interpolation methods, which fail when the task involves reconstructing large unmeasured regions. Our results demonstrate that generative models can significantly reduce the characterization overhead for quantum devices, and provides a robust path towards an experimental implementation.

关键词: quantum dot, charge stability diagrams, diffusion models, generative models, sparse measurements, device characterization, quantum processors, reconstruction

201. ❌ Kantorovich–Kernel Neural Operators: Approximation Theory, Asymptotics, and Neural Network Interpretation

作者: Tian-Xiao He 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26418v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是Kantorovich-kernel神经算子，属于数学分析和算子逼近理论领域，主要关注神经网络算子的数学性质（如收敛性、密度定理、Voronovskaya型定理等）。论文内容与所有评分关键词（均涉及大模型、深度学习技术原理、应用方法等）完全无关，没有涉及任何大模型技术、训练方法、推理优化、对齐技术、科学AI应用等内容。

!!! tip deepseek-chat TL;DR

该论文研究了多元Kantorovich-kernel神经网络算子的数学性质，包括密度结果、收敛估计、Voronovskaya型定理、Korovkin型定理和反演定理，并探讨了神经网络架构与经典正算子之间的联系。

摘要翻译

本文研究一类多元Kantorovich核神经网络算子，包括Sharma与Singh所研究的深度Kantorovich型神经网络算子。我们证明了密度结果，建立了定量收敛估计，推导了Voronovskaya型定理，分析了深度复合算子的偏微分方程极限，证明了Korovkin型定理，并提出了反演定理。此外，本文探讨了神经网络架构与由Chui、Hsu、He、Lorentz及Korovkin提出的经典正算子之间的联系。

摘要 (Abstract)

This paper studies a class of multivariate Kantorovich-kernel neural network operators, including the deep Kantorovich-type neural network operators studied by Sharma and Singh. We prove density results, establish quantitative convergence estimates, derive Voronovskaya-type theorems, analyze the limits of partial differential equations for deep composite operators, prove Korovkin-type theorems, and propose inversion theorems. This paper studies a class of multivariate Kantorovich-kernel neural network operators, including the deep Kantorovich-type neural network operators studied by Sharma and Singh. We prove density results, establish quantitative convergence estimates, derive Voronovskaya-type theorems, analyze the limits of partial differential equations for deep composite operators, prove Korovkin-type theorems, and propose inversion theorems. Furthermore, this paper discusses the connection between neural network architectures and the classical positive operators proposed by Chui, Hsu, He, Lorentz, and Korovkin.

关键词: Kantorovich-kernel neural operators, neural network operators, approximation theory, convergence estimates, Voronovskaya-type theorems, Korovkin-type theorems, inversion theorems, positive operators

202. ❌ Maintaining Difficulty: A Margin Scheduler for Triplet Loss in Siamese Networks Training

作者: Roberto Sprengel Minozzo Tomchak, Oge Marques, Lucas Garcia Pedroso, Luiz Eduardo Oliveira, Paulo Lisboa de Almeida 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26389v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是Siamese网络中的Triplet Margin Ranking Loss训练方法，提出了一种动态调整margin参数的调度器。论文内容完全聚焦于深度学习中的度量学习和损失函数优化，与所有评分关键词（均涉及大模型技术、训练方法、推理优化、应用领域等）均无直接关联。论文未提及任何大模型、语言模型、科学AI应用或相关技术概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对Siamese网络中的Triplet Margin Ranking Loss，提出了一种根据训练过程中简单三元组比例动态调整margin参数的调度器，实验表明该方法相比固定margin和单调递增margin方案能提高验证性能。

摘要翻译

三元组间隔排序损失是孪生网络中解决距离度量学习问题最广泛使用的损失函数之一。该损失函数依赖于间隔参数μ，该参数定义了训练期间正负样本对应保持的最小距离。本研究发现，在训练过程中，只要观察到足够数量违反该间隔的三元组，许多三元组的有效间隔往往会超过预设的μ值。这一现象表明，在整个训练过程中固定间隔可能会限制学习过程。基于此观察，我们提出了一种间隔调度器，该调度器根据每个训练周期观察到的简单三元组比例动态调整μ值，旨在维持训练难度随时间推移的稳定性。实验证明，与固定间隔方案及单调递增间隔方案相比，所提出的策略能有效提升模型性能。在四个不同数据集上的实验结果表明，该方法在验证性能上取得了持续提升。

摘要 (Abstract)

The Triplet Margin Ranking Loss is one of the most widely used loss functions in Siamese Networks for solving Distance Metric Learning (DML) problems. This loss function depends on a margin parameter μ, which defines the minimum distance that should separate positive and negative pairs during training. In this work, we show that, during training, the effective margin of many triplets often exceeds the predefined value of μ, provided that a sufficient number of triplets violating this margin is observed. This behavior indicates that fixing the margin throughout training may limit the learning process. Based on this observation, we propose a margin scheduler that adjusts the value of μ according to the proportion of easy triplets observed at each epoch, with the goal of maintaining training difficulty over time. We show that the proposed strategy leads to improved performance when compared to both a constant margin and a monotonically increasing margin scheme. Experimental results on four different datasets show consistent gains in verification performance.

关键词: Triplet Margin Ranking Loss, Siamese Networks, Distance Metric Learning, margin scheduler, training difficulty, easy triplets, verification performance, DML

203. ❌ Curvature-aware Expected Free Energy as an Acquisition Function for Bayesian Optimization

作者: Ajith Anil Meera, Wouter Kouw 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26339v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是贝叶斯优化中的采集函数设计，具体提出了基于期望自由能的曲率感知更新方法，并在Van der Pol振荡器的系统辨识问题上进行了验证。论文内容完全聚焦于贝叶斯优化、高斯过程、采集函数等传统机器学习优化方法，没有涉及任何大语言模型、深度学习技术原理、AI for Science应用或相关关键词中的技术。所有关键词均与大模型、深度学习、AI应用无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于期望自由能的曲率感知采集函数用于贝叶斯优化，解决了联合学习和优化问题，并在Van der Pol振荡器系统辨识中证明了其优于现有方法的性能。

摘要翻译

我们提出一种基于期望自由能的采集函数用于贝叶斯优化，以解决联合学习与优化问题，即同时优化并学习底层函数。我们证明，在特定假设下，期望自由能可简化为上置信界、下置信界及期望信息增益。我们证明了期望自由能对于凹函数具有无偏收敛保证。基于这些推导结果，我们为期望自由能引入了一种曲率感知更新律，并以范德波尔振荡器的系统辨识问题展示了其概念验证。通过严格的仿真实验，我们表明，基于期望自由能的自适应采集函数在最终简单遗憾和高斯过程学习误差方面均优于当前最先进的采集函数。

摘要 (Abstract)

We propose an Expected Free Energy-based acquisition function for Bayesian optimization to solve the joint learning and optimization problem, i.e., optimize and learn the underlying function simultaneously. We show that, under specific assumptions, Expected Free Energy reduces to Upper Confidence Bound, Lower Confidence Bound, and Expected Information Gain. We prove that Expected Free Energy has unbiased convergence guarantees for concave functions. Using the results from these derivations, we introduce a curvature-aware update law for Expected Free Energy and show its proof of concept using a system identification problem on a Van der Pol oscillator. Through rigorous simulation experiments, we show that our adaptive Expected Free Energy-based acquisition function outperforms state-of-the-art acquisition functions with the least final simple regret and error in learning the Gaussian process.

关键词: Bayesian optimization, Expected Free Energy, acquisition function, curvature-aware, Gaussian process, system identification, Van der Pol oscillator, simple regret

204. ❌ A Power-Weighted Noncentral Complex Gaussian Distribution

作者: Toru Nakashika 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26344v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于信号处理领域的概率模型（功率加权非中心复高斯分布），用于建模复值随机变量和语音功率谱，所有关键词均涉及大模型、深度学习、AI应用或相关技术，而论文内容属于传统信号处理理论，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种新的复值随机变量概率模型——功率加权非中心复高斯分布，用于更好地建模信号处理中的振幅特性，并在语音功率谱实验中显示出比传统分布更高的对数似然性能。

摘要翻译

复高斯分布作为信号处理与通信领域的基础谱模型与噪声模型已被广泛应用。然而，其高斯结构常限制其对单个源信号中观测到的多样化幅度特性的表征能力。另一方面，许多基于超球面模型推导的现有非高斯幅度分布因其幂律结构而获得了良好的经验拟合效果，但这些模型并未显式考虑复数值观测数据固有的复平面几何特性。本文提出一种针对复数值随机变量的新概率模型，该模型可解释为幂加权非中心复高斯分布。与传统的超球面幅度模型不同，所提模型直接在复平面上构建，在保留高维解释能力的同时，保持了复数值观测数据的几何结构。该模型通过单一形状参数引入非线性相位扩散，实现了从沿相位方向的弧形扩散到概率质量向原点集中的分布几何形态连续调控。我们构建了所提出的分布形式，并分析了其诱导幅度分布的统计特性。推导得到的幅度分布与功率分布提供了一个统一框架，涵盖信号建模中多种常用分布，包括莱斯分布、Nakagami分布与伽马分布。在语音功率谱上的实验结果表明，所提模型在对数似然度指标上持续优于传统分布。

摘要 (Abstract)

The complex Gaussian distribution has been widely used as a fundamental spectral and noise model in signal processing and communication. However, its Gaussian structure often limits its ability to represent the diverse amplitude characteristics observed in individual source signals. On the other hand, many existing non-Gaussian amplitude distributions derived from hyperspherical models achieve good empirical fit due to their power-law structures, while they do not explicitly account for the complex-plane geometry inherent in complex-valued observations. In this paper, we propose a new probabilistic model for complex-valued random variables, which can be interpreted as a power-weighted noncentral complex Gaussian distribution. Unlike conventional hyperspherical amplitude models, the proposed model is formulated directly on the complex plane and preserves the geometric structure of complex-valued observations while retaining a higher-dimensional interpretation. The model introduces a nonlinear phase diffusion through a single shape parameter, enabling continuous control of the distributional geometry from arc-shaped diffusion along the phase direction to concentration of probability mass toward the origin. We formulate the proposed distribution and analyze the statistical properties of the induced amplitude distribution. The derived amplitude and power distributions provide a unified framework encompassing several widely used distributions in signal modeling, including the Rice, Nakagami, and gamma distributions. Experimental results on speech power spectra demonstrate that the proposed model consistently outperforms conventional distributions in terms of log-likelihood.

关键词: complex Gaussian distribution, power-weighted noncentral complex Gaussian, signal processing, amplitude distribution, speech power spectra, probabilistic model, complex-valued observations, log-likelihood

205. ❌ Making Multi-Axis Models Robust to Multiplicative Noise: How, and Why?

作者: Bailey Andrew, David R. Westhead, Luisa Cutillo 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26327v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是针对单细胞RNA测序数据中乘性噪声的图学习算法MED-MAGMA，属于生物信息学领域的特定方法学研究。所有关键词均与大模型、深度学习技术原理或通用AI应用无关，仅最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’与论文的生物信息学应用背景有微弱关联（5分），但论文本身不涉及大模型或深度学习创新。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MED-MAGMA的图学习算法，用于拟合受乘性噪声污染的多轴（Kronecker和结构）模型，并在单细胞RNA测序数据上验证了其能学习到具有更好局部和全局结构的网络。

摘要翻译

本文提出了一种图学习算法MED-MAGMA，用于拟合受乘性噪声干扰的多轴（克罗内克和结构）模型。此类噪声在许多应用领域中具有自然性，例如单细胞RNA测序领域，其能自然地捕捉RNA测序平台的技术偏差。我们在单细胞表达图谱中所有符合特定规模的公共数据集上，将本方法与已有研究进行了系统性评估，结果表明我们的方法能够学习到具有更优局部与全局结构的网络。MED-MAGMA已作为Python软件包（MED-MAGMA）公开发布。

摘要 (Abstract)

In this paper we develop a graph-learning algorithm, MED-MAGMA, to fit multi-axis (Kronecker-sum-structured) models corrupted by multiplicative noise. This type of noise is natural in many application domains, such as that of single-cell RNA sequencing, in which it naturally captures technical biases of RNA sequencing platforms. Our work is evaluated against prior work on each and every public dataset in the Single Cell Expression Atlas under a certain size, demonstrating that our methodology learns networks with better local and global structure. MED-MAGMA is made available as a Python package (MED-MAGMA).

关键词: multi-axis models, multiplicative noise, graph-learning algorithm, MED-MAGMA, single-cell RNA sequencing, Kronecker-sum-structured, network structure, technical biases

206. ❌ Semi-structured multi-state delinquency model for mortgage default

作者: Victor Medina-Olivares, Wangzhen Xia, Stefan Lessmann, Nadja Klein 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26309v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是抵押贷款违约预测的半结构化多状态模型，结合了结构化加性预测器和神经网络组件，属于传统机器学习/统计建模在金融风险领域的应用。所有评分关键词均涉及大模型、深度学习技术原理或AI for Science等前沿方向，而本文未涉及任何大语言模型、深度学习技术原理创新或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种半结构化多状态模型来预测抵押贷款违约转移，结合了可解释的结构化组件和灵活的神经网络组件，在Freddie Mac数据集上相比纯结构化基准模型在早期预测中取得了适度但一致的性能提升。

摘要翻译

本文提出了一种半结构化离散时间多状态模型，用于分析抵押贷款违约状态转移。该模型结合了易于理解的结构化加性预测器（包含线性效应以及时间与协变量的平滑函数）与灵活捕捉复杂非线性及高阶交互作用的神经网络组件。为确保当协变量同时出现在两个组件时的可识别性，我们将非结构化部分相对于结构化设计进行正交化处理。针对离散时间竞争性转移，我们推导了精确的变换方法，将二元逻辑模型映射为有效的竞争转移概率，从而避免了对连续时间近似的依赖。在模拟实验中，我们的框架能有效还原结构化基线及协变量效应，同时利用神经网络组件检测交互模式。我们使用房地美（Freddie Mac）单户贷款层级数据集，采用跨时间测试设计对该方法进行验证。与结构化的广义加性基准模型相比，半结构化模型在最早预测区间内实现了适度但一致的区分度提升，同时保持了相似的布里尔分数（Brier scores）。在此跨时间评估中，添加宏观经济指标带来的增量效益有限，且未显著改变估计的借款人、贷款或期限驱动效应。总体而言，半结构化多状态建模在透明效应估计与灵活模式学习之间提供了实用的折衷方案，其潜在应用可扩展至信用转移预测之外的领域。

摘要 (Abstract)

We propose a semi-structured discrete-time multi-state model to analyse mortgage delinquency transitions. This model combines an easy-to-understand structured additive predictor, which includes linear effects and smooth functions of time and covariates, with a flexible neural network component that captures complex nonlinearities and higher-order interactions. To ensure identifiability when covariates are present in both components, we orthogonalise the unstructured part relative to the structured design. For discrete-time competing transitions, we derive exact transformations that map binary logistic models to valid competing transition probabilities, avoiding the need for continuous-time approximations. In simulations, our framework effectively recovers structured baseline and covariate effects while using the neural component to detect interaction patterns. We demonstrate the method using the Freddie Mac Single-Family Loan-Level Dataset, employing an out-of-time test design. Compared with a structured generalised additive benchmark, the semi-structured model provides modest but consistent gains in discrimination across the earliest prediction spans, while maintaining similar Brier scores. Adding macroeconomic indicators provides limited incremental benefit in this out-of-time evaluation and does not materially change the estimated borrower-, loan-, or duration-driven effects. Overall, semi-structured multi-state modelling offers a practical compromise between transparent effect estimates and flexible pattern learning, with potential applications beyond credit-transition forecasting.

关键词: mortgage default, semi-structured model, multi-state model, neural network, delinquency transitions, competing transitions, structured additive predictor, credit-transition forecasting

207. ❌ STN-GPR: A Singularity Tensor Network Framework for Efficient Option Pricing

作者: Dominic Gribben, Carolina Allende, Alba Villarino, Aser Cortines, Mazen Ali, Román Orús, Pascal Oswald, Noureddine Lehdili 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26318v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于金融工程中的期权定价问题，提出了一种基于张量网络（Tensor Network）和高斯过程回归（GPR）的代理模型方法。论文的核心技术是张量列车（TT）分解和近似，用于高效处理高维定价曲面，属于数值计算和计算金融领域。所有评分关键词均与大语言模型（LLM）、深度学习技术原理、AI for Science（如生物信息学、化学信息学）或通用AI方法（如推理、对齐、微调等）直接相关。论文内容完全不涉及这些主题，没有使用或讨论任何大模型、深度学习或AI for Science技术，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于张量网络和高斯过程回归的代理模型框架（STN-GPR），用于高效解决大规模投资组合重估中的高维期权定价问题，在测试误差和训练时间上优于标准高斯过程回归方法。

摘要翻译

我们开发了一种用于期权定价的张量网络代理模型，旨在解决市场风险管理（如风险价值与预期缺口计算）中出现的大规模投资组合重估问题。该方法通过张量列车格式表示高维价格曲面，利用张量列车交叉逼近技术直接从黑箱定价评估中构建代理模型，无需生成完整的训练张量。在推断阶段，我们采用拉普拉斯核函数，并在无噪声设定下推导出核矩阵及其闭式逆的张量列车表示，从而实现了无需稠密矩阵分解或迭代线性求解的张量列车高斯过程回归。研究发现，超参数优化始终倾向于较大的核长度尺度，并证明在此条件下，高斯过程回归预测器对离网格输入可简化为多线性插值；我们同时推导了该极限下的低秩张量列车表示。我们在八维参数空间（资产现货水平、行权价、利率和剩余期限）中对五资产篮子期权进行了方法评估。对于欧式几何篮子看跌期权，张量代理模型通过扩展到显著更大的有效训练集，在更短的训练时间内实现了比标准高斯过程回归更低的测试误差。对于基于最小二乘蒙特卡洛数据训练的美式算术篮子看跌期权，该代理模型在训练集规模扩展性方面表现更优，同时实现每查询毫秒级评估速度，整体运行时间主要受数据生成阶段主导。

摘要 (Abstract)

We develop a tensor-network surrogate for option pricing, targeting large-scale portfolio revaluation problems arising in market risk management (e.g., VaR and Expected Shortfall computations). The method involves representing high-dimensional price surfaces in tensor-train (TT) form using TT-cross approximation, constructing the surrogate directly from black-box price evaluations without materializing the full training tensor. For inference, we use a Laplacian kernel and derive TT representations of the kernel matrix and its closed-form inverse in the noise-free setting, enabling TT-based Gaussian process regression without dense matrix factorization or iterative linear solves. We found that hyperparameter optimization consistently favors a large kernel length-scale and show that in this regime the GPR predictor reduces to multilinear interpolation for off-grid inputs; we also derive a low-rank TT representation for this limit. We evaluate the approach on five-asset basket options over an eight dimensional parameter space (asset spot levels, strike, interest rate, and time to maturity). For European geometric basket puts, the tensor surrogate achieves lower test error at shorter training times than standard GPR by scaling to substantially larger effective training sets. For American arithmetic basket puts trained on LSMC data, the surrogate exhibits more favorable scaling with training-set size while providing millisecond-level evaluation per query, with overall runtime dominated by data generation.

关键词: tensor network, option pricing, Gaussian process regression, portfolio revaluation, tensor-train decomposition, high-dimensional pricing, computational finance, surrogate model

208. ❌ D-GATNet: Interpretable Temporal Graph Attention Learning for ADHD Identification Using Dynamic Functional Connectivity

作者: Qurat Ul Ain, Alptekin Temizel, Soyiba Jawed 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26308v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于使用动态功能连接和可解释的图注意力网络进行ADHD分类，属于生物医学AI应用领域。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、代理系统等）完全无关，因此评分为0。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（论文强调模型可解释性，但未深入探讨可解释AI的一般原理），以及与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（属于AI在生物信息学/神经科学领域的应用）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于动态功能连接和可解释图注意力网络的框架D-GATNet，用于自动诊断注意力缺陷多动障碍（ADHD），在ADHD-200数据集上取得了优于现有方法的分类性能（85.18%平衡准确率），并通过注意力机制揭示了潜在的神经影像生物标志物。

摘要翻译

注意缺陷多动障碍（Attention Deficit Hyperactivity Disorder, ADHD）是一种常见的神经发育障碍，由于其大脑连接中存在复杂且时变的紊乱，基于神经影像学的诊断仍面临挑战。功能磁共振成像（functional MRI, fMRI）为识别功能改变提供了一种强大的非侵入性模态。现有的深度学习（deep learning, DL）研究采用了多种神经影像特征；然而，静态功能连接仍被广泛使用，而动态连接建模的研究相对不足。此外，许多深度学习模型缺乏可解释性。在本研究中，我们提出了D-GATNet，这是一个基于时序图的可解释框架，用于利用动态功能连接（dynamic functional connectivity, dFC）进行ADHD的自动分类。通过滑动窗口皮尔逊相关法，构建了以感兴趣区域为节点、连接强度为边的功能性大脑图序列。空间依赖关系通过多层图注意力网络学习，而时序动态则使用一维卷积及随后的时序注意力机制进行建模。可解释性通过以下方式实现：图注意力权重揭示了占主导地位的ROI交互，ROI重要性分数识别了关键脑区，时序注意力则强调了信息丰富的连接片段。在北京大学ADHD-200数据集站点上进行的实验，采用分层10折交叉验证和5次种子集成，取得了85.18% ±5.64的平衡准确率和0.881的AUC值，性能优于现有先进方法。注意力分析揭示了小脑和默认模式网络的紊乱，指示了潜在的神经影像学生物标志物。

摘要 (Abstract)

Attention Deficit Hyperactivity Disorder (ADHD) is a prevalent neurodevelopmental disorder whose neuroimaging-based diagnosis remains challenging due to complex time-varying disruptions in brain connectivity. Functional MRI (fMRI) provides a powerful non-invasive modality for identifying functional alterations. Existing deep learning (DL) studies employ diverse neuroimaging features; however, static functional connectivity remains widely used, whereas dynamic connectivity modeling is comparatively underexplored. Moreover, many DL models lack interpretability. In this work, we propose D-GATNet, an interpretable temporal graph-based framework for automated ADHD classification using dynamic functional connectivity (dFC). Sliding-window Pearson correlation constructs sequences of functional brain graphs with regions of interest as nodes and connectivity strengths as edges. Spatial dependencies are learned via a multi-layer Graph Attention Network, while temporal dynamics are modeled using 1D convolution followed by temporal attention. Interpretability is achieved through graph attention weights revealing dominant ROI interactions, ROI importance scores identifying influential regions, and temporal attention emphasizing informative connectivity segments. Experiments on the Peking University site of the ADHD-200 dataset using stratified 10-fold cross-validation with a 5-seed ensemble achieved 85.18% +_5.64 balanced accuracy and 0.881 AUC, outperforming state-of-the-art methods. Attention analysis reveals cerebellar and default mode network disruptions, indicating potential neuroimaging biomarkers.

关键词: ADHD identification, dynamic functional connectivity, graph attention network, interpretable deep learning, fMRI analysis, temporal graph learning, neuroimaging biomarkers, brain connectivity

209. ❌ Contrastive Conformal Sets

作者: Yahya Alkhatib, Wee Peng Tay 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26261v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究对比学习中的保形预测方法，提出最小体积覆盖集和可学习的广义多范数约束，属于机器学习中的统计学习理论和方法论范畴。论文内容完全不涉及大模型、深度学习技术原理、AI科学应用或任何评分关键词中的具体技术（如LLM、MoE、RLHF、RAG等），也未提及生物信息学、化学信息学等科学应用领域。所有关键词均与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文针对对比学习中缺乏语义特征空间覆盖保证的问题，提出了一种基于最小体积覆盖集和可学习广义多范数约束的保形预测方法，在保证正样本覆盖的同时最大化负样本排除，并在模拟和真实图像数据集上验证了其优于标准距离基线的包含-排除权衡性能。

摘要翻译

对比学习通过促使正样本紧密聚集并分离负样本来生成连贯的语义特征嵌入。然而，现有的对比学习方法缺乏对语义特征空间内覆盖度的原则性保证。我们将可学习广义多范数约束的最小体积覆盖集引入此框架，从而将保形预测扩展至该领域。我们提出一种方法，能构建保证用户指定正样本覆盖度、同时最大化负样本排除的保形集。我们从理论上证明，体积最小化可作为负样本排除的代理目标，使得我们的方法即使在负样本对不可用时也能有效运作。正样本包含保证继承了保形预测的无分布覆盖特性，而负样本排除则通过在预留的训练集上优化学习到的集合几何形状来实现最大化。在模拟和真实世界图像数据集上的实验表明，与标准的基于距离的保形预测基线相比，本方法在包含-排除权衡方面取得了更优的效果。

摘要 (Abstract)

Contrastive learning produces coherent semantic feature embeddings by encouraging positive samples to cluster closely while separating negative samples. However, existing contrastive learning methods lack principled guarantees on coverage within the semantic feature space. We extend conformal prediction to this setting by introducing minimum-volume covering sets equipped with learnable generalized multi-norm constraints. We propose a method that constructs conformal sets guaranteeing user-specified coverage of positive samples while maximizing negative sample exclusion. We establish theoretically that volume minimization serves as a proxy for negative exclusion, enabling our approach to operate effectively even when negative pairs are unavailable. The positive inclusion guarantee inherits the distribution-free coverage property of conformal prediction, while negative exclusion is maximized through learned set geometry optimized on a held-out training split. Experiments on simulated and real-world image datasets demonstrate improved inclusion-exclusion trade-offs compared to standard distance-based conformal baselines.

关键词: contrastive learning, conformal prediction, minimum-volume covering sets, generalized multi-norm constraints, positive sample coverage, negative sample exclusion, distribution-free coverage, inclusion-exclusion trade-off

210. ❌ Topology-Aware Graph Reinforcement Learning for Energy Storage Systems Optimal Dispatch in Distribution Networks

作者: Shuyi Gao, Stavros Orfanoudakis, Shengren Hou, Peter Palensky, Pedro P. Vergara 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26264v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究配电网络中储能系统的最优调度问题，采用基于TD3的强化学习架构结合图神经网络（GNN）作为特征编码器。论文内容主要涉及强化学习、图神经网络和能源系统优化，与评分关键词列表中的绝大多数大模型技术关键词（如LLMs、MoE、Scaling Laws、RLHF、RAG等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学/工程领域的应用（能源系统），但并非核心的生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于拓扑感知的强化学习架构，结合图神经网络来优化配电网络中储能系统的调度，在34总线和69总线系统上验证了该方法能有效减少电压违规并提高经济性，但跨系统零样本迁移性能会下降。

摘要翻译

配电网中储能系统（ESS）的优化调度涉及在时变条件及可能的拓扑结构变化下，协同提升运行经济性与电压安全性。为支持快速在线决策，我们开发了一种基于双延迟深度确定性策略梯度（TD3）的拓扑感知强化学习架构，该架构集成图神经网络（GNNs）作为储能系统调度的图特征编码器。我们在34节点和69节点系统上对三种GNN变体——图卷积网络（GCNs）、拓扑自适应图卷积网络（TAGConv）和图注意力网络（GATs）——进行了系统性研究，并评估了其在多种拓扑重构场景下以及不同规模系统间的跨系统迁移鲁棒性。结果表明，基于GNN的控制器能持续减少电压越限的次数与幅度，在69节点系统及拓扑重构场景下效益更为显著；在69节点系统中，TD3-GCN和TD3-TAGConv相较于非线性规划基准所节省的成本也优于神经网络基线。我们还指出，迁移收益具有场景依赖性，在本质不同的系统间进行零样本迁移会导致性能显著下降并增加电压幅值越限。本工作开源地址为：https://github.com/ShuyiGao/GNNs_RL_ESSs 与 https://github.com/distributionnetworksTUDelft/GNNs_RL_ESSs。

摘要 (Abstract)

Optimal dispatch of energy storage systems (ESSs) in distribution networks involves jointly improving operating economy and voltage security under time-varying conditions and possible topology changes. To support fast online decision making, we develop a topology-aware Reinforcement Learning architecture based on Twin Delayed Deep Deterministic Policy Gradient (TD3), which integrates graph neural networks (GNNs) as graph feature encoders for ESS dispatch. We conduct a systematic investigation of three GNN variants: graph convolutional networks (GCNs), topology adaptive graph convolutional networks (TAGConv), and graph attention networks (GATs) on the 34-bus and 69-bus systems, and evaluate robustness under multiple topology reconfiguration cases as well as cross-system transfer between networks with different system sizes. Results show that GNN-based controllers consistently reduce the number and magnitude of voltage violations, with clearer benefits on the 69-bus system and under reconfiguration; on the 69-bus system, TD3-GCN and TD3-TAGConv also achieve lower saved cost relative to the NLP benchmark than the NN baseline. We also highlight that transfer gains are case-dependent, and zero-shot transfer between fundamentally different systems results in notable performance degradation and increased voltage magnitude violations. This work is available at: https://github.com/ShuyiGao/GNNs_RL_ESSs and https://github.com/distributionnetworksTUDelft/GNNs_RL_ESSs.

关键词: Energy Storage Systems, Optimal Dispatch, Reinforcement Learning, Graph Neural Networks, Distribution Networks, Topology-aware, Voltage Security, Transfer Learning

211. ❌ Improving Risk Stratification in Hypertrophic Cardiomyopathy: A Novel Score Combining Echocardiography, Clinical, and Medication Data

作者: Marion Taconné, Valentina D. A. Corino, Annamaria Del Franco, Sara Giovani, Iacopo Olivotto, Adrien Al Wazzan, Erwan Donal, Pietro Cerveri, Luca Mainardi 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26254v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于使用传统机器学习（随机森林）进行肥厚型心肌病的风险分层，属于医疗AI应用领域。与绝大多数大模型技术关键词（如LLM、MoE、RLHF等）完全无关，因为这些关键词涉及大语言模型架构、训练方法、推理优化等，而本文未使用任何大模型技术。仅与’Explainable AI’有一定关联（5分），因为论文提到模型具有高可解释性；与’AI for Science’有较强关联（8分），因为这是AI在生物医学领域的应用研究。

!!! tip deepseek-chat TL;DR

本研究开发了一种基于随机森林的机器学习风险评分模型，利用常规收集的超声心动图、临床和药物数据，显著改善了肥厚型心肌病的5年心血管结局预测，在内部和外部验证中均优于现有ESC评分。

摘要翻译

肥厚型心肌病（HCM）需要准确的风险分层，以指导植入式心律转复除颤器（ICD）治疗及随访管理的决策。当前已建立的模型，例如欧洲心脏病学会（ESC）风险评分，其区分性能仅为中等水平。本研究开发了一种稳健且可解释的机器学习（ML）风险评分，利用常规收集的超声心动图、临床和用药数据（这些数据通常包含在电子健康记录中），以预测HCM患者5年复合心血管结局。该模型使用来自SHARE注册库（佛罗伦萨医院）的大型队列（N=1,201）进行训练和内部验证，并在来自雷恩医院的独立队列（N=382）上进行了外部验证。最终的随机森林集成模型获得了高达0.85 ± 0.02的内部受试者工作特征曲线下面积（AUC），显著优于ESC评分（0.56 ± 0.03）。关键的是，在外部验证集上的生存曲线分析显示，ML评分具有更优的风险区分能力（对数秩检验 p = 8.62 x 10^(-4)），而ESC评分的p值为0.0559。此外，纵向分析表明，所提出的风险评分在未发生事件的患者中随时间推移保持稳定。该模型的高可解释性及其纵向风险监测能力，为HCM的个体化临床管理提供了有前景的工具。

摘要 (Abstract)

Hypertrophic cardiomyopathy (HCM) requires accurate risk stratification to inform decisions regarding ICD therapy and follow-up management. Current established models, such as the European Society of Cardiology (ESC) score, exhibit moderate discriminative performance. This study develops a robust, explainable machine learning (ML) risk score leveraging routinely collected echocardiographic, clinical, and medication data, typically contained within Electronic Health Records (EHRs), to predict a 5-year composite cardiovascular outcome in HCM patients. The model was trained and internally validated using a large cohort (N=1,201) from the SHARE registry (Florence Hospital) and externally validated on an independent cohort (N=382) from Rennes Hospital. The final Random Forest ensemble model achieved a high internal Area Under the Curve (AUC) of 0.85 +- 0.02, significantly outperforming the ESC score (0.56 +- 0.03). Critically, survival curve analysis on the external validation set showed superior risk separation for the ML score (Log-rank p = 8.62 x 10^(-4) compared to the ESC score (p = 0.0559). Furthermore, longitudinal analyses demonstrate that the proposed risk score remains stable over time in event-free patients. The model high interpretability and its capacity for longitudinal risk monitoring represent promising tools for the personalized clinical management of HCM.

关键词: hypertrophic cardiomyopathy, risk stratification, machine learning, random forest, echocardiography, cardiovascular outcome, clinical prediction model, electronic health records

212. ❌ Optimization Trade-offs in Asynchronous Federated Learning: A Stochastic Networks Approach

作者: Abdelkrim Alahyane, Céline Comte, Matthieu Jonckheere 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26231v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于异步联邦学习的优化问题，特别是通过随机网络方法分析梯度陈旧性、收敛速度和能耗之间的权衡。论文内容涉及分布式机器学习、通信延迟、队列理论和优化策略，但完全不涉及大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究的是联邦学习中的异步算法和网络优化，属于分布式机器学习的一个子领域，与评分关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了异步联邦学习中梯度陈旧性、收敛速度和能耗之间的权衡问题，通过随机队列网络框架提出了优化策略，在EMNIST数据集上实现了29%-46%的收敛时间减少和36%-49%的能耗降低。

摘要翻译

同步联邦学习因掉队者效应而难以扩展规模。异步算法通过即时处理到达的更新提高了更新吞吐量，但它们引入了两个根本性挑战：梯度陈旧性（会降低收敛速度）以及在异构数据分布下对更快客户端的偏向性。尽管诸如异步随机梯度下降（AsyncSGD）和广义异步随机梯度下降（Generalized AsyncSGD）等算法通过客户端任务队列缓解了这种偏向，但现有分析大多忽略了底层的排队动态，且缺乏对更新吞吐量和梯度陈旧性的闭式表征。为填补这一空白，我们为广义异步随机梯度下降建立了一个随机排队网络框架，该框架联合建模了客户端与中央服务器的随机计算时间，以及随机的上行与下行通信延迟。利用乘积形式网络理论，我们推导出了更新吞吐量的闭式表达式，以及达到 $ε$ 平稳点所需的通信轮数复杂度和预期挂钟时间的闭式上界。这些结果形式化地刻画了梯度陈旧性与挂钟收敛速度之间的权衡。我们进一步扩展该框架，以量化随机时序下的能耗，揭示了收敛速度与能源效率之间的额外权衡。基于这些分析结果，我们提出了基于梯度的优化策略，以联合优化路由与并发性。在EMNIST数据集上的实验表明，与异步随机梯度下降相比，收敛时间减少了29%–46%，能耗降低了36%–49%。

摘要 (Abstract)

Synchronous federated learning scales poorly due to the straggler effect. Asynchronous algorithms increase the update throughput by processing updates upon arrival, but they introduce two fundamental challenges: gradient staleness, which degrades convergence, and bias toward faster clients under heterogeneous data distributions. Although algorithms such as AsyncSGD and Generalized AsyncSGD mitigate this bias via client-side task queues, most existing analyses neglect the underlying queueing dynamics and lack closed-form characterizations of the update throughput and gradient staleness. To close this gap, we develop a stochastic queueing-network framework for Generalized AsyncSGD that jointly models random computation times at the clients and the central server, as well as random uplink and downlink communication delays. Leveraging product-form network theory, we derive a closed-form expression for the update throughput, alongside closed-form upper bounds for both the communication round complexity and the expected wall-clock time required to reach an $ε$-stationary point. These results formally characterize the trade-off between gradient staleness and wall-clock convergence speed. We further extend the framework to quantify energy consumption under stochastic timing, revealing an additional trade-off between convergence speed and energy efficiency. Building on these analytical results, we propose gradient-based optimization strategies to jointly optimize routing and concurrency. Experiments on EMNIST demonstrate reductions of 29%–46% in convergence time and 36%–49% in energy consumption compared to AsyncSGD.

关键词: Asynchronous Federated Learning, Stochastic Queueing Networks, Gradient Staleness, Convergence Speed, Energy Consumption, Optimization Trade-offs, AsyncSGD, Generalized AsyncSGD

213. ❌ Knowledge Distillation for Efficient Transformer-Based Reinforcement Learning in Hardware-Constrained Energy Management Systems

作者: Pascal Henrich, Jonas Sievers, Maximilian Beichter, Thomas Blank, Ralf Mikut, Veit Hagenmeyer 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26249v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究知识蒸馏在Transformer强化学习模型压缩中的应用，属于大模型技术原理创新。核心相关关键词：1) ‘Quantization OR Model Compression OR Low-bit Weights’（10分）- 知识蒸馏是模型压缩的核心技术，论文直接研究参数减少96%、内存减少90%；2) ‘Small Language Models OR SLMs OR On-device AI’（5分）- 研究目标是将大模型压缩为适合嵌入式部署的小模型；3) ‘Speculative Decoding OR Inference Acceleration’（5分）- 论文测量了推理时间减少63%，属于推理加速范畴；4) ‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）- 应用于能源管理系统，属于AI for Science的工程应用。其他关键词如LLMs、MoE、Scaling Laws等与论文的Transformer强化学习具体应用无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究如何通过知识蒸馏将计算密集的Decision Transformer策略压缩为适合资源受限住宅控制器的紧凑模型，在保持控制性能的同时实现了高达96%的参数减少和63%的推理加速。

摘要翻译

基于Transformer的强化学习已成为住宅能源管理中序列控制的有力候选方案。其中，决策变换器（Decision Transformer）能够从历史数据中学习有效的电池调度策略，从而提高光伏自消纳率并降低用电成本。然而，Transformer模型通常计算需求过高，难以部署在资源受限的住宅控制器上，因为内存和延迟限制至关重要。本文研究知识蒸馏（knowledge distillation）方法，以将高容量决策变换器策略的决策行为迁移至更适合嵌入式部署的紧凑模型中。利用Ausgrid数据集，我们在异构多建筑数据上基于离线序列决策变换器框架训练教师模型。随后，通过匹配教师模型的动作来蒸馏更小的学生模型，从而在减小模型规模的同时保持控制质量。在广泛的师生模型配置实验中，蒸馏方法基本保持了控制性能，甚至实现了最高达1%的小幅提升，同时将参数量减少高达96%，推理内存降低高达90%，推理时间缩短高达63%。除了这些压缩效果外，当蒸馏至具有相同架构容量的学生模型时，也观察到了可比拟的成本改善。总体而言，我们的结果表明，知识蒸馏使决策变换器控制在资源有限的硬件上更适用于住宅能源管理。

摘要 (Abstract)

Transformer-based reinforcement learning has emerged as a strong candidate for sequential control in residential energy management. In particular, the Decision Transformer can learn effective battery dispatch policies from historical data, thereby increasing photovoltaic self-consumption and reducing electricity costs. However, transformer models are typically too computationally demanding for deployment on resource-constrained residential controllers, where memory and latency constraints are critical. This paper investigates knowledge distillation to transfer the decision-making behaviour of high-capacity Decision Transformer policies to compact models that are more suitable for embedded deployment. Using the Ausgrid dataset, we train teacher models in an offline sequence-based Decision Transformer framework on heterogeneous multi-building data. We then distil smaller student models by matching the teachers’ actions, thereby preserving control quality while reducing model size. Across a broad set of teacher-student configurations, distillation largely preserves control performance and even yields small improvements of up to 1%, while reducing the parameter count by up to 96%, the inference memory by up to 90%, and the inference time by up to 63%. Beyond these compression effects, comparable cost improvements are also observed when distilling into a student model of identical architectural capacity. Overall, our results show that knowledge distillation makes Decision Transformer control more applicable for residential energy management on resource-limited hardware.

关键词: Knowledge Distillation, Transformer-based Reinforcement Learning, Decision Transformer, Model Compression, Energy Management Systems, Embedded Deployment, Inference Acceleration, Parameter Reduction

214. ❌ Privacy-Accuracy Trade-offs in High-Dimensional LASSO under Perturbation Mechanisms

作者: Ayaka Sakata, Haruka Tanzawa 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26227v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高维LASSO中的隐私-准确性权衡，使用差分隐私机制和近似消息传递分析稀疏线性回归。虽然涉及稀疏模型和机器学习，但所有关键词均专注于大语言模型（LLM）及其特定技术（如微调、对齐、推理、代理等），而本文完全不涉及LLM、深度学习或AI在科学领域的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文使用近似消息传递分析高维稀疏LASSO回归中差分隐私机制（输出扰动和目标扰动）的隐私-准确性权衡，发现稀疏性在稳定估计器对抗单点数据变化方面起关键作用，且两种机制表现出不同的行为特性。

摘要翻译

我们研究高维稀疏线性回归中的隐私保护问题，聚焦于LASSO估计器。我们分析了两种广泛应用的差分隐私机制：输出扰动（将噪声注入估计器）与目标扰动（在损失函数中加入随机线性项）。利用近似消息传递（Approximate Message Passing, AMP）方法，我们刻画了在随机设计及隐私噪声下这些估计器的典型行为。为量化隐私性，我们采用典型情况度量指标，包括平均KL散度——该指标可从相邻数据集可区分性的角度进行假设检验意义上的解释。我们的分析表明，稀疏性在塑造隐私-准确性权衡中起着核心作用：更强的正则化可通过稳定估计器对单点数据变化的响应来提升隐私性。我们进一步揭示，两种机制表现出本质不同的行为特性。特别地，对于目标扰动机制，增加噪声水平可能产生非单调效应，过量的噪声反而可能使估计器失稳，导致对数据扰动的敏感性增强。我们的研究结果证明，AMP为分析高维稀疏模型中隐私-准确性权衡提供了强有力的理论框架。

摘要 (Abstract)

We study privacy-preserving sparse linear regression in the high-dimensional regime, focusing on the LASSO estimator. We analyze two widely used mechanisms for differential privacy: output perturbation, which injects noise into the estimator, and objective perturbation, which adds a random linear term to the loss function. Using approximate message passing (AMP), we characterize the typical behavior of these estimators under random design and privacy noise. To quantify privacy, we adopt typical-case measures, including the on-average KL divergence, which admits a hypothesis-testing interpretation in terms of distinguishability between neighboring datasets. Our analysis reveals that sparsity plays a central role in shaping the privacy-accuracy trade-off: stronger regularization can improve privacy by stabilizing the estimator against single-point data changes. We further show that the two mechanisms exhibit qualitatively different behaviors. In particular, for objective perturbation, increasing the noise level can have non-monotonic effects, and excessive noise may destabilize the estimator, leading to increased sensitivity to data perturbations. Our results demonstrate that AMP provides a powerful framework for analyzing privacy-accuracy trade-offs in high-dimensional sparse models.

关键词: privacy-preserving sparse linear regression, high-dimensional LASSO, differential privacy, output perturbation, objective perturbation, approximate message passing (AMP), privacy-accuracy trade-off, sparsity

215. ❌ On associative neural networks for sparse patterns with huge capacities

作者: Matthias Löwe, Franck Vermet 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26217v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究经典神经网络模型（Hopfield模型、Willshaw模型、Amari模型）的存储容量理论分析，属于传统神经网络理论范畴，与所有关键词涉及的大模型、深度学习技术原理创新、AI应用等现代主题完全无关。论文未提及任何大语言模型、训练方法、推理技术、应用领域等关键词相关内容。

!!! tip deepseek-chat TL;DR

该论文研究了稀疏模式的高阶联想神经网络模型，通过结合高阶交互和稀疏机制，证明了这些模型能够获得多项式甚至超多项式的存储容量提升。

摘要翻译

具有高阶或指数相互作用项的广义霍普菲尔德模型已知其存储容量显著大于经典二次模型。另一方面，针对稀疏模式的联想记忆模型（如Willshaw模型和Amari模型）在稀疏机制下已超越经典霍普菲尔德模型的性能。本文结合了这两种机制，提出了稀疏联想记忆模型的高阶版本，并研究了它们的存储容量。对于固定的相互作用阶数$n$，我们获得了系统规模中多项式阶的存储容量。当相互作用阶数随神经元数量呈对数增长时，模型可产生超多项式容量。我们还讨论了Gripon–Berrou架构（该架构最初为非稀疏信息设计，参见\cite{griponc}）中的类比模型。研究结果表明，尽管具体的存储规模取决于底层架构，但由高阶相互作用引起的容量提升在稀疏设置中依然存在。

摘要 (Abstract)

Generalized Hopfield models with higher-order or exponential interaction terms are known to have substantially larger storage capacities than the classical quadratic model. On the other hand, associative memories for sparse patterns, such as the Willshaw and Amari models, already outperform the classical Hopfield model in the sparse regime. In this paper we combine these two mechanisms. We introduce higher-order versions of sparse associative memory models and study their storage capacities. For fixed interaction order $n$, we obtain storage capacities of polynomial order in the system size. When the interaction order is allowed to grow logarithmically with the number of neurons, this yields super-polynomial capacities. We also discuss an analogue in the Gripon–Berrou architecture which was formulated for non-sparse messages (see \cite{griponc}). Our results show that the capacity increase caused by higher-order interactions persists in the sparse setting, although the precise storage scale depends on the underlying architecture.

关键词: associative neural networks, sparse patterns, storage capacities, higher-order interactions, Hopfield models, Willshaw models, Amari models, Gripon-Berrou architecture

216. ❌ Geometric Evolution Graph Convolutional Networks: Enhancing Graph Representation Learning via Ricci Flow

作者: Jicheng Ma, Yunyan Yang, Juan Zhao, Liang Zhao 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26178v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Geometric Evolution Graph Convolutional Networks: Enhancing Graph Representation Learning via Ricci Flow》专注于图神经网络（GNN）和图表示学习，特别是通过离散Ricci流建模几何演化来增强图卷积网络。其核心贡献在于图结构学习、动态表示和异质图分类，属于图机器学习领域。所有关键词均围绕大模型（LLM）及其相关技术（如MoE、RLHF、RAG、量化、推理加速等）、大模型对齐与微调、大模型代理与工具使用、大模型可解释性等主题，与论文的图神经网络研究无直接关联。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为图神经网络在生物信息学等科学领域有应用潜力，但论文未明确涉及这些具体领域，故给5分（有一定关联）。其他关键词均完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于离散Ricci流建模几何演化的图卷积网络（GEGCN），通过长短期记忆网络学习动态图表示，在多个基准数据集上实现了最先进的分类性能，尤其在异质图上表现突出。

摘要翻译

我们提出几何演化图卷积网络（Geometric Evolution Graph Convolutional Network, GEGCN），这是一种通过建模图上的几何演化来增强图表示学习的新颖框架。具体而言，GEGCN采用长短期记忆网络对离散里奇流生成的结构序列进行建模，并将学习到的动态表示融入图卷积网络中。大量实验表明，GEGCN在多种基准数据集的分类任务上取得了最先进的性能，其在异质图上的表现尤为突出。

摘要 (Abstract)

We introduce the Geometric Evolution Graph Convolutional Network (GEGCN), a novel framework that enhances graph representation learning by modeling geometric evolution on graphs. Specifically, GEGCN employs a Long Short-Term Memory to model the structural sequence generated by discrete Ricci flow, and the learned dynamic representations are infused into a Graph Convolutional Network. Extensive experiments demonstrate that GEGCN achieves state-of-the-art performance on classification tasks across various benchmark datasets, with its performance being particularly outstanding on heterophilic graphs.

关键词: Geometric Evolution Graph Convolutional Network, GEGCN, graph representation learning, Ricci flow, heterophilic graphs, graph convolutional network, dynamic representations, classification tasks

217. ❌ Can AI Scientist Agents Learn from Lab-in-the-Loop Feedback? Evidence from Iterative Perturbation Discovery

作者: Gilles Wainrib, Barbara Bodinier, Haitem Dakhli, Josep Monserrat, Almudena Espin Perez, Sabrina Carpentier, Roberta Codato, John Klein 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26177v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM在科学实验设计中的上下文学习能力，直接涉及LLMs、LLM Agents、In-context Learning和AI for Science等关键词。其中LLMs、LLM Agents和In-context Learning是核心研究内容，AI for Science是应用领域，Hallucination Mitigation在讨论模型能力时涉及。其他关键词如MoE、SFT、RAG等未在论文中提及或研究。

!!! tip deepseek-chat TL;DR

该论文研究了LLM智能体能否通过实验反馈进行有效的上下文学习，发现只有在模型能力达到足够阈值时，才能实现显著的反馈驱动学习效果，在细胞成像筛选中将发现率提高了53.4%。

摘要翻译

近期研究对大语言模型（LLM）能否在科学实验设计中实现真正的上下文学习（ICL）提出了质疑，先前研究表明基于LLM的智能体对实验反馈缺乏敏感性。我们通过在高内涵筛选（Cell Painting high-content screening）中开展800项独立重复的迭代扰动发现实验，对此问题提出了新见解。我们比较了两种策略：一种是通过实验反馈迭代更新假设的LLM智能体，另一种是仅依赖预训练知识检索的零样本基线方法。获取反馈使每个特征的平均发现量提升了$+53.4%$（$p = 0.003$）。为验证该提升是否源于真正的反馈驱动学习而非提示诱导的预训练知识回忆，我们引入了随机反馈对照实验，其中命中/未命中标签被随机置换。在此对照条件下，性能增益消失，表明观察到的改进依赖于反馈信号的结构性（$+13.0$次命中，$p = 0.003$）。我们进一步探究了模型能力如何影响反馈利用效率。当模型从Claude Sonnet 4.5升级至4.6版本时，基因幻觉率从${\sim}33%$–$45%$降至${\sim}3$–$9%$，使得最优ICL策略的效果从无显著性的ICL增益（$+0.8$，$p = 0.32$）转变为大幅且高度显著的提升（$+11.0$，$p=0.003$）。这些结果表明，只有当模型达到足够的能力阈值时，才能从实验反馈中实现有效的上下文学习。

摘要 (Abstract)

Recent work has questioned whether large language models (LLMs) can perform genuine in-context learning (ICL) for scientific experimental design, with prior studies suggesting that LLM-based agents exhibit no sensitivity to experimental feedback. We shed new light on this question by carrying out 800 independently replicated experiments on iterative perturbation discovery in Cell Painting high-content screening. We compare an LLM agent that iteratively updates its hypotheses using experimental feedback to a zero-shot baseline that relies solely on pretraining knowledge retrieval. Access to feedback yields a $+53.4%$ increase in discoveries per feature on average ($p = 0.003$). To test whether this improvement arises from genuine feedback-driven learning rather than prompt-induced recall of pretraining knowledge, we introduce a random feedback control in which hit/miss labels are permuted. Under this control, the performance gain disappears, indicating that the observed improvement depends on the structure of the feedback signal ($+13.0$ hits, $p = 0.003$). We further examine how model capability affects feedback utilization. Upgrading from Claude Sonnet 4.5 to 4.6 reduces gene hallucination rates from ${\sim}33%$–$45%$ to ${\sim}3$–$9%$, converting a non-significant ICL effect ($+0.8$, $p = 0.32$) into a large and highly significant improvement ($+11.0$, $p=0.003$) for the best ICL strategy. These results suggest that effective in-context learning from experimental feedback emerges only once models reach a sufficient capability threshold.

关键词: Large Language Models, LLM Agents, In-context Learning, Experimental Feedback, Scientific Discovery, Cell Painting, Hallucination Mitigation, AI for Science

218. ❌ PEANUT: Perturbations by Eigenvalue Alignment for Attacking GNNs Under Topology-Driven Message Passing

作者: Bhavya Kohli, Biplab Sikdar 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26136v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究图神经网络（GNNs）的对抗攻击方法（PEANUT），专注于图结构扰动和GNN鲁棒性。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文主题是GNNs，属于图机器学习领域，与LLMs、MoE、Scaling Laws、对齐、推理、代理、量化等关键词无直接关联。虽然GNNs是深度学习的一个子领域，但论文内容不涉及评分关键词中指定的任何大模型技术或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PEANUT的简单、无梯度、受限黑盒攻击方法，通过注入虚拟节点来利用图神经网络（GNNs）在拓扑驱动消息传递中的核心漏洞，显著降低了GNNs在多个图任务上的性能。

摘要翻译

图神经网络（Graph Neural Networks, GNNs）在处理涉及关系数据的任务中已取得显著性能。然而，对图结构的微小扰动可能显著改变GNN的输出，这引发了对其在实际部署中鲁棒性的担忧。本研究探讨了GNN的核心脆弱性——其显式地以邻接矩阵或拉普拉斯矩阵（Laplacian）的形式利用图拓扑进行消息传递，并据此提出PEANUT，一种简单、无梯度、受限的黑盒攻击方法，通过注入虚拟节点来利用此脆弱性。PEANUT是一种基于节点注入的攻击，被广泛认为比直接修改原始图结构的图修改攻击更具实践性和现实性。我们的方法作用于推理阶段，属于逃避攻击，且几乎可立即实施，因为它不涉及耗时的迭代优化、参数学习（这些会增加计算和时间开销）或训练代理模型（代理模型易因模型先验和泛化能力的差异而失效）。PEANUT还无需为注入节点提供任何特征，实验表明即使注入特征全为零的节点，GNN性能也会显著下降，这凸显了在此类攻击中有效设计连接性的重要性。在三个图任务的真实数据集上进行的大量实验证明，尽管方法简单，我们的攻击仍具有显著效果。

摘要 (Abstract)

Graph Neural Networks (GNNs) have achieved remarkable performance on tasks involving relational data. However, small perturbations to the graph structure can significantly alter GNN outputs, raising concerns about their robustness in real-world deployments. In this work, we explore the core vulnerability of GNNs which explicitly consume graph topology in the form of the adjacency matrix or Laplacian as a means for message passing, and propose PEANUT, a simple, gradient-free, restricted black-box attack that injects virtual nodes to capitalize on this vulnerability. PEANUT is a injection based attack, which is widely considered to be more practical and realistic scenario than graph modification attacks, where the attacker is able to modify the original graph structure directly. Our method works at the inference phase, making it an evasion attack, and is applicable almost immediately, since it does not involve lengthy iterative optimizations or parameter learning, which add computational and time overhead, or training surrogate models, which are susceptible to failure due to differences in model priors and generalization capabilities. PEANUT also does not require any features on the injected node and consequently demonstrates that GNN performance can be significantly deteriorated even with injected nodes with zeros for features, highlighting the significance of effectively designed connectivity in such attacks. Extensive experiments on real-world datasets across three graph tasks demonstrate the effectiveness of our attack despite its simplicity.

关键词: Graph Neural Networks, GNNs, adversarial attack, graph structure perturbations, virtual node injection, topology-driven message passing, robustness, evasion attack

219. ❌ TinyML for Acoustic Anomaly Detection in IoT Sensor Networks

作者: Amar Almaini, Jakob Folz, Ghadeer Ashour 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26135v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于TinyML在物联网声学异常检测中的应用，属于边缘AI和轻量级模型部署领域。与大多数大模型相关关键词（如LLMs、MoE、RLHF等）完全无关。仅与两个关键词有弱关联：1）‘Small Language Models OR SLMs OR On-device AI’（5分）- 论文涉及边缘设备上的轻量级模型部署，属于on-device AI范畴，但未涉及语言模型；2）‘Quantization OR Model Compression OR Low-bit Weights’（5分）- 论文提到’lightweight neural network’和’optimized for deployment on edge devices’，隐含模型压缩/轻量化思想，但未明确提及量化技术。其他关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于物联网传感器网络中环境声音异常检测的紧凑型TinyML管道，通过在边缘设备上部署轻量级神经网络分类器，实现了91%的测试准确率和0.91的平衡F1分数，证明了嵌入式声学异常检测的可行性。

摘要翻译

微型机器学习技术能够直接在微控制器上实现实时、高能效的数据处理，这使其成为物联网传感器网络的理想选择。本文提出了一种紧凑型TinyML（微型机器学习）流程，用于在物联网传感器网络中检测环境声音异常。物联网系统中的声学监测可提升安全性与情境感知能力，但基于云计算的处理方式会带来延迟、功耗和隐私方面的挑战。我们的流程通过从声音信号中提取梅尔频率倒谱系数，并训练一个针对边缘设备部署优化的轻量级神经网络分类器来解决这些问题。该模型使用UrbanSound8K数据集进行训练和评估，在测试中达到了91%的准确率，并在正常与异常声音类别上均取得0.91的平衡F1分数。这些结果证明了嵌入式声学异常检测技术对于可扩展、高响应性物联网部署的可行性与可靠性。

摘要 (Abstract)

Tiny Machine Learning enables real-time, energy-efficient data processing directly on microcontrollers, making it ideal for Internet of Things sensor networks. This paper presents a compact TinyML pipeline for detecting anomalies in environmental sound within IoT sensor networks. Acoustic monitoring in IoT systems can enhance safety and context awareness, yet cloud-based processing introduces challenges related to latency, power usage, and privacy. Our pipeline addresses these issues by extracting Mel Frequency Cepstral Coefficients from sound signals and training a lightweight neural network classifier optimized for deployment on edge devices. The model was trained and evaluated using the UrbanSound8K dataset, achieving a test accuracy of 91% and balanced F1-scores of 0.91 across both normal and anomalous sound classes. These results demonstrate the feasibility and reliability of embedded acoustic anomaly detection for scalable and responsive IoT deployments.

关键词: TinyML, acoustic anomaly detection, IoT sensor networks, edge computing, lightweight neural network, Mel Frequency Cepstral Coefficients, UrbanSound8K dataset, real-time processing

220. ❌ Are LLM-Enhanced Graph Neural Networks Robust against Poisoning Attacks?

作者: Yuhang Ma, Jie Wang, Zheng Yan 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26105v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM-enhanced GNNs（大语言模型增强的图神经网络）的鲁棒性评估，直接涉及’Large Language Models’关键词（高度相关，10分）。论文属于AI在科学领域的应用研究，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理技术、代理系统、模型压缩等均未在摘要中提及或涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM增强的图神经网络在训练期间面临图结构和文本属性被操纵的投毒攻击时的鲁棒性，并提出一个评估框架，实验表明这些模型比基线方法具有更高的准确性和更强的抗攻击能力。

摘要翻译

大语言模型通过语义特征增强节点表示，推动了图神经网络的发展，由此产生的LLM增强型GNN模型取得了显著的性能提升。然而，这些模型在训练过程中同时面临图结构和文本属性被篡改的投毒攻击时，其鲁棒性尚未得到探索。为填补这一空白，我们提出了一个鲁棒性评估框架，用于系统评估投毒攻击下LLM增强型GNN的表现。该框架支持多维度综合评估：具体而言，我们通过将八种基于LLM或语言模型的特征增强器与三种代表性GNN骨干网络结合，评估了24种受害模型。为确保攻击覆盖的多样性，我们纳入了六种结构投毒攻击（包括定向与非定向攻击）以及三种分别在字符、单词和句子级别操作的文本投毒攻击。此外，我们采用四个真实世界数据集（其中一个在LLM兴起后发布）进行评估，以避免LLM预训练中可能存在的真实标签泄露，从而保证评估的公平性。大量实验表明，在各种攻击场景下，LLM增强型GNN相比基于浅层嵌入的基线模型展现出显著更高的准确率和更低的相对准确率下降值。我们的深入分析揭示了其鲁棒性的关键成因，例如节点表示中对结构信息和标签信息的有效编码。基于这些发现，我们从攻击与防御双重视角展望了未来研究方向，并提出一种新型组合攻击方案及相应的图净化防御方法。为支持后续研究，我们在~\url{https://github.com/CyberAlSec/LLMEGNNRP} 公开了框架源代码。

摘要 (Abstract)

Large Language Models (LLMs) have advanced Graph Neural Networks (GNNs) by enriching node representations with semantic features, giving rise to LLM-enhanced GNNs that achieve notable performance gains. However, the robustness of these models against poisoning attacks, which manipulate both graph structures and textual attributes during training, remains unexplored. To bridge this gap, we propose a robustness assessment framework that systematically evaluates LLM-enhanced GNNs under poisoning attacks. Our framework enables comprehensive evaluation across multiple dimensions. Specifically, we assess 24 victim models by combining eight LLM- or Language Model (LM)-based feature enhancers with three representative GNN backbones. To ensure diversity in attack coverage, we incorporate six structural poisoning attacks (both targeted and non-targeted) and three textual poisoning attacks operating at the character, word, and sentence levels. Furthermore, we employ four real-world datasets, including one released after the emergence of LLMs, to avoid potential ground truth leakage during LLM pretraining, thereby ensuring fair evaluation. Extensive experiments show that LLM-enhanced GNNs exhibit significantly higher accuracy and lower Relative Drop in Accuracy (RDA) than a shallow embedding-based baseline across various attack settings. Our in-depth analysis identifies key factors that contribute to this robustness, such as the effective encoding of structural and label information in node representations. Based on these insights, we outline future research directions from both offensive and defensive perspectives, and propose a new combined attack along with a graph purification defense. To support future research, we release the source code of our framework at~\url{https://github.com/CyberAlSec/LLMEGNNRP}.

关键词: Large Language Models, Graph Neural Networks, Poisoning Attacks, Robustness Assessment, Node Representations, Structural Poisoning, Textual Poisoning, Accuracy Drop

221. ❌ Adversarial Bandit Optimization with Globally Bounded Perturbations to Linear Losses

作者: Zhuoyu Cheng, Kohei Hatano, Eiji Takimoto 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26066v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是对抗性赌博机优化问题，具体针对具有全局有界扰动的线性损失函数。论文内容完全属于经典机器学习中的在线学习和优化理论领域，涉及非凸非光滑损失函数、线性损失、扰动分析和遗憾界证明。所有给定的关键词都聚焦于大语言模型（LLMs）、深度学习技术原理及其在不同领域的应用，包括模型架构、训练方法、推理优化、对齐技术、代理系统等。该论文与这些关键词没有任何关联，既没有涉及大模型技术，也没有涉及AI在科学领域的应用，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了具有全局有界扰动的对抗性赌博机优化问题，针对非凸非光滑的线性损失函数建立了期望和高概率的遗憾保证，并提供了下界分析。

摘要翻译

我们研究一类对抗性赌博机优化问题，其损失函数可能为非凸且非光滑。在每一轮中，学习者观察到一个损失，该损失由一个基础的线性分量与学习者在选择动作后施加的额外扰动共同构成。扰动是相对于线性损失来度量的，并受到一个全局预算的约束，该预算限制了其随时间累积的幅度。在此模型下，我们建立了期望和高概率的遗憾界保证。作为我们分析的一个特例，我们恢复了经典赌博机线性优化（即无扰动设置）中改进的高概率遗憾界。此外，我们通过证明期望遗憾的下界来补充我们的上界结果。

摘要 (Abstract)

We study a class of adversarial bandit optimization problems in which the loss functions may be non-convex and non-smooth. In each round, the learner observes a loss that consists of an underlying linear component together with an additional perturbation applied after the learner selects an action. The perturbations are measured relative to the linear losses and are constrained by a global budget that bounds their cumulative magnitude over time. Under this model, we establish both expected and high-probability regret guarantees. As a special case of our analysis, we recover an improved high-probability regret bound for classical bandit linear optimization, which corresponds to the setting without perturbations. We further complement our upper bounds by proving a lower bound on the expected regret.

关键词: adversarial bandit optimization, linear losses, globally bounded perturbations, non-convex non-smooth functions, regret guarantees, high-probability regret, lower bound

222. ❌ Asymptotic Optimism for Tensor Regression Models with Applications to Neural Network Compression

作者: Haoming Shi, Eric C. Chi, Hengrui Luo 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26048v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究低秩张量回归的秩选择理论及其在神经网络压缩中的应用。论文核心是张量回归模型的统计理论分析（CP/Tucker分解的乐观偏差最小化），与深度学习技术原理有一定关联但非直接相关。唯一相关关键词是’Quantization OR Model Compression OR Low-bit Weights’，因为论文最后提到了神经网络压缩应用，但这不是论文的主要研究内容，只是应用示例，因此给5分（有一定关联）。其他所有关键词都涉及大语言模型、对齐、推理、代理等具体技术，与论文的统计理论研究和张量回归主题完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了低秩张量回归模型的秩选择问题，证明了在随机协变量设计下，CP和Tucker分解的期望训练-测试差异（乐观偏差）在真实张量秩处最小化，并展示了该方法在神经网络压缩中的潜在应用价值。

摘要翻译

我们研究了随机协变量设计下的低秩张量回归的秩选择问题。在高斯随机设计模型及若干温和条件下，我们推导了CP分解与Tucker分解中期望训练-测试差异（乐观度）的总体表达式。进一步证明，对于CP回归和Tucker回归，乐观度均在真实张量秩处达到最小。这提出了一种面向预测的秩选择准则，该准则与交叉验证结果一致，并可自然扩展至张量模型平均。我们还讨论了低秩或过秩模型可能更优的条件，从而明确了该方法的适用范围。最后，我们在真实图像回归任务中展示了其实际效用，并将其扩展应用于基于张量的神经网络压缩，凸显了该方法在深度学习模型选择中的潜力。

摘要 (Abstract)

We study rank selection for low-rank tensor regression under random covariates design. Under a Gaussian random-design model and some mild conditions, we derive population expressions for the expected training-testing discrepancy (optimism) for both CP and Tucker decomposition. We further demonstrate that the optimism is minimized at the true tensor rank for both CP and Tucker regression. This yields a prediction-oriented rank-selection rule that aligns with cross-validation and extends naturally to tensor-model averaging. We also discuss conditions under which under- or over-ranked models may appear preferable, thereby clarifying the scope of the method. Finally, we showcase its practical utility on a real-world image regression task and extend its application to tensor-based compression of neural network, highlighting its potential for model selection in deep learning.

关键词: tensor regression, rank selection, low-rank decomposition, optimism, CP decomposition, Tucker decomposition, neural network compression, model selection

223. ❌ Constitutive parameterized deep energy method for solid mechanics problems with random material parameters

作者: Zhangyong Liang, Huanhuan Gao 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26030v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于固体力学中的物理驱动深度学习框架（CPDEM），用于处理随机材料参数下的力学响应预测。所有关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用直接相关，但论文内容不涉及LLM、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理技术、代理系统、模型压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等具体技术。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（固体力学）中的应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文主题（物理驱动深度学习、固体力学、参数化方法）无直接关联，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CPDEM的物理驱动深度学习框架，用于解决固体力学中材料参数随机变化时高效预测力学响应的问题，实现了无需数据生成或模型重训练的零样本实时推断。

摘要翻译

在实际结构设计与固体力学仿真中，材料属性必然会在有界区间内呈现随机变化。然而，在连续材料不确定性下评估力学响应仍是一个长期存在的挑战。传统数值方法（如有限元法）需要对每个参数实现进行重复的网格离散化和方程求解，导致计算成本极高。同样，数据驱动的代理模型严重依赖大规模高保真数据集，而标准物理信息驱动框架（例如深度能量法）在材料参数变化时严格需要从头开始重新训练。为弥补这一关键空白，我们提出了本构参数化深度能量法。在这一纯物理驱动的框架中，应变能密度泛函通过编码随机本构参数的隐式表示进行重构。通过将材料参数与空间坐标一同直接嵌入神经网络，本构参数化深度能量法将传统的空间配置点转化为参数感知的材料点。通过在参数域上以期望能量最小化的方式进行无监督训练，预训练模型能够连续学习解流形。因此，该方法无需任何数据集生成或模型重新训练，即可对未知材料参数的位移场进行零样本实时推断。所提方法在多种基准测试中得到了严格验证，包括线弹性、有限应变超弹性以及复杂的高度非线性接触力学问题。据我们所知，本构参数化深度能量法是首个能够同时高效处理固体力学中连续多参数变化的纯物理驱动深度学习范式。

摘要 (Abstract)

In practical structural design and solid mechanics simulations, material properties inherently exhibit random variations within bounded intervals. However, evaluating mechanical responses under continuous material uncertainty remains a persistent challenge. Traditional numerical approaches, such as the Finite Element Method (FEM), incur prohibitive computational costs as they require repeated mesh discretization and equation solving for every parametric realization. Similarly, data-driven surrogate models depend heavily on massive, high-fidelity datasets, while standard physics-informed frameworks (e.g., the Deep Energy Method) strictly demand complete retraining from scratch whenever material parameters change. To bridge this critical gap, we propose the Constitutive Parameterized Deep Energy Method (CPDEM). In this purely physics-driven framework, the strain energy density functional is reformulated by encoding a latent representation of stochastic constitutive parameters. By embedding material parameters directly into the neural network alongside spatial coordinates, CPDEM transforms conventional spatial collocation points into parameter-aware material points. Trained in an unsupervised manner via expected energy minimization over the parameter domain, the pre-trained model continuously learns the solution manifold. Consequently, it enables zero-shot, real-time inference of displacement fields for unknown material parameters without requiring any dataset generation or model retraining. The proposed method is rigorously validated across diverse benchmarks, including linear elasticity, finite-strain hyperelasticity, and complex highly nonlinear contact mechanics. To the best of our knowledge, CPDEM represents the first purely physics-driven deep learning paradigm capable of simultaneously and efficiently handling continuous multi-parameter variations in solid mechanics.

关键词: Constitutive Parameterized Deep Energy Method, solid mechanics, random material parameters, physics-driven deep learning, zero-shot inference, stochastic constitutive parameters, unsupervised training, energy minimization

224. ❌ Identification of Bivariate Causal Directionality Based on Anticipated Asymmetric Geometries

作者: Alex Glushkovsky 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26024v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于传统的因果推断方法（Anticipated Asymmetric Geometries和Monotonicity Index），未涉及任何大模型、深度学习、AI for Science或相关技术原理。所有关键词均与大模型技术、训练方法、推理优化、应用领域等无关，因此全部评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了两种基于条件分布和单调性指数的统计方法来识别双变量数据中的因果方向，其中调优后的AAG方法在真实数据集上达到了77.9%的准确率。

摘要翻译

双变量数值数据中因果方向性的识别是一个具有重要实际意义的基础研究问题。本文提出了两种通过考虑条件分布来识别因果方向的替代方法：(1) 预期不对称几何法(Anticipated Asymmetric Geometries, AAG)和(2) 单调性指数法。AAG方法将两个变量方向上的实际条件分布与预期分布进行比较。我们评估了不同的比较度量，如相关性、余弦相似度、杰卡德指数、K-L散度、K-S距离和互信息。预期分布是基于双响应统计量（均值和标准差）投影为正态分布。单调性指数法则通过比较沿两个轴的条件分布梯度计算出的单调性指数，并展示梯度符号变化的计数。两种方法均假设双变量数据的随机特性，并利用结果变量条件分布预期的单峰性。结果表明，在对95对真实世界案例进行分类时（Mooij等人，2014），经过调优的AAG方法优于单调性指数法，达到了77.9%的最高准确率，而加性噪声模型(ANMs)的准确率为63 +/- 10%。所述方法包含多个影响识别准确率的超参数。对于给定的一组超参数，AAG或单调性指数法均能提供唯一确定的解结果。为解决对超参数的敏感性问题，我们采用全因子实验设计对超参数进行了调优。此外，通过拟合决策树，利用输入数据的对称双变量统计量来区分误分类案例，以探讨“因果方向性识别方法的确定性程度如何”这一问题。

摘要 (Abstract)

Identification of causal directionality in bivariate numerical data is a fundamental research problem with important practical implications. This paper presents two alternative methods to identify direction of causation by considering conditional distributions: (1) Anticipated Asymmetric Geometries (AAG) and (2) Monotonicity Index. The AAG method compares the actual conditional distributions to anticipated ones along two variables. Different comparison metrics, such as correlation, cosine similarity, Jaccard index, K-L divergence, K-S distance, and mutual information have been evaluated. Anticipated distributions have been projected as normal based on dual response statistics: mean and standard deviation. The Monotonicity Index approach compares the calculated monotonicity indexes of the gradients of conditional distributions along two axes and exhibits counts of gradient sign changes. Both methods assume stochastic properties of the bivariate data and exploit anticipated unimodality of conditional distributions of the effect. It turns out that the tuned AAG method outperforms the Monotonicity Index and reaches a top accuracy of 77.9% compared to ANMs accuracy of 63 +/- 10% when classifying 95 pairs of real-world examples (Mooij et al, 2014). The described methods include a number of hyperparameters that impact accuracy of the identification. For a given set of hyperparameters, both the AAG or Monotonicity Index method provide a unique deterministic outcome of the solution. To address sensitivity to hyperparameters, tuning of hyperparameters has been done by utilizing a full factorial Design of Experiment. A decision tree has been fitted to distinguish misclassified cases using the input data’s symmetrical bivariate statistics to address the question of: How decisive is the identification method of causal directionality?

关键词: causal directionality, bivariate data, Anticipated Asymmetric Geometries, Monotonicity Index, conditional distributions, hyperparameter tuning, decision tree, statistical methods

225. ❌ GLU: Global-Local-Uncertainty Fusion for Scalable Spatiotemporal Reconstruction and Forecasting

作者: Linzheng Wang, Jason Chen, Nicolas Tricard, Zituo Chen, Sili Deng 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26023v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出GLU框架，专注于复杂物理系统的稀疏重建和动态预测，属于AI for Science（科学AI）领域，与"AI for Science OR Bioinformatics OR Cheminformatics"有一定关联（评分5分），但未涉及大模型、深度学习技术原理或生物信息学/化学信息学具体应用。其他关键词均与论文内容无关（评分0分）。

!!! tip deepseek-chat TL;DR

该论文提出了GLU框架，将稀疏重建和动态预测统一为状态表示问题，通过结构化潜在组装在多个基准测试中提高了重建保真度，并在湍流燃烧数据集上实现了稳定预测和跨通道热化学耦合的保持。

摘要翻译

复杂物理系统的数字孪生需从稀疏观测中推断未测状态并预测其时序演化，但这两项功能通常被视作独立任务。本文提出GLU（全局-局部-不确定性）框架，将稀疏重建与动态预测统一为状态表征问题，并为两项任务引入结构化潜在装配体。其核心思想是构建一种结构化潜在状态，它融合了系统级组织的全局摘要、锚定于可用测量数据的局部标记，以及一个根据物理信息度加权观测值的不确定性驱动重要性场。在重建任务中，GLU采用重要性感知的自适应邻域选择机制，在保持全局一致性的同时检索局部相关信息，并允许在任意几何结构上进行灵活查询。在一系列挑战性基准测试中，GLU相比降阶模型、卷积网络、神经算子及基于注意力的基线方法，持续提升重建保真度，更好地保留了多尺度结构。在预测任务中，分层式领导者-追随者动力学模块以显著降低的内存增长演化潜在状态，保持稳定的推演行为并延缓非线性动力学中的误差累积。在真实湍流燃烧数据集上，该框架不仅保持了多物理场中的锐利前沿与宽带结构，还保留了跨通道的热化学耦合特性。可扩展性测试表明，这些性能提升所需的内存增长远低于同类基于注意力的基线方法。综上，本研究确立了GLU作为一种灵活且计算实用的稀疏数字孪生范式。

摘要 (Abstract)

Digital twins of complex physical systems are expected to infer unobserved states from sparse measurements and predict their evolution in time, yet these two functions are typically treated as separate tasks. Here we present GLU, a Global-Local-Uncertainty framework that formulates sparse reconstruction and dynamic forecasting as a unified state-representation problem and introduces a structured latent assembly to both tasks. The central idea is to build a structured latent state that combines a global summary of system-level organization, local tokens anchored to available measurements, and an uncertainty-driven importance field that weights observations according to the physical informativeness. For reconstruction, GLU uses importance-aware adaptive neighborhood selection to retrieve locally relevant information while preserving global consistency and allowing flexible query resolution on arbitrary geometries. Across a suite of challenging benchmarks, GLU consistently improves reconstruction fidelity over reduced-order, convolutional, neural operator, and attention-based baselines, better preserving multi-scale structures. For forecasting, a hierarchical Leader-Follower Dynamics module evolves the latent state with substantially reduced memory growth, maintains stable rollout behavior and delays error accumulation in nonlinear dynamics. On a realistic turbulent combustion dataset, it further preserves not only sharp fronts and broadband structures in multiple physical fields, but also their cross-channel thermo-chemical couplings. Scalability tests show that these gains are achieved with substantially lower memory growth than comparable attention-based baselines. Together, these results establish GLU as a flexible and computationally practical paradigm for sparse digital twins.

关键词: digital twins, spatiotemporal reconstruction, dynamic forecasting, structured latent assembly, importance-aware adaptive neighborhood selection, Leader-Follower Dynamics, turbulent combustion, scalability

226. ❌ QuitoBench: A High-Quality Open Time Series Forecasting Benchmark

作者: Siqiao Xue, Zhaoyang Zhu, Wei Zhang, Rongyao Cai, Rui Wang, Yixiang Mu, Fan Zhou, Jianguo Li, Peng Di, Hang Yu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究时间序列预测基准，涉及基础模型（foundation models）在预测任务中的应用，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。论文讨论了数据质量对模型性能的影响，与’Scaling Laws AND Data Quality’有一定关联（5分）。研究发现基础模型在长上下文（L≥576）中表现更好，与’Context Window Extension OR Long Context LLMs’相关（5分）。时间序列预测在金融、医疗等领域有应用，属于AI for Science范畴（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对时间序列预测领域缺乏高质量基准的问题，提出了QuitoBench基准，通过评估10种模型发现基础模型在长上下文中表现更优，且数据规模扩展比模型规模扩展对性能提升更有效。

摘要翻译

时间序列预测在金融、医疗和云计算等领域至关重要，但其进展受到一个根本性瓶颈的制约：缺乏大规模、高质量的基准数据集。为弥补这一空白，我们提出了 \textsc{QuitoBench}，这是一个面向时间序列预测的机制平衡基准，覆盖了八种趋势×季节性×可预测性（TSF）机制，旨在捕捉与预测相关的特性，而非基于应用定义的领域标签。该基准建立在 \textsc{Quito} 之上——一个来自支付宝应用流量、跨越九个业务领域的十亿级时间序列语料库。通过对来自深度学习、基础模型和统计基线方法的10个模型在232,200个评估实例上进行基准测试，我们报告了四个关键发现：（i）存在一个上下文长度交叉点，深度学习模型在短上下文（$L=96$）中领先，而基础模型在长上下文（$L \ge 576$）中占优；（ii）可预测性是主要困难驱动因素，在不同机制间产生了 $3.64 \times$ 的平均绝对误差差距；（iii）深度学习模型以 $59 \times$ 更少的参数量达到或超越了基础模型的性能；（iv）对于两类模型家族而言，增加训练数据量带来的收益远大于扩大模型规模。这些发现得到了跨基准和跨指标一致性的有力验证。我们的开源发布为时间序列预测研究提供了可复现、机制感知的评估框架。

摘要 (Abstract)

Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce \textsc{QuitoBench}, a regime-balanced benchmark for time series forecasting with coverage across eight trend$\times$seasonality$\times$forecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon \textsc{Quito}, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context ($L=96$) but foundation models dominate at long context ($L \ge 576$); (ii) forecastability is the dominant difficulty driver, producing a $3.64 \times$ MAE gap across regimes; (iii) deep learning models match or surpass foundation models at $59 \times$ fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.

关键词: time series forecasting, benchmark, foundation models, context length, data quality, scaling, regime-balanced, Alipay

227. ❌ Multi-scale Metabolic Modeling and Simulation

作者: Peter E. Carstensen, Teddy Groves, Lars K. Nielsen, Ulrich Krühne, Krist V. Gernaey, John B. Jørgensen 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26370v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Multi-scale Metabolic Modeling and Simulation》专注于生物系统建模，特别是将基因组尺度的代谢模型与生物反应器动态模拟相结合，使用神经网络替代传统的线性规划求解。所有关键词均与大语言模型（LLMs）、深度学习技术原理或相关应用（如MoE、RLHF、RAG等）直接相关，而本文的核心是计算生物学和代谢工程中的传统机器学习（神经网络作为替代模型）应用，并非大模型或深度学习技术原理的创新。唯一的相关关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文涉及生物信息学（代谢建模）和AI在科学中的应用（神经网络替代优化），但并非核心创新于大模型，因此给予5分（有一定关联）。其他关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该研究解决了动态生物反应器模拟中基因组尺度代谢模型因重复求解线性规划而导致的数值不可行性和计算效率问题，通过开发一个多尺度建模框架，用神经网络替代模型来近似优化映射，从而实现了大肠杆菌补料分批发酵的连续动态模拟。

摘要翻译

生物系统受细胞内代谢与生物反应器操作之间跨越多时间尺度的耦合相互作用所支配。基于约束的代谢模型被广泛用于描述细胞内代谢，但在动态模型中每个时间步重复求解优化问题会引发与不可行性和计算效率相关的数值挑战。本研究提出了一种多尺度建模框架，将基因组尺度的基于约束代谢模型与动态生物反应器模拟相集成。细胞内代谢采用简约通量平衡分析中的正通量变量进行描述，并将由此产生的嵌入式优化问题替换为神经网络代理模型。该代理模型提供了嵌入式优化映射的光滑近似，并消除了模拟过程中重复的线性规划求解。该方法在大肠杆菌的补料分批发酵中得到验证，其中代理模型在底物限制条件下成功给出了细胞内通量，而基础的线性规划模型在此条件下本将不可行。该框架提供了细胞内代谢的连续表征，适用于在生物反应器配置中对基因组尺度模型进行动态模拟。

摘要 (Abstract)

Biological systems are governed by coupled interactions between intracellular metabolism and bioreactor operation that span multiple time scales. Constraint-based metabolic models are widely used to describe intracellular metabolism, but repeatedly solving the optimization problem at each time step in dynamic models introduces numerical challenges related to infeasibility and computational efficiency. This work presents a multi-scale modeling framework that integrates genome-scale, constraint-based metabolic models with dynamic bioreactor simulations. Intracellular metabolism is described using positive flux variables in a parsimonious flux balance analysis, and the resulting embedded optimization problem is replaced by a neural network surrogate. The surrogate provides a smooth approximation of the embedded optimization mapping and eliminates repeated linear program solves during simulation. The approach is demonstrated for fed-batch fermentation of Escherichia coli, in which the surrogate model yields intracellular fluxes under substrate-limited conditions, whereas the underlying linear program would otherwise be infeasible. The framework provides a continuous representation of intracellular metabolism suitable for dynamic simulation of genome-scale models in bioreactor configurations.

关键词: multi-scale modeling, metabolic modeling, neural network surrogate, flux balance analysis, bioreactor simulation, genome-scale models, dynamic simulation, Escherichia coli fermentation

228. ❌ TurboESM: Ultra-Efficient 3-Bit KV Cache Quantization for Protein Language Models with Orthogonal Rotation and QJL Correction

作者: Yue Hu, Junqing Wang, Yingchao Liu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26110v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	15.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	15.0/10	0.0

评分理由: 论文专注于蛋白质语言模型（PLMs）的KV缓存量化技术，属于大模型在科学领域的应用。核心贡献是3位KV缓存量化方法，与’Quantization OR Model Compression OR Low-bit Weights’高度相关（15分），并涉及’KV Cache Compression OR Linear Attention OR FlashAttention’（10分）。论文属于’AI for Science OR Bioinformatics OR Cheminformatics’范畴（15分），因为PLMs是生物信息学应用。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为PLMs是蛋白质领域的语言模型，论文也对比了PLMs与LLMs的差异。‘Speculative Decoding OR Inference Acceleration’得5分，因为论文提到了推理加速（1.96倍速度提升），但这不是主要焦点。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了蛋白质语言模型推理时KV缓存内存消耗过大的问题，提出了一种3位KV缓存量化方法TurboESM，实现了7.1倍内存减少和1.96倍速度提升，同时保持了高精度。

摘要翻译

蛋白质语言模型（PLM）的快速规模化发展，在蛋白质结构预测与设计领域实现了前所未有的准确性，但推理过程中键值（KV）缓存呈二次增长的内存需求，仍是单GPU部署与高通量生成的主要障碍。虽然8位量化已成为标准方案，但由于激活值中存在严重的数值异常值，3位量化仍难以实现。本文提出了TurboESM，这是将谷歌TurboQuant方法适配至PLM领域的成果。我们通过推导一种RoPE优先的旋转流程，解决了旋转位置编码（RoPE）与正交变换之间的根本性不兼容问题。我们引入了一种针对氨基酸激活流形设计的头级别SVD校准方法、一种适用于非对称K/V分布的双查找表（LUT）策略，以及一种1位量化约翰逊-林登斯特劳斯（QJL）残差校正。所有实验均在ESM-2 650M模型上进行，我们的实现将内存占用降低了7.1倍（从330 MB降至47 MB），同时在涵盖短肽、跨膜螺旋、酶活性位点片段和固有无序区域在内的多种蛋白质家族的自回归解码中，保持了大于0.96的余弦相似度。我们进一步实现了一个基于Triton的融合解码注意力内核，消除了中间反量化的内存分配，仅KV获取操作就比PyTorch两步路径实现了1.96倍的加速；然而，由于KV量化和打包操作，TurboESM相比原始模型产生了21-27毫秒的预填充开销，这使其最适合内存受限的场景，而非对延迟敏感的短序列任务。分析表明，由于氨基酸词表的稀疏性，PLM比大语言模型（LLM）表现出更尖锐的异常值分布特征，而我们的方法有效地处理了这些分布。

摘要 (Abstract)

The rapid scaling of Protein Language Models (PLMs) has unlocked unprecedented accuracy in protein structure prediction and design, but the quadratic memory growth of the Key-Value (KV) cache during inference remains a prohibitive barrier for single-GPU deployment and high-throughput generation. While 8-bit quantization is now standard, 3-bit quantization remains elusive due to severe numerical outliers in activations. This paper presents TurboESM, an adaptation of Google’s TurboQuant to the PLM domain. We solve the fundamental incompatibility between Rotary Position Embeddings (RoPE) and orthogonal transformations by deriving a RoPE-first rotation pipeline. We introduce a head-wise SVD calibration method tailored to the amino acid activation manifold, a dual look-up table (LUT) strategy for asymmetric K/V distributions, and a 1-bit Quantized Johnson-Lindenstrauss (QJL) residual correction. All experiments are conducted on ESM-2 650M, where our implementation achieves a 7.1x memory reduction (330 MB to 47 MB) while maintaining cosine similarity > 0.96 in autoregressive decoding across diverse protein families, including short peptides, transmembrane helices, enzyme active site fragments, and intrinsically disordered regions. We further implement a Triton-based fused decode attention kernel that eliminates intermediate dequantization memory allocations, achieving a 1.96x speedup over the PyTorch two-step path for the KV fetch operation alone; however, TurboESM incurs a prefill overhead of 21-27 ms relative to the original model due to KV quantization and packing, making it most suitable for memory-bound scenarios rather than latency-critical short-sequence workloads. Analysis reveals that PLMs exhibit sharper outlier profiles than large language models (LLMs) due to amino acid vocabulary sparsity, and our method effectively addresses these distributions.

关键词: Protein Language Models, KV cache quantization, 3-bit quantization, memory reduction, inference acceleration, ESM-2, autoregressive decoding, amino acid activation manifold

229. ❌ Controlling isomer population using a dual-oscillator infrared free-electron laser

作者: América Y. Torres-Boy, Anoushka Ghosh, Myles B. T. Osenton, Akash C. Behera, Sandy Gewinner, Marco De Pas, Heinz Junkes, Wieland Schöllkopf, Alexander Paarmann, Gert von Helden, Gerard Meijer 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26473v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是使用双振荡器红外自由电子激光控制超流体氦纳米液滴中离子异构体种群，属于实验物理化学领域。所有关键词均与大模型、深度学习、AI技术原理或应用无关，因此除’AI for Science OR Bioinformatics OR Cheminformatics’外，其他关键词均得0分。‘AI for Science’得5分，因为该研究属于科学实验（化学物理交叉），但论文本身并未使用AI方法，只是广义上属于科学研究范畴，关联度有限。

!!! tip deepseek-chat TL;DR

该研究使用高度同步的双色红外自由电子激光，在超流体氦纳米液滴中控制和表征了磷酸二氢盐与甲酸盐的单氘代质子结合二聚体的异构体种群，并记录了个别异构体的红外光谱。

摘要翻译

我们报道了利用双振荡器红外自由电子激光器的双色操作，实现对超流氦纳米液滴内离子异构体布居的控制与表征。两束激光的时序高度同步，其频率（或称“色光”）可在宽范围内独立调谐。氦纳米液滴内单氘代磷酸二氢根-甲酸质子结合二聚体与双色光的相互作用，实现了对其异构体布居的控制，并记录下各单一异构体在单色光条件下被隐藏的红外光谱。

摘要 (Abstract)

We report on the control and characterization of the isomer population of ions inside superfluid helium nanodroplets, using two-color operation of a dual-oscillator infrared free-electron laser. The timing of both lasers is highly synchronized, and their frequencies (or “colors”) can be tuned independently over a wide range. Interaction of the singly deuterated proton-bound dimer of dihydrogen phosphate and formate inside helium nanodroplets with both colors enables the control over its isomer population and the recording of - one-color hidden - infrared spectra of individual isomers.

关键词: isomer population control, dual-oscillator infrared free-electron laser, superfluid helium nanodroplets, two-color operation, infrared spectroscopy, proton-bound dimer, dihydrogen phosphate, formate

230. ❌ Non-additive Ion Effects on the Coil-Globule Equilibrium of a Generic Uncharged Polymer

作者: Kushagra Goel, Monika Choudhary, Swaminath Bharadwaj 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26555v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究聚合物在盐溶液中的构象转变，属于物理化学/高分子科学领域，与绝大多数大模型/深度学习技术关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于计算化学/分子模拟范畴，可视为’AI for Science’在化学/材料科学中的一个应用实例，但论文本身并未明确使用AI或机器学习方法（主要使用分子动力学模拟），因此相关性较弱，给予5分（有一定关联）。其他所有关键词均与大模型技术、训练方法、推理优化、AI代理等无关，全部评分为0分。

!!! tip deepseek-chat TL;DR

该研究通过原子尺度模拟探究了通用不带电聚合物在单一及混合盐溶液中的线圈-球状转变，发现非特异性聚合物-离子相互作用足以重现实验观察到的非加性离子效应，揭示了体相离子-离子和离子-水相互作用的主导作用。

摘要翻译

弱水合阴离子与强水合阴离子的混合物会引发热响应性聚合物（如聚（N-异丙基丙烯酰胺）（PNIPAM）和聚环氧乙烷（PEO））的低临界溶解温度（LCST）发生非加和性变化。对PNIPAM-NaI-Na₂SO₄混合物的大规模原子模拟表明，这些效应源于有利的PNIPAM-碘离子相互作用与强水合硫酸根离子耗竭之间的协同作用。本文旨在探究，是否必须依赖化学特异性的聚合物-阴离子相互作用才能重现此类行为。为此，我们研究了一种通用的不带电线性聚合物在单一及混合盐原子级水溶液中的线团-球体转变，该聚合物仅具有非特异性的聚合物-水及聚合物-离子范德华相互作用。我们在固定强水合盐Na₂SO₄浓度、逐步增加弱水合盐NaSCN和NaI浓度的条件下进行了模拟。该通用聚合物在纯NaSCN溶液、纯Na₂SO₄溶液以及混合盐溶液中均能定性复现实验观测趋势。该模型捕捉到了SCN⁻在聚合物附近的富集与SO₄²⁻的耗竭之间的相互增强效应，正是这一效应导致了非加和性行为，这与PNIPAM溶液中的原子模拟结果一致。随着背景盐浓度的增加，这些特征变得更加明显；当用I⁻替换SCN⁻时，由于聚合物-碘离子相互作用较弱，这些特征进一步增强。我们的结果表明，非特异性的聚合物-离子相互作用足以重现非加和性特征，这凸显了本体离子-离子及离子-水相互作用的主导作用。

摘要 (Abstract)

Mixtures of weakly and strongly hydrated anions induce non-additive changes in the LCST of thermoresponsive polymers such as Poly(N-isopropylacrylamide) (PNIPAM) and PEO. Large-scale atomistic simulations of PNIPAM-NaI-Na${2}$SO${4}$ mixtures show that these effects arise from the interplay between favorable PNIPAM-iodide interactions and the depletion of strongly hydrated sulfate ions. Here, we investigate whether chemically specific polymer-anion interactions are necessary to reproduce such behavior. To this end, we study the coil-to-globule transition of a generic uncharged linear polymer with non-specific polymer-water and polymer-ion van der Waals interactions in atomistic aqueous solutions of single and mixed salts. We perform simulations at fixed concentrations of the strongly hydrated salt, Na${2}$SO${4}$, and increasing concentrations of weakly hydrated salts, NaSCN and NaI. The generic polymer qualitatively reproduces experimental trends in both pure NaSCN and Na${2}$SO${4}$ solutions, as well as in mixed salt solutions. The model captures the mutual reinforcement between SCN$^{-}$ accumulation near the polymer and SO$_{4}^{2-}$ depletion that gives rise to non-additive behavior, consistent with atomistic simulations in PNIPAM solutions. These features become more pronounced with increasing background salt concentration and are further enhanced upon replacing SCN$^{-}$ with I$^{-}$, owing to weaker polymer-iodide interactions. Our results demonstrate that non-specific polymer-ion interactions are sufficient to reproduce non-additive features, highlighting the dominant role of bulk ion-ion and ion-water interactions.

关键词: coil-globule transition, uncharged polymer, non-additive ion effects, atomistic simulations, mixed salt solutions, polymer-ion interactions, ion depletion, PNIPAM

231. ❌ Coupling Quantum Mechanical Modeling and Molecular Dynamics on Heterogeneous Supercomputers for Studying Distal Mutation Effects on Drug Binding in HIV-1

作者: William Dawson, Louis Beal, Marco Zaccaria, Luigi Genovese 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26411v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究HIV-1蛋白酶中远端突变对药物结合的影响，采用分子动力学和量子力学模拟的计算方法，属于计算生物物理/生物信息学领域。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为其研究问题（药物结合、突变效应）属于生物信息学/科学计算的应用范畴，但论文并未使用AI或机器学习方法，而是基于传统的物理模拟（MD和DFT），因此相关性有限，给予5分。

!!! tip deepseek-chat TL;DR

该论文通过耦合分子动力学模拟和量子力学计算，研究了HIV-1蛋白酶中远端突变如何通过改变结合界面的电子相互作用网络导致抗病毒药物Darunavir耐药性的分子机制，并提出了一种可扩展的计算策略来解析复杂的药物耐药生物物理机制。

摘要翻译

预测蛋白质突变如何影响药物结合仍是一个重大挑战，尤其当突变远离结合位点时。本研究提出了一种耦合模拟工作流程，将长时间尺度的分子动力学模拟与高通量量子力学分析相结合，以揭示HIV-1蛋白酶中突变诱导耐药性的电子结构特征。该工作流程利用GPU加速的分子动力学模拟生成构象系综，并在CPU节点耦合分区上并行地对选定帧进行原位线性缩放密度泛函理论计算。这一设计使得我们能够以原子分辨率对蛋白质-配体复合物进行高效、大规模并行的量子分析。应用此方法，我们研究了多重突变HIV-1蛋白酶变体对抗病毒药物达芦那韦的耐药性。通过绘制结合界面上的电子相互作用网络，我们的结果凸显了构象采样和量子层面洞察在理解远端突变效应中的关键作用，并展示了一种可扩展的计算策略，用于研究复杂的耐药性生物物理机制。我们认为，此类分析可能为设计能够抵抗系统性突变诱导去稳定化、保持结合稳定性的抑制剂开辟新途径。

摘要 (Abstract)

Predicting how protein mutations affect drug binding remains a major challenge, particularly when the mutations are distal from the binding site. In this study, we introduce a coupled simulation workflow that combines long-time-scale molecular dynamics (MD) with high-throughput quantum mechanical (QM) analysis to reveal the electronic structure signatures of mutation induced drug resistance in the HIV-1 protease. Our workflow leverages GPU-accelerated MD to generate conformational ensembles, and performs in-operando linear-scaling density functional theory (DFT) calculations on selected frames parallelized on a coupled partition of CPU nodes. This design enables efficient, massively parallel quantum analysis of protein-ligand complexes at atomic resolution. Using this approach, we investigate resistance to the antiviral Darunavir in a multi-mutant HIV-1 protease variant. By mapping the network of electronic interactions across the binding interface, our results highlight the critical role of conformational sampling and quantum insight in understanding distal mutation effects, and demonstrate a scalable computational strategy for studying complex biophysical mechanisms of drug resistance. We argue that such kind of analysis may pave the way for designing inhibitors that maintain binding stability against systemic, mutation-induced destabilization.

关键词: HIV-1 protease, drug binding, distal mutations, molecular dynamics, quantum mechanical modeling, drug resistance, Darunavir, electronic structure

232. ❌ Hunting Structural Demons in Digital Reticular Chemistry

作者: Yongchul G. Chung, Myoung Soo Lah 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26295v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文是一篇关于数字网状化学中晶体结构错误的综述，主要讨论MOF（金属有机框架）数据库中的结构错误来源、检测和预防方法。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词都是关于深度学习和大语言模型的技术主题。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于化学信息学领域，涉及计算化学和数据库管理，与科学AI应用有一定关联，但论文本身并未明确使用AI或机器学习方法，只是讨论结构验证和数据库管理，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

这篇综述探讨了数字网状化学中晶体结构错误（称为'结构恶魔'）的来源、检测方法和预防策略，旨在通过改进数据管理和结构生成流程来减少下游数据库中的无效结构。

摘要翻译

数字网状化学依赖于精确的晶体结构来驱动计算筛选、数据驱动的发现以及结构-性能分析，然而近期研究表明，在主要计算筛选项目中表现优异的结构候选者超过半数在化学上是无效的。在实验性金属有机框架（MOF）数据库中，当无序或不完整的结构模型被错误地转换为完全明确的模拟输入时，便会产生结构错误。在假设性MOF数据库中，结构在构建时是完整的，但可能包含化学上不合理的氧化态、配位环境或电荷分布。我们将这些错误的结构模型称为“结构恶魔”。本微型综述提出三个问题：这些错误从何而来，我们如何发现它们，以及如何预防它们。在预防方面，关键步骤包括从一开始就将衍射数据与合成细节共同保存，在结构入库时采用一致的标准化处理流程，并在结构生成前筛选拓扑选择。将这些步骤串联起来，可以阻止大量不良结构进入下游数据库，从而减少后续修正的需求。

摘要 (Abstract)

Digital reticular chemistry relies on accurate crystal structures to power computational screening, data-driven discovery, and structure-property analysis, yet recent studies reveal that more than half of the top-performing candidates in major computational screening campaigns are chemically invalid. In experimental MOF databases, structural errors arise when disordered or incomplete structural models are incorrectly converted into fully specified simulation inputs. In hypothetical MOF database, structures are complete by construction but may encode chemically implausible oxidation states, coordination environments, or charge distributions. We term these erroneous structural models “structural demons.” This mini-review asks three questions: where these errors enter, how we find them, and how we prevent them. On the prevention side, the key steps are keeping diffraction data and synthesis details together from the start, using consistent curation when structures enter a database, and filtering topology choices before structure generation. Connecting these steps can keep many bad structures out of downstream databases and reduce the need to fix them later.

关键词: digital reticular chemistry, crystal structures, structural errors, MOF databases, computational screening, structure validation, data curation, topology filtering

233. ❌ Microscopic Structure and Dynamics of Interfacial Water at Fluorinated vs Nonfluorinated Surfaces – Insights from Ab-Initio Simulations and IR Spectroscopy

作者: Maximilian R. Becker, Ruben Cruz, Kenichi Ataka, Joachim Heberle, Roland R. Netz 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26300v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究氟化和非氟化表面界面水的微观结构和动力学，使用密度泛函理论分子动力学模拟和红外光谱分析。所有关键词均与大模型、深度学习技术原理或AI应用直接相关，但论文内容属于计算化学和物理化学领域，未涉及任何大模型、深度学习或AI技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于计算科学领域，但未使用AI方法，仅使用传统计算模拟，因此给予5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文通过密度泛函理论分子动力学模拟和红外光谱研究了氟化和非氟化自组装单分子层表面界面水的结构和动力学，发现氟化表面尽管宏观上更疏水，但其光谱特性既不属于疏水也不属于亲水表面，且色散相互作用主导了水与表面的相互作用。

摘要翻译

全氟烷基和多氟烷基物质是一类广泛用作降低表面能涂层的合成化合物。然而，其与水及有机化合物之间弱相互作用的微观机制仍不甚明晰。本研究通过大规模密度泛函理论分子动力学模拟，探究了水在氟化及非氟化碳氢化合物自组装单分子膜界面的行为。我们分析了界面水结构，并将其与典型的疏水性空气-水界面进行对比。两种自组装单分子膜处的界面水结构均与空气-水界面高度相似，均呈现明显的耗尽层和平行于表面的二维氢键网络。计算得到的各向异性红外光谱复现了表面增强红外吸收光谱实验中观测到的关键特征，包括可直接探测局部表面-水相互作用的游离OH振动。值得注意的是，虽然碳氢化合物自组装单分子膜-水界面的游离OH伸缩振动相对于空气-水界面呈现红移（表明存在弱结合作用），但氟化自组装单分子膜-水界面却显示出微弱蓝移的游离OH振动模式，这与实验结果一致。这种频率变化规律无法用基于振动斯塔克效应的常规理论解释，表明水与自组装单分子膜之间的相互作用主要受色散作用而非静电作用主导。光谱线形分析进一步表明，水分子在氟化表面附近的重取向动力学显著减慢，这种现象常见于亲水表面。这表明氟化表面虽然在宏观上比未氟化表面更具疏水性，但其光谱特征既不符合典型疏水性也不符合典型亲水性的界定标准。

摘要 (Abstract)

Per- and polyfluoroalkyl substances are a class of synthetic chemical compounds widely used as coatings to lower surface energies. Yet the microscopic mechanisms of their weak interaction with water and organic compounds remain poorly understood. Here, we perform large-scale density-functional-theory molecular dynamics simulations to investigate water at self-assembled monolayers (SAMs) of fluorinated and non-fluorinated hydrocarbons. We analyze the interfacial water structure and compare it to the prototypical hydrophobic air-water interface. The interfacial water structure at both SAMs closely resembles that at the air-water interface, featuring a distinct depletion layer and a two-dimensional hydrogen-bond network parallel to the surface. Computed anisotropic infrared spectra reproduce key experimental signatures observed in surface-enhanced infrared absorption spectroscopy (SEIRAS), including the presence of free OH vibrations directly probing the local surface-water interactions. Notably, while the free OH stretch at the hydrocarbon SAM-water interface exhibits a red shift relative to the air-water interface, indicative of weak binding, the fluorinated SAM-water interface displays a weakly blue-shifted free OH mode, in agreement with experiment. This frequency behavior, which defies common interpretations based on the vibrational Stark effect, indicates that dispersive rather than electrostatic interactions dominate the interaction between water and SAMs. Analysis of spectral line shapes further shows that the reorientation dynamics of water molecules are significantly slower near the fluorinated surface, as commonly observed at hydrophilic surfaces. This indicates that fluorinated surfaces, despite being macroscopically more hydrophobic than their unfluorinated counterparts, exhibit spectroscopic characteristics that neither qualify it as hydrophobic nor hydrophilic.

关键词: interfacial water, fluorinated surfaces, density-functional-theory molecular dynamics, infrared spectroscopy, self-assembled monolayers, hydrophobic interactions, spectral line shapes, dispersive interactions

234. ❌ The Unreconstructed α-Al${2}$O${3}$(0001) Surface is Inhomogeneous and Rough

作者: Johanna I. Hütner-Reisch, Andrea Conti, David Kugler, Florian Mittendorfer, Michael Schmid, Ulrike Diebold, Jan Balajka 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26201v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究氧化铝（Al2O3）表面的原子结构，使用非接触原子力显微镜（nc-AFM）和密度泛函理论（DFT）计算，挑战了关于α-Al2O3(0001)表面平坦且均匀的假设。论文主题属于材料科学和表面物理，与所有大模型、深度学习、AI技术原理关键词完全无关（评分为0）。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及计算材料科学（DFT计算）和实验技术，属于科学领域的计算应用，但并非AI或大模型驱动的研究，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文通过实验和计算揭示了未重构的α-Al2O3(0001)表面实际上是粗糙、无序且不均匀的，挑战了传统原子级平坦模型的假设。

摘要翻译

氧化铝（Al${2}$O${3}$）是薄膜生长和多相催化中的关键材料，其原子表面结构对性能具有决定性影响。通过结合非接触原子力显微镜（nc-AFM）与密度泛函理论（DFT）计算，我们对普遍认为的未重构$α$-Al${2}$O${3}$(0001)表面具有原子级平整且均匀铝终端结构的假设提出了挑战。这一被广泛接受的体相终端模型虽满足极性补偿要求，却导致表面铝阳离子高度配位不足。尽管这些铝阳离子发生显著的向内弛豫，我们发现相对于在1000°C以上高温下形成的热力学稳定$(\sqrt{31} \times \sqrt{31})R\pm9°$表面重构，(1 ${\times}$ 1)表面本质上仍处于亚稳态。对未重构表面的nc-AFM成像显示出粗糙无序的形貌，仅纳米尺度区域呈现有序的铝终端(1 $\times$ 1)结构。我们的研究结果表明，未重构的Al${2}$O${3}$(0001)表面本质上是非均匀的，这调和了相互矛盾的实验观测结果，并对常用原子模型的合理性提出了质疑。

摘要 (Abstract)

Alumina (Al${2}$O${3}$) is a key material for thin-film growth and heterogeneous catalysis, where the atomic surface structure critically impacts performance. Using noncontact atomic force microscopy (nc-AFM) combined with density functional theory (DFT) calculations, we challenge the common assumption that the unreconstructed $α$-Al${2}$O${3}$(0001) surface is atomically flat and uniformly Al-terminated. This widely accepted bulk termination satisfies polarity compensation requirements but results in highly undercoordinated surface Al cations at the surface. Despite substantial inward relaxation of these Al cations, we find that the (1 ${\times}$ 1) surface remains inherently metastable, relative to the thermodynamically stable $(\sqrt{31} \times \sqrt{31})R\pm9°$ surface reconstruction that forms at high temperatures above 1000 °C. Nc-AFM imaging of the unreconstructed surface reveals a rough and disordered morphology, with only nanometer-scale regions exhibiting the ordered Al-terminated (1 $\times$ 1) structure. Our results show that the unreconstructed Al${2}$O${3}$(0001) surface is intrinsically inhomogeneous, reconciling conflicting experimental observations and challenging the validity of commonly used atomistic models.

关键词: alumina, surface structure, atomic force microscopy, density functional theory, surface reconstruction, thin-film growth, heterogeneous catalysis, polarity compensation

235. ❌ Geometric Phase Effect in Thermodynamic Properties and in the Imaginary-Time Multi-Electronic-State Path Integral Formulation

作者: Jian Liu 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26151v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子化学中的几何相位效应及其在热力学性质计算中的影响，属于计算化学/物理化学领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于科学计算（计算化学）范畴，但论文并未使用AI/机器学习方法，而是基于传统的路径积分分子动力学理论，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了量子化学中几何相位效应对热力学性质的影响，并通过多电子态路径积分方法量化了这一效应，证明了该方法能准确捕捉几何相位，而传统方法在低温下会产生显著误差。

摘要翻译

几何相位（Geometric Phase, GP）是一种源于锥形交叉（Conical Intersections, CIs）的基本量子效应，对电子振动能级具有深远影响。基于玻恩-奥本海默近似的标准虚时路径积分分子动力学（PIMD）未考虑几何相位，可能导致低温热力学性质出现显著误差。在本视角论文中，我们证明，虚时多电子态路径积分（MES-PI）表述（发表于 J. Chem. Phys. 2018, 148, 102319）通过计算连续虚时间切片间统计加权重叠矩阵乘积的电子迹，自然地捕捉了几何相位效应。这一关键能力已隐含于该奠基性工作的基准MES-PIMD模拟中。为将此拓扑效应与其他非绝热效应分离，我们引入了一个（针对锥形交叉的）几何特征矩阵以及一个绕数诱导的相位因子，构建了一种特设的排除几何相位的MES-PI方法。将此特设基准方法与严格的MES-PI方法进行比较，使我们能够明确量化几何相位对热力学性质的影响。尽管在仅考虑基电子态时可使用更简单的近似方法，但对于真实复杂体系——其中锥形交叉缝的位置和拓扑结构通常无法先验获知——MES-PIMD是最通用且最精确的方法。

摘要 (Abstract)

The geometric phase (GP) is a fundamental quantum effect arising from conical intersections (CIs), with profound consequences for vibronic energy levels. Standard imaginary-time path integral molecular dynamics (PIMD) based on the Born-Oppenheimer approximation does not account for the GP, potentially leading to significant errors in low-temperature thermodynamic properties. In this Perspective, we demonstrate that the multi-electronic-state path integral (MES-PI) formulation in imaginary time (developed in J. Chem. Phys. 2018, 148, 102319) naturally captures the GP effect through the electronic trace of the product of statistically weighted overlap matrices between successive imaginary-time slices. This crucial capability was already implicit in the benchmark MES-PIMD simulations in that foundational work. To isolate this topological effect from other nonadiabatic effects, we introduce a geometric signature matrix (for the CI) and a winding-number-induced phase factor, constructing an ad hoc GP-excluded MES-PI method. Comparing this ad hoc baseline against the rigorous MES-PI approach allows us to unambiguously quantify the impact of the GP on thermodynamic properties. While simpler approximations exist when only the ground electronic-state is considered, MES-PIMD is the most general and accurate approach applicable to real complex systems where the location and topology of CI seams are often not known a priori.

关键词: Geometric Phase, Conical Intersections, Path Integral Molecular Dynamics, Multi-electronic-state Path Integral, Thermodynamic Properties, Imaginary-time, Nonadiabatic Effects, Winding Number

236. ❌ Computational Insights into PEMFC Durability: Degradation Mechanisms, Interfacial Chemistry, and the Emerging Role of Machine Learning Potentials

作者: Jack Jon Hinsch, Kazushi Fujimoto 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.26022v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于质子交换膜燃料电池（PEMFC）耐久性的计算建模研究，特别是降解机制、界面化学和机器学习势的应用。所有关键词（共27个）中，只有“AI for Science OR Bioinformatics OR Cheminformatics”与论文内容有一定关联，因为论文涉及机器学习在科学（材料/化学）领域的应用，但并非核心焦点（论文主要讨论传统计算方法和新兴的机器学习势，而非大语言模型或深度学习技术）。其他26个关键词均与大语言模型、深度学习技术原理、对齐、推理、代理等主题相关，与论文的燃料电池计算建模主题完全无关，因此评分为0。

!!! tip deepseek-chat TL;DR

该综述论文研究了质子交换膜燃料电池（PEMFC）耐久性不足的问题，通过整合密度泛函理论、分子动力学和机器学习势等计算建模方法，揭示了多尺度降解机制（如膜降解、铂溶解）及其耦合反馈循环，并提出了未来多尺度建模和机器学习势在带电界面应用的方向。

摘要翻译

质子交换膜燃料电池（PEMFCs）是一种前景广阔的清洁能源技术，在固定式和汽车应用中具有高效率与近零运行排放的优势。然而，其广泛采用仍受限于耐久性不足，这主要源于实际运行条件下催化剂层和质子交换膜的退化。尽管退化的宏观后果已通过实验得到充分证实，但引发和扩展失效的原子与分子机制仍未完全明晰。本综述综合了计算建模领域的最新进展，涵盖密度泛函理论、分子动力学以及新兴的机器学习势函数，以探究化学、机械、电化学及污染驱动的退化机制如何在多个长度与时间尺度上发生作用。关键议题包括自由基诱导的膜降解、铂溶解与碳载体腐蚀、电-湿-热循环下的机械疲劳，以及离子与气体污染物的影响。一个核心发现是，这些退化路径并非独立存在，而是形成了强耦合的反馈循环，而目前尚无计算框架被设计用于同时捕捉这种耦合。本文提出了未来研究方向，重点强调多尺度建模框架以及机器学习原子间势函数在带电界面中的应用。

摘要 (Abstract)

Proton exchange membrane fuel cells (PEMFCs) are a promising clean energy technology, offering high efficiency and near-zero operational emissions for stationery and automotive applications. However, their widespread adoption remains limited by insufficient durability, driven by the degradation of the catalyst layer and proton exchange membrane under realistic operating conditions. While the macroscopic consequences of degradation are well established experimentally, the atomistic and molecular mechanisms that initiate and propagate failure remain incompletely understood. This review synthesizes recent advances in computational modelling, spanning density functional theory, molecular dynamics, and emerging machine learning potentials, to examine how chemical, mechanical, electrochemical, and contamination driven degradation mechanisms operate across multiple length and time scales. Key topics include radical-induced membrane degradation, platinum dissolution and carbon support corrosion, mechanical fatigue under electrical and hygrothermal cycling, and the impact of ionic and gaseous contaminants. A central finding is that these degradation pathways are not independent, but form strongly coupled feedback loops that no existing computational framework has been designed to capture this coupling simultaneously. Future directions are proposed, with emphasis on multiscale modelling frameworks and the application of machine learning interatomic potentials to the electrified interface.

关键词: Proton exchange membrane fuel cells (PEMFCs), durability, degradation mechanisms, computational modelling, machine learning potentials, multiscale modelling, electrified interface, interfacial chemistry

237. ❌ Liquid structure adjacent to solid surfaces follows the superposition principle

作者: Qian Ai, Haiyi Wu, Lalith Krishna Samanth Bonagiri, Kaustubh S. Panse, Shan Zhou, Fujia Zhao, Yitong Li, Kenneth S. Schweizer, Narayana R. Aluru, Yingjie Zhang 期刊/来源: arxiv 发布日期: 2026-03-27 arXiv链接: http://arxiv.org/abs/2603.25992v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究固体-液体界面结构，属于物理化学/材料科学领域，使用3D原子力显微镜和分子动力学模拟等传统计算与实验方法。所有关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、对齐、代理等），而本文完全不涉及这些内容。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，但论文未使用AI/机器学习方法，而是基于物理原理的分析模型（SLS）和传统模拟，因此仅给予5分（有一定关联，因属于科学计算/模拟范畴，但非AI驱动）。其余关键词评分为0（完全无关）。

!!! tip deepseek-chat TL;DR

该研究发现了固体-液体界面液体结构遵循叠加原理，通过3D原子力显微镜实验和固体-液体叠加（SLS）分析模型，揭示了从埃到近微米尺度的通用液体密度振荡和重排行为，并建立了预测任意形态固体表面附近液体结构的理论框架。

摘要翻译

固液界面处的液体结构对从生物信号转导到电化学能量转换等诸多自然与工程过程至关重要。先进的实验与计算方法已为纳米尺度下邻近平面基底液体结构提供了深入见解。然而，实际固液界面在多个长度尺度上不可避免地存在非均匀性，其复杂性超出了现有方法的研究能力。本研究通过发现并利用一种迄今被忽视的界面液体原理——叠加原理，弥合了这一复杂性鸿沟。在实验中，我们采用三维原子力显微镜（3D-AFM）对多种有机溶剂、水溶液及电解质的界面结构进行成像，揭示了在异质基底位点普遍存在的液体密度振荡现象以及涌现的液体层重构行为。我们进一步开发了一种解析模型，称为固液叠加（SLS）模型，该模型基于一个关键描述符——液体分子与邻近固体原子间的有效总相关函数（ETCF），求解界面液体密度分布。SLS模型不仅解释了从埃尺度到近微米尺度所有实验观测到的界面液体分布轮廓，还预测了更为精确的原子尺度干涉图案，这些预测进一步得到了分子动力学（MD）模拟的验证。本研究揭示了界面液体的一个关键结构描述符，并建立了一个理论框架，用于快速准确地预测具有任意形貌与尺寸尺度的固体表面邻近液体结构。

摘要 (Abstract)

Liquid structure at solid-liquid interfaces is critical for many natural and engineered processes ranging from biological signal transduction to electrochemical energy conversion. Advanced experimental and computational methods have provided insights into the structure of liquids adjacent to planar substrates at the nanoscale. However, realistic solid-liquid interfaces are inevitably inhomogeneous across multiple length scales, presenting a complexity that surpasses the capabilities of existing approaches. Here we bridge the complexity gap by discovering and utilizing a hitherto hidden principle of interfacial liquid–superposition. Experimentally, we use 3D atomic force microscopy (3D-AFM) to image the interfacial structure of a wide range of organic and aqueous solvents and electrolytes, uncovering universal liquid density oscillations and emergent liquid layer reconfigurations at heterogeneous substrate sites. We further develop an analytical model, coined solid-liquid superposition (SLS), which solves the interfacial liquid density distribution based on a key descriptor: the effective total correlation function (ETCF) between a liquid molecule and nearby solid atoms. SLS not only explains all the experimentally observed interfacial liquid distribution profiles from the angstrom to near-micron scale, but also predicts more precise atomic-scale interference patterns which are further corroborated by molecular dynamics (MD) simulations. This study unveils a key structural descriptor of interfacial liquids, and establishes a theoretical framework for rapidly and accurately predicting liquid structures adjacent to solid surfaces with arbitrary morphology and size scale.

关键词: solid-liquid interfaces, 3D atomic force microscopy, superposition principle, liquid density oscillations, molecular dynamics simulations, interfacial liquid structure, heterogeneous substrates, effective total correlation function

Token 消耗统计

总计: 751,026 tokens（输入 516,644 / 输出 234,382）

模型	输入	输出	合计
deepseek-chat	422,774	234,382	657,156
glm-4.7	93,870	0	93,870

📊 ArXiv 研究报告 (2026-03-31)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification

2. Learning to Commit: Generating Organic Pull Requests via Online Repository Memory

3. Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

4. SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatol

5. ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-r

6. MA-Bench: Towards Fine-grained Micro-Action Understanding

7. From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasonin

8. Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR

📋 所有论文列表

1. ✅ Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification

2. ✅ Learning to Commit: Generating Organic Pull Requests via Online Repository Memory

3. ✅ Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

4. ✅ SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

5. ✅ ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims

6. ✅ MA-Bench: Towards Fine-grained Micro-Action Understanding

7. ✅ From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs

8. ✅ Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR

9. ❌ Automated near-term quantum algorithm discovery for molecular ground states

10. ❌ Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference

11. ❌ Experimental study on surveillance video-based indoor occupancy measurement with occupant-centric control

12. ❌ PQuantML: A Tool for End-to-End Hardware-aware Model Compression

13. ❌ SPECTRA: An Efficient Spectral-Informed Neural Network for Sensor-Based Activity Recognition

14. ❌ ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs

15. ❌ EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching

16. ❌ SAFT: Sensitivity-Aware Filtering and Transmission for Adaptive 3D Point Cloud Communication over Wireless Channels

17. ❌ EcoFair: Trustworthy and Energy-Aware Routing for Privacy-Preserving Vertically Partitioned Medical Inference

18. ❌ Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

19. ❌ PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

20. ❌ Ruka-v2: Tendon Driven Open-Source Dexterous Hand with Wrist and Abduction for Robot Learning

21. ❌ Machine Learning Transferability for Malware Detection

22. ❌ Make Geometry Matter for Spatial Reasoning

23. ❌ Think over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling

24. ❌ Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow

25. ❌ Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

26. ❌ Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation

27. ❌ Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

28. ❌ Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

29. ❌ When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

30. ❌ How Open Must Language Models be to Enable Reliable Scientific Inference?

31. ❌ The Multi-AMR Buffer Storage, Retrieval, and Reshuffling Problem: Exact and Heuristic Approaches

32. ❌ JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems

33. ❌ CADSmith: Multi-Agent CAD Generation with Programmatic Geometric Validation

34. ❌ AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese

35. ❌ AIRA_2: Overcoming Bottlenecks in AI Research Agents

36. ❌ Foundation Model for Cardiac Time Series via Masked Latent Attention

37. ❌ UNIFERENCE: A Discrete Event Simulation Framework for Developing Distributed AI Models

38. ❌ A Boltzmann-machine-enhanced Transformer For DNA Sequence Classification

39. ❌ Neuro-Symbolic Process Anomaly Detection

40. ❌ Can AI Models Direct Each Other? Organizational Structure as a Probe into Training Limitations

41. ❌ CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities

42. ❌ KMM-CP: Practical Conformal Prediction under Covariate Shift via Selective Kernel Mean Matching

43. ❌ Why Models Know But Don’t Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models

44. ❌ Generative Modeling in Protein Design: Neural Representations, Conditional Generation, and Evaluation Standards

45. ❌ Generative Score Inference for Multimodal Data

46. ❌ CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law

47. ❌ Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation

48. ❌ PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management

49. ❌ Label-Free Cross-Task LoRA Merging with Null-Space Compression

50. ❌ Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy

51. ❌ findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

52. ❌ PhysVid: Physics Aware Local Conditioning for Generative Video Models

53. ❌ Knowdit: Agentic Smart Contract Vulnerability Detection with Auditing Knowledge Summarization

54. ❌ GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation

55. ❌ GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

56. ❌ Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models

57. ❌ ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction

58. ❌ Channelling, Coordinating, Collaborating: A Three-Layer Framework for Disability-Centered Human-Agent Collaboration

59. ❌ Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan

60. ❌ Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

61. ❌ Physics-Informed Neural Networks and Sequence Encoder: Application to heating and early cooling of thermo-stamping process

62. ❌ Automating Domain-Driven Design: Experience with a Prompting Framework

63. ❌ Clawed and Dangerous: Can We Trust Open Agentic Systems?

64. ❌ Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

65. ❌ Sparse Auto-Encoders and Holism about Large Language Models