📊 ArXiv 研究报告 (2026-03-12)

生成时间: 2026-03-12 23:19:15 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 63 篇
及格论文: 1 篇 (1.6%)
深度分析: 1 篇

⭐ 及格论文详细分析

1. A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification

作者: Yichi Zhu, Kan Ling, Xu Liu, Hengrun Zhang, Huiqun Yu, Guisheng Fan 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10891v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心是应用LLMs解决处方验证中的安全问题，直接涉及LLMs、Chain of Thought推理、幻觉缓解和AI for Science（生物信息学/医学应用）。论文提出KB-grounded Chain of Verification（CoV）方法，与Chain of Thought高度相关（10分），并旨在解决LLMs的事实不可靠性（Hallucination Mitigation，10分）。Retrieval-Augmented Generation（RAG）相关度为5分，因为方法涉及基于知识库的检索，但未明确使用RAG术语。System 2 Thinking和Explainable AI各得5分，因论文强调透明推理和可解释性。其他关键词如MoE、SFT、RLHF等与论文技术细节无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对处方验证中LLMs存在的事实不可靠、缺乏可追溯性和复杂推理能力弱的问题，提出了PharmGraph-Auditor系统，通过构建混合药物知识库和KB-grounded Chain of Verification方法，实现了安全、可追溯的处方审核，实验表明能帮助药剂师更安全、快速地验证处方。

摘要翻译

用药差错对患者安全构成重大威胁，使得药师审核（Pharmacist Verification, PV）成为一项至关重要但负担沉重的最终安全屏障。由于大型语言模型（Large Language Models, LLMs）固有的不可靠性、缺乏可追溯性以及在复杂推理方面的不足，将其直接应用于这一零容忍领域是不可行的。为应对这些挑战，我们提出了PharmGraph-Auditor，这是一个为安全且基于证据的处方审核而设计的新型系统。我们系统的核心是一个可信赖的混合药物知识库（Hybrid Pharmaceutical Knowledge Base, HPKB），该知识库在虚拟知识图谱（Virtual Knowledge Graph, VKG）范式下实现。此架构通过一个严谨的映射层，策略性地整合了用于集合约束满足的关系型组件和用于拓扑推理的图谱组件。为构建此HPKB，我们提出了迭代式模式精炼（Iterative Schema Refinement, ISR）算法，这是一个能够从医学文本中实现图谱与关系模式协同演化的框架。针对审核任务，我们引入了基于知识库的验证链（KB-grounded Chain of Verification, CoV），这是一种新的推理范式，它将LLM从一个不可靠的生成器转变为一个透明的推理引擎。CoV将审核任务分解为一系列针对HPKB的可验证查询，并生成混合查询计划以从最合适的数据存储中检索证据。实验结果表明，该系统具备强大的知识抽取能力，并展现了使用PharmGraph-Auditor帮助药师实现更安全、更快速处方审核的潜力。

摘要 (Abstract)

Medication errors pose a significant threat to patient safety, making pharmacist verification (PV) a critical, yet heavily burdened, final safeguard. The direct application of Large Language Models (LLMs) to this zero-tolerance domain is untenable due to their inherent factual unreliability, lack of traceability, and weakness in complex reasoning. To address these challenges, we introduce PharmGraph-Auditor, a novel system designed for safe and evidence-grounded prescription auditing. The core of our system is a trustworthy Hybrid Pharmaceutical Knowledge Base (HPKB), implemented under the Virtual Knowledge Graph (VKG) paradigm. This architecture strategically unifies a relational component for set constraint satisfaction and a graph component for topological reasoning via a rigorous mapping layer. To construct this HPKB, we propose the Iterative Schema Refinement (ISR) algorithm, a framework that enables the co-evolution of both graph and relational schemas from medical texts. For auditing, we introduce the KB-grounded Chain of Verification (CoV), a new reasoning paradigm that transforms the LLM from an unreliable generator into a transparent reasoning engine. CoV decomposes the audit task into a sequence of verifiable queries against the HPKB, generating hybrid query plans to retrieve evidence from the most appropriate data store. Experimental results demonstrate robust knowledge extraction capabilities and show promises of using PharmGraph-Auditor to enable pharmacists to achieve safer and faster prescription verification.

关键词: Large Language Models, Prescription Verification, Chain of Verification, Hybrid Pharmaceutical Knowledge Base, Hallucination Mitigation, AI for Healthcare, Knowledge-grounded Reasoning, PharmGraph-Auditor

深度分析:

处方验证中安全性与可追溯性的混合知识驱动框架

摘要:

针对用药错误威胁患者安全及药师人工验证负担重的问题，本文提出了PharmGraph-Auditor系统。该系统旨在解决大模型直接应用于处方验证时存在的幻觉、缺乏可追溯性及复杂推理能力弱的问题。核心在于构建基于虚拟知识图范式的混合医药知识库（HPKB），结合关系组件处理数值约束和图组件处理语义拓扑。构建阶段采用迭代模式细化（ISR）算法，审计阶段引入基于知识库的验证链，将大模型转变为透明推理引擎。实验表明，该系统在真实处方数据上优于传统规则系统，F1分数提升13.4%，实现了更安全、高效的处方验证。

创新点:

提出了PharmGraph-Auditor系统，结合关系数据库和图数据库的优势，构建混合医药知识库（HPKB），以同时满足严格的数值审计和复杂的语义推理需求。
设计了迭代模式细化（ISR）算法，通过人机协同的方式，从医疗文本中动态演化并完善混合模式，解决了领域知识异构性带来的挑战。
引入了基于知识库的验证链推理范式，将大模型从不可靠的生成器转变为透明的推理引擎，通过分解任务和生成混合查询计划来验证处方。
提出了患者档案驱动的证据选择树（P-EST），用于修剪不相关规则，并明确标记信息缺失，优先考虑安全性而非产生幻觉的结论。

方法

!!! info

论文采用混合架构方法，首先基于虚拟知识图（VKG）范式定义理论模型，将知识分为关系组件（处理集合约束）和图组件（处理拓扑遍历）。在构建阶段，利用分层抽样策略和Section-Aware Multi-Agent框架，结合LLM的缺口检测能力和专家的语义抽象能力，通过ISR算法构建可信的HPKB。在应用阶段，采用基于知识库的验证链，将审计任务分解为可验证的子任务，执行混合查询，并利用P-EST进行证据筛选。

关键结果:

在真实世界住院处方数据集上的实验表明，PharmGraph-Auditor在F1分数上比传统基于规则的CDSS提高了13.4%。
系统在保持高精度以减轻药师警报疲劳的同时，显著超越了人类专家的召回率。
证明了混合架构在平衡安全性与效率方面的有效性，能够处理复杂的药物相互作用和剂量限制检查。

技术栈: Iterative Schema Refinement (ISR) 算法, Chain of Verification (CoV) 推理范式, Patient Profile-driven Evidence Selection Tree (P-EST), Virtual Knowledge Graph (VKG) 架构, Hybrid Pharmaceutical Knowledge Base (HPKB), Relational Database (RDBMS) 与 Labeled Property Graph, B-Tree 索引与 Index-free adjacency 邻接技术

优点

安全性高：通过混合知识库和验证链机制，有效解决了大模型的幻觉问题，确保了医疗领域的零容忍要求。
可追溯性强：所有审计结论均基于可验证的知识库查询，符合循证医学原则，推理过程透明。
架构创新：巧妙结合了关系数据库的高效数值查询和图数据库的多跳推理能力，解决了单一数据模型的局限性。
人机协同：在知识库构建中利用LLM辅助专家，既提高了效率又保证了准确性。

局限

依赖初始模式：虽然ISR算法可以演化模式，但初始“种子模式”的质量可能影响后续构建效率。
系统复杂度：构建和维护混合知识库（特别是映射层）可能比单一存储系统更复杂。
领域迁移成本：虽然架构具有适应性，但具体的ISR算法和分层策略是针对医药领域定制的，迁移到其他复杂领域可能需要重新设计。
数据依赖性：系统在患者数据缺失时会标记信息缺口，这在保证安全的同时也可能导致无法给出结论。

与研究方向的相关性:

该论文高度契合研究关键词。它属于大模型在生物医药领域的深度应用，针对处方验证这一具体痛点提出了创新解决方案。技术上，它没有简单套用大模型，而是创新性地提出了混合知识库（HPKB）和基于知识库的验证链，属于大模型与符号知识（知识图谱）结合的前沿技术路线，解决了大模型的事实不可靠和推理弱项。论文在应用场景和技术原理上均有显著创新，符合高分标准。

📋 所有论文列表

1. ✅ A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification

作者: Yichi Zhu, Kan Ling, Xu Liu, Hengrun Zhang, Huiqun Yu, Guisheng Fan 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10891v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对处方验证中LLMs存在的事实不可靠、缺乏可追溯性和复杂推理能力弱的问题，提出了PharmGraph-Auditor系统，通过构建混合药物知识库和KB-grounded Chain of Verification方法，实现了安全、可追溯的处方审核，实验表明能帮助药剂师更安全、快速地验证处方。

摘要翻译

用药差错对患者安全构成重大威胁，使得药师审核（Pharmacist Verification, PV）成为一项至关重要但负担沉重的最终安全屏障。由于大型语言模型（Large Language Models, LLMs）固有的不可靠性、缺乏可追溯性以及在复杂推理方面的不足，将其直接应用于这一零容忍领域是不可行的。为应对这些挑战，我们提出了PharmGraph-Auditor，这是一个为安全且基于证据的处方审核而设计的新型系统。我们系统的核心是一个可信赖的混合药物知识库（Hybrid Pharmaceutical Knowledge Base, HPKB），该知识库在虚拟知识图谱（Virtual Knowledge Graph, VKG）范式下实现。此架构通过一个严谨的映射层，策略性地整合了用于集合约束满足的关系型组件和用于拓扑推理的图谱组件。为构建此HPKB，我们提出了迭代式模式精炼（Iterative Schema Refinement, ISR）算法，这是一个能够从医学文本中实现图谱与关系模式协同演化的框架。针对审核任务，我们引入了基于知识库的验证链（KB-grounded Chain of Verification, CoV），这是一种新的推理范式，它将LLM从一个不可靠的生成器转变为一个透明的推理引擎。CoV将审核任务分解为一系列针对HPKB的可验证查询，并生成混合查询计划以从最合适的数据存储中检索证据。实验结果表明，该系统具备强大的知识抽取能力，并展现了使用PharmGraph-Auditor帮助药师实现更安全、更快速处方审核的潜力。

摘要 (Abstract)

Medication errors pose a significant threat to patient safety, making pharmacist verification (PV) a critical, yet heavily burdened, final safeguard. The direct application of Large Language Models (LLMs) to this zero-tolerance domain is untenable due to their inherent factual unreliability, lack of traceability, and weakness in complex reasoning. To address these challenges, we introduce PharmGraph-Auditor, a novel system designed for safe and evidence-grounded prescription auditing. The core of our system is a trustworthy Hybrid Pharmaceutical Knowledge Base (HPKB), implemented under the Virtual Knowledge Graph (VKG) paradigm. This architecture strategically unifies a relational component for set constraint satisfaction and a graph component for topological reasoning via a rigorous mapping layer. To construct this HPKB, we propose the Iterative Schema Refinement (ISR) algorithm, a framework that enables the co-evolution of both graph and relational schemas from medical texts. For auditing, we introduce the KB-grounded Chain of Verification (CoV), a new reasoning paradigm that transforms the LLM from an unreliable generator into a transparent reasoning engine. CoV decomposes the audit task into a sequence of verifiable queries against the HPKB, generating hybrid query plans to retrieve evidence from the most appropriate data store. Experimental results demonstrate robust knowledge extraction capabilities and show promises of using PharmGraph-Auditor to enable pharmacists to achieve safer and faster prescription verification.

2. ❌ A Systematic Study of Pseudo-Relevance Feedback with LLMs

作者: Nour Jedidi, Jimmy Lin 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11008v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	8.0/10	8.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文核心研究基于大语言模型（LLMs）的伪相关反馈（PRF）方法，因此与关键词"Large Language Models"高度相关（10分）。论文涉及检索增强生成（RAG）相关技术，因为PRF是信息检索中查询扩展的一种方法，与RAG有概念关联，但非直接研究RAG系统，故给8分。其他关键词如MoE、SLMs、训练方法、推理优化、AI for Science等均未在论文中涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文通过系统实验研究了基于大语言模型的伪相关反馈方法中反馈来源和反馈模型两个设计维度对检索效果的影响，发现反馈模型的选择至关重要，仅使用LLM生成文本作为反馈来源最具成本效益，而使用语料库反馈在强初始检索器下最有效。

摘要翻译

基于大语言模型（LLM）构建的伪相关反馈（PRF）方法可从两个关键设计维度进行组织：反馈源（即反馈文本的来源）与反馈模型（即如何利用给定的反馈文本优化查询表示）。然而，由于这两个维度在实证评估中常相互交织，各自所起的独立作用尚不明确。本文通过控制实验系统性地研究了反馈源与反馈模型的选择如何影响PRF效果，以填补这一研究空白。我们在13个低资源BEIR任务中测试了五种LLM PRF方法，结果表明：（1）反馈模型的选择对PRF效果具有关键影响；（2）完全基于LLM生成文本的反馈是性价比最高的解决方案；（3）当使用强一阶段检索器提供的候选文档时，基于语料库的反馈最为有效。综上，本研究深化了对PRF设计空间中各要素重要性的理解。

摘要 (Abstract)

Pseudo-relevance feedback (PRF) methods built on large language models (LLMs) can be organized along two key design dimensions: the feedback source, which is where the feedback text is derived from and the feedback model, which is how the given feedback text is used to refine the query representation. However, the independent role that each dimension plays is unclear, as both are often entangled in empirical evaluations. In this paper, we address this gap by systematically studying how the choice of feedback source and feedback model impact PRF effectiveness through controlled experimentation. Across 13 low-resource BEIR tasks with five LLM PRF methods, our results show: (1) the choice of feedback model can play a critical role in PRF effectiveness; (2) feedback derived solely from LLM-generated text provides the most cost-effective solution; and (3) feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever. Together, our findings provide a better understanding of which elements in the PRF design space are most important.

关键词: Pseudo-relevance feedback, Large Language Models, LLMs, Information retrieval, Query expansion, Feedback source, Feedback model, BEIR benchmark

3. ❌ TOSSS: a CVE-based Software Security Benchmark for Large Language Models

作者: Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi, Angela Makhanu, Gaëtan Peter, Roos Wensveen 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10969v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在软件安全领域的应用评估，与关键词"Large Language Models"高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SLMs、训练方法、推理技术、代理系统等），也未涉及科学领域的AI应用，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

论文研究了LLMs在软件安全中的能力，提出了基于CVE数据库的TOSSS基准测试，评估了14个模型在安全代码选择上的表现，得分在0.48到0.89之间。

摘要翻译

随着大语言模型（LLM）能力的不断提升，其应用已遍及众多行业。它们已成为软件工程师的有力工具，支持广泛的开发任务。随着LLM在软件开发工作流中的使用日益增多，一个关键问题随之浮现：LLM是否擅长软件安全？与此同时，全球各组织正大力投资网络安全，以减少遭受破坏性攻击的风险。将LLM集成到软件工程工作流中，可能会引入新的漏洞并削弱现有的安全防护。

我们提出了TOSSS（双选项安全代码片段选择），这是一个用于衡量LLM在安全代码片段与易受攻击代码片段之间进行选择的能力的基准。现有的LLM安全基准仅涵盖有限范围的漏洞。相比之下，TOSSS基于CVE数据库，并提供了一个可扩展的框架，能够随时间推移整合新披露的漏洞。我们的基准根据模型的表现给出一个介于0到1之间的安全分数；得分为1表示模型总是选择安全代码片段，而得分为0则表示它总是选择易受攻击的代码片段。我们在C/C++和Java代码上评估了14个广泛使用的开源和闭源模型，观察到的分数范围在0.48至0.89之间。LLM提供商已为其模型发布了许多基准分数，TOSSS有望成为这些报告中一个补充性的、聚焦安全的评分指标。

摘要 (Abstract)

With their increasing capabilities, Large Language Models (LLMs) are now used across many industries. They have become useful tools for software engineers and support a wide range of development tasks. As LLMs are increasingly used in software development workflows, a critical question arises: are LLMs good at software security? At the same time, organizations worldwide invest heavily in cybersecurity to reduce exposure to disruptive attacks. The integration of LLMs into software engineering workflows may introduce new vulnerabilities and weaken existing security efforts. We introduce TOSSS (Two-Option Secure Snippet Selection), a benchmark that measures the ability of LLMs to choose between secure and vulnerable code snippets. Existing security benchmarks for LLMs cover only a limited range of vulnerabilities. In contrast, TOSSS relies on the CVE database and provides an extensible framework that can integrate newly disclosed vulnerabilities over time. Our benchmark gives each model a security score between 0 and 1 based on its behavior; a score of 1 indicates that the model always selects the secure snippet, while a score of 0 indicates that it always selects the vulnerable one. We evaluate 14 widely used open-source and closed-source models on C/C++ and Java code and observe scores ranging from 0.48 to 0.89. LLM providers already publish many benchmark scores for their models, and TOSSS could become a complementary security-focused score to include in these reports.

关键词: Large Language Models, software security, benchmark, CVE database, vulnerable code, security score, TOSSS, code snippets

4. ❌ COMIC: Agentic Sketch Comedy Generation

作者: Susung Hong, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11048v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出一个基于多智能体系统的自动化喜剧视频生成框架，其中明确使用了LLM作为批评家来评估幽默，因此与’Large Language Models’和’LLM Agents’、‘Multi-agent Systems’高度相关（10分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究提出了一个基于多智能体系统的自动化框架，用于生成类似《周六夜现场》的喜剧短视频，通过引入基于LLM的批评家来评估幽默，实验表明该框架能生成接近专业质量的视频。

摘要翻译

我们提出了一种全自动人工智能系统，能够生成类似于《周六夜现场》等短篇喜剧小品节目的视频内容。该系统从角色设定出发，采用一组基于真实影视制作团队角色构建的智能体群体，通过迭代竞争、评估和改进的结构化流程，优化创意与输出内容的质量及多样性。一项关键贡献在于引入了基于大型语言模型的评论智能体，该模型通过分析YouTube平台上的喜剧视频语料库，与真实观众偏好对齐，实现了对幽默元素的自动化评估。实验表明，我们的框架所生成的结果在质量上接近专业制作的小品视频，同时在视频生成领域展现出前沿性能。

摘要 (Abstract)

We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.

关键词: AI system, comedic videos, sketch shows, population of agents, LLM critics, humor evaluation, video generation, state-of-the-art performance

5. ❌ LiTo: Surface Light Field Tokenization

作者: Jen-Hao Rick Chang, Xiaoming Zhao, Dorian Chan, Oncel Tuzel 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11047v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《LiTo: Surface Light Field Tokenization》专注于计算机视觉和3D重建领域，提出了一种联合建模物体几何和视点相关外观的3D潜在表示方法。虽然论文使用了深度学习技术（如潜在流匹配模型），但其核心内容与所有评分关键词（均围绕大语言模型、深度学习技术原理创新及其在科学领域的应用）完全无关。论文不涉及任何语言模型、模型训练技术、推理方法、对齐技术、代理系统或AI for Science的具体应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于表面光场标记化的3D潜在表示方法，能够从单张输入图像联合生成具有几何结构和视点相关外观（如镜面高光和菲涅尔反射）的3D物体，实验表明其视觉质量和输入保真度优于现有方法。

摘要翻译

我们提出了一种联合建模物体几何与视角相关外观的三维潜在表征。多数先前研究或侧重于重建三维几何，或致力于预测视角无关的漫反射外观，因而难以捕捉真实的视角相关效果。我们的方法基于以下洞察：RGB-D图像提供了表面光场的采样点。通过将此表面光场的随机子样本编码为一组紧凑的潜在向量，我们的模型学会了在统一的三维潜在空间中同时表征几何与外观。该表征能够复现复杂光照下的视角相关效果，如镜面高光与菲涅尔反射。我们进一步在此表征上训练潜在流匹配模型，以学习基于单张输入图像的条件分布，从而生成外观与输入图像中光照和材质相一致的三维物体。实验表明，相较于现有方法，我们的方案在视觉质量与输入保真度方面均表现更优。

摘要 (Abstract)

We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.

关键词: 3D latent representation, surface light field, view-dependent appearance, geometry reconstruction, latent flow matching, specular highlights, Fresnel reflections, single image 3D generation

6. ❌ Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation

作者: Tao Zhong, Yixun Hu, Dongzhe Zheng, Aditya Sood, Christine Allen-Blanchette 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11045v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文NeFTY专注于使用神经网络场进行热断层扫描的物理框架，属于AI在科学领域的应用（具体是热传导和材料缺陷检测），因此仅与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有中等关联（5分），因为它是AI在物理科学中的应用，但并非生物信息学或化学信息学。其他所有关键词均涉及大语言模型、训练技术、推理方法、代理系统等，与论文的物理模拟和逆问题求解主题完全无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了NeFTY，一种基于可微分物理求解器的神经场热断层扫描框架，用于从瞬态表面温度测量中定量重建材料的三维扩散率场，显著提高了地下缺陷定位的准确性。

摘要翻译

我们提出神经场热层析成像（NeFTY），这是一个基于可微分物理的框架，用于从瞬态表面温度测量中定量重建材料的三维特性。传统热成像技术依赖于忽略横向扩散的逐像素一维近似方法，而软约束的物理信息神经网络（PINNs）在瞬态扩散场景中常因梯度刚性而失效；与此不同，NeFTY将三维扩散率场参数化为连续神经场，并通过严格的数值求解器进行优化。通过利用可微分物理求解器，我们的方法将热力学定律作为硬约束强制执行，同时保持了高分辨率三维层析成像所需的内存效率。我们采用的“先离散后优化”范式有效缓解了逆热传导中固有的谱偏差和不适定性，从而能够恢复任意尺度的亚表面缺陷。在合成数据上的实验验证表明，NeFTY在亚表面缺陷定位精度上较基线方法有显著提升。更多细节请访问：https://cab-lab-princeton.github.io/nefty/

摘要 (Abstract)

We propose Neural Field Thermal Tomography (NeFTY), a differentiable physics framework for the quantitative 3D reconstruction of material properties from transient surface temperature measurements. While traditional thermography relies on pixel-wise 1D approximations that neglect lateral diffusion, and soft-constrained Physics-Informed Neural Networks (PINNs) often fail in transient diffusion scenarios due to gradient stiffness, NeFTY parameterizes the 3D diffusivity field as a continuous neural field optimized through a rigorous numerical solver. By leveraging a differentiable physics solver, our approach enforces thermodynamic laws as hard constraints while maintaining the memory efficiency required for high-resolution 3D tomography. Our discretize-then-optimize paradigm effectively mitigates the spectral bias and ill-posedness inherent in inverse heat conduction, enabling the recovery of subsurface defects at arbitrary scales. Experimental validation on synthetic data demonstrates that NeFTY significantly improves the accuracy of subsurface defect localization over baselines. Additional details at https://cab-lab-princeton.github.io/nefty/

关键词: Neural Field Thermal Tomography, differentiable physics, 3D reconstruction, material properties, transient surface temperature, inverse heat conduction, subsurface defect localization, NeFTY

7. ❌ V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

作者: Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11042v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视频到音乐的生成，核心创新在于通过模态内相似性计算事件曲线来捕捉跨模态的共享时间结构，实现零配对数据的视频-音乐时间对齐。与大多数大模型技术关键词无关，仅与’Pre-training’和’Post-training’有一定关联（5分），因为使用了预训练的音乐和视频编码器，并对文本到音乐模型进行了微调。其他关键词均不涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种零配对视频到音乐生成方法V2M-Zero，通过模态内事件曲线捕捉共享时间结构，实现了比基于配对数据基线更好的时间对齐音乐生成。

摘要翻译

为视频生成时间同步的音乐对现有文本到音乐模型具有挑战性，这些模型缺乏细粒度的时间控制。我们提出了V2M-Zero，一种零配对视频到音乐生成方法，可为视频输出时间同步的音乐。我们的方法基于一个关键观察：时间同步需要匹配变化发生的时间和程度，而非变化的具体内容。尽管音乐和视觉事件在语义上不同，但它们展现出共享的时间结构，这种结构可以在各自模态内独立捕获。我们通过使用预训练的音乐和视频编码器，从模态内相似性计算事件曲线来捕获这种结构。通过独立测量每个模态内的时间变化，这些曲线提供了跨模态的可比表征。这使得一种简单的训练策略成为可能：在音乐事件曲线上微调文本到音乐模型，然后在推理时替换为视频事件曲线，无需跨模态训练或配对数据。在OES-Pub、MovieGenBench-Music和AIST++数据集上，V2M-Zero相比基于配对数据的基线方法取得了显著提升：音频质量提高5-21%，语义对齐提升13-15%，时间同步性改善21-52%，舞蹈视频的节拍对齐度提高28%。通过大规模众包主观听力测试，我们得到了相似的结果。总体而言，我们的结果验证了通过模态内特征（而非配对的跨模态监督）实现时间对齐，对于视频到音乐生成是有效的。结果可见于https://genjib.github.io/v2m_zero/。

摘要 (Abstract)

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/

关键词: video-to-music generation, temporal alignment, zero-pair learning, event curves, intra-modal similarity, pretrained encoders, fine-tuning, temporal synchronization

8. ❌ Instruction set for the representation of graphs

作者: Ezequiel Lopez-Rubio, Mario Pascual-Gonzalez 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11039v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文IsalGraph提出了一种将图结构编码为紧凑字符串的方法，属于图表示学习领域。虽然摘要提到该编码与语言模型兼容（’language-model-compatible’），但论文核心是图编码算法本身，并未涉及大模型或深度学习的技术原理、训练方法、推理优化、对齐、应用等任何具体方面。所有评分关键词均与大模型/深度学习技术直接相关，而本文是纯粹的图论/图算法研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为IsalGraph的方法，将任意有限简单图的结构编码为紧凑的字符串表示，并证明该字符串的编辑距离与图编辑距离强相关，从而为图相似性搜索、图生成和图条件语言建模提供了兼容的序列编码。

摘要翻译

本文提出IsalGraph方法，可将任意有限简单图的结构表示为基于九字符指令字母表的紧凑字符串。该编码由一台小型虚拟机执行，该虚拟机包含一个稀疏图、一个存储图节点引用的循环双向链表（CDLL, circular doubly-linked list）以及两个遍历指针。指令功能包括移动指针遍历CDLL，或向图中插入节点或边。该方法的关键设计特性是：字母表上的任意字符串均可解码为有效图，且不会进入无效状态。一种贪婪的GraphToString算法可在节点数多项式时间内将任意连通图编码为字符串；其穷举回溯变体通过在所有起始节点和所有有效遍历顺序中选择字典序最小的最短字符串，生成规范字符串。我们在五个真实世界图基准数据集（IAM Letter LOW/MED/HIGH、LINUX和AIDS）上评估该表示方法，结果表明IsalGraph字符串间的莱文斯坦距离与图编辑距离（GED, graph edit distance）高度相关。这些特性共同使IsalGraph字符串成为一种紧凑的、同构不变的、且与语言模型兼容的图结构序列编码，可直接应用于图相似性搜索、图生成和图条件语言建模。

摘要 (Abstract)

We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emph{GraphToString} algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling

关键词: graph representation, instruction set, compact string encoding, graph isomorphism, graph edit distance, language-model-compatible, graph similarity search, graph generation

9. ❌ Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style

作者: Marvin Limpijankit, Milad Alshomary, Yassin Oulad Daoud, Amith Ananthram, Tim Trombley, Elias Stengel-Eskin, Mohit Bansal, Noam M. Elcott, Kathleen McKeown 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11024v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究视觉语言模型（VLMs）在艺术风格识别中的机制，与艺术史学家标准对比，属于AI在艺术领域的应用研究。与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化等）完全无关，仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文涉及模型解释性分析；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为艺术分析可视为AI在人文科学领域的应用，但非核心生物/化学信息学。其他关键词无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文通过跨学科合作，研究了视觉语言模型识别艺术风格的机制，发现73%的提取概念被艺术史学家认为具有连贯的视觉特征，90%的概念在风格预测中被判定为相关，揭示了模型与人类专家在艺术分析中的异同。

摘要翻译

视觉语言模型（VLMs）在一系列计算机视觉任务中（如视觉问答和物体检测）已展现出日益精熟的能力。这包括在艺术领域不断增强的潜力，从艺术品分析到艺术创作。在计算机科学家与艺术史学者的跨学科合作中，我们系统探究了视觉语言模型预测艺术风格的内在机制，并评估其与艺术史学者推理艺术风格所用标准的契合程度。我们采用潜在空间分解方法，识别驱动艺术风格预测的概念要素，并通过定量评估、因果分析和艺术史学者的专业评判进行验证。研究结果表明，提取出的概念中有73%被艺术史学者判定为具有连贯且语义明确的视觉特征，而在预测特定艺术品风格时使用的概念中，90%被认定为相关。对于少数不相关概念却能成功预测风格的情况，艺术史学者指出了其可能有效的缘由：例如，模型可能以更形式化的方式“理解”某些概念（如明暗对比）。

摘要 (Abstract)

VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs’ ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might “understand” a concept in more formal terms, such as dark/light contrasts.

关键词: Vision Language Models, Artistic Style Recognition, Art History, Interpretability, Latent-space Decomposition, Concept Analysis, Interdisciplinary Collaboration, Model Evaluation

10. ❌ Artificial Intelligence as a Catalyst for Innovation in Software Engineering

作者: Carlos Alberto Fernández-y-Fernández, Jorge R. Aguilar-Cisneros 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10994v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文探讨AI在软件工程中的应用，主要涉及机器学习和自然语言处理，但未具体讨论大模型技术原理、训练方法、推理优化、对齐技术、模型压缩、科学AI应用等关键词。所有关键词均与大模型技术或特定科学领域相关，而本文是AI在软件工程领域的应用综述，与这些具体技术关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文研究AI如何作为催化剂增强软件工程中的敏捷性和创新，通过文献综述和实证调查发现AI驱动的工具（特别是ML和NLP）能够自动化繁琐任务并优化敏捷实践。

摘要翻译

现代软件需求的快速演进与内在复杂性，要求开发方法具备高度的灵活性与响应能力。尽管敏捷框架已成为业界优先迭代、协作与适应性的标准，软件开发团队在管理持续变化的需求、并在紧迫时限下保持产品质量方面，仍面临持续挑战。本文探讨了人工智能（Artificial Intelligence, AI）与软件工程（Software Engineering, SE）的交叉领域，旨在分析AI如何作为提升敏捷性与促进创新的强大催化剂。本研究综合了现有文献综述与实证研究，通过对软件工程专业人士进行问卷调查，评估了AI驱动工具的认知度、采纳度及其影响。关键发现表明，AI（特别是通过机器学习（Machine Learning, ML）和自然语言处理（Natural Language Processing, NLP））的集成，促进了从需求管理到代码生成与测试等一系列繁琐任务的自动化。本文论证了AI不仅优化了当前的敏捷实践，还引入了新的能力，这些能力对于在未来软件开发格局中持续保持质量、速度与创新至关重要。

摘要 (Abstract)

The rapid evolution and inherent complexity of modern software requirements demand highly flexible and responsive development methodologies. While Agile frameworks have become the industry standard for prioritizing iteration, collaboration, and adaptability, software development teams continue to face persistent challenges in managing constantly evolving requirements and maintaining product quality under tight deadlines. This article explores the intersection of Artificial Intelligence (AI) and Software Engineering (SE), to analyze how AI serves as a powerful catalyst for enhancing agility and fostering innovation. The research combines a comprehensive review of existing literature with an empirical study, utilizing a survey directed at Software Engineering professionals to assess the perception, adoption, and impact of AI-driven tools. Key findings reveal that the integration of AI (specifically through Machine Learning (ML) and Natural Language Processing (NLP) )facilitates the automation of tedious tasks, from requirement management to code generation and testing . This paper demonstrates that AI not only optimizes current Agile practices but also introduces new capabilities essential for sustaining quality, speed, and innovation in the future landscape of software development.

关键词: Artificial Intelligence, Software Engineering, Agile frameworks, Machine Learning, Natural Language Processing, automation, code generation, innovation

11. ❌ RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

作者: Patricia Paskov, Kevin Wei, Shen Zhou Hong, Dan Bateyko, Xavier Roberts-Gaal, Carson Ezell, Gailius Praninskas, Valerie Chen, Umang Bhatt, Ella Guest 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11001v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究前沿AI系统（frontier AI systems）评估中的人类提升研究（human uplift studies）的方法学挑战，特别是随机对照试验（RCT）方法在生物安全、网络安全、教育、劳动等领域的应用问题。论文内容聚焦于AI评估方法论、因果推断假设与现实世界复杂性之间的张力，而非具体的大模型技术原理、架构创新或应用技术（如训练方法、推理优化、代理系统等）。唯一略有相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文提到了生物安全（biosecurity）作为应用领域之一，但这并非论文的核心技术焦点，因此给予5分（有一定关联）。其他所有关键词均涉及具体的大模型技术、方法或应用，与论文的方法论研究主题完全无关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了使用随机对照试验（RCT）方法评估前沿AI系统对人类性能影响（即人类提升研究）时面临的方法学挑战，包括AI系统快速演化、基线变化、用户异质性等对内部、外部和结构效度的假设造成压力，并基于专家访谈提出了应对这些挑战的实践解决方案。

摘要翻译

人类提升研究——即通过随机对照试验等方法，测量人工智能相对于现状对人类表现的影响——正日益为前沿人工智能系统的部署、治理与安全决策提供依据。尽管这类研究采用的方法论已较为成熟，但其与前沿AI系统特有属性之间的相互作用仍未得到充分审视，尤其在研究结果被用于高风险决策时。本文通过对16位在生物安全、网络安全、教育和劳动力等领域具有人类提升研究实践经验的专家进行访谈，发现专家们普遍描述了标准因果推断假设与研究客体之间存在的持续张力。快速演进的人工智能系统、动态变化的基准线、用户能力的异质性与可变性，以及现实场景的边界渗透性，均对内部效度、外部效度和建构效度的基础假设构成压力，从而使得提升证据的解释与合理运用复杂化。我们将这些挑战归纳至人类提升研究生命周期的关键阶段，并与实践者提出的解决方案进行对应分析，以此阐明在高风险决策中人类提升研究证据的局限性及适用边界。

摘要 (Abstract)

Human uplift studies - or studies that measure AI effects on human performance relative to a status quo, typically using randomized controlled trial (RCT) methodology - are increasingly used to inform deployment, governance, and safety decisions for frontier AI systems. While the methods underlying these studies are well-established, their interaction with the distinctive properties of frontier AI systems remains underexamined, particularly when results are used to inform high-stakes decisions. We present findings from interviews with 16 expert practitioners with experience conducting human uplift studies in domains including biosecurity, cybersecurity, education, and labor. Across interviews, experts described a recurring tension between standard causal inference assumptions and the object of study itself. Rapidly evolving AI systems, shifting baselines, heterogeneous and changing user proficiency, and porous real-world settings strain assumptions underlying internal, external, and construct validity, complicating the interpretation and appropriate use of uplift evidence. We synthesize these challenges across key stages of the human uplift research lifecycle and map them to practitioner-reported solutions, clarifying both the limits and the appropriate uses of evidence from human uplift studies in high-stakes decision-making.

关键词: human uplift studies, randomized controlled trials, frontier AI systems, causal inference, methodological challenges, AI evaluation, validity, high-stakes decision-making

12. ❌ Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation

作者: Zixuan Liu, Ruoyi Qiao, Chenrui Tie, Xuanwei Liu, Yunfan Lou, Chongkai Gao, Zhixuan Xu, Lin Shao 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10971v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人灵巧操作领域，提出了一种基于接触覆盖引导的探索方法（CCGE），使用深度强化学习（DRL）解决通用灵巧操作任务。论文内容完全围绕机器人控制、强化学习、接触建模和探索策略展开，未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用（如生物信息学、化学信息学）。所有评分关键词均与大模型、深度学习技术或科学AI应用相关，而本文研究领域是机器人学，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对通用灵巧操作任务缺乏通用奖励函数的问题，提出了一种接触覆盖引导探索方法（CCGE），通过鼓励机器人手发现多样化的接触模式来显著提高训练效率和成功率，并成功将学习到的接触模式迁移到真实机器人系统。

摘要翻译

深度强化学习（DRL）在奖励结构明确的领域（如雅达利游戏和运动控制）已取得显著成功。相比之下，灵巧操作任务缺乏通用的奖励设计框架，通常依赖于针对特定任务手工设计的先验知识来引导手与物体的交互。我们提出接触覆盖引导探索（Contact Coverage-Guided Exploration, CCGE），这是一种专为通用灵巧操作任务设计的探索方法。CCGE将接触状态表示为物体表面点与预定义手部关键点之间的交集，激励灵巧手发现多样且新颖的接触模式，即哪些手指接触物体的哪些区域。该方法通过基于学习哈希码获取的离散化物体状态，维护一个条件接触计数器，以记录每个手指与不同物体区域交互的频率。该计数器通过两种互补方式被利用：（1）分配基于计数的接触覆盖奖励，以促进探索新颖的接触模式；（2）设计基于能量的接近奖励，引导智能体朝向未充分探索的接触区域移动。我们在多种灵巧操作任务上评估CCGE，包括杂乱物体分离、受限物体抓取、手内重定向以及双手操作。实验结果表明，相较于现有探索方法，CCGE显著提升了训练效率和成功率，并且通过CCGE学习到的接触模式能够稳健地迁移到真实世界机器人系统中。项目页面为 https://contact-coverage-guided-exploration.github.io。

摘要 (Abstract)

Deep Reinforcement learning (DRL) has achieved remarkable success in domains with well-defined reward structures, such as Atari games and locomotion. In contrast, dexterous manipulation lacks general-purpose reward formulations and typically depends on task-specific, handcrafted priors to guide hand-object interactions. We propose Contact Coverage-Guided Exploration (CCGE), a general exploration method designed for general-purpose dexterous manipulation tasks. CCGE represents contact state as the intersection between object surface points and predefined hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate CCGE on a diverse set of dexterous manipulation tasks, including cluttered object singulation, constrained object retrieval, in-hand reorientation, and bimanual manipulation. Experimental results show that CCGE substantially improves training efficiency and success rates over existing exploration methods, and that the contact patterns learned with CCGE transfer robustly to real-world robotic systems. Project page is https://contact-coverage-guided-exploration.github.io.

关键词: Dexterous Manipulation, Contact Coverage, Exploration Method, Deep Reinforcement Learning, Robotic Systems, Hand-Object Interaction, Training Efficiency, Real-world Transfer

13. ❌ GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

作者: Boyuan Chen, Minghao Shao, Siddharth Garg, Ramesh Karri, Muhammad Shafique 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10978v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉语言模型(VLMs)在计数任务中的幻觉问题，提出通过结合对象检测模型(ODMs)进行空间定位来缓解计数幻觉。核心相关关键词包括：‘Hallucination Mitigation’（高度相关，10分），因为论文直接针对幻觉缓解；‘Large Language Models’（8分），因为VLMs是大语言模型的视觉扩展；‘Chain of Thought’、‘System 2 Thinking’、‘Self-Correction’和’Mechanistic Interpretability’（各5分），因为论文涉及推理机制、反思和可解释性分析。其他关键词如MoE、量化、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在计数任务中存在的幻觉问题，提出了一种结合对象检测模型进行空间定位的框架GroundCount，显著提高了计数准确性并减少了推理时间。

摘要翻译

视觉语言模型（VLMs）在计数任务中持续存在幻觉现象，其准确率显著低于其他视觉推理任务（情感分析除外）。即使在当前最先进的具备推理能力的VLMs中，这一现象依然存在。相比之下，基于CNN的目标检测模型（ODMs，如YOLO）擅长空间定位和实例计数，且计算开销极小。我们提出GroundCount框架，该框架通过ODMs提供的显式空间定位信息来增强VLMs，以缓解计数幻觉。在最佳情况下，我们基于提示的增强策略在性能最佳的模型（Ovis2.5-2B）上实现了81.3%的计数准确率——提升了6.6个百分点——同时通过对较强模型消除幻觉驱动的推理循环，将推理时间减少了22%。我们进行了全面的消融研究，证明位置编码是一个关键组件，对较强模型有益，但对较弱模型有害。相比之下，置信度分数在大多数架构中引入了噪声，移除该分数在五个评估模型中的四个上提升了性能。我们进一步评估了特征级融合架构，发现尽管存在复杂的交叉注意力机制，但通过结构化提示实现的显式符号定位仍优于隐式特征融合。我们的方法在五个评估的VLM架构中的四个上取得了一致的性能提升（6.2-7.5个百分点），其中一个架构因迭代反思机制与结构化提示不兼容而出现性能下降。这些结果表明，计数失败源于根本性的空间-语义整合局限，而非特定架构缺陷，同时凸显了架构兼容性在增强策略中的重要性。

摘要 (Abstract)

Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2–7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.

关键词: Vision Language Models, Hallucination Mitigation, Object Detection, Counting Accuracy, Spatial Grounding, Inference Acceleration, Reasoning Mechanisms, Architectural Compatibility

14. ❌ Historical Consensus: Preventing Posterior Collapse via Iterative Selection of Gaussian Mixture Priors

作者: Zegu Zhang, Jian Zhang 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10935v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于变分自编码器（VAEs）中的后验坍塌问题，提出了一种基于高斯混合模型先验的历史共识训练方法。论文内容完全属于传统深度学习中的生成模型和变分推断领域，未涉及任何大语言模型（LLMs）、大模型技术原理、大模型应用或AI for Science相关主题。所有评分关键词均与大模型、大模型技术或大模型在科学领域的应用直接相关，而本文研究的是VAE这一特定架构的基础问题，与这些关键词无任何关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为历史共识训练的新方法，通过迭代选择高斯混合模型先验来彻底防止变分自编码器中的后验坍塌问题，并在理论和实验上证明了其有效性。

摘要翻译

变分自编码器（VAEs）常面临后验坍缩问题，即潜变量失去信息性且近似后验退化为先验分布。近期研究将这一现象描述为受数据协方差矩阵谱特性控制的相变。本文提出一种根本不同的方法：我们不再通过架构约束或超参数调优来避免坍缩，而是利用高斯混合模型（GMM）聚类的多样性来彻底消除坍缩的可能性。我们引入历史共识训练——一种通过交替优化与选择逐步精炼候选GMM先验集合的迭代选择流程。其核心洞见在于，经过多个不同聚类约束训练的模型会形成历史屏障，即参数空间中即使后续使用单一目标训练仍能保持稳定的区域。我们证明该屏障排除了坍缩解，并通过合成数据集与真实数据集的广泛实验表明，无论解码器方差或正则化强度如何，我们的方法均能获得非坍缩表征。该方法无需显式稳定性条件（例如$σ^{\prime 2} < λ_{\max}$），且适用于任意神经架构。代码发布于https://github.com/tsegoochang/historical-consensus-vae。

摘要 (Abstract)

Variational autoencoders (VAEs) frequently suffer from posterior collapse, where latent variables become uninformative and the approximate posterior degenerates to the prior. Recent work has characterized this phenomenon as a phase transition governed by the spectral properties of the data covariance matrix. In this paper, we propose a fundamentally different approach: instead of avoiding collapse through architectural constraints or hyperparameter tuning, we eliminate the possibility of collapse altogether by leveraging the multiplicity of Gaussian mixture model (GMM) clusterings. We introduce Historical Consensus Training, an iterative selection procedure that progressively refines a set of candidate GMM priors through alternating optimization and selection. The key insight is that models trained to satisfy multiple distinct clustering constraints develop a historical barrier – a region in parameter space that remains stable even when subsequently trained with a single objective. We prove that this barrier excludes the collapsed solution, and demonstrate through extensive experiments on synthetic and real-world datasets that our method achieves non-collapsed representations regardless of decoder variance or regularization strength. Our approach requires no explicit stability conditions (e.g., $σ^{\prime 2} < λ_{\max}$) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/historical-consensus-vae.

关键词: Variational autoencoders, Posterior collapse, Gaussian mixture model, Historical Consensus Training, Latent variables, Clustering constraints, Parameter space barrier, Non-collapsed representations

15. ❌ Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control

作者: Yaswanth Chittepu, Ativ Joshi, Rajarshi Bhattacharjee, Scott Niekum 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10938v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	15.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心贡献是提出了一种新的RLHF对齐框架RAD，用一阶随机优势约束替代传统的期望成本约束，以更好地控制尾部风险和分布外失败。因此，与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（15分），与’Instruction Tuning OR Alignment OR Value Alignment’相关（10分），因为论文聚焦于对齐问题。论文也涉及大模型（LLMs）的应用背景，但并非其技术核心，因此给5分。其他关键词与论文的研究内容（如随机优势、最优传输、谱风险度量）无直接关联，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对传统基于期望约束的安全RLHF方法无法有效控制尾部风险的问题，提出了一种基于一阶随机优势约束的风险敏感对齐框架RAD，通过最优传输实现可微优化，实验表明其在保持有用性的同时提高了无害性和分布外鲁棒性。

摘要翻译

基于人类反馈的安全强化学习（Safe Reinforcement Learning from Human Feedback, RLHF）通常通过期望成本约束来保障安全性，但期望值仅能捕捉成本分布的单一统计量，无法反映分布不确定性，尤其在重尾分布或罕见灾难性事件中这一局限尤为突出。当鲁棒性和风险敏感性至关重要时，该缺陷将导致严重问题。随机占优提供了一种原则性替代方案，它通过比较完整的成本分布而非仅关注其平均值，能够直接控制尾部风险以及基于期望的约束可能忽略的潜在分布外失效情况。本文提出一种名为“基于占优的风险敏感对齐”（Risk-sensitive Alignment via Dominance, RAD）的新型对齐框架，该框架以一阶随机占优（First-Order Stochastic Dominance, FSD）约束取代传统的标量期望成本约束。我们通过在最优传输（Optimal Transport, OT）框架内将目标策略的成本分布与参考策略的分布进行比较，从而实现该约束，并采用熵正则化与Sinkhorn迭代来获得可微分且计算高效的目标函数，以支持稳定的端到端优化。此外，我们引入了分位数加权的一阶随机占优约束，证明加权FSD能够普遍控制一大类谱风险度量（Spectral Risk Measures, SRMs），因此在加权占优下的改进必然意味着相应谱风险的保证性改善。这为通过分位数加权函数调整模型的风险特征提供了原则性机制。实验结果表明，RAD在保持助益性竞争力的同时，相较于基线方法显著提升了无害性，并在分布外无害性评估中表现出更强的鲁棒性。

摘要 (Abstract)

Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected cost constraints, but the expectation captures only a single statistic of the cost distribution and fails to account for distributional uncertainty, particularly under heavy tails or rare catastrophic events. This limitation is problematic when robustness and risk sensitivity are critical. Stochastic dominance offers a principled alternative by comparing entire cost distributions rather than just their averages, enabling direct control over tail risks and potential out-of-distribution failures that expectation-based constraints may overlook. In this work, we propose Risk-sensitive Alignment via Dominance (RAD), a novel alignment framework that replaces scalar expected cost constraints with First-Order Stochastic Dominance (FSD) constraints. We operationalize this constraint by comparing the target policy’s cost distribution to that of a reference policy within an Optimal Transport (OT) framework, using entropic regularization and Sinkhorn iterations to obtain a differentiable and computationally efficient objective for stable end-to-end optimization. Furthermore, we introduce quantile-weighted FSD constraints and show that weighted FSD universally controls a broad class of Spectral Risk Measures (SRMs), so that improvements under weighted dominance imply guaranteed improvements in the corresponding spectral risk. This provides a principled mechanism for tuning a model’s risk profile via the quantile weighting function. Empirical results demonstrate that RAD improves harmlessness over baselines while remaining competitive in helpfulness, and exhibits greater robustness on out-of-distribution harmlessness evaluations.

关键词: Safe RLHF, Stochastic Dominance, Risk-sensitive Alignment, Spectral Risk Measures, Optimal Transport, Tail Risk Control, Alignment Framework, First-Order Stochastic Dominance

16. ❌ When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

作者: Anupam Purwar, Aditya Choudhary 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10904v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在TTS系统中的微调应用，特别是LoRA微调技术，因此与’Large Language Models’和’PEFT/LoRA’高度相关（10分和15分）。论文提到使用量化模型（GGUF），与’Quantization’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在基于LLM的文本转语音系统中，使用LoRA微调技术如何通过数据多样性改善语音质量、说话人相似度和信噪比，结果表明LoRA微调能有效适应说话人特征而不损害语言建模能力。

摘要翻译

大语言模型正日益被用作神经文本转语音系统的语义主干。然而，冻结的LLM表征不足以建模说话人特有的声学和感知特征。我们对TTS语言模型主干进行微调的实验表明，该方法有望在语音克隆任务中提升语音一致性和信噪比。在三位互补的语音质量维度上，LoRA微调方法在多位说话人上均持续优于未经微调的Qwen-0.5B基础模型。首先，对于训练数据具有足够声学变异性的说话人，感知质量显著提升，DNS-MOS得分最高增加0.42分。其次，所有评估说话人的语音保真度均得到改善，语音相似度持续提升，表明LoRA能有效适配说话人身份表征且不损害语言建模能力。第三，多数情况下信号级质量得到改善，信噪比最高提升34%。关键在于，这些改进强烈依赖于训练数据的特性。声学能量和感知质量变异度高的说话人，在DNS-MOS、语音相似度和信噪比上能同时获得增益。总体而言，本研究证实LoRA微调不仅是参数高效的优化技术，更是紧凑型基于LLM的TTS系统中实现更优说话人级别适配的有效机制。当获得足够多样化的训练数据支持时，经LoRA适配的Qwen-0.5B模型在量化格式的GGUF模型部署下，能以低延迟持续超越其冻结基础模型，在感知质量和说话人相似度方面表现更优。

摘要 (Abstract)

Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio SNR in voice cloning task. Across multiple speakers LoRA finetuning consistently outperforms the non-finetuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly with DNS-MOS gains of up to 0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers with consistent increases in voice similarity indicating that LoRA effectively adapts speaker identity representations without degrading linguistic modeling. Third, signal level quality improves in most cases with signal to noise ratio increasing by as much as 34 percent. Crucially these improvements are strongly governed by the characteristics of the training data. Speakers with high variability in acoustic energy and perceptual quality achieve simultaneous gains in DNS-MOS voice similarity and SNR. Overall this work establishes that LoRA finetuning is not merely a parameter efficient optimization technique but an effective mechanism for better speaker level adaptation in compact LLM-based TTS systems. When supported by sufficiently diverse training data LoRA adapted Qwen-0.5B consistently surpasses its frozen base model in perceptual quality speaker similarity with low latency using GGUF model hosted in quantized form.

关键词: Large Language Models, Fine-tuning, LoRA, Text-to-Speech, Speaker Adaptation, Data Diversity, Parameter-efficient Fine-tuning, Voice Cloning

17. ❌ LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

作者: Jinwoo Ahn, Ingyu Seong, Akhil Kedia, Junhan Kim, Hyemi Jang, Kangwook Lee, Yongkweon Jeon 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10899v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究KV缓存淘汰机制以解决长上下文LLM推理效率问题，与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（15分），直接涉及KV缓存优化技术；与’Large Language Models OR LLMs OR Foundation Models’（10分）和’Context Window Extension OR Long Context LLMs’（10分）密切相关，聚焦LLM长上下文任务；与’Speculative Decoding OR Inference Acceleration’（10分）相关，旨在加速推理；与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’（5分）有间接关联，使用参数高效模块；其他关键词未涉及。

!!! tip deepseek-chat TL;DR

该论文提出LookaheadKV框架，通过预测重要性分数而非生成草稿来高效淘汰KV缓存，解决了长上下文LLM推理中缓存增长导致的效率瓶颈，在保持高准确性的同时显著降低了淘汰成本并加速了首词生成时间。

摘要翻译

基于Transformer的大语言模型（LLM）依赖键值（KV）缓存来避免自回归推理过程中的冗余计算。尽管这一机制极大提升了效率，但缓存大小随输入序列长度线性增长，迅速成为长上下文任务的瓶颈。现有解决方案通过根据估计的重要性分数淘汰被认为不重要的提示KV来缓解此问题。值得注意的是，近期一系列研究提出通过“展望未来”来提升淘汰质量：即使用草稿生成器产生一个近似目标模型真实响应的替代未来响应，并利用该替代响应更准确地估计缓存KV的重要性。然而，这些方法依赖于计算成本高昂的草稿生成，引入了显著的预填充开销，限制了其在实际部署中的实用性。为应对这一挑战，我们提出LookaheadKV——一个轻量级的淘汰框架，它无需显式草稿生成即可利用替代未来响应的优势。LookaheadKV通过训练参数高效的模块来增强Transformer层，这些模块能够高精度预测真实重要性分数。我们的设计确保了与现有低成本启发式方法相当、可忽略的运行时开销，同时实现了优于更高成本近似方法的准确性。在多种模型上进行的长上下文理解基准测试实验表明，我们的方法不仅在各项长上下文理解任务中优于近期竞争基线，还将淘汰成本降低了高达14.5倍，从而显著缩短了首次令牌生成时间。代码发布于https://github.com/SamsungLabs/LookaheadKV。

摘要 (Abstract)

Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by “glimpsing into the future”, in which a draft generator produces a surrogate future response approximating the target model’s true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter-efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time-to-first-token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.

关键词: KV cache eviction, large language models, long-context inference, inference acceleration, parameter-efficient modules, autoregressive inference, time-to-first-token, transformer layers

18. ❌ Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

作者: Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10887v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RL finetuning of LLMs，与’Large Language Models’和’RLHF’高度相关（10分）。论文专注于提升LLMs的reasoning能力，与’Chain of Thought’和’System 2 Thinking’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RAG、Quantization等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对强化学习微调大型语言模型时训练数据选择效率低的问题，提出了Dynamics-Predictive Sampling方法，通过预测学习动态来高效选择信息丰富的提示，从而减少冗余计算并提升推理性能。

摘要翻译

强化学习（Reinforcement Learning, RL）微调已成为提升大语言模型（Large Language Models, LLMs）推理能力的关键技术。然而，其效果在很大程度上取决于训练数据的选择。近期研究进展强调了在线提示选择方法的重要性，这类方法通常将训练集中在当前策略下部分解决或具有适度挑战性的样本上，从而产生更有效的模型更新。尽管这些方法在训练步数方面显著加速了RL微调过程，但它们也带来了巨大的计算开销——需要在大型候选批次上进行大量LLM推演以识别信息丰富的样本，这一开销可能超过微调过程本身。为应对这一挑战，本研究提出了动态预测采样（Dynamics-Predictive Sampling, DPS），该方法通过在昂贵的推演之前推断其学习动态，在线预测并选择信息丰富的提示。具体而言，我们引入了一种新视角：将RL微调过程中每个提示的解决进度建模为一个动态系统，其中解决程度表示为状态，状态转移由隐马尔可夫模型刻画。利用历史推演奖励信号，我们进行在线贝叶斯推断以估计演化的状态分布，推断结果为无需依赖推演密集型筛选的高效提示选择提供了预测性先验。在包括数学、规划和视觉几何在内的多种推理任务上的实验结果表明，DPS显著减少了冗余推演，加速了训练过程，并实现了更优的推理性能。

摘要 (Abstract)

Reinforcement learning (RL) finetuning has become a key technique for enhancing the reasoning abilities of large language models (LLMs). However, its effectiveness critically depends on the selection of training data. Recent advances underscore the importance of online prompt selection methods, which typically concentrate training on partially solved or moderately challenging examples under the current policy, thereby yielding more effective model updates. While significantly accelerating RL finetuning in terms of training steps, they also incur substantial computational overhead by requiring extensive LLM rollouts over large candidate batches to identify informative samples, an expense that can outweigh the finetuning process itself. To address this challenge, this work proposes Dynamics-Predictive Sampling (DPS), which online predicts and selects informative prompts by inferring their learning dynamics prior to costly rollouts. Specifically, we introduce a new perspective by modeling each prompt’s solving progress during RL finetuning as a dynamical system, where the extent of solving is represented as the state and the transition is characterized by a hidden Markov model. Using historical rollout reward signals, we perform online Bayesian inference to estimate evolving state distributions, and the inference outcome provides a predictive prior for efficient prompt selection without rollout-intensive filtering. Empirical results across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that DPS substantially reduces redundant rollouts, accelerates the training process, and achieves superior reasoning performance.

关键词: Reinforcement Learning, Finetuning, Large Language Models, Reasoning, Prompt Selection, Dynamics-Predictive Sampling, Training Efficiency

19. ❌ Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements

作者: Jonathan Liu, Kia Ghods 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10885v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究使用扩散变换器（Diffusion Transformer）生成合成调控DNA序列，属于AI在生物信息学领域的应用。与大多数关键词（如LLMs、MoE、SFT等）无关，因为这些关键词主要针对语言模型或通用深度学习技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文直接应用AI于生物信息学（生成DNA序列），评分为10分（高度相关）。此外，论文提到使用DDPO（Direct Preference Optimization）进行微调，与关键词’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’有一定关联，但DDPO不是核心内容，仅作为方法的一部分，评分为5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种参数高效的扩散变换器（DiT），用于生成细胞类型特异性调控DNA序列，并通过DDPO微调显著提升了预测调控活性。

摘要翻译

我们提出一种参数高效的扩散变换器（Diffusion Transformer，DiT），用于生成200bp细胞类型特异性调控DNA序列。通过将DNA-Diffusion的U-Net主干替换为配备二维卷积神经网络（2D CNN）输入编码器的变换器去噪器，我们的模型在13个训练周期内达到与U-Net最佳验证损失相当的性能（训练周期减少60倍），且最终收敛损失降低39%，同时通过BLAT比对发现生成序列对训练数据的记忆率从5.3%降至1.7%。消融实验表明CNN编码器至关重要：若移除该模块，无论采用何种位置嵌入方法，验证损失均上升70%。我们进一步使用Enformer作为奖励模型，通过扩散策略直接偏好优化（DDPO）进行微调，使预测调控活性提升38倍。在独立预测任务中与DRAKES进行交叉验证的结果证实，这些改进反映了真实的调控信号而非奖励模型的过拟合。

摘要 (Abstract)

We present a parameter-efficient Diffusion Transformer (DiT) for generating 200bp cell-type-specific regulatory DNA sequences. By replacing the U-Net backbone of DNA-Diffusion with a transformer denoiser equipped with a 2D CNN input encoder, our model matches the U-Net’s best validation loss in 13 epochs (60$\times$ fewer) and converges 39% lower, while reducing memorization from 5.3% to 1.7% of generated sequences aligning to training data via BLAT. Ablations show the CNN encoder is essential: without it, validation loss increases 70% regardless of positional embedding choice. We further apply DDPO finetuning using Enformer as a reward model, achieving a 38$\times$ improvement in predicted regulatory activity. Cross-validation against DRAKES on an independent prediction task confirms that improvements reflect genuine regulatory signal rather than reward model overfitting.

关键词: Diffusion Transformer, regulatory DNA sequences, parameter-efficient, DDPO finetuning, bioinformatics, synthetic biology, Enformer reward model, cell-type-specific

20. ❌ An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took “Use of Practical AI in Digital Libraries” seriously?

作者: Jennifer D’Souza, Sameer Sadruddin, Maximilian Kähler, Andrea Salfinger, Luca Zaccagna, Francesca Incitti, Lauro Snidaro, Osma Suominen 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10876v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要关注数字图书馆中的主题索引问题，发布了一个双语语料库和可机读的分类法，用于支持多标签分类、文本到权威术语的映射以及代理辅助编目。论文内容涉及AI在数字图书馆中的应用，但未明确提及或深入探讨任何大模型、深度学习技术原理或科学领域的AI应用。所有关键词均与大模型技术、深度学习原理或特定科学AI应用相关，而该论文专注于图书馆学中的传统AI应用（如分类、编目），因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文发布了一个大型双语（英语/德语）编目记录语料库和可机读的GND分类法，以解决大规模跨语言主题索引的难题，并支持基于权威的AI辅助编目评估。

摘要翻译

主题标引对文献发现至关重要，但难以实现大规模跨语言持续应用。我们发布了一个经整合规范文档（GND）标注的大型双语（英语/德语）目录记录语料库，以及一套机器可操作的GND分类体系。该资源支持基于本体的多标签分类、文本到规范术语的映射，并可通过可复现的规范控制评估实现智能辅助编目。我们对三个系统进行了简要统计描述与定性误差分析。我们呼吁学界不仅关注准确性，更应评估实用性与透明度，以构建能增强编目员工作的、以规范文档为锚点的人工智能协同系统。

摘要 (Abstract)

Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers’ work.

关键词: Extreme Multi-label Text Classification, Digital Libraries, Subject Indexing, Integrated Authority File (GND), Ontology-aware Classification, Agent-assisted Cataloging, Bilingual Corpus, Authority-grounded Evaluation

21. ❌ GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, And Continuous Environments

作者: Chuanlong Zang, Anna Mannucci, Isabelle Barz, Philipp Schillinger, Florian Lier, Wolfgang Hönig 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10858v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于多机器人路径规划（MAPF/MRMP）的仿真与基准测试平台开发，与大多数大模型/深度学习技术关键词无直接关联。仅与’Multi-agent Systems OR Agent Coordination’有一定关联（5分），因为涉及多机器人协调规划；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为属于AI在科学/工程领域的应用。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了GRACE，一个统一的2D多机器人路径规划仿真与基准测试平台，支持网格、路线图和连续环境，通过实证量化了表示精度与性能的权衡，旨在促进跨表示研究的可比性和多机器人规划研究的进展。

摘要翻译

推进多智能体路径规划（MAPF）与多机器人运动规划（MRMP）研究需要能够实现建模选择间透明、可复现比较的平台。现有工具要么在简化假设（网格、同构智能体）下具备扩展性，要么提供更高仿真保真度但缺乏可比性的评估工具。我们提出GRACE——一个统一的二维仿真与基准测试平台，它通过显式、可复现的操作器及通用评估协议，在多个抽象层级（网格、路网、连续空间）上实例化同一任务。我们在公开地图和代表性规划器上的实证结果，使得在共享实例集上进行对等比较成为可能。此外，我们量化了表征方式与保真度之间的预期权衡（MRMP以更高保真度但较低速度求解实例，而网格/路网规划器具有更优扩展性）。通过整合表征、执行与评估环节，GRACE旨在提升跨表征研究的可比性，并为推动多机器人规划研究及其实际应用转化提供有效工具。

摘要 (Abstract)

Advancing Multi-Agent Pathfinding (MAPF) and Multi-Robot Motion Planning (MRMP) requires platforms that enable transparent, reproducible comparisons across modeling choices. Existing tools either scale under simplifying assumptions (grids, homogeneous agents) or offer higher fidelity with less comparable instrumentation. We present GRACE, a unified 2D simulator+benchmark that instantiates the same task at multiple abstraction levels (grid, roadmap, continuous) via explicit, reproducible operators and a common evaluation protocol. Our empirical results on public maps and representative planners enable commensurate comparisons on a shared instance set. Furthermore, we quantify the expected representation-fidelity trade-offs (MRMP solves instances at higher fidelity but lower speed, while grid/roadmap planners scale farther). By consolidating representation, execution, and evaluation, GRACE thereby aims to make cross-representation studies more comparable and provides a means to advance multi-robot planning research and its translation to practice.

关键词: Multi-Agent Pathfinding, Multi-Robot Motion Planning, simulator, benchmark, grid environments, roadmap environments, continuous environments, representation-fidelity trade-offs

22. ❌ $V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

作者: Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10848v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习（RL）中的价值模型和稀疏采样方法，特别是提出V0.5模型来融合预训练价值模型先验与稀疏经验均值以构建鲁棒基线。所有关键词均与大模型、深度学习技术原理或科学AI应用直接相关，但论文未涉及任何大模型（如LLM）、深度学习技术（如MoE、SFT、RAG）或科学领域应用（如生物信息学）。论文内容与关键词无直接关联，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出V0.5模型，通过动态融合预训练价值模型先验与稀疏经验均值来构建鲁棒基线，在数学推理基准上显著提升强化学习性能，实现更快收敛和约10%的性能改进。

摘要翻译

在可验证奖励的强化学习（RLVR）中，构建稳健的优势基线对于策略梯度至关重要，它能有效引导策略模型强化期望行为。近期研究引入了通用价值模型（例如 $V_0$），该模型通过显式地在上下文中编码模型能力来实现预训练的价值估计，从而无需与策略模型同步更新价值模型。本文提出 $V_{0.5}$，它自适应地融合此类价值模型预测的基线（作为先验）与稀疏 rollout 获得的经验均值，从而构建了一个在计算效率与极低方差之间取得平衡的稳健基线。具体而言，我们引入了实时统计检验与动态预算分配机制。该方法平衡了稀疏采样导致的高方差与价值模型先验固有的系统偏差（或幻觉）。通过构建假设检验以实时评估先验的可靠性，系统能够按需动态分配额外的 rollout 预算。这一机制最小化了基线估计器的均方误差（MSE），即使在分组大小为 4 的极端稀疏条件下，也能保证策略梯度的稳定性。在六个数学推理基准上的广泛评估表明，$V_{0.5}$ 显著优于 GRPO 和 DAPO，实现了更快的收敛速度以及约 10% 以上的性能提升。

摘要 (Abstract)

In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as $V_0$), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose $V_{0.5}$, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model’s prior. By constructing a hypothesis test to evaluate the prior’s reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator’s Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that $V_{0.5}$ significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.

关键词: Reinforcement Learning, Value Model, Sparse Rollouts, Policy Gradients, Mathematical Reasoning, Baseline Estimation, Generalist Value Model, V0.5

23. ❌ Semantic Landmark Particle Filter for Robot Localisation in Vineyards

作者: Rajitha de Silva, Jonathan Cox, James R. Heselden, Marija Popović, Cesar Cadena, Riccardo Polvara 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10847v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究机器人定位技术，专注于农业环境（葡萄园）中的传感器融合和语义SLAM，使用LiDAR、GNSS和语义检测（树干、杆子）。所有评分关键词均涉及大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用，而本文完全不涉及这些主题。论文内容属于机器人学、计算机视觉和农业自动化领域，与LLM、深度学习模型训练、推理优化、AI对齐等关键词无任何关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对葡萄园中因行级感知混淆导致的机器人定位不可靠问题，提出了一种融合语义地标检测和LiDAR的粒子滤波器方法，实验证明该方法相比几何和视觉基线显著降低了定位误差。

摘要翻译

葡萄园中的可靠定位因行级感知混淆而受阻：平行的作物行产生几乎完全相同的激光雷达观测数据，导致仅基于几何特征和视觉的SLAM系统倾向于收敛至错误的行间通道，尤其在端头转弯区域。本文提出一种语义地标粒子滤波器（SLPF），该框架在概率定位系统中将树干与立柱地标检测结果与二维激光雷达数据相融合。检测到的树干被转换为语义墙体，形成嵌入测量模型的结构化行边界，从而增强相邻行间的区分能力。系统将GNSS作为轻量级先验信息纳入，可在语义观测稀疏时稳定定位性能。

在十行葡萄园进行的实地实验表明，该方法相较于仅基于几何的定位方法（AMCL）、基于视觉的方法（RTAB-Map）以及GNSS基线方案均有持续改进。与AMCL相比，SLPF在两个遍历方向上分别将绝对位姿误差降低22%和65%；相对于带噪声的GNSS基线，绝对位姿误差减少65%和61%。行识别正确率从0.67提升至0.73，平均横向跟踪误差从1.40米降至1.26米。这些结果表明，在测量模型中嵌入行级结构语义信息，能够在高度重复的户外农业环境中实现鲁棒的定位。

摘要 (Abstract)

Reliable localisation in vineyards is hindered by row-level perceptual aliasing: parallel crop rows produce nearly identical LiDAR observations, causing geometry-only and vision-based SLAM systems to converge towards incorrect corridors, particularly during headland transitions. We present a Semantic Landmark Particle Filter (SLPF) that integrates trunk and pole landmark detections with 2D LiDAR within a probabilistic localisation framework. Detected trunks are converted into semantic walls, forming structural row boundaries embedded in the measurement model to improve discrimination between adjacent rows. GNSS is incorporated as a lightweight prior that stabilises localisation when semantic observations are sparse. Field experiments in a 10-row vineyard demonstrate consistent improvements over geometry-only (AMCL), vision-based (RTAB-Map), and GNSS baselines. Compared to AMCL, SLPF reduces Absolute Pose Error by 22% and 65% across two traversal directions; relative to a NoisyGNSS baseline, APE decreases by 65% and 61%. Row correctness improves from 0.67 to 0.73, while mean cross-track error decreases from 1.40 m to 1.26 m. These results show that embedding row-level structural semantics within the measurement model enables robust localisation in highly repetitive outdoor agricultural environments.

关键词: robot localisation, vineyards, semantic landmark, particle filter, LiDAR, perceptual aliasing, agricultural environments, SLAM

24. ❌ Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

作者: Yujie Zheng, Zhuo Li, Shengtao Zhang, Hanjing Wang, Junjie Sheng, Jiaqian Wang, Junchi Yan, Weinan Zhang, Ying Wen, Bo Tang, Muning Wen 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文EvoKernel主要涉及LLM在编程领域的应用，特别是NPU内核合成。它明确使用LLM作为基础模型，并构建了一个自主代理框架（LLM Agents）来实现从草稿到持续优化的自动化流程。论文的核心创新在于价值驱动的记忆方法和强化学习任务，但未涉及其他关键词如MoE、SLM、微调技术、推理加速等具体技术。虽然属于AI应用，但未明确属于生物信息学等科学领域。

!!! tip deepseek-chat TL;DR

该论文针对数据稀缺的NPU编程领域，提出了EvoKernel框架，通过价值驱动的记忆强化学习方法，使大语言模型能够自动完成内核合成任务，将正确率从11.0%提升至83.0%，并实现3.60倍的中位加速。

摘要翻译

将大型语言模型部署到数据稀缺的编程领域面临重大挑战，尤其在新兴的特定领域架构（Domain-Specific Architectures）上进行内核合成时，“数据墙”限制了可用训练数据。尽管模型在CUDA等数据丰富的平台上表现优异，但在如NPU编程这类数据稀缺的生态系统中，其性能会出现灾难性下降。为了在不进行昂贵微调的情况下克服这一冷启动障碍，我们提出了EvoKernel，一个自我进化的智能体框架，实现了从初始草拟到持续优化的内核合成全生命周期自动化。EvoKernel通过将合成过程构建为基于记忆的强化学习任务来解决此问题。通过一种新颖的价值驱动检索机制，它学习阶段特定的Q值，根据经验对当前目标（无论是引导出可行草案还是迭代优化延迟）的贡献来优先选择经验。此外，通过实现跨任务记忆共享，智能体能够将简单算子的洞察泛化至复杂算子。通过构建KernelBench的NPU变体并在其上评估，EvoKernel将前沿模型的正确率从11.0%提升至83.0%，并通过迭代优化实现了相对于初始草案中位数3.60倍的加速。这表明，价值引导的经验积累使得通用模型能够掌握小众硬件生态系统上的内核合成任务。我们的官方页面位于https://evokernel.zhuo.li。

摘要 (Abstract)

Deploying Large Language Models to data-scarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a “Data Wall” limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce EvoKernel, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining. EvoKernel addresses this by formulating the synthesis process as a memory-based reinforcement learning task. Through a novel value-driven retrieval mechanism, it learns stage-specific Q-values that prioritize experiences based on their contribution to the current objective, whether bootstrapping a feasible draft or iteratively refining latency. Furthermore, by enabling cross-task memory sharing, the agent generalizes insights from simple to complex operators. By building an NPU variant of KernelBench and evaluating on it, EvoKernel improves frontier models’ correctness from 11.0% to 83.0% and achieves a median speedup of 3.60x over initial drafts through iterative refinement. This demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems. Our official page is available at https://evokernel.zhuo.li.

关键词: Large Language Models, LLM Agents, Kernel Synthesis, NPU Programming, Reinforcement Learning, Value-Driven Memory, Cold-Start, Continual Refining

25. ❌ Human Presence Detection via Wi-Fi Range-Filtered Doppler Spectrum on Commodity Laptops

作者: Jessica Sanson, Rahul C. Shah, Valerio Frascolla 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10845v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于Wi-Fi信号的人类存在检测技术，属于无线感知和嵌入式系统领域。论文内容完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术概念。所有关键词都与论文主题无关，因此所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Range-Filtered Doppler Spectrum的新型Wi-Fi感知技术，首次在商用笔记本电脑内置Wi-Fi硬件上实现了无需外部设备的人类存在检测，通过自适应多速率处理框架降低了计算复杂度。

摘要翻译

人体存在检测是实现日常设备智能功耗管理与安防功能的关键技术。本文提出了首个基于单站Wi-Fi感知的人体存在检测方案，该方案仅利用设备内置Wi-Fi硬件即可检测用户位置，无需外部设备、接入点或额外传感器。相比之下，现有笔记本电脑人体存在检测方案需依赖增加成本与复杂度的外部专用传感器，或采用引发严重隐私担忧的摄像头方案。我们在此提出距离滤波多普勒频谱——一种用于存在估计的新型Wi-Fi感知技术，能够实现距离选择性及时间窗式的人体存在检测。通过在多普勒分析前对信道脉冲响应域进行定向距离区域滤波，我们的方法将处理聚焦于任务相关空间区域，显著降低了计算复杂度。此外，与传统的二维距离-多普勒检测器相比，在频谱域采用时间窗设计使估计器具有更高的稳定性。进一步地，我们提出自适应多速率处理框架，可动态调整信道状态信息采样率——在空闲时段以低帧率运行，仅在检测到运动时切换至高帧率。据我们所知，这是首个基于商用现成笔记本电脑内置Wi-Fi网络接口控制器的低复杂度单站Wi-Fi感知占位检测方案，无需外部网络基础设施或专用传感器。我们的解决方案可在不同环境与设备间扩展，且无需校准或重新训练。

摘要 (Abstract)

Human Presence Detection (HPD) is key to enable intelligent power management and security features in everyday devices. In this paper we propose the first HPD solution that leverages monostatic Wi-Fi sensing and detects user position using only the built-in Wi-Fi hardware of a device, with no need for external devices, access points, or additional sensors. In contrast, existing HPD solutions for laptops require external dedicated sensors which add cost and complexity, or rely on camera-based approaches that introduce significant privacy concerns. We herewith introduce the Range-Filtered Doppler Spectrum (RF-DS), a novel Wi-Fi sensing technique for presence estimation that enables both range-selective and temporally windowed detection of user presence. By applying targeted range-area filtering in the Channel Impulse Response (CIR) domain before Doppler analysis, our method focuses processing on task-relevant spatial zones, significantly reducing computational complexity. In addition, the use of temporal windows in the spectrum domain provides greater estimator stability compared to conventional 2D Range-Doppler detectors. Furthermore, we propose an adaptive multi-rate processing framework that dynamically adjusts Channel State Information (CSI) sampling rates-operating at low frame rates (10Hz) during idle periods and high rates (100Hz) only when motion is detected. To our knowledge, this is the first low-complexity solution for occupancy detection using monostatic Wi-Fi sensing on a built-in Wi-Fi network interface controller (NIC) of a commercial off-the-shelf laptop that requires no external network infrastructure or specialized sensors. Our solution can scale across different environments and devices without calibration or retraining.

关键词: Human Presence Detection, Wi-Fi Sensing, Range-Filtered Doppler Spectrum, Monostatic Wi-Fi, Channel Impulse Response, Adaptive Multi-rate Processing, Commodity Laptops, Low-complexity Solution

26. ❌ Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

作者: Mingyang Song, Mao Zheng, Chenning Xu 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11027v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为评估者（LLM-as-a-judge）的可靠性问题，直接涉及大语言模型的应用和评估方法，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文在摘要最后提到对RLAIF中奖励建模的启示，因此与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’有一定关联（5分）。论文未涉及其他关键词的具体技术或应用，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现LLM作为评估者时的高一致性共识经常是虚假的，并提出基于领域知识动态生成评估标准的方法能产生更有意义的评估。

摘要翻译

LLM即评判者范式依赖于一个关键假设，即高评分者间一致性意味着可靠且客观的评估。我们提出两项互补的研究发现，对此假设提出了挑战。首先，我们证明这种共识常常是虚幻的。我们识别并形式化了评估幻觉现象，即LLM评判者能生成复杂的评语，但其评分却锚定在共通的表面启发式特征上，而非实质质量。通过对105,600个评估实例（32个LLM × 3个前沿评判模型 × 100项任务 × 11个温度参数）的大规模研究，我们发现模型层面的一致性（斯皮尔曼ρ=0.99）掩盖了脆弱的样本层面一致性（皮尔逊平均r=0.72；绝对一致性ICC=0.67），仅共享评估准则结构即可恢复总一致性的62%，且高质量输出反而会得到最不一致的评价。其次，我们证明基于领域知识动态生成评估准则能产生更有意义的评估。我们提出了MERG（元认知增强准则生成框架），这是一个知识驱动的评估准则生成框架，其领域选择性效应证实了这一点。在知识能将评估者锚定于共同标准的规范化领域（教育领域+22%，学术领域+27%），一致性提升；而在真正评估多元性出现的主观领域，一致性则下降。这些发现表明，评估准则应通过专家知识动态丰富，而非依赖通用标准，这对RLAIF中的奖励建模具有重要启示。

摘要 (Abstract)

The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $ρ= 0.99$) masks fragile sample-level agreement (Pearson $\bar{r} = 0.72$; absolute agreement ICC $= 0.67$), that merely sharing rubric structure restores 62% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22%, Academic +27%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.

关键词: LLM-as-a-judge, Evaluation Illusion, Inter-evaluator agreement, MERG, Knowledge-driven rubric generation, RLAIF, Reward modeling, Domain knowledge

27. ❌ LLM2Vec-Gen: Generative Embeddings from Large Language Models

作者: Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10913v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出LLM2Vec-Gen方法，核心是使用冻结的LLM生成文本嵌入，因此与’Large Language Models’高度相关（10分）。方法涉及训练特殊token来代表LLM的响应，这属于参数高效微调（PEFT）范畴（5分）。论文提到将LLM的安全对齐和推理能力转移到嵌入任务，因此与’Instruction Tuning/Alignment’（5分）和’Chain of Thought Reasoning’（5分）有一定关联。嵌入的可解释性解码与’Explainable AI’相关（5分）。其他关键词如MoE、SLMs、RAG、量化等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LLM2Vec-Gen的自监督方法，通过训练特殊token来生成代表大型语言模型响应的文本嵌入，从而在Massive Text Embedding Benchmark上实现了最先进的性能，并显著提升了嵌入任务的安全性和推理能力。

摘要翻译

基于大语言模型（LLM）的文本嵌入器通常对其输入的语义内容进行编码。然而，嵌入任务要求将多样化的输入映射到相似的输出。通常，这种输入-输出映射问题是通过使用对比学习与配对数据训练嵌入模型来解决的。在本研究中，我们提出了一种新颖的自监督方法——LLM2Vec-Gen，它采用了一种不同的范式：我们并非对输入进行编码，而是学习表示模型潜在的响应。具体而言，我们在LLM的词表中添加可训练的特殊标记，将其附加到输入之后，并通过优化使其在一个固定长度的序列中表示LLM的响应。训练由LLM自身对查询的补全结果以及一个提供蒸馏目标的无监督嵌入教师模型共同指导。这种设计有助于弥合输入-输出之间的差距，并将LLM的安全对齐、推理等能力迁移到嵌入任务中。关键的是，LLM主干网络保持冻结状态，且训练仅需未标注的查询数据。LLM2Vec-Gen在Massive Text Embedding Benchmark（MTEB）上实现了最先进的自监督性能，相比最佳的无监督嵌入教师模型提升了9.3%。我们还观察到，在嵌入任务中，有害内容检索减少了高达43.2%，推理能力提升了29.3%。最后，学习到的嵌入表示具有可解释性，可被解码为文本以揭示其语义内容。

摘要 (Abstract)

LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model’s potential response. Specifically, we add trainable special tokens to the LLM’s vocabulary, append them to input, and optimize them to represent the LLM’s response in a fixed-length sequence. Training is guided by the LLM’s own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.

关键词: LLM2Vec-Gen, text embeddings, self-supervised learning, large language models, parameter-efficient fine-tuning, embedding tasks, MTEB benchmark, interpretable embeddings

28. ❌ GLM-OCR Technical Report

作者: Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, Jie Tang 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10910v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GLM-OCR是一个0.9B参数的多模态模型，专注于文档理解任务。核心相关关键词包括：1) ‘Small Language Models OR SLMs OR On-device AI’（10分）：模型为0.9B紧凑架构，强调边缘部署，高度相关；2) ‘Large Language Models OR LLMs OR Foundation Models’（8分）：使用GLM语言解码器，属于大模型范畴；3) ‘Speculative Decoding OR Inference Acceleration’（8分）：引入Multi-Token Prediction机制提升解码吞吐量，直接相关；4) ‘Quantization OR Model Compression OR Low-bit Weights’（5分）：紧凑架构涉及效率优化，有一定关联；5) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’（各5分）：模型开发可能涉及预训练和微调，但未明确说明。其他关键词如MoE、RAG、CoT等与文档OCR任务无直接关联，故评0分。

!!! tip deepseek-chat TL;DR

GLM-OCR提出了一种高效的0.9B参数紧凑多模态模型，通过Multi-Token Prediction机制和两阶段流水线，在文档理解任务中实现了竞争性的性能，同时适合资源受限的边缘部署。

摘要翻译

GLM-OCR是一种高效的0.9B参数紧凑型多模态模型，专为现实世界文档理解任务而设计。它结合了0.4B参数的CogViT视觉编码器与0.5B参数的GLM语言解码器，在计算效率与识别性能之间实现了良好平衡。针对确定性OCR任务中标准自回归解码的低效问题，GLM-OCR引入了多令牌预测（Multi-Token Prediction, MTP）机制，该机制通过共享参数在每步预测多个令牌，显著提升了解码吞吐量，同时保持较低的内存开销。在系统层面，采用两阶段处理流程：PP-DocLayout-V3首先进行版面分析，随后执行并行区域级识别。在公开基准测试和工业场景中的广泛评估表明，GLM-OCR在文档解析、文本与公式转录、表格结构恢复以及关键信息提取等任务中均取得了具有竞争力或最先进的性能。其紧凑的架构和结构化生成特性使其既适用于资源受限的边缘部署，也适用于大规模生产系统。

摘要 (Abstract)

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.

关键词: GLM-OCR, compact multimodal model, document understanding, Multi-Token Prediction, efficient decoding, layout analysis, edge deployment, 0.9B parameters

作者: Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, Tanmoy Chakraborty 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10877v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ARMADA框架，专注于从视觉-语言模型（包括黑盒模型）到纯语言模型的知识蒸馏，属于大模型技术应用与优化范畴。核心相关关键词：1) ‘Large Language Models’ (10分)：论文涉及DeBERTa、OPT、LLaMA等大模型；2) ‘Small Language Models’ (8分)：知识蒸馏旨在压缩大模型为更小模型；3) ‘Post-training/SFT’ (8分)：通过蒸馏实现模型优化，类似微调；4) ‘Quantization/Model Compression’ (8分)：知识蒸馏是模型压缩的关键技术；5) ‘Pre-training’ (5分)：提及预训练模型作为基础。其他关键词如MoE、Scaling Laws、RAG等与论文内容无直接关联，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ARMADA的高效跨模态知识蒸馏框架，能够从视觉-语言模型（包括黑盒模型）向纯语言模型转移知识，在多项自然语言理解和生成任务上实现了性能提升，且无需昂贵的多模态预训练或教师模型微调。

摘要翻译

知识蒸馏（Knowledge Distillation，KD）方法在将大型预训练语言模型压缩为较小模型方面至关重要，其能在不显著降低性能的前提下确保计算效率。传统的知识蒸馏技术假设教师（源）模型与学生（目标）模型之间具有模态同质性。另一方面，现有的多模态知识蒸馏方法要求对教师模型进行特定模态的预训练，这在大多数情况下计算上不可行。本文提出ARMADA，一种高效的跨模态知识蒸馏框架，旨在将大型视觉-语言模型（包括黑盒模型）的知识迁移至纯语言模型。与现有依赖多模态教师模型内部结构或需要计算成本高昂的预训练的知识蒸馏技术不同，ARMADA利用新颖的对齐技术来蒸馏知识，且无需修改教师模型，从而确保了效率与可扩展性。我们在十二项自然语言理解任务、八项复杂生成推理任务和五项指令微调任务上对ARMADA进行了实证验证，结果表明其在DeBERTa-v2-1.4B、OPT-1.3B、LLaMA-{3B, 7B, 8B}等大型模型上均实现了持续的性能提升。ARMADA在语言理解任务上取得了高达3.4%的提升，在生成推理任务上提升了2.6%，且均无需昂贵的多模态预训练或对教师模型进行微调。我们的研究结果挑战了传统的知识蒸馏范式，证明即使是缺乏直接文本理解能力的视觉-语言模型，在经过适当蒸馏后也能显著增强语言模型的性能。

摘要 (Abstract)

Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.

关键词: Knowledge Distillation, Cross-modal, Vision-Language Models, Black-box Teachers, Language Models, Model Compression, ARMADA, Efficient Framework

30. ❌ SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

作者: Nevidu Jayatilleke, Nisansa de Silva, Uthpala Nimanthi, Gagani Kulathilaka, Azra Safrullah, Johan Sofalas 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10861v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于构建一个低资源语言（僧伽罗语）的历史语料库SiDiaC-v.2.0，涉及文本收集、预处理、标注和分类，属于语料库构建和自然语言处理（NLP）资源开发领域。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理、优化方法、应用等）无直接关联，未涉及任何模型训练、优化、推理、对齐、代理、科学AI应用等技术主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究构建了僧伽罗语历史语料库SiDiaC-v.2.0，包含24.4万词、185部文学作品，并进行了分类和标注，为僧伽罗语NLP提供了全面资源。

摘要翻译

SiDiaC-v.2.0 是迄今为止规模最大的综合性僧伽罗语历时语料库，其收录文献的出版年代覆盖公元1800年至1955年，而文本的实际书写年代则跨越公元5世纪至20世纪。该语料库包含185部文学作品，共计24.4万词，所有文本均经过严格的筛选、预处理与版权合规审查，并进行了大量的后处理工作。此外，其中59份文档（总计7万词）已根据其书写年代完成了标注。文本选自斯里兰卡国家图书馆的SiDiaC-v.1.0未过滤列表，并通过Google Document AI OCR技术进行数字化，随后通过后处理以修正格式问题、处理语码混合、加入特殊标记并修复错误标记。SiDiaC-v.2.0的构建借鉴了其他语料库（如FarPaHC、SiDiaC-v.1.0和CCOHA）的经验，尤其在句法标注和文本规范化策略方面——考虑到法罗语与僧伽罗语同属低资源语言，且CCOHA采用的清洗策略具有相似性。本语料库根据体裁分为两个层级：主要分类为二元划分，将每部作品归入“非虚构”或“虚构”类别；次要分类则更为细致，将文本按特定体裁分组，如宗教、历史、诗歌、语言和医学等。尽管面临资源有限的挑战，SiDiaC-v.2.0在SiDiaC-v.1.0已有工作的基础上，为僧伽罗语自然语言处理研究提供了一个全面的资源库。

摘要 (Abstract)

SiDiaC-v.2.0 is the largest comprehensive Sinhala Diachronic Corpus to date, covering a period from 1800 CE to 1955 CE in terms of publication dates, and a historical span from the 5th to the 20th century CE in terms of written dates. The corpus consists of 244k words across 185 literary works that underwent thorough filtering, preprocessing, and copyright compliance checks, followed by extensive post-processing. Additionally, a subset of 59 documents totalling 70k words was annotated based on their written dates. Texts from the National Library of Sri Lanka were selected from the SiDiaC-v.1.0 non-filtered list, which was digitised using Google Document AI OCR. This was followed by post-processing to correct formatting issues, address code-mixing, include special tokens, and fix malformed tokens. The construction of SiDiaC-v.2.0 was informed by practices from other corpora, such as FarPaHC, SiDiaC-v.1.0, and CCOHA. This was particularly relevant for syntactic annotation and text normalisation strategies, given the shared characteristics of low-resource language status between Faroese and the similar cleaning strategies utilised in CCOHA. This corpus is categorised into two layers based on genres: primary and secondary. The primary categorisation is binary, assigning each book to either Non-Fiction or Fiction. The secondary categorisation is more detailed, grouping texts under specific genres such as Religious, History, Poetry, Language, and Medical. Despite facing challenges due to limited resources, SiDiaC-v.2.0 serves as a comprehensive resource for Sinhala NLP, building upon the work previously done in SiDiaC-v.1.0.

关键词: Sinhala, diachronic corpus, low-resource language, text preprocessing, syntactic annotation, genre categorization, NLP resource, historical texts

31. ❌ Agentar-Fin-OCR

作者: Siyi Qian, Xiongfei Bai, Bingtao Fu, Yichen Lu, Gaoyang Zhang, Xudong Yang, Peng Zhang 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11044v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于金融文档解析系统（Agentar-Fin-OCR）和金融文档基准（FinDocBench），涉及文档布局分析、表格解析、跨页内容整合、课程学习训练策略等技术，但未提及任何大模型、深度学习技术原理创新或AI for Science应用。所有关键词均与大模型技术、训练方法、推理优化、AI代理、科学AI等主题相关，而本文属于特定领域（金融）的文档处理系统，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一个针对金融领域文档的解析系统Agentar-Fin-OCR，通过跨页内容整合和课程学习策略解决复杂布局和表格解析问题，并引入了金融文档基准FinDocBench进行评估，在OmniDocBench和FinDocBench上展示了高性能。

摘要翻译

本文提出Agentar-Fin-OCR，一种专为金融领域文档设计的解析系统，能够将超长金融PDF文件转化为语义连贯、高精度、结构化且具备审计级溯源能力的输出。针对金融文档特有的复杂版面、跨页结构断裂以及单元格级引用能力等挑战，Agentar-Fin-OCR结合了（1）跨页内容整合算法以恢复页面间的连续性，以及文档级标题层次重构模块，用于构建全局一致的目录树，实现结构感知检索；（2）一种难度自适应的课程学习训练策略用于表格解析，并配合CellBBoxRegressor模块，该模块利用结构锚定标记从解码器隐藏状态中定位表格单元格，无需依赖外部检测器。实验表明，我们的模型在OmniDocBench的表格解析指标上表现出优异性能。为了在金融垂直领域实现更贴近实际的评估，我们进一步引入了FinDocBench基准测试集，该基准包含六类金融文档，具备专家验证的标注，并采用基于编辑距离的目录相似度、跨页拼接TEDS以及表格单元格交并比等评估指标。我们在FinDocBench上评估了多种前沿模型，以衡量其在金融文档处理上的能力与现存局限。总体而言，Agentar-Fin-OCR与FinDocBench为可靠的下游金融文档应用提供了实用基础。

摘要 (Abstract)

In this paper, we propose Agentar-Fin-OCR, a document parsing system tailored to financial-domain documents, transforming ultra-long financial PDFs into semantically consistent, highly accurate, structured outputs with auditing-grade provenance. To address finance-specific challenges such as complex layouts, cross-page structural discontinuities, and cell-level referencing capability, Agentar-Fin-OCR combines (1) a Cross-page Contents Consolidation algorithm to restore continuity across pages and a Document-level Heading Hierarchy Reconstruction (DHR) module to build a globally consistent Table of Contents (TOC) tree for structure-aware retrieval, and (2) a difficulty-adaptive curriculum learning training strategy for table parsing, together with a CellBBoxRegressor module that uses structural anchor tokens to localize table cells from decoder hidden states without external detectors. Experiments demonstrate that our model shows high performance on the table parsing metrics of OmniDocBench. To enable realistic evaluation in the financial vertical, we further introduce FinDocBench, a benchmark that includes six financial document categories with expert-verified annotations and evaluation metrics including Table of Contents edit-distance-based similarity (TocEDS), cross-page concatenated TEDS, and Table Cell Intersection over Union (C-IoU). We evaluate a wide range of state-of-the-art models on FinDocBench to assess their capabilities and remaining limitations on financial documents. Overall, Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications.

关键词: financial document parsing, table parsing, cross-page content consolidation, curriculum learning, FinDocBench benchmark, document structure reconstruction, cell localization, financial PDF processing

32. ❌ DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

作者: Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, Tieniu Tan 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	15.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DynVLA提出了一种新的CoT范式（Dynamics CoT），用于自动驾驶中的动作推理，核心是学习世界动态模型。与关键词高度相关的包括：1）‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（15分）：论文核心创新是Dynamics CoT范式，属于CoT推理的扩展；2）‘World Models AND General World Models’（15分）：论文明确学习世界动态（world dynamics）用于决策，是核心内容；3）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）：应用于自动驾驶代理的决策；4）‘Post-training OR Supervised Fine-tuning OR SFT’（10分）：使用SFT和RFT训练模型；5）‘Large Language Models OR LLMs OR Foundation Models’（8分）：基于VLA（视觉语言模型）架构，属于大模型应用；6）‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（8分）：Dynamics CoT涉及更深入的物理推理；7）‘Mechanistic Interpretability OR Explainable AI’（5分）：dynamics tokens提供可解释性。其他关键词如MoE、量化、RAG等未涉及，评0分。

!!! tip deepseek-chat TL;DR

论文提出Dynamics CoT范式，通过预测紧凑的世界动态来改进自动驾驶决策，实验表明其优于传统CoT方法。

摘要翻译

我们提出DynVLA，一种驾驶视觉语言动作模型，其引入了一种称为动态思维链（Dynamics CoT）的新范式。DynVLA在生成动作前预测紧凑的世界动态，从而实现更具信息感知和物理依据的决策。为获得紧凑的动态表征，DynVLA引入了动态分词器（Dynamics Tokenizer），将未来演化过程压缩为一小组动态令牌。考虑到交互密集型驾驶场景中丰富的环境动态，DynVLA解耦了以自车为中心和以环境为中心的两类动态，实现了更精确的世界动态建模。随后我们通过监督微调（SFT）和强化微调（RFT）训练DynVLA在生成动作前先产生动态令牌，在保持低延迟推理的同时提升决策质量。相较于缺乏细粒度时空理解的文本思维链（Textual CoT），以及因密集图像预测引入大量冗余的视觉思维链（Visual CoT），动态思维链以紧凑、可解释且高效的形式捕捉世界演化过程。在NAVSIM、Bench2Drive及大规模内部数据集上的广泛实验表明，DynVLA持续优于文本思维链和视觉思维链方法，验证了动态思维链的有效性与实用价值。

摘要 (Abstract)

We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

关键词: Dynamics CoT, world dynamics, autonomous driving, action reasoning, VLA model, dynamics tokenizer, SFT, RFT

33. ❌ Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity

作者: Zhengyao Fang, Zexi Jia, Yijia Zhong, Pengcheng Luo, Jinchao Zhang, Guangming Lu, Jun Yu, Wenjie Pei 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10990v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于文本到图像（T2I）生成中的颜色保真度问题，提出了数据集（CFD）、评估指标（CFM）和改进方法（CFR）。所有关键词均与大语言模型（LLM）或深度学习技术原理相关，而本文研究的是计算机视觉领域的图像生成，未涉及LLM、MoE、SLMs、缩放定律、预训练、后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、代理系统、模型压缩、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对文本到图像生成中颜色过于鲜艳而不真实的问题，提出了一个包含数据集、评估指标和训练免费改进方法的渐进式框架，以评估和提升真实风格图像的颜色保真度。

摘要翻译

文本到图像生成技术的最新进展显著提升了视觉质量，但生成在视觉上逼真于现实世界摄影的图像仍具挑战。这部分源于现有评估范式的偏差：人工评分和基于偏好训练的度量标准往往倾向于色彩饱和度和对比度夸张的鲜明图像，这导致即使生成指令要求写实风格，所得图像也常因过于鲜艳而失真。为解决此问题，我们提出了用于客观评估写实风格生成中色彩保真度的色彩保真度数据集与色彩保真度度量标准。CFD包含超过130万张具有有序色彩真实度等级的真实与合成图像，而CFM采用多模态编码器来学习感知色彩保真度。此外，我们提出了一种无需训练的CFR方法，通过自适应调节生成过程中的时空引导尺度来增强色彩真实性。CFD为CFM的评估提供支持，而CFM学习到的注意力机制进一步指导CFR优化T2I生成保真度，从而形成一个用于评估和改进写实风格T2I生成色彩保真度的渐进式框架。数据集与代码公开于https://github.com/ZhengyaoFang/CFM。

摘要 (Abstract)

Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images. To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity. Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at https://github.com/ZhengyaoFang/CFM.

关键词: text-to-image generation, color fidelity, realistic-style images, evaluation metric, dataset, training-free refinement, multimodal encoder, spatial-temporal guidance

34. ❌ Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI

作者: Joan Perramon-Llussà, Amelia Jiménez-Sánchez, Grzegorz Skorupko, Fotis Avgoustidis, Carlos Martín-Isla, Karim Lekadir, Polyxeni Gkontra 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10967v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究医疗影像领域的基础模型联邦微调，与’Foundation Models’高度相关（10分），‘PEFT/LoRA’是核心创新方法（15分），‘Supervised Fine-tuning’是主要技术（10分），‘AI for Science’是应用领域（10分），‘Domain Adaptation’有一定关联（5分），其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对医疗影像中多中心数据隐私和异质性问题，提出了Med-DualLoRA联邦学习框架，通过分离全局和本地LoRA适配器，在保持通信效率的同时显著提升了3D心脏MRI疾病检测性能。

摘要翻译

基础模型（FMs）在经过特定任务适应后，在包括心脏磁共振（CMR）在内的多种医学影像任务和模态中展现出强大的下游性能潜力。然而，使用单中心数据进行适应可能导致性能欠佳并增加模型偏差，而由于隐私限制，对临床数据进行集中式微调通常不可行。联邦微调提供了一种隐私保护的替代方案；但传统方法在异构、非独立同分布的多中心数据下表现不佳，且在适应大模型时产生巨大的通信开销。在本研究中，我们探索了针对3D CMR疾病检测的联邦FM微调，并提出Med-DualLoRA——一种客户端感知的参数高效微调（PEFT）联邦框架，通过加法分解解耦全局共享与本地低秩适应（LoRA）。全局和本地LoRA模块在本地训练，但仅全局组件在各中心间共享和聚合，本地适配器保持私有。这一设计在显著降低通信成本的同时提升了个性化能力，实验表明仅适应两个Transformer块即可保持性能并进一步提升效率。我们在一个多中心、最先进的3D电影CMR FM上评估了该方法，该模型使用ACDC和合并的M&M数据集针对疾病检测进行微调，并将每个设备供应商视为一个联邦客户端。与其他联邦PEFT基线方法相比，Med-DualLoRA取得了统计学上显著提升的性能（平衡准确率0.768，特异性0.612），同时保持了通信效率。我们的方法为在实际临床约束下实现医学基础模型的本地化联邦适应提供了可扩展的解决方案。

摘要 (Abstract)

Foundation models (FMs) show great promise for robust downstream performance across medical imaging tasks and modalities, including cardiac magnetic resonance (CMR), following task-specific adaptation. However, adaptation using single-site data may lead to suboptimal performance and increased model bias, while centralized fine-tuning on clinical data is often infeasible due to privacy constraints. Federated fine-tuning offers a privacy-preserving alternative; yet conventional approaches struggle under heterogeneous, non-IID multi-center data and incur substantial communication overhead when adapting large models. In this work, we study federated FM fine-tuning for 3D CMR disease detection and propose Med-DualLoRA, a client-aware parameter-efficient fine-tuning (PEFT) federated framework that disentangles globally shared and local low-rank adaptations (LoRA) through additive decomposition. Global and local LoRA modules are trained locally, but only the global component is shared and aggregated across sites, keeping local adapters private. This design improves personalization while significantly reducing communication cost, and experiments show that adapting only two transformer blocks preserves performance while further improving efficiency. We evaluate our method on a multi-center state-of-the-art cine 3D CMR FM fine-tuned for disease detection using ACDC and combined M&Ms datasets, treating each vendor as a federated client. Med-DualLoRA achieves statistically significant improved performance (balanced accuracy 0.768, specificity 0.612) compared to other federated PEFT baselines, while maintaining communication efficiency. Our approach provides a scalable solution for local federated adaptation of medical FMs under realistic clinical constraints.

关键词: Foundation Models, Federated Learning, Parameter-efficient Fine-tuning, LoRA, Medical Imaging, Cardiac MRI, 3D CMR, Disease Detection

35. ❌ Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

作者: Jian Sun, Mohammad H. Mahoor 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10965v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于视频质量评估与视频分类的结合，使用自监督学习和Vision Transformer架构，属于计算机视觉领域。所有关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，但论文未涉及任何大语言模型技术、模型训练方法、推理优化、对齐技术、代理系统等。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文应用于医疗健康领域（Mild Cognitive Impairment诊断），但并非核心生物信息学或化学信息学研究，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合自监督学习和无参考视频质量评估的Video Vision Transformer方法（SSL-V3），用于提高视频分类的鲁棒性，特别是在医疗健康领域的轻度认知障碍诊断中取得了94.87%的准确率。

摘要翻译

视频质量显著影响视频分类效果。我们在对轻度认知障碍进行分类时发现，清晰视频的分类效果良好，而模糊视频的分类效果较差。自此，我们意识到参考视频质量评估可能提升视频分类性能。本文提出了一种结合无参考视频质量评估的自监督学习视频视觉变换器分类方法（SSL-V3）以实现这一目标。SSL-V3利用组合式自监督学习机制将VQA融入视频分类，并解决了视频数据集中常见的VQA标签短缺问题——该问题导致无法提供准确的视频质量分数。简言之，组合式自监督学习将视频质量分数作为直接调整视频分类特征图的因子，该分数作为交叉连接点，将VQA与分类任务相链接，利用有监督的分类任务来优化VQA参数。SSL-V3在两个数据集上取得了稳健的实验结果。例如，在I-CONECT（一个包含面部视频的医疗数据集）的部分访谈视频中达到了94.87%的准确率，验证了SSL-V3的有效性。

摘要 (Abstract)

Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3’s effectiveness.

关键词: Video Quality Assessment, Video Vision Transformer, Self-Supervised Learning, Video Classification, Mild Cognitive Impairment, Healthcare Dataset, No-reference VQA, Combined-SSL

36. ❌ Pointy - A Lightweight Transformer for Point Cloud Foundation Models

作者: Konrad Szafer, Marek Kraft, Dominik Belter 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10963v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种用于点云数据的轻量级Transformer基础模型，与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为论文明确研究点云基础模型，属于基础模型范畴。与’Pre-training OR Continual Pre-training OR Domain Adaptation’相关（8分），因为论文涉及预训练和架构设计。与’Scaling Laws AND Data Quality’有一定关联（5分），因为论文讨论了数据质量（仅用39k点云数据）和模型性能的关系。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为点云数据处理可视为科学计算或AI for Science的应用。其他关键词如MoE、SFT、RLHF等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级Transformer架构用于点云基础模型，仅用39k点云数据训练即可超越使用更多数据的大型模型，并通过标准化实验框架验证了其设计的有效性。

摘要翻译

点云数据的基础模型近年来能力不断增强，通常依赖于从语言或视觉模态进行的大规模表征学习。在本研究中，我们采用了一种更为可控的方法，引入了一种基于轻量级Transformer的点云架构。与高度依赖跨模态监督的方法不同，我们的模型仅使用3.9万个点云进行训练，但其性能却优于多个使用超过20万个训练样本的大型基础模型。值得注意的是，我们的方法取得了与那些训练过超百万点云、图像和文本样本的模型相媲美的先进结果，这证明了精心设计的训练方案与架构的价值。为确保严谨评估，我们开展了全面的复现研究，统一了训练流程并对多种点云架构进行了标准化基准测试。这一统一的实验框架分离了架构选择的影响，使得透明比较成为可能，并凸显了我们设计及其他无标记化架构的优势。我们的结果表明，简单的骨干网络能够取得与更复杂或数据更丰富的策略相竞争的结果。相关实现（包括代码、预训练模型和训练协议）已在https://github.com/KonradSzafer/Pointy 开源。

摘要 (Abstract)

Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy.

关键词: point cloud, foundation models, transformer, lightweight architecture, representation learning, training regime, benchmarking, tokenizer-free

37. ❌ Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD

作者: Qinxin Wu, Fucheng Niu, Hengchuan Zhu, Yifan Sun, Ye Shen, Xu Li, Han Wu, Leqi Liu, Zhiwen Pan, Zuozhu Liu, Fudong Zhu, Bin Feng 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10933v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学影像报告生成，属于AI在生物医学领域的应用，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为CBCT是医学影像学，属于生物信息学应用范畴。其他关键词均涉及大模型技术原理、训练方法、推理优化、代理系统等，而本文未提及任何大模型或深度学习技术细节，仅泛称’Generative AI’，未涉及LLM、MoE、训练技术、对齐、推理方法等具体技术，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究解决了口腔颌面CBCT影像报告生成中高质量配对数据稀缺和体积影像解释复杂的问题，通过开发CBCTRepD系统并构建大规模数据集，实现了与中级放射科医生相当的报告生成质量，并在多级协作中显著提升了各经验水平放射科医生的报告准确性。

摘要翻译

生成式人工智能在医学报告生成领域发展迅速，然而其在口腔颌面锥形束CT报告生成中的应用仍十分有限，这主要源于高质量配对的CBCT-报告数据稀缺，以及三维CBCT影像解读固有的复杂性。为此，我们推出了CBCTRepD，一个双语口腔颌面CBCT报告生成系统，旨在融入常规的放射科医师与AI协同撰写工作流。我们构建了一个大规模、高质量的配对CBCT-报告数据集，包含约7,408项研究，涵盖不同采集条件下的55种口腔疾病实体，并以此为基础开发了本系统。我们进一步建立了一个基于临床实践的多层次评估框架，该框架结合自动化指标以及以放射科医师和临床医生为中心的评价，对AI直接生成的报告草稿和经放射科医师编辑后的协作报告均进行评估。利用该框架，我们证明CBCTRepD在报告生成性能上表现优异，其生成的草稿在书写质量和标准化程度上可与中级放射科医师的报告相媲美。更重要的是，在放射科医师与AI的协作中，CBCTRepD为不同经验水平的医师均提供了持续且具有临床意义的助益：它帮助初级放射科医师提升至中级报告水平，使中级放射科医师能够接近高级医师的表现，甚至通过减少遗漏相关错误（包括临床上重要的漏诊病灶）来辅助高级放射科医师。通过改善报告结构、减少遗漏、并促进对跨解剖区域共存病灶的关注，CBCTRepD展现出强大而可靠的潜力，有望成为多层级医疗场景中现实世界CBCT报告撰写的实用助手。

摘要 (Abstract)

Generative AI has advanced rapidly in medical report generation; however, its application to oral and maxillofacial CBCT reporting remains limited, largely because of the scarcity of high-quality paired CBCT-report data and the intrinsic complexity of volumetric CBCT interpretation. To address this, we introduce CBCTRepD, a bilingual oral and maxillofacial CBCT report-generation system designed for integration into routine radiologist-AI co-authoring workflows. We curated a large-scale, high-quality paired CBCT-report dataset comprising approximately 7,408 studies, covering 55 oral disease entities across diverse acquisition settings, and used it to develop the system. We further established a clinically grounded, multi-level evaluation framework that assesses both direct AI-generated drafts and radiologist-edited collaboration reports using automatic metrics together with radiologist- and clinician-centered evaluation. Using this framework, we show that CBCTRepD achieves superior report-generation performance and produces drafts with writing quality and standardization comparable to those of intermediate radiologists. More importantly, in radiologist-AI collaboration, CBCTRepD provides consistent and clinically meaningful benefits across experience levels: it helps novice radiologists improve toward intermediate-level reporting, enables intermediate radiologists to approach senior-level performance, and even assists senior radiologists by reducing omission-related errors, including clinically important missed lesions. By improving report structure, reducing omissions, and promoting attention to co-existing lesions across anatomical regions, CBCTRepD shows strong and reliable potential as a practical assistant for real-world CBCT reporting across multi-level care settings.

关键词: CBCT report generation, oral and maxillofacial, generative AI, radiologist-AI collaboration, clinical evaluation, medical imaging, dataset curation, multi-level assessment

38. ❌ Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

作者: Fanqi Yu, Matteo Tiezzi, Tommaso Apicella, Cigdem Beyan, Vittorio Murino 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10929v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文研究终身模仿学习框架，专注于机器人策略的持续精炼，使用多模态潜在重放和增量调整技术。虽然涉及AI和机器学习，但论文内容与所有评分关键词（均围绕大语言模型技术、训练方法、推理、对齐、压缩、科学AI应用等）无直接关联。论文未提及LLMs、MoE、SLMs、缩放定律、预训练、微调、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、代理、工具使用、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种终身模仿学习框架，通过多模态潜在重放和增量特征调整机制，在内存和数据受限条件下实现跨顺序任务的持续策略精炼，在LIBERO基准测试中取得了新的最优性能，显著减少了遗忘。

摘要翻译

我们提出了一种终身模仿学习框架，该框架能够在现实的内存与数据约束下，实现跨序列任务的持续策略优化。我们的方法完全在多模态潜在空间中运行，从而区别于传统的经验回放机制；在该空间中，视觉、语言及机器人状态信息的紧凑表示被存储并复用，以支持未来的学习。为进一步稳定适应过程，我们引入了一种增量特征调整机制，通过角度间隔约束对任务嵌入的演化进行正则化，从而保持任务间的区分性。我们的方法在LIBERO基准测试中确立了新的性能标杆，与先前领先方法相比，AUC指标提升了10-17个点，且遗忘率降低了高达65%。消融实验验证了各组成部分的有效性，显示出相较于替代策略的稳定性能提升。代码发布于：https://github.com/yfqi/lifelong_mlr_ifa。

摘要 (Abstract)

We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot’s state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.

关键词: lifelong imitation learning, multimodal latent replay, incremental adjustment, continual policy refinement, sequential tasks, memory constraints, LIBERO benchmarks, forgetting reduction

39. ❌ Novel Architecture of RPA In Oral Cancer Lesion Detection

作者: Revana Magdy, Joy Naoum, Ali Hamdi 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10928v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是口腔癌病变检测中的RPA（机器人流程自动化）架构优化，属于医学图像处理领域。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，但论文摘要中未提及任何大模型、深度学习、AI技术或相关术语，仅涉及RPA、设计模式和批处理等传统软件工程方法，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究通过引入Singleton设计模式和批处理优化了口腔癌病变检测的RPA架构，将单图像预测时间从0.29秒降至0.06秒，实现了60-100倍的效率提升。

摘要翻译

口腔癌病灶的准确早期检测对有效诊断与治疗至关重要。本研究使用包含31张图像的测试集评估了两种RPA实现方案——OC-RPAv1与OC-RPAv2。OC-RPAv1平均每幅图像预测耗时0.29秒，而OC-RPAv2采用单例设计模式（Singleton design pattern）与批量处理技术，将单幅图像预测时间缩短至0.06秒。相较于标准RPA方法，该方案实现了60-100倍的效率提升，证明设计模式与批量处理能显著增强口腔癌检测系统的可扩展性并降低应用成本。

摘要 (Abstract)

Accurate and early detection of oral cancer lesions is crucial for effective diagnosis and treatment. This study evaluates two RPA implementations, OC-RPAv1 and OC-RPAv2, using a test set of 31 images. OC-RPAv1 processes one image per prediction in an average of 0.29 seconds, while OCRPAv2 employs a Singleton design pattern and batch processing, reducing prediction time to just 0.06 seconds per image. This represents a 60-100x efficiency improvement over standard RPA methods, showcasing that design patterns and batch processing can enhance scalability and reduce costs in oral cancer detection

关键词: oral cancer lesion detection, RPA, Singleton design pattern, batch processing, prediction time, efficiency improvement, scalability, cost reduction

40. ❌ S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs

作者: Yuzhou Ji, Qijian Tian, He Zhu, Xiaoqi Jiang, Guangzhi Cao, Lizhuang Ma, Yuan Xie, Xin Tan 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10893v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的3D重建技术，提出了一种从稀疏输入到密集3D高斯溅射重建的新方法。虽然论文涉及深度学习（扩散模型）和3D表示学习，但所有给定的关键词都专门针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等）。论文内容与LLM、语言模型技术或AI for Science（生物信息学/化学信息学）无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为S2D的新方法，通过高效的扩散模型和重建策略，解决了从稀疏输入视图实现高质量3D高斯溅射重建的问题，实现了在最小输入条件下的稳定场景重建。

摘要翻译

显式三维表示已成为三维仿真与理解的重要媒介。然而，最常用的点云与三维高斯泼溅（3DGS）各自存在渲染非真实感及在稀疏输入下性能显著退化的问题。本文提出稀疏到稠密提升（S2D）这一新颖流程，它桥接了两种表示方式，并能够以极简输入实现高质量的三维高斯泼溅重建。具体而言，S2D提升包含双重机制：我们首先提出一种高效的单步扩散模型，用于提升稀疏点云以实现高保真度的图像伪影修复；同时，为重建三维一致场景，我们还设计了一种相应的重建策略，结合随机采样丢弃与加权梯度方法，以实现从稀疏输入视角到稠密新视角的鲁棒模型拟合。大量实验表明，在不同输入稀疏度条件下，S2D在生成新视角引导方面具有最佳的一致性，并在稀疏视角重建质量上达到第一梯队水平。通过在现有方法中以最少的采集量重建稳定场景，S2D为三维高斯泼溅应用实现了最小化的输入需求。

摘要 (Abstract)

Explicit 3D representations have already become an essential medium for 3D simulation and understanding. However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs. In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs. Specifically, the S2D lifting is two-fold. We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing. Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views. Extensive experiments show that S2D achieves the best consistency in generating novel view guidance and first-tier sparse view reconstruction quality under different input sparsity. By reconstructing stable scenes with the least possible captures among existing methods, S2D enables minimal input requirements for 3DGS applications.

关键词: 3D reconstruction, sparse inputs, 3D Gaussian Splatting, diffusion model, novel view synthesis, point cloud, minimal input, scene consistency

41. ❌ Bilevel Layer-Positioning LoRA for Real Image Dehazing

作者: Yan Zhang, Long Ma, Yuxin Feng, Zhe Huang, Fan Zhou, Zhuo Su 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10872v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究计算机视觉领域的图像去雾任务，提出了一种名为BiLaLoRA的LoRA参数高效微调策略，用于自动搜索关键网络层的注入位置。因此，仅与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为论文核心贡献就是LoRA的改进应用。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文涉及领域适应（domain adaptation）到不同雾霾场景。其他关键词均与论文内容无关（0分），因为论文不涉及大语言模型、推理、对齐、科学AI等主题。

!!! tip deepseek-chat TL;DR

该论文针对真实图像去雾任务中模型难以适应多样化雾霾场景的问题，提出了基于CLIP的跨模态无监督损失和Bilevel Layer-positioning LoRA策略，在多个真实去雾基准测试中取得了优于现有方法的效果。

摘要翻译

基于学习的真实图像去雾方法已取得显著进展，但在多样化的真实雾霾场景中仍面临适应性问题。这些挑战主要源于缺乏针对无标签数据的有效无监督机制，以及完整模型微调的高昂成本。为解决这些问题，我们提出了雾到清文本导向损失，该损失利用CLIP的跨模态能力，将真实图像去雾重新定义为潜在空间中的语义对齐问题，从而在缺乏参考图像的情况下提供显式的无监督跨模态指导。此外，我们引入了双层定位LoRA策略，该策略不仅学习LoRA参数，还能自动搜索注入层，实现对关键网络层的针对性适应。大量实验证明，在多个真实世界去雾基准测试中，我们的方法优于当前最先进的技术。代码已公开于https://github.com/YanZhang-zy/BiLaLoRA。

摘要 (Abstract)

Learning-based real image dehazing methods have achieved notable progress, yet they still face adaptation challenges in diverse real haze scenes. These challenges mainly stem from the lack of effective unsupervised mechanisms for unlabeled data and the heavy cost of full model fine-tuning. To address these challenges, we propose the haze-to-clear text-directed loss that leverages CLIP’s cross-modal capabilities to reformulate real image dehazing as a semantic alignment problem in latent space, thereby providing explicit unsupervised cross-modal guidance in the absence of reference images. Furthermore, we introduce the Bilevel Layer-positioning LoRA (BiLaLoRA) strategy, which learns both the LoRA parameters and automatically search the injection layers, enabling targeted adaptation of critical network layers. Extensive experiments demonstrate our superiority against state-of-the-art methods on multiple real-world dehazing benchmarks. The code is publicly available at https://github.com/YanZhang-zy/BiLaLoRA.

关键词: real image dehazing, LoRA, parameter-efficient fine-tuning, unsupervised learning, cross-modal guidance, CLIP, domain adaptation, layer positioning

作者: Lin Chen, Bolin Ni, Qi Yang, Zili Wang, Kun Ding, Ying Wang, Houwen Peng, Shiming Xiang 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种新的位置编码方法DIPE来解决多模态大语言模型（MLLMs）在长上下文场景中的视觉衰减问题。该研究与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的重要扩展；与’Context Window Extension’高度相关（10分），因为论文专门解决长上下文场景下的性能问题。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、应用领域等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在长上下文场景中出现的视觉衰减问题，提出了一种解耦模态交互的位置编码方法DIPE，有效保持了视觉信号的感知一致性并缓解了视觉衰减。

摘要翻译

尽管多模态大语言模型（MLLMs）具备卓越的能力，但在长上下文场景中仍存在视觉信息衰减的问题。具体而言，随着文本序列长度的增加，模型对视觉标记的关注度逐渐减弱，导致生成的文本脱离视觉约束。我们将这种性能下降归因于多模态旋转位置编码（Multimodal RoPE）固有的归纳偏置，该偏置会随着视觉标记与文本标记之间距离的增加而抑制跨模态注意力。为解决这一问题，我们提出了跨模态距离不变位置编码（DIPE），这是一种简单而有效的机制，能够基于模态交互解耦位置编码。DIPE 保留了模态内交互的自然相对位置关系以维持局部结构，同时为跨模态交互强制锚定的感知邻近性。该策略有效缓解了基于跨模态距离的抑制效应，确保无论上下文长度如何变化，视觉信号在感知上保持一致。实验结果表明，通过将 DIPE 与多模态 RoPE 结合，模型在长上下文场景中保持了稳定的视觉基础，显著缓解了视觉衰减问题，同时在标准短上下文基准测试中保持了性能。代码可在 https://github.com/lchen1019/DIPE 获取。

摘要 (Abstract)

Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.

关键词: Multimodal Large Language Models, Position Encoding, Long-context Scenarios, Visual Fading, Inter-modal Attention, RoPE, Visual Grounding, DIPE

43. ❌ UltrasoundAgents: Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis

作者: Yali Zhu, Kang Zhou, Dingbang Wu, Gaofeng Meng 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10852v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文提出了一种用于乳腺超声诊断的分层多智能体框架，与多个关键词高度相关：1）使用监督微调（SFT）训练智能体；2）采用证据链推理（类似思维链）进行诊断；3）实现深度推理过程；4）基于多智能体系统协调工作；5）提供可解释的AI输出；6）属于AI在生物医学领域的应用。其他关键词如大语言模型、MoE、量化等与论文的计算机视觉和多智能体框架无直接关系。

!!! tip deepseek-chat TL;DR

该论文针对乳腺超声诊断中现有方法缺乏细粒度证据和可审计性的问题，提出了一个分层多智能体框架，通过局部属性分析和证据链推理显著提高了诊断准确性和可解释性。

摘要翻译

乳腺超声诊断通常遵循从整体病灶定位到局部征象评估，再整合证据以确定BI-RADS分类及良恶性的流程。现有方法多依赖端到端预测或仅提供弱证据支持，可能遗漏细粒度病灶特征，且可审计性与临床复核性有限。为契合临床工作流程并提升证据可追溯性，我们提出一种分层多智能体框架，称为UltrasoundAgents。主智能体在全图像中定位病灶并触发裁剪放大操作；子智能体分析局部视图并预测四个临床相关属性：回声模式、钙化、边界类型及边缘形态。主智能体随后整合这些结构化属性进行循证推理，输出BI-RADS分类与恶性预测，同时生成可复核的中间证据。此外，分层多智能体训练常面临误差传播、信用分配困难及奖励稀疏等问题。为缓解此问题并提升训练稳定性，我们引入解耦渐进式训练策略：首先训练属性智能体，随后使用真实属性训练主智能体以学习稳健的基于属性的推理，最后通过空间监督的校正轨迹自蒸馏构建高质量轨迹进行监督微调，从而获得可部署的端到端策略。实验表明，本方法在诊断准确率与属性一致性上均优于强视觉语言基线模型，同时提供结构化证据与可追溯的推理过程。

摘要 (Abstract)

Breast ultrasound diagnosis typically proceeds from global lesion localization to local sign assessment and then evidence integration to assign a BI-RADS category and determine benignity or malignancy. Many existing methods rely on end-to-end prediction or provide only weakly grounded evidence, which can miss fine-grained lesion cues and limit auditability and clinical review. To align with the clinical workflow and improve evidence traceability, we propose a hierarchical multi-agent framework, termed UltrasoundAgents. A main agent localizes the lesion in the full image and triggers a crop-and-zoom operation. A sub-agent analyzes the local view and predicts four clinically relevant attributes, namely echogenicity pattern, calcification, boundary type, and edge (margin) morphology. The main agent then integrates these structured attributes to perform evidence-based reasoning and output the BI-RADS category and the malignancy prediction, while producing reviewable intermediate evidence. Furthermore, hierarchical multi-agent training often suffers from error propagation, difficult credit assignment, and sparse rewards. To alleviate this and improve training stability, we introduce a decoupled progressive training strategy. We first train the attribute agent, then train the main agent with oracle attributes to learn robust attribute-based reasoning, and finally apply corrective trajectory self-distillation with spatial supervision to build high-quality trajectories for supervised fine-tuning, yielding a deployable end-to-end policy. Experiments show consistent gains over strong vision-language baselines in diagnostic accuracy and attribute agreement, together with structured evidence and traceable reasoning.

关键词: multi-agent systems, evidence-chain reasoning, breast ultrasound diagnosis, supervised fine-tuning, explainable AI, hierarchical agents, clinical workflow alignment, BI-RADS prediction

44. ❌ Leech Lattice Vector Quantization for Efficient LLM Compression

作者: Tycho F. A. van der Ouderaa, Mart van Baalen, Paul Whatmough, Markus Nagel 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11021v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大语言模型（LLMs）的量化压缩技术，核心贡献是提出了一种基于Leech格子的向量量化方法（LLVQ）。因此，与’Large Language Models OR LLMs OR Foundation Models’和’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分）。论文未涉及其他关键词，如MoE、SLMs、训练方法、推理加速、对齐、代理、科学AI应用等，故这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型（LLMs）量化压缩中传统标量量化存在信息理论限制的问题，提出了一种基于Leech格子的向量量化方法（LLVQ），实现了无需显式码本存储的高效索引和并行化解量化，并在性能上超越了现有先进方法。

摘要翻译

大语言模型（LLM）的标量量化从根本上受到信息论界限的限制。虽然向量量化（VQ）通过对参数块进行联合编码克服了这些限制，但实际实现必须避免使用昂贵的查找机制或其他显式码本存储。格点方法通过高度结构化且密集的填充解决了这一问题。本文探讨了利奇格（Leech lattice），该格在24维上具有最优球体填充和接触配置，是已知具有此类最优性质的最高维格点。为使利奇格能用于LLM量化，我们扩展了一种基于扩展戈莱码构造的现有搜索算法，以：i）支持索引功能，实现在无需具体化码本的情况下与比特字符串相互转换；ii）允许在利奇格壳层的并集上进行角度搜索；iii）提出完全可并行化的反量化内核。这些共同构成了一种实用算法，即利奇格向量量化（LLVQ）。LLVQ实现了最先进的LLM量化性能，超越了近期如Quip#、QTIP和PVQ等方法。这些结果凸显了高维格点对于可扩展、理论坚实的模型压缩的重要性。

摘要 (Abstract)

Scalar quantization of large language models (LLMs) is fundamentally limited by information-theoretic bounds. While vector quantization (VQ) overcomes these limits by encoding blocks of parameters jointly, practical implementations must avoid the need for expensive lookup mechanisms or other explicit codebook storage. Lattice approaches address this through highly structured and dense packing. This paper explores the Leech lattice, which, with its optimal sphere packing and kissing configurations at 24 dimensions, is the highest dimensional lattice known with such optimal properties. To make the Leech lattice usable for LLM quantization, we extend an existing search algorithm based on the extended Golay code construction, to i) support indexing, enabling conversion to and from bitstrings without materializing the codebook, ii) allow angular search over union of Leech lattice shells, iii) propose fully-parallelisable dequantization kernel. Together this yields a practical algorithm, namely Leech Lattice Vector Quantization (LLVQ). LLVQ delivers state-of-the-art LLM quantization performance, outperforming recent methods such as Quip#, QTIP, and PVQ. These results highlight the importance of high-dimensional lattices for scalable, theoretically grounded model compression.

关键词: Large Language Models, LLM Quantization, Vector Quantization, Leech Lattice, Model Compression, Codebook, Parameter Compression, Information-theoretic Bounds

45. ❌ Cross-Species Transfer Learning for Electrophysiology-to-Transcriptomics Mapping in Cortical GABAergic Interneurons

作者: Theo Schwider, Ramin Ramezani 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.11000v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文研究神经科学中的跨物种迁移学习，使用注意力BiLSTM模型预测GABA能抑制性中间神经元的转录组身份。论文与大多数关键词（涉及大模型技术、训练方法、推理优化等）完全无关，因为这些关键词针对通用大语言模型技术，而本文使用特定领域的深度学习模型（BiLSTM）。唯一相关的是"AI for Science OR Bioinformatics OR Cheminformatics"，因为论文属于生物信息学/神经科学AI应用，但并非核心大模型技术，故给8分（有一定关联但非高度相关）。

!!! tip deepseek-chat TL;DR

该论文研究了使用注意力BiLSTM模型进行跨物种迁移学习，以从电生理数据预测小鼠和人类皮层GABA能抑制性中间神经元的转录组身份，并证明小鼠到人类的迁移学习能提高人类亚类预测性能。

摘要翻译

单细胞电生理记录为神经元功能多样性提供了强大的观察窗口，并为将内在生理特性与转录组身份关联提供了可解释的路径。本研究基于艾伦研究所公开的小鼠与人类皮层Patch-seq数据集，复现并拓展了Gouwens等人（2020）提出的电生理-转录组学关联框架。我们聚焦于GABA能抑制性中间神经元，以研究跨物种可比且保守的亚类结构（Lamp5、Pvalb、Sst、Vip）。经过质量控制，我们分析了来自小鼠视觉皮层的3,699个神经元以及神经外科切除获取的506个人类新皮层神经元。通过标准化电生理特征与稀疏主成分分析（sPCA），我们复现了原小鼠研究中报告的主要类别层次分离结果。在有监督预测任务中，类别平衡随机森林模型在小鼠数据中提供了强力的特征工程基线，在人类数据中虽有所降低但仍具信息量。随后，我们开发了一种基于注意力机制的双向长短期记忆网络（BiLSTM），该模型直接处理结构化的IPFX特征族表示，避免了稀疏主成分分析的使用，并通过学习到的注意力权重提供特征族层面的可解释性。最后，我们评估了跨物种迁移学习场景：该序列模型在小鼠数据上进行预训练，随后在人类数据上针对对齐的四分类任务进行微调，相较于仅使用人类数据训练的基线，其宏观F1分数得到提升。综上，这些结果证实了Gouwens流程在小鼠数据中的可复现性，证明了序列模型能够匹配特征工程基线的性能，并表明从小鼠到人类的迁移学习可为人类神经元亚类预测带来可量化的性能提升。

摘要 (Abstract)

Single-cell electrophysiological recordings provide a powerful window into neuronal functional diversity and offer an interpretable route for linking intrinsic physiology to transcriptomic identity. Here, we replicate and extend the electrophysiology-to-transcriptomics framework introduced by Gouwens et al. (2020) using publicly available Allen Institute Patch-seq datasets from both mouse and human cortex. We focus on GABAergic inhibitory interneurons to target a subclass structure (Lamp5, Pvalb, Sst, Vip) that is comparable and conserved across species. After quality control, we analyzed 3,699 mouse visual cortex neurons and 506 human neocortical neurons from neurosurgical resections. Using standardized electrophysiological features and sparse PCA, we reproduced the major class-level separations reported in the original mouse study. For supervised prediction, a class-balanced random forest provided a strong feature-engineered baseline in mouse data and a reduced but still informative baseline in human data. We then developed an attention-based BiLSTM that operates directly on the structured IPFX feature-family representation, avoiding sPCA and providing feature-family-level interpretability via learned attention weights. Finally, we evaluated a cross-species transfer setting in which the sequence model is pretrained on mouse data and fine-tuned on human data for an aligned 4-class task, improving human macro-F1 relative to a human-only training baseline. Together, these results confirm reproducibility of the Gouwens pipeline in mouse data, demonstrate that sequence models can match feature-engineered baselines, and show that mouse-to-human transfer learning can provide measurable gains for human subclass prediction.

关键词: cross-species transfer learning, electrophysiology-to-transcriptomics mapping, GABAergic interneurons, attention-based BiLSTM, mouse-to-human transfer, Patch-seq datasets, supervised prediction, feature-family interpretability

46. ❌ Factorized Neural Implicit DMD for Parametric Dynamics

作者: Siyuan Chen, Zhecheng Wang, Yixin Chen, Yue Chang, Peter Yichen Chen, Eitan Grinspun, Jonathan Panuelos 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10995v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是基于神经场和Koopman算子的物理系统动态建模方法，属于AI for Science（科学AI）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（5分）。但论文完全不涉及大语言模型（LLMs）、深度学习技术原理创新、或任何其他评分关键词中的具体技术（如MoE、SFT、RAG、量化等），因此其他所有关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于物理编码神经场和Koopman算子谱分解的参数化动态系统建模方法，能够实现长期稳定预测、参数空间插值和谱分析，并在多种动态问题上验证了其有效性。

摘要翻译

一种数据驱动的、无模型的方法来模拟物理系统的时间演化，减少了对显式控制方程知识的依赖。即使存在偏微分方程等物理先验信息，此类系统也常处于高维状态空间并呈现非线性动力学特性，使得传统数值求解器计算成本高昂且难以适用于实时分析与控制。考虑学习动力系统参数化流的问题：给定初始场和一组物理参数，我们的目标是预测系统随时间演化的过程，同时支持长时程推演、对未见参数的泛化以及谱分析。

我们提出了一种基于物理编码神经场的库普曼算子谱分解参数化方法。与拟合单一解曲面的物理约束神经场不同，也区别于直接近似固定时间范围解算子的神经算子，我们的模型学习一种分解的流算子，将空间模态与时间演化解耦。这种结构揭示了底层物理过程的本征值、模态和稳定性，从而实现了稳定的长期推演、参数空间插值以及谱分析。我们在一系列动力学问题上验证了该方法的有效性，展示了其准确预测复杂时空现象的能力，同时为系统动态行为提供了可解释的洞察。

摘要 (Abstract)

A data-driven, model-free approach to modeling the temporal evolution of physical systems mitigates the need for explicit knowledge of the governing equations. Even when physical priors such as partial differential equations are available, such systems often reside in high-dimensional state spaces and exhibit nonlinear dynamics, making traditional numerical solvers computationally expensive and ill-suited for real-time analysis and control. Consider the problem of learning a parametric flow of a dynamical system: with an initial field and a set of physical parameters, we aim to predict the system’s evolution over time in a way that supports long-horizon rollouts, generalization to unseen parameters, and spectral analysis. We propose a physics-coded neural field parameterization of the Koopman operator’s spectral decomposition. Unlike a physics-constrained neural field, which fits a single solution surface, and neural operators, which directly approximate the solution operator at fixed time horizons, our model learns a factorized flow operator that decouples spatial modes and temporal evolution. This structure exposes underlying eigenvalues, modes, and stability of the underlying physical process to enable stable long-term rollouts, interpolation across parameter spaces, and spectral analysis. We demonstrate the efficacy of our method on a range of dynamics problems, showcasing its ability to accurately predict complex spatiotemporal phenomena while providing insights into the system’s dynamic behavior.

关键词: neural field, Koopman operator, parametric dynamics, spectral decomposition, physics-coded, factorized flow operator, spatiotemporal phenomena, dynamic behavior

47. ❌ Bayesian Optimization with Gaussian Processes to Accelerate Stationary Point Searches

作者: Rohit Goswami 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10992v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用贝叶斯优化和高斯过程来加速势能面上的驻点搜索，属于科学计算和优化领域。虽然它涉及AI在科学计算中的应用（如贝叶斯优化、高斯过程回归），但论文内容完全不涉及大语言模型（LLMs）、深度学习技术原理或任何列出的LLM相关关键词（如MoE、SFT、RAG、量化等）。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在科学（具体是计算化学/物理）中的应用，但并非核心匹配，因此给5分（有一定关联）。其他所有关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种统一的贝叶斯优化框架，使用高斯过程回归和主动学习来加速势能面上最小值、单点和双点鞍点的搜索，通过六步代理循环减少了计算评估需求并提高了效率。

摘要翻译

加速势能面上驻点的探索，构建局部替代模型已历经数十年的努力。若实施得当，替代模型能在保持底层理论精度的同时，将所需计算量降低一个数量级。我们提出了一种统一的贝叶斯优化视角，通过一个统一的六步替代循环来处理极小化、单点鞍点搜索和双端鞍点搜索，其区别仅在于内部优化目标和采集准则的不同。该框架采用包含导数观测的高斯过程回归、反距离核以及主动学习。通过结合最远点采样与推土机距离的最优传输高斯过程扩展、基于方差屏障和振荡检测的最大后验概率正则化，以及自适应信任半径，构成了同一基础方法的具体扩展，从而提升了精度与效率。我们还展示了随机傅里叶特征能够将超参数训练与预测解耦，为高维系统提供了有利的扩展性。随附的教学性Rust代码表明，所有应用均使用完全相同的贝叶斯优化循环，从而弥合了理论表述与实际执行之间的鸿沟。

摘要 (Abstract)

Accelerating the explorations of stationary points on potential energy surfaces building local surrogates spans decades of effort. Done correctly, surrogates reduce required evaluations by an order of magnitude while preserving the accuracy of the underlying theory. We present a unified Bayesian Optimization view of minimization, single point saddle searches, and double ended saddle searches through a unified six-step surrogate loop, differing only in the inner optimization target and acquisition criterion. The framework uses Gaussian process regression with derivative observations, inverse-distance kernels, and active learning. The Optimal Transport GP extensions of farthest point sampling with Earth mover’s distance, MAP regularization via variance barrier and oscillation detection, and adaptive trust radius form concrete extensions of the same basic methodology, improving accuracy and efficiency. We also demonstrate random Fourier features decouple hyperparameter training from predictions enabling favorable scaling for high-dimensional systems. Accompanying pedagogical Rust code demonstrates that all applications use the exact same Bayesian optimization loop, bridging the gap between theoretical formulation and practical execution.

关键词: Bayesian Optimization, Gaussian Process Regression, Stationary Point Search, Potential Energy Surfaces, Active Learning, Surrogate Modeling, Derivative Observations, Computational Chemistry

48. ❌ ForwardFlow: Simulation only statistical inference using deep learning

作者: Stefan Böhringer 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10991v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究基于深度学习的参数统计模型分析，使用模拟数据训练神经网络进行参数估计，属于深度学习在科学计算/统计推断中的应用。所有关键词均与论文内容无直接关联，仅’AI for Science OR Bioinformatics OR Cheminformatics’因涉及深度学习在科学领域的应用（统计推断）而获得5分（有一定关联），但论文未明确涉及生物信息学或化学信息学。其他关键词主要针对大语言模型（LLM）相关技术（如预训练、对齐、推理、代理等），而本文专注于传统深度神经网络在统计推断中的特定应用，未涉及LLM、MoE、缩放定律、微调、RAG、注意力优化、推理技术、代理系统、模型压缩等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度学习的模拟统计推断方法ForwardFlow，使用分支网络结构从模拟数据中学习参数估计，并在仿真中展示了有限样本精确性、数据污染鲁棒性和算法近似等理想特性。

摘要翻译

深度学习模型正被用于基于纯仿真框架的参数化统计分析。采用归一化流的贝叶斯模型从先验分布中模拟数据，并由两个深度神经网络构成：一个用于学习参数充分统计量的摘要网络，以及一个以摘要网络为条件、能够近似后验分布的归一化流。本文探索基于单一摘要网络的频率学派模型。在训练过程中，网络的输入是基于参数生成的模拟数据集，损失函数通过最小化学习得到的摘要统计量与参数之间的均方误差，使网络能够解决参数估计的逆问题。我们提出一种分支网络结构，其中包含压缩层——这些层将数据集降维为摘要统计量，再通过全连接层进一步映射以近似参数估计值。我们通过理论依据论证了该网络结构的选择。

在仿真实验中，我们证明了参数估计具备三个理想特性：有限样本精确性、对数据污染的鲁棒性以及算法近似能力。这些特性是通过在训练阶段向网络提供不同样本量、受污染数据以及需要算法重构的数据而实现的。在我们的仿真中，针对遗传数据的EM算法被网络自动近似。

纯仿真方法在复杂建模任务中展现出实用优势：研究者仅需负责较简单的数据模拟部分，而将解决逆问题这一更复杂的任务交由神经网络完成。未来的挑战性工作包括提供可广泛应用于多种场景的预训练模型。

摘要 (Abstract)

Deep learning models are being used for the analysis of parametric statistical models based on simulation-only frameworks. Bayesian models using normalizing flows simulate data from a prior distribution and are composed of two deep neural networks: a summary network that learns a sufficient statistic for the parameter and a normalizing flow that conditional on the summary network can approximate the posterior distribution. Here, we explore frequentist models that are based on a single summary network. During training, input of the network is a simulated data set based on a parameter and the loss function minimizes the mean-square error between learned summary and parameter. The network thereby solves the inverse problem of parameter estimation. We propose a branched network structure that contains collapsing layers that reduce a data set to summary statistics that are further mapped through fully connected layers to approximate the parameter estimate. We motivate our choice of network structure by theoretical considerations. In simulations we demonstrate three desirable properties of parameter estimates: finite sample exactness, robustness to data contamination, and algorithm approximation. These properties are achieved offering the the network varying sample size, contaminated data, and data needing algorithmic reconstruction during the training phase. In our simulations an EM-algorithm for genetic data is automatically approximated by the network. Simulation only approaches seem to offer practical advantages in complex modeling tasks where the simpler data simulation part is left to the researcher and the more complex problem of solving the inverse problem is left to the neural network. Challenging future work includes offering pre-trained models that can be used in a wide variety of applications.

关键词: deep learning, statistical inference, simulation-only, parameter estimation, normalizing flows, summary network, inverse problem, EM-algorithm

49. ❌ MCMC Informed Neural Emulators for Uncertainty Quantification in Dynamical Systems

作者: Heikki Haario, Zhi-Song Liu, Martin Simon, Hendrik Weichel 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是使用神经网络作为物理模型的替代品（代理模型），并引入MCMC方法将模型参数分布作为输入进行训练，以实现高效的不确定性量化。论文的核心是神经网络在科学计算中的应用（具体是动态系统的不确定性量化），属于AI for Science的范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分）。然而，论文并未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法、推理优化、智能体等）或任何其他关键词所描述的具体技术，因此其他所有关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过MCMC将模型参数分布作为输入来训练神经网络代理模型的方法，以高效实现动态系统中与底层物理模型相同的不确定性量化，显著减少了计算时间。

摘要翻译

神经网络是一种常用方法，能以计算成本低廉的代理模型替代物理模型。若假设模型参数的先验分布准确，参数不确定性量化可被纳入训练过程。本文研究的是常见的相反情况：直接筛选或随机采样模型参数会导致训练时间过长，并在非物理参数值上进行评估。我们的解决方案是将不确定性量化与网络架构解耦。我们不再对网络权重进行采样，而是通过马尔可夫链蒙特卡洛（Markov chain Monte Carlo，MCMC）方法将模型参数分布作为输入引入网络训练。通过这种方式，代理模型能够实现与底层物理模型相同的不确定性量化，同时显著减少计算时间。该方法对神经网络的选择完全无关。在示例中，我们展示了一种用于预测的分位数仿真器，以及一种基于自编码器的新型常微分方程（ODE）网络仿真器，该仿真器能够灵活估计对应于不同ODE模型参数的不同轨迹路径。此外，我们提出了一种数学分析，以透明的方式将潜在性能损失与可测量的分布失配关联起来。

摘要 (Abstract)

Neural networks are a commonly used approach to replace physical models with computationally cheap surrogates. Parametric uncertainty quantification can be included in training, assuming that an accurate prior distribution of the model parameters is available. Here we study the common opposite situation, where direct screening or random sampling of model parameters leads to exhaustive training times and evaluations at unphysical parameter values. Our solution is to decouple uncertainty quantification from network architecture. Instead of sampling network weights, we introduce the model-parameter distribution as an input to network training via Markov chain Monte Carlo (MCMC). In this way, the surrogate achieves the same uncertainty quantification as the underlying physical model, but with substantially reduced computation time. The approach is fully agnostic with respect to the neural network choice. In our examples, we present a quantile emulator for prediction and a novel autoencoder-based ODE network emulator that can flexibly estimate different trajectory paths corresponding to different ODE model parameters. Moreover, we present a mathematical analysis that provides a transparent way to relate potential performance loss to measurable distribution mismatch.

关键词: neural networks, uncertainty quantification, dynamical systems, MCMC, surrogate models, parameter distribution, autoencoder, ODE network emulator

50. ❌ The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

作者: Peter Balogh 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10985v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究Transformer语言模型中MLP层的内部工作机制，特别是GPT-2 Small模型中的二进制路由机制。论文与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为研究基于GPT-2模型；与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为核心贡献是解释MLP层的内部工作机制和路由决策。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统、科学应用等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文揭示了Transformer语言模型中MLP层通过二进制神经元激活实现连续信号的二进制路由机制，在GPT-2 Small模型中发现了共识/异常处理架构，并验证了这种路由的功能重要性。

摘要翻译

我们发现，Transformer语言模型中的多层感知机（MLP）层执行连续信号的二元路由：尽管被路由的信号是连续的，但关于某个词元是否需要非线性处理的决策，可以通过神经元的二元激活状态来准确捕捉。在GPT-2 Small（1.24亿参数）模型中，我们发现特定神经元实现了一种共识架构——七个“默认开启”神经元和一个异常处理器（位于第11层的N2123神经元）以93-98%的互斥性协同工作，形成了一个二元路由开关。跨层分析揭示了一个发展弧线：早期层（L1-3）使用单个网关神经元来路由异常，不涉及共识机制；中间层（L4-6）呈现弥散式处理，既无网关也无共识；而后期层（L7-11）则结晶出完整的共识/异常架构，共识神经元的数量逐步增加（从1个到3个再到7个）。因果验证证实了该路由的功能性：在共识失效时移除MLP会导致困惑度上升43.3%，而在完全共识状态下移除MLP仅导致10.1%的上升——差异超过4倍。比较路由决策中二元特征与连续特征的效果，证实二值化几乎不损失信息（准确率79.2% vs. 78.8%），而连续激活则携带额外的幅度信息（R^2 = 0.36 vs. 0.22）。这种二元路由结构解释了为何平滑多项式逼近会失效：对于高度非线性层，交叉验证的多项式拟合（2-7次）的R^2从未超过0.06。我们提出，深度网络中已确立的分段仿射特性描述，可以用路由特性来补充：沿着自然数据流形，分段边界实现了关于哪些词元需要非线性处理的二元决策，从而将连续信号路由到性质不同的计算路径中。

摘要 (Abstract)

We show that MLP layers in transformer language models perform binary routing of continuous signals: the decision of whether a token needs nonlinear processing is well-captured by binary neuron activations, even though the signals being routed are continuous. In GPT-2 Small (124M parameters), we find that specific neurons implement a consensus architecture – seven “default-ON” neurons and one exception handler (N2123 in Layer 11) that are 93-98% mutually exclusive – creating a binary routing switch. A cross-layer analysis reveals a developmental arc: early layers (L1-3) use single gateway neurons to route exceptions without consensus quorums; middle layers (L4-6) show diffuse processing with neither gateway nor consensus; and late layers (L7-11) crystallize full consensus/exception architectures with increasing quorum size (1 to 3 to 7 consensus neurons). Causal validation confirms the routing is functional: removing the MLP at consensus breakdown costs 43.3% perplexity, while at full consensus removing it costs only 10.1% – exceeding a 4x difference. Comparing binary vs. continuous features for the routing decision confirms that binarization loses essentially no information (79.2% vs. 78.8% accuracy), while continuous activations carry additional magnitude information (R^2 = 0.36 vs. 0.22). This binary routing structure explains why smooth polynomial approximation fails: cross-validated polynomial fits (degrees 2-7) never exceed R^2 = 0.06 for highly nonlinear layers. We propose that the well-established piecewise-affine characterization of deep networks can be complemented by a routing characterization: along the natural data manifold, the piecewise boundaries implement binary decisions about which tokens need nonlinear processing, routing continuous signals through qualitatively different computational paths.

关键词: Transformer, MLP layers, binary routing, GPT-2, neuron activations, consensus architecture, mechanistic interpretability, perplexity analysis

51. ❌ Federated Learning-driven Beam Management in LEO 6G Non-Terrestrial Networks

作者: Maria Lamprini Bartsioka, Ioannis A. Bartsiokas, Athanasios D. Panagopoulos, Dimitra I. Kaklamani, Iakovos S. Venieris 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10983v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于联邦学习的低地球轨道卫星网络波束管理技术，使用多层感知机和图神经网络进行波束预测。所有评分关键词都涉及大语言模型、深度学习技术原理或AI在科学领域的应用，而本论文专注于通信网络中的联邦学习和传统神经网络应用，与评分关键词完全无关，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文研究了在低地球轨道非地面网络中，利用联邦学习驱动的多层感知机和图神经网络进行波束管理，结果表明图神经网络在波束预测准确性和稳定性方面优于多层感知机。

摘要翻译

低地球轨道（Low Earth Orbit, LEO）非地面网络（Non-Terrestrial Networks, NTNs）需要在动态传播条件下实现高效的波束管理。本研究探讨了基于联邦学习（Federated Learning, FL）的LEO卫星星座波束选择方案，其中各轨道平面通过利用高空平台站（High-Altitude Platform Stations, HAPS）作为分布式学习节点进行协作。研究采用真实的信道与波束赋形数据，对多层感知机（Multi-Layer Perceptron, MLP）和图神经网络（Graph Neural Network, GNN）两种模型进行了评估。结果表明，GNN在波束预测准确性和稳定性方面均优于MLP，尤其在低仰角场景下表现突出，从而为未来NTN部署提供了轻量且智能的波束管理方案。

摘要 (Abstract)

Low Earth Orbit (LEO) Non-Terrestrial Networks (NTNs) require efficient beam management under dynamic propagation conditions. This work investigates Federated Learning (FL)-based beam selection in LEO satellite constellations, where orbital planes operate as distributed learners through the utilization of High-Altitude Platform Stations (HAPS). Two models, a Multi-Layer Perceptron (MLP) and a Graph Neural Network (GNN), are evaluated using realistic channel and beamforming data. Results demonstrate that GNN surpasses MLP in beam prediction accuracy and stability, particularly at low elevation angles, enabling lightweight and intelligent beam management for future NTN deployments.

关键词: Federated Learning, Beam Management, LEO Satellite Networks, Graph Neural Network, Multi-Layer Perceptron, Non-Terrestrial Networks, Beam Selection, High-Altitude Platform Stations

52. ❌ FRIEND: Federated Learning for Joint Optimization of multi-RIS Configuration and Eavesdropper Intelligent Detection in B5G Networks

作者: Maria Lamprini A. Bartsioka, Ioannis A. Bartsiokas, Anastasios K. Papazafeiropoulos, Maria A. Seimeni, Dimitra I. Kaklamani, Iakovos S. Venieris 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10977v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究B5G网络中基于联邦学习的窃听检测和RIS配置优化，使用深度卷积神经网络（DCNN）处理信道状态信息（CSI）。所有关键词均与大语言模型（LLM）或深度学习技术原理直接相关，而本文仅涉及传统DCNN和联邦学习在无线通信安全的应用，未涉及LLM、MoE、缩放定律、训练方法、推理优化、智能体等大模型核心技术。唯一相关的是’AI for Science’，因为论文将AI应用于通信工程（可视为科学应用），但并非核心生物信息学或化学信息学，故给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于联邦学习的框架，用于在RIS增强的无蜂窝毫米波网络中联合优化多RIS配置和窃听检测，实验表明该方法比基线方法提高了约30%的保密率，同时保持了接近最优的检测精度。

摘要翻译

随着无线系统向超5G（B5G）演进，无蜂窝（CF）毫米波（mmWave）架构与可重构智能表面（RIS）的结合，正成为实现超可靠、大容量、可扩展且安全的工业物联网（IIoT）通信的关键赋能技术。然而，保护这些复杂分布式环境免受窃听仍是一个严峻挑战，尤其是在传统安全机制难以克服可扩展性和时延限制的情况下。本文提出了一种新颖的框架，利用联邦学习（FL）在RIS增强的无蜂窝毫米波网络中检测恶意用户。该设想方案采用多个在无传统蜂窝边界下运行的接入点（APs），并借助RIS节点动态塑造无线传播环境。边缘设备基于本地观测的信道状态信息（CSI）协作训练一个深度卷积神经网络（DCNN），从而无需交换原始数据。此外，该模型引入了提前退出机制，以共同满足计算复杂度的要求。性能评估表明，与未使用RIS辅助的基线方法相比，FL与多RIS协同的集成将可达保密速率（SR）提升了约30%，同时保持了接近最优的检测准确率水平。这项工作为下一代IIoT部署建立了一种分布式、保护隐私的物理层窃听检测方法。

摘要 (Abstract)

As wireless systems evolve toward Beyond 5G (B5G), the adoption of cell-free (CF) millimeter-wave (mmWave) architectures combined with Reconfigurable Intelligent Surfaces (RIS) is emerging as a key enabler for ultra-reliable, high-capacity, scalable, and secure Industrial Internet of Things (IIoT) communications. However, safeguarding these complex and distributed environments against eavesdropping remains a critical challenge, particularly when conventional security mechanisms struggle to overcome scalability, and latency constraints. In this paper, a novel framework for detecting malicious users in RIS-enhanced cell-free mmWave networks using Federated Learning (FL) is presented. The envisioned setup features multiple access points (APs) operating without traditional cell boundaries, assisted by RIS nodes to dynamically shape the wireless propagation environment. Edge devices collaboratively train a Deep Convolutional Neural Network (DCNN) on locally observed Channel State Information (CSI), eliminating the need for raw data exchange. Moreover, an early-exit mechanism is incorporated in that model to jointly satisfy computational complexity requirements. Performance evaluation indicates that the integration of FL and multi-RIS coordination improves approximately 30% the achieved secrecy rate (SR) compared to baseline non-RIS-assisted methods while maintaining near-optimal detection accuracy levels. This work establishes a distributed, privacy-preserving approach to physical layer eavesdropping detection tailored for next-generation IIoT deployments.

关键词: Federated Learning, Reconfigurable Intelligent Surfaces, Cell-free mmWave Networks, Eavesdropping Detection, Deep Convolutional Neural Network, Channel State Information, Secrecy Rate, Industrial Internet of Things

53. ❌ Bio-Inspired Self-Supervised Learning for Wrist-worn IMU Signals

作者: Prithviraj Tarale, Kiet Chu, Abhishek Varghese, Kai-Chun Liu, Maxwell A Xu, Mohit Iyyer, Sunghoon I. Lee 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10961v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于可穿戴传感器信号的自监督学习，与大多数大模型技术关键词无关。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文使用了预训练方法。与’AI for Science OR Bioinformatics OR Cheminformatics’有较强关联（8分），因为论文涉及生物信息学（Bioinformatics）和AI在健康监测中的应用，属于AI for Science范畴。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于生物运动子运动理论的新型令牌化策略，用于手腕加速度计信号的自监督学习，通过预训练Transformer编码器在NHANES数据集上学习人类活动表示，在多个HAR基准测试中优于现有基线并表现出更强的数据效率。

摘要翻译

可穿戴加速度计已实现大规模健康监测，但标记数据的稀缺性制约了稳健人体活动表征的学习。尽管自监督学习提供了潜在的解决方案，现有方法将传感器数据流视为非结构化时间序列，忽略了人体运动的内在生物结构——我们认为这一因素对有效的人体活动识别至关重要。基于运动控制中的子运动理论，我们提出一种新颖的标记化策略，该理论认为连续手腕运动由被称为子运动的基本基函数叠加构成。我们将标记定义为运动片段——一种由有限子运动序列构成的运动单元，可直接从手腕加速度计信号中提取。通过将这些片段作为标记，我们采用掩码运动片段重建任务预训练Transformer编码器，以建模运动片段间的时间依赖关系，使学习重点超越局部波形形态。基于NHANES数据集（约2.8万小时；约1.1万名参与者；约1000万个数据窗口）预训练后，我们的表征在六个受试者独立的HAR基准测试中均优于现有可穿戴自监督学习方法，并在数据稀缺场景中展现出更强的数据效率。代码与预训练权重将公开提供。

摘要 (Abstract)

Wearable accelerometers have enabled large-scale health and wellness monitoring, yet learning robust human-activity representations has been constrained by the scarcity of labeled data. While self-supervised learning offers a potential remedy, existing approaches treat sensor streams as unstructured time series, overlooking the underlying biological structure of human movement, a factor we argue is critical for effective Human Activity Recognition (HAR). We introduce a novel tokenization strategy grounded in the submovement theory of motor control, which posits that continuous wrist motion is composed of superposed elementary basis functions called submovements. We define our token as the movement segment, a unit of motion composed of a finite sequence of submovements that is readily extractable from wrist accelerometer signals. By treating these segments as tokens, we pretrain a Transformer encoder via masked movement-segment reconstruction to model the temporal dependencies of movement segments, shifting the learning focus beyond local waveform morphology. Pretrained on the NHANES corpus (approximately 28k hours; approximately 11k participants; approximately 10M windows), our representations outperform strong wearable SSL baselines across six subject-disjoint HAR benchmarks. Furthermore, they demonstrate stronger data efficiency in data-scarce settings. Code and pretrained weights will be made publicly available.

关键词: self-supervised learning, wearable accelerometers, human activity recognition, submovement theory, Transformer encoder, masked reconstruction, NHANES dataset, data efficiency

54. ❌ Ranking Reasoning LLMs under Test-Time Scaling

作者: Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10960v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于推理大语言模型（LLMs）在测试时扩展（test-time scaling）下的评估和排名方法，与’Large Language Models’和’Chain of Thought/System 2 Thinking’高度相关，因为论文明确研究推理LLMs，涉及多步推理和深度推理评估。其他关键词如MoE、SLMs、训练技术、对齐、RAG、压缩、代理等均未在摘要中提及，因此评分为0。论文未涉及特定科学领域应用，故’AI for Science’等评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了在测试时扩展（通过每个提示采样多个输出）下推理大语言模型的排名问题，提出了Scorio库并验证了多种统计排名方法的有效性，在高预算和低预算场景下均能可靠地排名模型。

摘要翻译

测试时扩展通过为每个提示采样多个输出来评估推理大语言模型，但在此机制下的模型排序研究仍不充分。我们形式化了测试时扩展下的密集基准排序问题，并推出Scorio库，该库实现了统计排序方法，包括配对比较模型、项目反应理论模型、投票规则，以及基于图论和谱分析的方法。在四个奥林匹克风格数学基准（AIME'24、AIME'25、HMMT'25和BrUMO'25；最多$N=80$次试验）上对$20$个推理模型进行评估，大多数全试验排序结果与贝叶斯黄金标准$\mathrm{Bayes}{\mathcal{U}}@80$高度一致（平均肯德尔$τ_b = 0.93$–$0.95$），且$19$至$34$种方法能完全复现相同排序。在单试验机制下，最佳方法可达$τ_b \approx 0.86$。使用贪婪解码作为经验先验（$\mathrm{Bayes}{\mathbf{R}_0}@N$）可在$N=1$时降低$16$–$52%$的方差，但当贪婪采样与随机采样结果不一致时可能引入排序偏差。这些结果为高预算和低预算的测试时扩展场景确定了可靠的排序方法。我们在https://github.com/mohsenhariri/scorio开源发布Scorio库。

摘要 (Abstract)

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}{\mathcal{U}}@80$ (mean Kendall’s $τ_b = 0.93$–$0.95$), and $19$–$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$–$52%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

关键词: reasoning LLMs, test-time scaling, model ranking, statistical ranking methods, paired-comparison models, item response theory, Olympiad-style math benchmarks, Scorio library

55. ❌ When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra

作者: Mira Jürgens, Gaetan De Waele, Morteza Rakhshaninejad, Willem Waegeman 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10950v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于分子结构检索中的选择性预测和不确定性量化，属于AI在科学（具体是生物信息学/化学信息学）领域的应用。论文内容涉及机器学习方法、不确定性估计、风险控制等，但未涉及任何大模型（LLM）、深度学习技术原理创新、模型训练/对齐/推理优化、智能体系统等关键词。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文处理质谱数据以识别分子结构，属于生物/化学信息学应用，但并非核心创新于大模型技术，因此给予5分（有一定关联）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对从串联质谱（MS/MS）中检索分子结构时错误率较高的问题，提出了一个选择性预测框架，通过不确定性量化使模型在置信度低时拒绝预测，并证明使用一阶置信度度量可以实现良好的风险-覆盖权衡，满足指定的错误率约束。

摘要翻译

基于串联质谱（MS/MS）识别分子结构的机器学习方法发展迅速，但现有方法仍存在较高的错误率。在临床代谢组学与环境筛查等高风险应用中，错误的注释可能引发严重后果，因此必须确定何时可以信任预测结果。本文提出了一种从MS/MS谱图中检索分子结构的选择性预测框架，使模型在不确定性过高时能够拒绝预测。我们将该问题置于风险-覆盖权衡框架内，并在两个粒度级别上全面评估不确定性量化策略：针对预测分子指纹位点的指纹级不确定性，以及针对候选分子排序的检索级不确定性。我们比较了多种评分函数，包括一阶置信度度量、来自二阶分布的数据不确定性与认知不确定性估计，以及潜在空间中的基于距离的度量。所有实验均在MassSpecGym基准测试平台上进行。分析表明，虽然指纹级不确定性评分难以有效反映检索成功率，但计算成本低廉的一阶置信度度量与检索级数据不确定性能够在不同评估设定下实现优越的风险-覆盖权衡。我们进一步证明，通过基于泛化界的无分布风险控制方法，实践者可以指定可容忍的错误率，并以高概率获得满足该约束的注释子集。

摘要 (Abstract)

Machine learning methods for identifying molecular structures from tandem mass spectra (MS/MS) have advanced rapidly, yet current approaches still exhibit significant error rates. In high-stakes applications such as clinical metabolomics and environmental screening, incorrect annotations can have serious consequences, making it essential to determine when a prediction can be trusted. We introduce a selective prediction framework for molecular structure retrieval from MS/MS spectra, enabling models to abstain from predictions when uncertainty is too high. We formulate the problem within the risk-coverage tradeoff framework and comprehensively evaluate uncertainty quantification strategies at two levels of granularity: fingerprint-level uncertainty over predicted molecular fingerprint bits, and retrieval-level uncertainty over candidate rankings. We compare scoring functions including first-order confidence measures, aleatoric and epistemic uncertainty estimates from second-order distributions, as well as distance-based measures in the latent space. All experiments are conducted on the MassSpecGym benchmark. Our analysis reveals that while fingerprint-level uncertainty scores are poor proxies for retrieval success, computationally inexpensive first-order confidence measures and retrieval-level aleatoric uncertainty achieve strong risk-coverage tradeoffs across evaluation settings. We demonstrate that by applying distribution-free risk control via generalization bounds, practitioners can specify a tolerable error rate and obtain a subset of annotations satisfying that constraint with high probability.

关键词: selective prediction, molecular structure retrieval, mass spectra, uncertainty quantification, risk-coverage tradeoff, MS/MS, confidence measures, error control

56. ❌ Quantifying Membership Disclosure Risk for Tabular Synthetic Data Using Kernel Density Estimators

作者: Rajdeep Pathak, Sayantee Jana 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10937v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于使用核密度估计器量化表格合成数据的成员披露风险，属于隐私保护技术领域。论文内容完全不涉及大模型、深度学习技术原理、科学AI应用或任何评分关键词中的技术主题（如LLM、MoE、Scaling Laws、微调、对齐、推理、代理、压缩等），因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用核密度估计器量化表格合成数据成员披露风险的实用方法，通过两种攻击模型在多个数据集上验证了其比基线方法更高的F1分数和更精确的风险表征能力。

摘要翻译

合成数据的使用作为一种保护隐私的真实数据共享替代方案日益普及，尤其在医疗、金融和人口统计等敏感领域。然而，合成数据的隐私保障并非绝对，仍易受到成员推理攻击的影响——攻击者旨在判断特定个体是否存在于用于训练生成器的数据集中。本研究提出一种实用且有效的方法，利用核密度估计器量化表格型合成数据中的成员披露风险。我们基于KDE的方法建模了合成数据与训练记录之间最近邻距离的分布，从而支持成员关系的概率推断，并通过ROC曲线实现稳健评估。我们提出了两种攻击模型：一是假设拥有训练数据特权的“真实分布攻击”，另一种是更现实且可实施的“现实攻击”，该模型使用无真实成员标签的辅助数据。在四个真实世界数据集和六种合成数据生成器上的实证评估表明，相较于先前的基线方法，我们的方法始终能获得更高的F1分数和更精确的风险特征刻画，且无需计算成本高昂的影子模型。所提出的方法为量化合成数据中的成员披露风险提供了一个实用框架和度量标准，使数据管理者能够在发布合成数据供下游使用前进行生成后风险评估。本研究的全部数据集和代码可在https://github.com/PyCoder913/MIA-KDE获取。

摘要 (Abstract)

The use of synthetic data has become increasingly popular as a privacy-preserving alternative to sharing real datasets, especially in sensitive domains such as healthcare, finance, and demography. However, the privacy assurances of synthetic data are not absolute, and remain susceptible to membership inference attacks (MIAs), where adversaries aim to determine whether a specific individual was present in the dataset used to train the generator. In this work, we propose a practical and effective method to quantify membership disclosure risk in tabular synthetic datasets using kernel density estimators (KDEs). Our KDE-based approach models the distribution of nearest-neighbour distances between synthetic data and the training records, allowing probabilistic inference of membership and enabling robust evaluation via ROC curves. We propose two attack models: a ‘True Distribution Attack’, which assumes privileged access to training data, and a more realistic, implementable ‘Realistic Attack’ that uses auxiliary data without true membership labels. Empirical evaluations across four real-world datasets and six synthetic data generators demonstrate that our method consistently achieves higher F1 scores and sharper risk characterization than a prior baseline approach, without requiring computationally expensive shadow models. The proposed method provides a practical framework and metric for quantifying membership disclosure risk in synthetic data, which enables data custodians to conduct a post-generation risk assessment prior to releasing their synthetic datasets for downstream use. The datasets and codes for this study are available at https://github.com/PyCoder913/MIA-KDE.

关键词: synthetic data, membership inference attacks, kernel density estimators, privacy-preserving, tabular data, risk assessment, nearest-neighbour distances, ROC curves

57. ❌ ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

作者: Kadir-Kaan Özer, René Ebeling, Markus Enzweiler 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10926v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于汽车时间序列异常检测的部署导向评估，研究内容涉及异常检测方法在受限计算环境下的性能评估，包括经典方法和深度学习方法。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文研究的是特定领域（汽车）的时间序列异常检测，不涉及大模型、深度学习技术原理创新或AI在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了ECoLAD评估协议，用于评估时间序列异常检测方法在汽车部署环境中的实际可行性，发现轻量级经典方法在计算受限条件下比深度方法更具部署优势。

摘要翻译

时序异常检测器通常在无约束执行的工作站级硬件上进行性能比较。然而，车载监控需要在有限的CPU并行性下具备可预测的延迟和稳定的行为。因此，仅以准确性为标准的排行榜可能无法准确反映哪些方法在符合实际部署约束的条件下仍然可行。

我们提出了ECoLAD（异常检测效率计算阶梯），这是一种面向部署的评估协议，具体体现为对专有汽车遥测数据（异常率约0.022%）及补充性公开基准的实证研究。ECoLAD通过机械确定的、仅使用整数的缩放规则和明确的CPU线程上限，在异构检测器家族上应用单调递减的计算资源阶梯，同时记录所有应用的配置变更。通过扫描目标评分速率并报告（i）覆盖率（达到目标的实体比例）以及（ii）在满足目标的已测阶梯配置中可达到的最佳AUC-PR（精确率-召回率曲线下面积），来表征吞吐量受限下的行为。在受限的汽车遥测数据上，轻量级经典检测器能在整个吞吐量扫描范围内，同时维持高于随机基线的覆盖率和检测提升度。而多种深度学习方法在尚未损失准确性之前，便已丧失了部署可行性。

摘要 (Abstract)

Time-series anomaly detectors are commonly compared on workstation-class hardware under unconstrained execution. In-vehicle monitoring, however, requires predictable latency and stable behavior under limited CPU parallelism. Accuracy-only leaderboards can therefore misrepresent which methods remain feasible under deployment-relevant constraints. We present ECoLAD (Efficiency Compute Ladder for Anomaly Detection), a deployment-oriented evaluation protocol instantiated as an empirical study on proprietary automotive telemetry (anomaly rate ${\approx}$0.022) and complementary public benchmarks. ECoLAD applies a monotone compute-reduction ladder across heterogeneous detector families using mechanically determined, integer-only scaling rules and explicit CPU thread caps, while logging every applied configuration change. Throughput-constrained behavior is characterized by sweeping target scoring rates and reporting (i) coverage (the fraction of entities meeting the target) and (ii) the best AUC-PR achievable among measured ladder configurations satisfying the target. On constrained automotive telemetry, lightweight classical detectors sustain both coverage and detection lift above the random baseline across the full throughput sweep. Several deep methods lose feasibility before they lose accuracy.

关键词: time-series anomaly detection, automotive telemetry, deployment-oriented evaluation, efficiency compute ladder, throughput-constrained behavior, lightweight classical detectors, deep methods, AUC-PR

58. ❌ NCAA Bracket Prediction Using Machine Learning and Combinatorial Fusion Analysis

作者: Yuanhong Wu, Isaiah Smith, Tushar Marwah, Michael Schroeter, Mohamed Rahouti, D. Frank Hsu 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10916v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于使用机器学习（特别是组合融合分析）进行体育比赛预测，属于传统机器学习应用范畴。论文未涉及大模型、深度学习技术原理、科学AI应用或任何评分关键词中的技术概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究如何利用组合融合分析（CFA）改进NCAA篮球锦标赛预测，通过结合多个排名系统实现了74.60%的准确率，优于单个排名系统的最佳结果。

摘要翻译

过去几年中，机器学习模型在体育预测领域取得了显著成功，通常将体育预测视为该领域内的分类任务。本文引入了分析体育数据以更准确预测结果的新视角。我们利用排名，通过组合融合分析（Combinatorial Fusion Analysis, CFA）这一新范式，结合秩-分特征（RSC）函数与认知多样性（CD），为2024年数据集生成球队排名。基于球队排名的秩组合方法，我们的预测准确率达到$74.60%$，高于十种常用公开排名系统中最佳结果的$73.02%$。这展现了CFA通过不同视角提升体育预测精度的有效性。

摘要 (Abstract)

Machine learning models have demonstrated remarkable success in sports prediction in the past years, often treating sports prediction as a classification task within the field. This paper introduces new perspectives for analyzing sports data to predict outcomes more accurately. We leverage rankings to generate team rankings for the 2024 dataset using Combinatorial Fusion Analysis (CFA), a new paradigm for combining multiple scoring systems through the rank-score characteristic (RSC) function and cognitive diversity (CD). Our result based on rank combination with respect to team ranking has an accuracy rate of $74.60%$, which is higher than the best of the ten popular public ranking systems ($73.02%$). This exhibits the efficacy of CFA in enhancing the precision of sports prediction through different lens.

关键词: NCAA bracket prediction, machine learning, Combinatorial Fusion Analysis, sports prediction, team ranking, rank-score characteristic, cognitive diversity, accuracy improvement

59. ❌ Ergodicity in reinforcement learning

作者: Dominik Baumann, Erfaun Noorani, Arsenii Mustafin, Xinyi Sheng, Bert Verbruggen, Arne Vanhoyweghen, Vincent Ginis, Thomas B. Schön 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10895v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究强化学习中的非遍历性奖励过程问题，讨论其对优化目标的影响，并提出解决方案。所有关键词均与大模型、深度学习技术原理或科学应用相关，而本文专注于强化学习的理论分析，未涉及大模型、深度学习技术或具体科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了强化学习中非遍历性奖励过程对个体代理长期性能的影响，并通过示例分析、概念关联和现有解决方案探讨了如何优化非遍历性奖励动态下的单轨迹长期表现。

摘要翻译

在强化学习中，我们通常旨在优化智能体在一条轨迹上收集的奖励总和的期望值。然而，如果生成这些奖励的过程是非遍历的，那么期望值——即给定策略下无限多条轨迹的平均值——对于单条无限长轨迹的平均值而言并不具有参考意义。因此，如果我们关注个体智能体在部署过程中的表现，期望值就不是一个良好的优化目标。本文通过一个启发性示例，探讨了非遍历奖励过程对强化学习智能体的影响，将遍历奖励过程的概念与更广泛使用的遍历马尔可夫链概念联系起来，并介绍了在非遍历奖励动态下优化单条轨迹长期性能的现有解决方案。

摘要 (Abstract)

In reinforcement learning, we typically aim to optimize the expected value of the sum of rewards an agent collects over a trajectory. However, if the process generating these rewards is non-ergodic, the expected value, i.e., the average over infinitely many trajectories with a given policy, is uninformative for the average over a single, but infinitely long trajectory. Thus, if we care about how the individual agent performs during deployment, the expected value is not a good optimization objective. In this paper, we discuss the impact of non-ergodic reward processes on reinforcement learning agents through an instructive example, relate the notion of ergodic reward processes to more widely used notions of ergodic Markov chains, and present existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics.

关键词: reinforcement learning, ergodicity, non-ergodic reward processes, expected value optimization, individual agent performance, long-term trajectory performance, Markov chains

60. ❌ Kernel Tests of Equivalence

作者: Xing Liu, Axel Gandy 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10886v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Kernel Tests of Equivalence》专注于统计学中的核方法（kernel Stein discrepancy, Maximum Mean Discrepancy）和等价性检验（equivalence testing），属于统计推断和假设检验领域。所有评分关键词均围绕大模型、深度学习技术原理及其应用（如AI for Science），而本文未涉及任何大模型、深度学习、AI技术或相关应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于核方法（核斯坦因差异和最大均值差异）的统计检验方法，用于评估两个分布是否等价，解决了传统拟合优度检验在证明分布无差异时的局限性。

摘要翻译

我们提出了一种基于核函数的新型检验方法，用于评估分布间的等价性。传统的拟合优度检验不适用于判定分布差异不存在的情况，因为未能拒绝原假设可能仅仅是检验功效不足（即第二类错误）的结果。这促使了等价性检验的发展，其目标是在控制错误率的前提下评估统计意义上显著效应的不存在性。然而，现有的等价性检验要么局限于参数化分布，要么仅关注特定矩而非完整分布。我们通过两种基于核函数的统计差异度量——核斯坦因差异与最大均值差异——来应对这些局限。我们所提出检验的原假设假定候选分布与目标分布之间的差异至少超过一个预先设定的边界值，该边界值通过上述差异度量进行量化。我们提出了两种计算检验临界值的方法：一种采用渐近正态性近似，另一种基于自助法。我们通过数值实验评估了这些检验方法的性能。

摘要 (Abstract)

We propose novel kernel-based tests for assessing the equivalence between distributions. Traditional goodness-of-fit testing is inappropriate for concluding the absence of distributional differences, because failure to reject the null hypothesis may simply be a result of lack of test power, also known as the Type-II error. This motivates \emph{equivalence testing}, which aims to assess the \emph{absence} of a statistically meaningful effect under controlled error rates. However, existing equivalence tests are either limited to parametric distributions or focus only on specific moments rather than the full distribution. We address these limitations using two kernel-based statistical discrepancies: the \emph{kernel Stein discrepancy} and the \emph{Maximum Mean Discrepancy}. The null hypothesis of our proposed tests assumes the candidate distribution differs from the nominal distribution by at least a pre-defined margin, which is measured by these discrepancies. We propose two approaches for computing the critical values of the tests, one using an asymptotic normality approximation, and another based on bootstrapping. Numerical experiments are conducted to assess the performance of these tests.

关键词: equivalence testing, kernel Stein discrepancy, Maximum Mean Discrepancy, statistical hypothesis testing, distribution comparison, bootstrapping, asymptotic normality

61. ❌ LAtte: Hyperbolic Lorentz Attention for Cross-Subject EEG Classification

作者: Johannes Burchert, Ahmad Bdeir, Tom Hanika, Lars Schmidt-Thieme, Niels Landwehr 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10881v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文LAtte专注于EEG信号分类，属于AI for Science（生物信息学/医学AI）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（8分）。论文使用了预训练（pretraining）和微调（finetuning）技术，与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（各5分）。论文提出的’Lorentz low-rank adapters’属于参数高效微调技术，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’相关（5分）。其他关键词主要涉及大语言模型（LLM）技术、推理、对齐、代理等，与本文的EEG深度学习模型无直接关系，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LAtte的新框架，通过结合Lorentz注意力模块和InceptionTime编码器，并利用预训练和Lorentz低秩适配器，有效解决了跨被试EEG分类中信号噪声高和个体差异大的挑战，在三个数据集上显著超越了现有方法。

摘要翻译

脑电图（EEG）分类对于从医学诊断到脑机接口的应用至关重要，但由于其固有的低信噪比（SNR）和高被试间变异性，该任务仍具挑战性。为解决这些问题，我们提出了LAtte，一种新颖的框架，它将洛伦兹注意力模块与基于InceptionTime的编码器相结合，以实现稳健且可泛化的EEG分类。与先前主要评估单被试性能的研究不同，LAtte专注于跨被试训练。首先，我们通过预训练任务学习所有被试共享的基线信号，以捕捉共同的潜在模式。随后，我们利用新颖的洛伦兹低秩适配器来学习建模个体差异的被试特定嵌入。这使得我们能够学习一个在被试间表现稳健的共享模型，该模型随后可针对个体被试进行微调，或用于泛化至未见过的被试。我们在三个成熟的EEG数据集上评估LAtte，其性能相比当前最先进方法取得了显著提升。

摘要 (Abstract)

Electroencephalogram (EEG) classification is critical for applications ranging from medical diagnostics to brain-computer interfaces, yet it remains challenging due to the inherently low signal-to-noise ratio (SNR) and high inter-subject variability. To address these issues, we propose LAtte, a novel framework that integrates a Lorentz Attention Module with an InceptionTime-based encoder to enable robust and generalizable EEG classification. Unlike prior work, which evaluates primarily on single-subject performance, LAtte focuses on cross-subject training. First, we learn a shared baseline signal across all subjects using pretraining tasks to capture common underlying patterns. Then, we utilize novel Lorentz low-rank adapters to learn subject-specific embeddings that model individual differences. This allows us to learn a shared model that performs robustly across subjects, and can be subsequently finetuned for individual subjects or used to generalize to unseen subjects. We evaluate LAtte on three well-established EEG datasets, achieving a substantial improvement in performance over current state-of-the-art methods.

关键词: EEG classification, cross-subject training, Lorentz Attention, low-rank adapters, pretraining, finetuning, InceptionTime, brain-computer interfaces

62. ❌ SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion

作者: Andrea Lampis, Michela Carlotta Massi, Nicola Pirastu, Francesca Ieva, Matteo Matteucci, Emanuele Di Angelantonio 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10873v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文SNPgen专注于使用变分自编码器和潜在扩散模型生成合成基因型数据，属于生物信息学领域，与AI for Science/Bioinformatics高度相关（10分）。然而，论文未涉及大语言模型、深度学习技术原理创新或任何其他评分关键词（如MoE、Scaling Laws、RLHF等），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SNPgen的两阶段条件潜在扩散框架，用于生成表型监督的合成基因型数据，以解决基因组数据共享的隐私限制问题，并在UK Biobank数据上验证了其合成数据在疾病预测任务中能达到与真实数据相近的性能。

摘要翻译

多基因风险评分及其他基因组分析需要大规模个体层面的基因型数据集，但严格的数据访问限制阻碍了数据共享。合成基因型生成提供了一种隐私保护的替代方案，但现有方法大多为无条件生成，产生的样本缺乏表型对齐，或依赖于无监督压缩，导致统计保真度与下游任务效用之间存在差距。本文提出SNPgen——一种用于生成表型监督合成基因型的双阶段条件潜在扩散框架。SNPgen结合了全基因组关联研究（GWAS）指导的变异位点筛选（选取1,024-2,048个性状相关单核苷酸多态性）、用于基因型压缩的变分自编码器，以及通过无分类器引导实现二元疾病标签条件化的潜在扩散模型。在包含458,724名英国生物银行（UK Biobank）参与者、涵盖四种复杂疾病（冠状动脉疾病、乳腺癌、1型糖尿病和2型糖尿病）的数据集上评估显示，采用“合成数据训练、真实数据测试”协议时，基于合成数据训练的模型达到了与真实数据相当的预测性能，其效果接近使用多2-6倍变异位点的全基因组PRS方法。隐私分析证实合成数据与原始数据零完全匹配，成员推断攻击接近随机水平（AUC≈0.50），同时保持了连锁不平衡结构，并与源数据保持高度等位基因频率相关性（r≥0.95）。通过已知因果效应的受控模拟实验，验证了该方法能准确还原预设的遗传关联结构。

摘要 (Abstract)

Polygenic risk scores and other genomic analyses require large individual-level genotype datasets, yet strict data access restrictions impede sharing. Synthetic genotype generation offers a privacy-preserving alternative, but most existing methods operate unconditionally, producing samples without phenotype alignment, or rely on unsupervised compression, creating a gap between statistical fidelity and downstream task utility. We present SNPgen, a two-stage conditional latent diffusion framework for generating phenotype-supervised synthetic genotypes. SNPgen combines GWAS-guided variant selection (1,024-2,048 trait-associated SNPs) with a variational autoencoder for genotype compression and a latent diffusion model conditioned on binary disease labels via classifier-free guidance. Evaluated on 458,724 UK Biobank individuals across four complex diseases (coronary artery disease, breast cancer, type 1 and type 2 diabetes), models trained on synthetic data matched real-data predictive performance in a train-on-synthetic, test-on-real protocol, approaching genome-wide PRS methods that use $2$-$6\times$ more variants. Privacy analysis confirmed zero identical matches, near-random membership inference (AUC $\approx 0.50$), preserved linkage disequilibrium structure, and high allele frequency correlation ($r \geq 0.95$) with source data. A controlled simulation with known causal effects verified faithful recovery of the imposed genetic association structure.

关键词: synthetic genotype generation, latent diffusion model, phenotype-supervised, privacy-preserving, GWAS-guided variant selection, variational autoencoder, UK Biobank, polygenic risk scores

63. ❌ 6ABOS: An Open-Source Atmospheric Correction Framework for the EnMAP Hyperspectral Mission Based on 6S

作者: Gabriel Caballero Cañas, Bárbara Alvado Arranz, Xavier Sòria-Perpinyà, Antonio Ruiz-Verdú, Jesús Delegido, José Moreno 期刊/来源: arxiv 发布日期: 2026-03-11 arXiv链接: http://arxiv.org/abs/2603.10856v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于遥感图像的大气校正，使用6S辐射传输模型和Google Earth Engine API开发开源框架6ABOS，用于EnMAP高光谱图像处理。论文内容与绝大多数关键词（涉及大模型、深度学习、AI对齐、推理优化等）完全无关，仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于科学计算和地球科学领域的AI应用，但未涉及生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文针对EnMAP高光谱图像在水体表面反射率反演中的大气校正挑战，提出了基于6S辐射传输模型的开源框架6ABOS，并通过地中海两个水库的验证表明其能有效获取与实地测量高度一致的水体反射率光谱。

摘要翻译

环境制图与分析计划（EnMAP）任务为光学复杂环境监测开辟了新前沿。然而，水体表面反射率的精确反演仍是一项重大挑战，因为离水信号通常仅占总辐射的很小部分，极易被大气散射和表面反射效应所掩盖。本文介绍了6ABOS（基于6S的大气背景偏移扣除法），这是一个新颖的开源Python框架，旨在实现EnMAP高光谱影像大气校正（AC）的自动化。通过利用太阳光谱卫星信号二次模拟（6S）辐射传输模型，6ABOS实施了一种基于物理的反演方案，该方案考虑了瑞利散射、气溶胶相互作用及气体吸收。该框架将自动化的EnMAP元数据解析与通过谷歌地球引擎（GEE）应用程序接口（API）的动态大气参数反演相结合。验证工作在两个营养状态迥异的地中海内陆水库进行：贫营养的贝纳赫韦尔水库和超富营养的贝柳斯水库。结果表明，现场测量数据与EnMAP反演的离水反射率之间具有高度的光谱相似性。在两个研究区域，光谱角制图（SAM）值始终保持较低水平（SAM < 10°）。6ABOS通过conda-forge平台分发，为科学界提供了一个可扩展、透明且可复现的开源科学工具，以推动云计算时代的高光谱水生研究。

摘要 (Abstract)

The Environmental Mapping and Analysis Program (EnMAP) mission has opened new frontiers in the monitoring of optically complex environments. However, the accurate retrieval of surface reflectance over water bodies remains a significant challenge, as the water-leaving signal typically accounts for only a small fraction of the total radiance, being easily obscured by atmospheric scattering and surface reflection effects. This paper introduces 6ABOS (6S-based Atmospheric Background Offset Subtraction), a novel open-source Python framework designed to automate the atmospheric correction (AC) of EnMAP hyperspectral imagery. By leveraging the Second Simulation of the Satellite Signal in the Solar Spectrum (6S) radiative transfer model, 6ABOS implements a physically-based inversion scheme that accounts for Rayleigh scattering, aerosol interactions, and gaseous absorption. The framework integrates automated EnMAP metadata parsing with dynamic atmospheric parameter retrieval via the Google Earth Engine (GEE) Application Programming Interface (API). Validation was conducted over two Mediterranean inland water reservoirs with contrasting trophic states: the oligotrophic Benag{’e}ber and the hypertrophic Bell{‘u}s. Results demonstrate a high degree of spectral similarity between in situ measurements and EnMAP-derived water-leaving reflectances. The Spectral Angle Mapper (SAM) values remained consistently low (SAM $<$ 10$^\circ$) across both study sites. 6ABOS is distributed via conda-forge, providing the scientific community with a scalable, transparent, and reproducible open-science tool for advancing hyperspectral aquatic research in the cloud-computing era.

关键词: atmospheric correction, hyperspectral imagery, EnMAP, 6S radiative transfer model, water-leaving reflectance, open-source framework, Google Earth Engine, spectral validation

Token 消耗统计

总计: 182,188 tokens（输入 116,785 / 输出 65,403）

模型	输入	输出	合计
deepseek-chat	113,052	62,058	175,110
glm-4.7	3,733	3,345	7,078

📊 ArXiv 研究报告 (2026-03-12)#

📌 配置信息#

关键词列表（共 27 个，总权重 27.0）#

评分设置#

📈 论文统计#

⭐ 及格论文详细分析#

1. A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification#

处方验证中安全性与可追溯性的混合知识驱动框架#

📋 所有论文列表#

1. ✅ A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification#

2. ❌ A Systematic Study of Pseudo-Relevance Feedback with LLMs#

3. ❌ TOSSS: a CVE-based Software Security Benchmark for Large Language Models#

4. ❌ COMIC: Agentic Sketch Comedy Generation#

5. ❌ LiTo: Surface Light Field Tokenization#

6. ❌ Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation#

7. ❌ V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation#

8. ❌ Instruction set for the representation of graphs#

9. ❌ Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style#

10. ❌ Artificial Intelligence as a Catalyst for Innovation in Software Engineering#

11. ❌ RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation#

12. ❌ Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation#

13. ❌ GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations#

14. ❌ Historical Consensus: Preventing Posterior Collapse via Iterative Selection of Gaussian Mixture Priors#

15. ❌ Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control#

16. ❌ When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS#

17. ❌ LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation#

18. ❌ Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models#

19. ❌ Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements#

20. ❌ An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took “Use of Practical AI in Digital Libraries” seriously?#

21. ❌ GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, And Continuous Environments#

22. ❌ $V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts#

23. ❌ Semantic Landmark Particle Filter for Robot Localisation in Vineyards#

24. ❌ Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis#

25. ❌ Human Presence Detection via Wi-Fi Range-Filtered Doppler Spectrum on Commodity Laptops#

26. ❌ Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge#

27. ❌ LLM2Vec-Gen: Generative Embeddings from Large Language Models#

28. ❌ GLM-OCR Technical Report#

29. ❌ From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers#

30. ❌ SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0#

31. ❌ Agentar-Fin-OCR#

32. ❌ DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving#

33. ❌ Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity#

34. ❌ Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI#

35. ❌ Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition#

36. ❌ Pointy - A Lightweight Transformer for Point Cloud Foundation Models#

37. ❌ Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD#

38. ❌ Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment#

39. ❌ Novel Architecture of RPA In Oral Cancer Lesion Detection#

40. ❌ S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs#

41. ❌ Bilevel Layer-Positioning LoRA for Real Image Dehazing#

42. ❌ Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding#

43. ❌ UltrasoundAgents: Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis#

44. ❌ Leech Lattice Vector Quantization for Efficient LLM Compression#

45. ❌ Cross-Species Transfer Learning for Electrophysiology-to-Transcriptomics Mapping in Cortical GABAergic Interneurons#

46. ❌ Factorized Neural Implicit DMD for Parametric Dynamics#

47. ❌ Bayesian Optimization with Gaussian Processes to Accelerate Stationary Point Searches#

48. ❌ ForwardFlow: Simulation only statistical inference using deep learning#

49. ❌ MCMC Informed Neural Emulators for Uncertainty Quantification in Dynamical Systems#

50. ❌ The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers#

51. ❌ Federated Learning-driven Beam Management in LEO 6G Non-Terrestrial Networks#

52. ❌ FRIEND: Federated Learning for Joint Optimization of multi-RIS Configuration and Eavesdropper Intelligent Detection in B5G Networks#

53. ❌ Bio-Inspired Self-Supervised Learning for Wrist-worn IMU Signals#

54. ❌ Ranking Reasoning LLMs under Test-Time Scaling#

55. ❌ When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra#

56. ❌ Quantifying Membership Disclosure Risk for Tabular Synthetic Data Using Kernel Density Estimators#

57. ❌ ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection#

58. ❌ NCAA Bracket Prediction Using Machine Learning and Combinatorial Fusion Analysis#

59. ❌ Ergodicity in reinforcement learning#

60. ❌ Kernel Tests of Equivalence#

61. ❌ LAtte: Hyperbolic Lorentz Attention for Cross-Subject EEG Classification#

62. ❌ SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion#

63. ❌ 6ABOS: An Open-Source Atmospheric Correction Framework for the EnMAP Hyperspectral Mission Based on 6S#

Token 消耗统计#

📊 ArXiv 研究报告 (2026-03-12)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification

处方验证中安全性与可追溯性的混合知识驱动框架

📋 所有论文列表

1. ✅ A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification

2. ❌ A Systematic Study of Pseudo-Relevance Feedback with LLMs

3. ❌ TOSSS: a CVE-based Software Security Benchmark for Large Language Models

4. ❌ COMIC: Agentic Sketch Comedy Generation

5. ❌ LiTo: Surface Light Field Tokenization

6. ❌ Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation

7. ❌ V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

8. ❌ Instruction set for the representation of graphs

9. ❌ Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style

10. ❌ Artificial Intelligence as a Catalyst for Innovation in Software Engineering

11. ❌ RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

12. ❌ Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation

13. ❌ GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

14. ❌ Historical Consensus: Preventing Posterior Collapse via Iterative Selection of Gaussian Mixture Priors

15. ❌ Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control

16. ❌ When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

17. ❌ LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

18. ❌ Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

19. ❌ Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements

20. ❌ An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took “Use of Practical AI in Digital Libraries” seriously?

21. ❌ GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, And Continuous Environments

22. ❌ $V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

23. ❌ Semantic Landmark Particle Filter for Robot Localisation in Vineyards

24. ❌ Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

25. ❌ Human Presence Detection via Wi-Fi Range-Filtered Doppler Spectrum on Commodity Laptops

26. ❌ Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

27. ❌ LLM2Vec-Gen: Generative Embeddings from Large Language Models

28. ❌ GLM-OCR Technical Report

29. ❌ From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

30. ❌ SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

31. ❌ Agentar-Fin-OCR

32. ❌ DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

33. ❌ Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity

34. ❌ Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI

35. ❌ Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

36. ❌ Pointy - A Lightweight Transformer for Point Cloud Foundation Models

37. ❌ Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD

38. ❌ Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

39. ❌ Novel Architecture of RPA In Oral Cancer Lesion Detection

40. ❌ S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs

41. ❌ Bilevel Layer-Positioning LoRA for Real Image Dehazing

42. ❌ Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

43. ❌ UltrasoundAgents: Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis

44. ❌ Leech Lattice Vector Quantization for Efficient LLM Compression

45. ❌ Cross-Species Transfer Learning for Electrophysiology-to-Transcriptomics Mapping in Cortical GABAergic Interneurons

46. ❌ Factorized Neural Implicit DMD for Parametric Dynamics

47. ❌ Bayesian Optimization with Gaussian Processes to Accelerate Stationary Point Searches

48. ❌ ForwardFlow: Simulation only statistical inference using deep learning

49. ❌ MCMC Informed Neural Emulators for Uncertainty Quantification in Dynamical Systems

50. ❌ The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

51. ❌ Federated Learning-driven Beam Management in LEO 6G Non-Terrestrial Networks

52. ❌ FRIEND: Federated Learning for Joint Optimization of multi-RIS Configuration and Eavesdropper Intelligent Detection in B5G Networks

53. ❌ Bio-Inspired Self-Supervised Learning for Wrist-worn IMU Signals

54. ❌ Ranking Reasoning LLMs under Test-Time Scaling

55. ❌ When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra

56. ❌ Quantifying Membership Disclosure Risk for Tabular Synthetic Data Using Kernel Density Estimators

57. ❌ ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

58. ❌ NCAA Bracket Prediction Using Machine Learning and Combinatorial Fusion Analysis

59. ❌ Ergodicity in reinforcement learning

60. ❌ Kernel Tests of Equivalence

61. ❌ LAtte: Hyperbolic Lorentz Attention for Cross-Subject EEG Classification

62. ❌ SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion

63. ❌ 6ABOS: An Open-Source Atmospheric Correction Framework for the EnMAP Hyperspectral Mission Based on 6S

Token 消耗统计