📊 ArXiv 研究报告 (2026-03-27)

生成时间: 2026-03-27 09:42:30 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 291 篇
及格论文: 13 篇 (4.5%)
深度分析: 13 篇

⭐ 及格论文详细分析

1. AI-Supervisor: Autonomous AI Research Supervision via a Persistent Research World Model

作者: Yunbo Long 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24402v1

评分: 75.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出AutoProf框架，这是一个用于AI研究监督的多智能体系统，核心是维护一个持续演化的研究世界模型（Research World Model）。该框架明确支持主流大语言模型（LLMs），因此与"Large Language Models"高度相关（10分）。框架的核心是多智能体系统，涉及自主代理、协调和共识机制，因此与"LLM Agents"、“Multi-agent Systems"和"Self-Correction/Self-Improvement"高度相关（均为10分）。其核心创新之一是"Research World Model”，与"World Models"高度相关（10分）。应用领域是AI研究监督，属于"AI for Science"范畴（10分）。框架涉及结构化分析、自我纠正循环和迭代改进，体现了"Chain of Thought"和"System 2 Thinking"的某些方面（均为5分）。智能体可能需要调用工具或API来执行研究任务，与"Tool Use"有一定关联（5分）。论文未涉及其他关键词的具体技术细节，如MoE、模型压缩、训练方法、推理加速、对齐技术等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有自动化研究系统缺乏持久性理解和自我纠正能力的问题，提出了AutoProf框架，这是一个由持续演化的研究世界模型驱动的多智能体系统，能够实现从文献综述到论文撰写的端到端自主AI研究监督，并通过结构化差距发现和自我纠正循环来改进研究过程。

摘要翻译

现有自动化研究系统以无状态的线性流程运作，其生成输出时并未保持对研究领域的持续性理解。这些系统按顺序处理论文，在没有结构化缺口分析的情况下提出想法，且缺乏智能体间相互验证或完善发现的机制。我们提出AutoProf（自主教授）——一个多智能体协同框架，其中专业化智能体通过自主探索与自我修正更新，在人类兴趣驱动下提供从文献综述、缺口发现、方法开发、评估到论文撰写的端到端人工智能研究指导。与顺序流程不同，AutoProf维护着一个持续演化的研究世界模型（以知识图谱形式实现），将方法、基准、局限性和未探索缺口作为跨智能体的共享记忆进行捕捉。该框架包含三项核心贡献：第一，结构化缺口发现机制，将方法解构为模块，跨基准评估模块性能，并识别模块层级的缺口；第二，自我修正发现循环，通过分析模块成功或失败的原因、检测基准偏差及评估充分性来实现持续优化；第三，自我改进的开发循环，利用跨领域机制搜索迭代修正失效组件。所有智能体在共识机制下运行，任何发现需经验证后方可提交至共享模型。该框架与模型无关，支持主流大语言模型，并可根据计算资源弹性扩展——从轻量级探索到全面研究均可适配。

摘要 (Abstract)

Existing automated research systems operate as stateless, linear pipelines, generating outputs without maintaining a persistent understanding of the research landscape. They process papers sequentially, propose ideas without structured gap analysis, and lack mechanisms for agents to verify or refine each other’s findings. We present AutoProf (Autonomous Professor), a multi-agent orchestration framework where specialized agents provide end-to-end AI research supervision driven by human interests, from literature review through gap discovery, method development, evaluation, and paper writing, via autonomous exploration and self-correcting updates. Unlike sequential pipelines, AutoProf maintains a continuously evolving Research World Model implemented as a Knowledge Graph, capturing methods, benchmarks, limitations, and unexplored gaps as shared memory across agents. The framework introduces three contributions: first, structured gap discovery that decomposes methods into modules, evaluates them across benchmarks, and identifies module-level gaps; second, self-correcting discovery loops that analyze why modules succeed or fail, detect benchmark biases, and assess evaluation adequacy; third, self-improving development loops using cross-domain mechanism search to iteratively address failing components. All agents operate under a consensus mechanism where findings are validated before being committed to the shared model. The framework is model-agnostic, supports mainstream large language models, and scales elastically with token budget from lightweight exploration to full-scale investigation.

关键词: Autonomous AI Research, Multi-agent System, Research World Model, Knowledge Graph, Self-correcting Discovery, Gap Analysis, AI for Science, Agent Coordination

深度分析:

AI-Supervisor：基于持久化研究世界模型的自主AI研究监督系统

摘要:

现有自动化研究系统多为无状态线性管道，缺乏对研究领域的持久理解。本文提出了AutoProf（自主教授），一个由人类兴趣驱动的多智能体编排框架，用于端到端的AI研究监督。该框架的核心是一个持续演进的“研究世界模型”（以知识图谱形式实现），作为智能体间的共享记忆。主要贡献包括：结构化缺口发现、自纠错发现循环以及自改进开发循环。所有智能体在共识机制下运行，确保发现被验证后才提交。该框架模型无关，支持主流大模型，并可根据预算弹性扩展。

创新点:

提出了基于持久化研究世界模型（知识图谱）的多智能体框架AutoProf，解决了现有系统无状态和缺乏结构化分析的问题。
设计了结构化缺口发现机制，将方法分解为模块并在基准上评估，以识别模块级缺口。
引入了自纠错发现循环和自改进开发循环，能够分析模块成败原因并跨域搜索机制进行修复。
建立了智能体间的共识机制，确保所有发现经过验证后才能更新到共享模型中。

方法

!!! info

构建了一个多智能体系统，利用知识图谱构建“研究世界模型”作为共享记忆。采用结构化分解策略将复杂方法拆解为模块进行独立评估，并实施迭代循环机制（发现循环和开发循环）进行自我修正和改进。所有智能体通过共识协议验证输出。

关键结果:

AutoProf能够实现从文献综述到论文写作的端到端AI研究监督。
研究世界模型作为共享记忆，有效维持了对研究领域的持续理解。
结构化缺口发现和自纠错机制能够识别并修复方法中的缺陷。
框架具有模型无关性，支持主流大语言模型，并能根据资源弹性扩展。

技术栈: 多智能体系统, 知识图谱, 大语言模型, 共识机制, 模块化分解算法, 跨域机制搜索

优点

持久化记忆：通过知识图谱解决了传统AI代理的无状态问题，积累了研究洞察。
结构化分析：不仅仅是生成文本，而是进行模块级的缺口分析和评估。
自我修正：具备发现和修复自身逻辑漏洞或评估偏差的能力。
灵活性：模型无关且可弹性扩展，适应不同规模的预算。

局限

复杂性：构建和维护知识图谱以及多智能体共识机制可能非常复杂。
依赖LLM：性能严重依赖底层大语言模型的能力，可能存在幻觉问题。
计算成本：多智能体交互和迭代循环可能消耗大量计算资源和Token。
验证准确性：共识机制虽然存在，但如果所有智能体都产生相同的幻觉，仍可能通过验证。

与研究方向的相关性:

该论文高度相关。它直接利用大模型（LLM）作为核心驱动力构建智能体，属于大模型技术原理的创新。同时，它探讨了AI技术在科学研究流程（AI for Science）中的应用，即利用AI进行AI研究，符合大模型在科学领域应用的关注点。其提出的持久化世界模型和多智能体协作机制具有很高的创新性。

2. Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

作者: Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, Tunazzina Islam 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24580v1

评分: 66.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	15.0/10	15.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

评分理由: 论文核心研究RAG系统在AI政策分析中的应用，因此"Retrieval-Augmented Generation"得满分15分。论文明确使用DPO进行对齐，“RLHF/DPO"得10分。研究关注幻觉问题，“Hallucination Mitigation"得10分。论文涉及大模型应用，“Large Language Models"得8分；应用于AI政策领域，“AI for Science"得8分。论文提到领域适应和微调，“Pre-training/Domain Adaptation"和"Post-training/SFT"各得5分；涉及对齐，“Instruction Tuning/Alignment"得5分。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在AI政策分析中应用检索增强生成（RAG）系统时，发现检索质量的提升并不总能改善端到端问答性能，有时甚至会导致更自信的幻觉，为动态监管语料库上的问答系统设计提供了实用见解。

摘要翻译

检索增强生成系统正日益广泛地应用于复杂政策文件的分析，然而在那些以密集法律语言和动态重叠的监管框架为特征的领域中，要达到专家使用所需的足够可靠性仍具挑战。本研究利用人工智能治理与监管档案库——一个包含947份人工智能政策文件的精选语料库——探讨了检索增强生成在人工智能治理与政策分析中的应用。我们的系统结合了基于ColBERT的检索器（通过对比学习进行微调）和采用直接偏好优化方法对齐人类偏好的生成器。我们构建了合成查询并收集成对偏好数据，以使系统适应政策领域。通过评估检索质量、答案相关性和忠实度的实验，我们发现领域特定的微调能提升检索指标，但并未持续改善端到端问答性能。在某些情况下，当语料库中缺乏相关文档时，更强的检索能力反而会反直觉地导致更自信的幻觉生成。这些结果凸显了构建政策导向检索增强生成系统的关键关切：单个组件的改进未必能转化为更可靠的答案。我们的研究结果为基于动态监管语料库设计有依据的问答系统提供了实践启示。

摘要 (Abstract)

Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.

关键词: Retrieval-augmented generation, RAG, AI policy analysis, Direct Preference Optimization, DPO, hallucinations, domain adaptation, question answering

深度分析:

检索改进并不保证更好的答案：AI政策问答中RAG的研究

摘要:

本文研究了检索增强生成（RAG）系统在AI治理和政策分析领域的应用。研究基于AGORA语料库，构建了一个结合ColBERT检索器（经对比学习微调）和经直接偏好优化（DPO）对齐的生成器的RAG系统。通过构建合成查询和收集成对偏好数据来适应政策领域。实验评估了检索质量、答案相关性和忠实度。结果显示，尽管特定领域的微调提高了检索指标，但并未始终改善端到端的问答性能。在某些情况下，更强的检索甚至会在语料库缺少相关文档时导致更自信的幻觉。这表明组件改进不一定能转化为更可靠的答案。

创新点:

针对AI治理领域的特定RAG系统构建，结合了对比学习微调的检索器和DPO对齐的生成器。
发现并验证了检索指标的提升并不一定能转化为端到端问答性能的改善，特别是在高密度法律文本领域。
揭示了“更强的检索可能导致更自信的幻觉”这一反直觉现象，即当语料库缺乏相关文档时，改进的检索器可能加剧模型的错误自信。

方法

!!! info

研究使用了AGORA数据集（947份文档）作为基础。在检索器方面，基于ColBERTv2，利用LLM生成的合成查询和人工标注的正负样本进行对比学习微调。在生成器方面，基于Mistral-7B-Instruct，利用人工标注的成对偏好数据进行直接偏好优化（DPO）微调。评估方面，使用RAGAS框架评估答案相关性和准确性，计算忠实度分数，并使用MRR、Recall@k等指标评估检索性能，同时辅以专家定性审查。

关键结果:

特定领域的微调虽然提高了检索指标（如MRR, Recall@5），但并未显著提高端到端的问答准确性。
最佳RAG配置（Base ColBERT + DPO Mistral）仅比基线略有提升，而GPT-5.4基线在准确性上表现显著更好。
DPO对齐的生成器在忠实度上略有提升（0.80 vs 0.78）。
专家审查发现系统能捕捉关键主题，但在精确政策解释和跨文档引用方面存在局限。

技术栈: ColBERTv2 (检索模型), Mistral-7B-Instruct (生成模型), 对比学习 (Contrastive Learning), 直接偏好优化 (DPO), LoRA (参数高效微调), RAGAS (评估框架), AGORA 数据集

优点

针对AI治理这一新兴且复杂的领域进行了深入探索，填补了交互式问答系统的空白。
提供了关于RAG系统组件改进与整体性能之间关系的反直觉见解，具有重要的实践指导意义。
结合了检索器微调和生成器对齐，构建了完整的端到端实验流程，并进行了多维度的评估。

局限

端到端性能提升有限，特定领域微调的效果不如预期。
在语料库缺乏相关文档时，系统容易产生自信的幻觉。
专家审查指出系统在精确政策解释和跨文档引用方面仍有不足，难以处理复杂的法律嵌套定义。

与研究方向的相关性:

该论文高度相关。它研究了大模型（RAG系统）在科学/政策领域的应用，属于大模型在不同领域的研究应用。同时，论文涉及深度学习技术原理的创新，如对比学习、DPO在特定领域的应用以及对检索器与生成器交互机制的深入分析。虽然应用领域是政策分析而非纯自然科学，但其技术栈、方法论和对大模型局限性的探讨完全符合用户对大模型技术原理及应用的兴趣。

3. LensWalk: Agentic Video Understanding by Planning How You See in Videos

作者: Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24558v1

评分: 51.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出LensWalk框架，核心是让大型语言模型（LLM）作为推理器主动控制视觉观察，实现视频理解。这与"Large Language Models"高度相关（10分），因为LLM是核心推理组件；与"LLM Agents"高度相关（10分），因为框架本质是代理系统；与"Chain of Thought"高度相关（10分），因为强调渐进式证据收集服务于推理链；与"System 2 Thinking"较强相关（8分），涉及深度推理；与"Tool Use"较强相关（8分），因为使用视觉语言模型工具；与"Explainable AI"有一定关联（5分），因为提到可解释性；其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文解决了视频理解中推理与感知脱节的问题，通过提出LensWalk代理框架，让大型语言模型主动控制视觉观察，实现了无需微调的即插即用性能提升，在长视频基准上准确率提高超过5%。

摘要翻译

视频密集且时序化的特性为自动化分析带来了巨大挑战。尽管现有方法采用了强大的视觉语言模型，但主流视频理解技术仍受限于推理与感知之间的固有割裂：它们依赖静态的预处理信息，无法在理解深化过程中主动从视频中搜寻原始证据。为此，我们提出LensWalk——一种灵活的智能体框架，使大语言模型推理器能够主动控制其视觉观察过程。LensWalk构建了紧密的“推理-规划-观察”循环机制，智能体可在每一步动态指定所观察视频的时间范围与采样密度。通过调用一系列由这些参数配置的、基于视觉语言模型的多样化工具，智能体能够执行大范围线索扫描、聚焦特定片段进行事实提取，并整合多时刻证据以完成整体验证。该设计实现了直接服务于智能体动态思维链的渐进式按需证据收集。无需任何模型微调，LensWalk在多种模型架构上实现了显著的即插即用性能提升，在LVBench和Video-MME等具有挑战性的长视频基准测试中，将模型准确率提升了5%以上。我们的分析表明，赋予智能体控制其观察方式的能力，是解锁更精准、鲁棒且可解释的视频推理的关键所在。

摘要 (Abstract)

The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent’s evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.

关键词: Agentic Framework, Large Language Model, Video Understanding, Dynamic Observation, Chain of Thought, Vision-Language Models, Reason-Plan-Observe Loop, Plug-and-play Performance

深度分析:

LensWalk：通过规划视频观察方式实现智能体视频理解

摘要:

针对视频理解中推理与感知脱节的问题，论文提出了LensWalk框架。该框架赋予大语言模型推理器主动控制视觉观察的能力，建立了紧密的推理-规划-观察循环。通过设计Scan Search、Segment Focus和Stitch Verify等工具，智能体能动态指定观察的时间范围和采样密度，实现从广泛扫描到细节聚焦的按需证据收集。实验表明，无需微调即可显著提升模型在LVBench和Video-MME等长视频基准上的准确率，同时保持高效性，实现了高精度与低成本的平衡。

创新点:

提出了LensWalk智能体框架，实现了推理与观察的紧密耦合，允许模型根据推理状态动态规划视觉观察策略。
设计了一套多功能观察工具集（Scan Search, Segment Focus, Stitch Verify），支持从广度扫描到深度聚焦再到全局验证的灵活操作。
引入了时间戳锚点和全局主体记忆表机制，确保多轮交互中的证据连贯性和上下文记忆。
实现了无需微调的即插即用性能提升，在保持高精度的同时显著降低了Token消耗，优于传统的静态采样和检索方法。

方法

!!! info

论文采用基于智能体的研究方法。核心架构包含LLM推理器和VLM观察器。技术路线是构建一个迭代循环：推理器分析当前问题与证据，规划下一步动作并调用观察工具；观察工具根据参数（时间范围、采样率）从原始视频中提取视觉证据；通过时间戳锚点和记忆表更新全局状态。整个过程是动态的、按需的，而非静态预处理。

关键结果:

LensWalk在LVBench和Video-MME（长视频分割）等具有挑战性的基准上，将多种模型（如o3）的准确率提升了超过5%至11.5%。
相比消耗数百万Token的检索智能体，LensWalk的总Token使用量仅略高于单次密集遍历，展现了卓越的效率。
智能体展现出了类似人类的策略，如战略性反思、渐进式聚焦和整合验证，而非重复执行退化动作。

技术栈: 大语言模型 (LLM), 视觉语言模型 (VLM), 智能体框架, 工具使用机制, 动态采样策略

优点

主动性：突破了静态观察的限制，实现了基于推理需求的动态视觉感知。
高效性：通过按需分配观察预算，避免了无关信息的浪费，在提升精度的同时控制了计算成本。
通用性：无需微调，可作为插件提升多种现有VLM/LLM组合的性能。
可解释性：推理轨迹展示了清晰的观察策略，便于理解模型的决策过程。

局限

依赖LLM的规划能力，如果LLM规划错误，可能导致观察偏差。
多轮迭代可能增加推理延迟，尽管Token总量低，但交互次数可能较多。
对于极短或极简单的视频，复杂的智能体框架可能显得冗余。
论文主要关注长视频理解，对于实时视频流处理的适用性未明确探讨。

与研究方向的相关性:

论文高度相关。它属于“大模型和深度学习技术原理的创新”领域。具体来说，它创新了多模态智能体的架构，将大语言模型的推理能力与视觉感知深度结合，提出了“规划观察”的新范式。这不仅是大模型在视频领域的应用，更是对智能体如何利用工具进行感知和推理的技术原理层面的突破，具有很高的创新性。

4. Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

作者: Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24472v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的自蒸馏（self-distillation）这一后训练（post-training）技术，并深入分析其对模型推理能力（特别是数学推理）的影响。研究发现自蒸馏会抑制模型在推理过程中表达不确定性（epistemic verbalization），从而损害其鲁棒性，尤其是在分布外（OOD）任务上。这与’Chain of Thought’、‘System 2 Thinking’、‘Self-Correction’等涉及多步推理、深度思考和自我反思/改进的关键词高度相关。论文明确将自蒸馏定位为一种’post-training paradigm’，因此与’Post-training’关键词高度相关。论文研究对象为LLMs，因此与’Large Language Models’关键词高度相关。论文未涉及其他关键词所描述的具体技术、应用领域或特定专家。

!!! tip deepseek-chat TL;DR

该研究发现，大语言模型的自蒸馏后训练技术虽然能缩短推理轨迹，但会抑制模型在推理过程中表达不确定性，导致其在数学推理任务（尤其是分布外问题）上的性能下降，揭示了保持适当不确定性表达对鲁棒推理的重要性。

摘要翻译

自蒸馏已成为大型语言模型一种有效的后训练范式，通常能在缩短推理轨迹的同时提升模型性能。然而在数学推理任务中，我们发现该方法虽能缩减响应长度，却可能导致性能下降。我们将这种性能退化归因于认知性言语表达的抑制——即模型在推理过程中不确定性表达能力的减弱。通过控制条件上下文丰富度与任务覆盖范围的对比实验，我们证明：让教师模型基于丰富信息进行条件生成会抑制不确定性的表达，这虽能在有限任务覆盖范围内实现快速的领域内优化，却会损害分布外（OOD）性能——因为面对未见问题时，模型需要表达不确定性并进行相应调整才能获得更好表现。在Qwen3-8B、DeepSeek-Distill-Qwen-7B和Olmo3-7B-Instruct三个模型上的实验显示，性能下降幅度最高可达40%。我们的研究结果表明：暴露适当程度的不确定性对于实现鲁棒推理至关重要，同时强调了对推理行为进行优化不应仅仅局限于强化正确答案轨迹，而需关注更本质的认知表达机制。

摘要 (Abstract)

Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model’s expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

关键词: Self-distillation, Large Language Models (LLMs), Reasoning capability, Epistemic verbalization, Uncertainty expression, Mathematical reasoning, Post-training, Out-of-distribution (OOD) performance

深度分析:

为什么自蒸馏（有时）会降低大语言模型的推理能力？

摘要:

论文研究了自蒸馏在数学推理任务中导致性能下降的现象。虽然自蒸馏通常能提高性能并缩短推理链，但在数学领域，它往往导致性能退化。研究发现，这是因为自蒸馏抑制了“认识性语言化”，即模型在推理过程中表达不确定性的能力（如使用“Wait”等词）。当教师模型基于丰富信息（如标准答案）进行条件约束时，会产生过于自信且简短的推理轨迹，缺乏不确定性表达。这种风格虽然有利于域内优化，但损害了分布外（OOD）性能，因为解决未见问题需要表达不确定性以调整推理路径。实验表明，在Qwen、DeepSeek和Olmo等模型上，这种抑制会导致高达40%的性能下降。论文强调，优化推理行为不仅要强化正确答案，还要保留适当的不确定性表达。

创新点:

揭示了自蒸馏在数学推理中导致性能退化的根本原因：抑制了认识性语言化。
提出了“信息丰富度”与“任务覆盖度”是影响自蒸馏效果的两个关键因素。
通过控制实验证明，即使训练数据包含正确答案，缺乏不确定性表达的训练数据也会显著损害模型的泛化能力。
强调了在LLM后训练中保留不确定性感知推理行为的重要性，而非仅仅追求答案正确和长度压缩。

方法

!!! info

论文采用了对比分析和控制变量的实验方法。首先，比较了自蒸馏在化学和数学领域的不同表现。其次，通过定义条件互信息来量化上下文信息的丰富度，设置了无引导、答案引导、去除思考内容的答案引导、再生条件引导四种生成设置，观察模型推理行为的变化。最后，使用DeepSeek-R1-Distill-Qwen-7B模型，分别在高不确定性（无引导）和低不确定性（答案引导）的正确数据集上进行监督微调，并在AIME、AMC、MATH500等基准测试上评估其泛化能力。

关键结果:

随着条件上下文信息丰富度的增加，模型的推理长度和认识性标记数量单调递减，推理变得更加自信但缺乏不确定性。
在低不确定性数据集（答案引导）上训练会导致严重的性能退化（如AIME24准确率从54.79%降至20.21%），而在高不确定性数据集上训练则能保持原有性能。
自蒸馏通过模仿拥有额外信息的教师模型，导致学生模型在推理时预设了推理时不可得的信息，从而破坏了其自我贝叶斯推理过程。
在Qwen3-8B、DeepSeek-Distill-Qwen-7B和Olmo3-7B-Instruct上均观察到了因抑制认识性语言化导致的显著性能下降。

技术栈: Self-Distillation (SDPO), Reinforcement Learning from Verifiable Rewards (RLVR), Group Relative Policy Optimization (GRPO), KL散度, 条件互信息, DeepSeek-R1-Distill-Qwen-7B, Qwen3-8B, Olmo3-7B-Instruct, DAPO-Math-17k

优点

洞察深刻，不仅观察到了性能下降的现象，还深入挖掘了背后的机制（认识性语言化的抑制）。
实验设计严谨，通过控制变量（信息丰富度）和对比实验（不同领域、不同训练数据），有力地支撑了研究假设。
实用价值高，为LLM的后训练提供了重要指导，指出不能盲目追求推理链的压缩和自信，必须保留不确定性表达以增强鲁棒性。
通用性强，在多个主流开源模型上验证了结论。

局限

研究主要集中在数学推理领域，在其他复杂推理任务（如代码生成、逻辑推理）中的普适性可能需要进一步验证。
论文主要分析了问题原因，未提出具体的算法改进方案来自动保留或增强认识性语言化。
实验主要在7B-8B参数规模的模型上进行，在更大参数模型上这种抑制效应是否依然显著尚不明确。

与研究方向的相关性:

论文直接研究了大模型（LLMs）的核心技术——自蒸馏和推理能力，属于深度学习技术原理的创新范畴。它深入探讨了模型内部推理机制（认识性语言化）与训练目标之间的关系，对理解大模型如何进行科学推理（特别是数学）具有重要意义。虽然主要在数学数据集上验证，但其发现对于提升大模型在科学领域的应用鲁棒性具有关键指导价值。因此，该论文与“大模型和深度学习技术原理的创新”高度相关，同时也涉及“大模型在科学领域的应用”。

5. CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

作者: Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Bechard, Spandana Gella, Sai Rajeswar 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24440v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	5.0/10	5.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究计算机使用代理（CUAs），属于自主代理和工具使用领域，与"LLM Agents"和"Tool Use"高度相关（10分）。论文提到基础动作模型（foundation action models）和视觉世界模型（visual world models），与"Large Language Models"和"World Models"有一定关联（5分）。数据集旨在解决数据稀缺瓶颈，支持扩展，与"Scaling Laws” AND “Data Quality"相关（5分）。注释包含多步推理，与"Chain of Thought"和"System 2 Thinking"相关（5分）。其他关键词如MoE、SLMs、训练方法、推理优化、科学AI等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了计算机使用代理因缺乏连续高质量人类演示视频而发展受限的问题，通过发布CUA-Suite大规模数据集（包含约55小时专家视频和密集注释）来支持代理的评估和训练，并发现当前基础动作模型在专业桌面应用上失败率较高。

摘要翻译

计算机使用智能体（Computer-use Agents, CUAs）在自动化复杂桌面工作流程方面前景广阔，但通用智能体的发展受限于连续、高质量人类演示视频的稀缺。近期研究强调，连续视频（而非稀疏截图）是扩展这类智能体规模的关键缺失要素。然而，现有最大的开放数据集ScaleCUA仅包含200万张截图，相当于不足20小时的视频数据。为突破这一瓶颈，我们推出CUA-Suite——一个面向专业桌面计算机使用智能体的大规模专家演示视频与密集标注生态系统。其核心是VideoCUA数据集，该数据集提供涵盖87种多样化应用的约1万个人类演示任务，包含30帧/秒的连续屏幕录制、运动学光标轨迹以及多层推理标注，总计约55小时、600万帧的专家级视频。与仅捕获最终点击坐标的稀疏数据集不同，这些连续视频流完整保留了人机交互的时序动态，构成信息超集，可无损转换为现有智能体框架所需的格式。CUA-Suite进一步提供两项互补资源：UI-Vision（用于评估CUAs grounding与规划能力的严谨基准）和GroundCUA（包含5.6万张标注截图及超360万个UI元素标注的大规模grounding数据集）。初步评估表明，当前基础动作模型在处理专业桌面应用时面临显著挑战（任务失败率约60%）。除评估功能外，CUA-Suite丰富的多模态语料库支持新兴研究方向，包括通用屏幕解析、连续空间控制、基于视频的奖励建模及视觉世界模型。所有数据与模型均已公开发布。

摘要 (Abstract)

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite’s rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

关键词: Computer-use agents, Video demonstrations, Desktop workflows, Foundation action models, UI element annotations, Visual world models, Continuous spatial control, Multimodal corpus

深度分析:

CUA-Suite：面向计算机使用代理的大规模人工标注视频演示数据集

摘要:

针对计算机使用代理（CUA）在复杂桌面工作流中面临的连续高质量演示数据稀缺问题，本文提出了CUA-Suite生态系统。其核心是VIDEOCUA，包含约55小时、600万帧的30fps连续视频，涵盖87个专业桌面应用和10000个任务，并配有光标轨迹和多层推理注释。此外，该套件还包括用于评估的UI-VISION基准和用于定位的GROUNDCUA数据集。研究显示，现有基础模型在专业桌面应用上的任务失败率高达60%。CUA-Suite通过提供密集、因果监督的视频数据，支持连续空间控制和视觉世界模型等新兴研究方向，所有数据和模型均已开源。

创新点:

构建了VIDEOCUA，这是目前最大的开源专家视频语料库，包含55小时连续30fps视频，保留了完整的时间动态，远超现有的稀疏截图数据集。
提供了密集的多层注释，结合了运动学光标轨迹、详细的推理文本（平均每步497词）以及像素级UI元素定位，实现了密集的因果监督。
提出了全栈生态系统CUA-Suite，整合了演示数据（VIDEOCUA）、定位数据（GROUNDCUA）和评估基准（UI-VISION），覆盖了计算机使用智能的完整技术栈。
专注于专业桌面应用（如3D建模、IDE），涵盖了87种软件，填补了现有数据集主要关注Web应用的空白。

方法

!!! info

研究采用了人类专家演示与密集标注相结合的方法。首先，人类专家在87种桌面软件上执行任务，记录屏幕录制和动作日志。随后，对关键帧进行边界框、OCR和交互标注，并经过多步专家验证以确保质量。最终构建了包含连续视频轨迹、UI元素定位和推理文本的CUA-Suite生态系统，并利用UI-VISION基准对现有模型进行评估。

关键结果:

构建了VIDEOCUA数据集，包含约55小时、600万帧的专家演示视频，规模超过现有最大开源数据集2.5倍。
发布了GROUNDCUA数据集，包含56K标注截图和超过360万个UI元素注释。
初步评估揭示，当前基础动作模型在专业桌面应用上的任务失败率约为60%，表明现有技术在复杂环境中仍存在显著瓶颈。
验证了连续视频流相比稀疏截图在捕捉人类交互动态和学习连续空间控制策略方面的优势。

技术栈: 屏幕录制与动作日志记录（30 fps）, 边界框标注与OCR技术, 多模态学习（视频、文本、轨迹）, UI元素定位, 基准测试与评估框架

优点

数据质量高且规模大，由人类专家验证，解决了自动生成数据噪声大的问题。
强调时间连续性，提供连续视频而非稀疏截图，有助于学习视觉世界模型和连续控制策略。
生态系统全面，不仅提供训练数据，还提供了评估基准和定位数据，形成闭环研究生态。
完全开源，极大地推动了计算机使用代理领域的研究进展。

局限

数据收集成本极高，依赖人工标注和专家演示，难以快速扩展到更多应用。
虽然覆盖了87种应用，但相对于海量的桌面软件生态，覆盖面仍有限，可能存在分布偏差。
当前模型在该数据集上的表现依然不佳（60%失败率），说明仅靠数据可能不足以解决所有问题，算法架构仍需进一步创新。

与研究方向的相关性:

该论文高度相关。它属于“大模型和深度学习技术原理的创新”领域，特别是针对具身智能和代理智能的数据集构建。它解决了大模型在计算机控制这一具体应用场景中的数据瓶颈问题，涉及视频理解、多模态学习和轨迹优化等前沿技术。其创新性在于从稀疏截图转向连续视频学习，这对提升大模型的时空推理能力和连续控制能力具有重要意义，符合用户对新技术和创新的高分标准。

6. Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing

作者: Michael Somma, Markus Großpointner, Paul Zabalegui, Eppu Heilimo, Branka Stojanović 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24221v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLMs）构建一个环境接地的多智能体系统，用于机器人环境中的自动化渗透测试。因此，与"Large Language Models”（LLMs）高度相关（10分），因为论文明确探索LLMs在此应用。与"LLM Agents"和"Multi-agent Systems"高度相关（10分），因为论文提出了一个多智能体架构。与"Tool Use"有一定关联（5分），因为渗透测试涉及利用工具进行漏洞利用，但论文未明确讨论LLMs的API工具调用功能。其他关键词如MoE、SFT、RAG、量化等，论文未涉及技术细节或创新，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何利用大语言模型构建一个环境接地的多智能体工作流，以自动化机器人系统的渗透测试，并在ROS/ROS2 Capture-the-Flag场景中实现了100%的成功率。

摘要翻译

数字基础设施日益增长的复杂性与互联性，使得可扩展且可靠的安全评估方法变得至关重要。机器人系统作为一类尤为重要的运营技术（Operational Technology），其现代形态是高度网络化的信息物理系统，广泛应用于工业自动化、物流及自主服务等领域。本文探讨了在机器人环境中利用大语言模型进行自动化渗透测试的方法。我们提出了一种专为机器人系统定制的、基于环境的多智能体架构。该方法在执行过程中动态构建一个基于图的共享记忆，用以捕获可观测的系统状态，包括网络拓扑、通信通道、漏洞及已尝试的攻击手段。这实现了结构化的自动化测试，同时在整个测试过程中保持了可追溯性和有效的上下文管理。通过在专门的机器人夺旗场景（ROS/ROS2）中进行多轮迭代评估，该系统表现出高可靠性，在全部测试运行（n=5）中均成功完成挑战，成功率达100%。这一性能显著超越了文献中的基准水平，同时满足了如《欧盟人工智能法案》等框架所要求的可追溯性与人工监督。

摘要 (Abstract)

The increasing complexity and interconnectivity of digital infrastructures make scalable and reliable security assessment methods essential. Robotic systems represent a particularly important class of operational technology, as modern robots are highly networked cyber-physical systems deployed in domains such as industrial automation, logistics, and autonomous services. This paper explores the use of large language models for automated penetration testing in robotic environments. We propose an environment-grounded multi-agent architecture tailored to Robotics-based systems. The approach dynamically constructs a shared graph-based memory during execution that captures the observable system state, including network topology, communication channels, vulnerabilities, and attempted exploits. This enables structured automation while maintaining traceability and effective context management throughout the testing process. Evaluated across multiple iterations within a specialized robotics Capture-the-Flag scenario (ROS/ROS2), the system demonstrated high reliability, successfully completing the challenge in 100% of test runs (n=5). This performance significantly exceeds literature benchmarks while maintaining the traceability and human oversight required by frameworks like the EU AI Act.

关键词: Large Language Models, Multi-agent Systems, Autonomous Penetration Testing, Robotic Environments, Environment-grounded Architecture, Graph-based Memory, ROS/ROS2, Capture-the-Flag

深度分析:

面向自主渗透测试的环境感知多智能体工作流

摘要:

随着数字基础设施日益复杂，特别是在机器人系统等运营技术（OT）领域，对可扩展且可靠的安全评估方法需求迫切。本文探讨了大语言模型在机器人环境中自动化渗透测试的应用。针对现有LLM驱动工具在自动化程度、持久上下文管理及可追溯性方面的不足，作者提出了一种基于环境感知的多智能体架构。该架构包含规划器、执行器和记忆智能体，利用LangGraph构建闭环工作流。系统在执行过程中动态构建共享的基于图的记忆，捕获网络拓扑、通信通道、漏洞及利用尝试等可观察的系统状态。在专门的机器人夺旗赛（ROS/ROS2）场景中进行的多次迭代评估表明，该系统具有高可靠性，在100%的测试运行中成功完成了挑战。该性能显著超过了文献基准，同时满足了欧盟AI法案等框架对可追溯性和人工监督的要求。

创新点:

提出了一种环境感知的动态图记忆机制，在执行过程中实时构建共享知识图谱，捕获系统状态而非依赖预设知识。
设计了一种基于LangGraph的闭环多智能体架构，通过规划器、执行器和记忆智能体的协作，实现了渗透测试的结构化自动化。
通过持久化的图记忆结构，显著提升了决策的可追溯性和上下文管理能力，满足了安全关键领域对监管合规（如欧盟AI法案）的要求。
针对机器人系统（ROS/ROS2）的特性定制了工作流，集成了网络扫描与ROS特定的漏洞利用工具，实现了从侦察到利用的自动化切换。

方法

!!! info

论文采用基于LangGraph的多智能体工作流架构。系统分为三个主要阶段：规划阶段由规划器生成结构化任务列表；执行阶段由执行器将任务转化为Nmap扫描或ROS漏洞利用命令；记忆保存阶段由记忆智能体将结果更新到持久化的知识图谱中。系统通过区分临时的工作流状态和持久的图记忆来管理上下文。评估在专门的机器人CTF环境（ROS/ROS2）中进行，通过多次迭代测试系统的成功率，并与文献中的基准进行对比，同时分析了不同LLM架构和模型大小对性能的影响。

关键结果:

在专门的机器人CTF场景中，系统在5次测试运行中达到了100%的成功率，表现出极高的可靠性。
系统的性能显著超过了现有的文献基准。
基于图的记忆机制有效减少了冗余探索，并支持了攻击路径的完整重构。
验证了在保持人工监督和可追溯性的同时，实现高度自动化渗透测试的可行性。

技术栈: LangGraph (多智能体编排框架), Large Language Models (LLMs), Nmap (网络扫描工具), ROS/ROS2 (机器人操作系统), Knowledge Graph (知识图谱/图数据库), Bash Scripts (自定义ROS利用脚本)

优点

高可靠性与自动化：在测试中实现了100%的成功率，显著优于现有方法。
卓越的可追溯性：图记忆结构提供了透明的决策记录，便于审计和人工监督，符合法规要求。
上下文感知能力：动态构建的记忆使智能体能够基于已发现信息生成后续任务，避免了信息丢失。
领域针对性：专门针对机器人系统的复杂性和特定协议（如ROS）进行了优化，填补了OT安全自动化的空白。

局限

评估环境主要基于模拟的CTF场景，在真实工业环境中的复杂性和干扰因素可能未被完全覆盖。
系统仍需人工干预以切换操作模式（如从扫描模式切换到利用模式），尚未实现完全的端到端自主化。
依赖的工具集相对有限（主要是Nmap和自定义脚本），面对未知的复杂漏洞或特定防御机制时可能能力不足。
LLM本身的幻觉风险和上下文窗口限制虽然通过架构有所缓解，但仍是潜在的瓶颈。

与研究方向的相关性:

该论文高度相关。它属于大模型（LLM）在科学领域（网络安全/机器人学）的应用研究，同时在大模型技术原理上有多智能体协作和记忆机制的创新。论文提出的基于环境的图记忆和多智能体工作流，解决了LLM在复杂任务中上下文管理和可追溯性的痛点，具有很强的技术创新性。符合用户关注的大模型应用及技术创新的评分标准。

7. HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

作者: Ken Ding 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23871v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心研究大语言模型（LLMs）在数学推理任务中，通过强化学习（RL）训练时遇到的梯度消失问题，并提出了一种名为HDPO的混合蒸馏策略优化方法。该方法通过特权自蒸馏（self-distillation）来增强标准RL，特别针对模型完全无法解决的“悬崖”提示（cliff prompts）。因此，论文与“Large Language Models”和“RLHF”高度相关（10分），因为其核心就是LLMs的RL训练优化。与“Chain of Thought”和“Self-Correction”有一定关联（5分），因为数学推理涉及多步推理，且自蒸馏可视为一种自我改进机制。与“AI for Science”也有一定关联（5分），因为数学推理是科学AI的一个子领域。其他关键词如MoE、SFT、RAG、量化等，论文未涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在数学推理任务中通过强化学习训练时，对完全无法解决的“悬崖”提示梯度消失的问题，提出了一种混合蒸馏策略优化方法HDPO，该方法通过特权自蒸馏来增强学习信号，实验证明能有效提升模型在数学问题上的覆盖率和准确性。

摘要翻译

采用强化学习（RL）训练的大语言模型在数学推理任务中面临一个根本性挑战：对于模型完全无法解决的“悬崖”式问题，RL梯度会完全消失，导致这些失败模式无法获得任何学习信号。我们提出了混合蒸馏策略优化（Hybrid Distillation Policy Optimization, HDPO），该方法通过针对悬崖式问题的特权自蒸馏来增强标准RL训练。在每一步训练中，HDPO首先识别所有采样轨迹均失败的提示，随后通过向模型提供真实信息生成特权轨迹，筛选出正确解，并将教师模型在词元级别的分布蒸馏至学生模型。由于教师与学生模型共享权重（仅输入不同），与跨模型蒸馏不同，其可实现性差距在理论上是有界的。我们证明，在硬阈值极限下，采用R=1过滤的特权生成能够恢复最优的KL正则化RL策略。在OpenMathInstruct-2数据集上使用Qwen2.5-Math-1.5B-Instruct模型进行的实验表明，HDPO在保持贪婪准确率的同时，持续提升了覆盖度指标（pass@4提升0.8-1.1%，pass@8提升0.4-1.7%），其中蒸馏权重λ为探索-利用权衡提供了直接控制。

摘要 (Abstract)

Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - “cliff” prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher’s token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintaining greedy accuracy, with the distillation weight lambda providing direct control over the exploration-exploitation tradeoff.

关键词: Large Language Models, Reinforcement Learning, Mathematical Reasoning, Policy Optimization, Self-Distillation, Privileged Information, KL-regularized RL, Exploration-Exploitation Tradeoff

深度分析:

HDPO：基于特权自蒸馏的混合蒸馏策略优化

摘要:

针对大语言模型在数学推理强化学习训练中面临的“悬崖”问题（即模型完全无法解决的提示导致梯度消失），本文提出了混合蒸馏策略优化（HDPO）。该方法将标准RL与特权自蒸馏相结合，在训练中识别出所有尝试均失败的提示，利用基本真值作为特权信息生成正确解，并通过JSD散度将特权分布蒸馏回原始模型。实验证明，HDPO在OpenMathInstruct-2数据集上显著提升了模型的覆盖率（pass@4和pass@8），同时保持了贪婪准确率。该方法无需复杂的课程调度或外部模型，通过理论证明了同模型蒸馏的可实现性差距更小，且R=1过滤机制能恢复最优RL策略。

创新点:

提出了HDPO框架，通过特权自蒸馏解决RL训练中的零梯度“悬崖”问题，为失败样本提供学习信号。
证明了同模型特权蒸馏相比跨模型蒸馏具有更紧的可实现性差距，消除了模型失配项。
理论证明了R=1过滤的特权生成能够恢复最优的KL正则化RL策略，为教师构建提供了理论依据。
在数学推理任务中，实现了在不损失贪婪准确率（pass@1）的前提下显著提升模型解的覆盖率（pass@4/8）。

方法

!!! info

论文采用混合训练策略，结合了组相对策略优化（GRPO）与特权自蒸馏。首先进行标准的GRPO更新；然后识别出所有rollout均失败的“悬崖”提示；接着将基本真值附加到输入中作为特权信息，生成特权rollout并过滤出正确解（R=1）；最后使用Jensen-Shannon散度（JSD）作为损失函数，将特权模型（教师）的token级分布蒸馏回原始模型（学生）。整个过程在同一模型权重下进行，仅输入上下文不同。

关键结果:

在OpenMathInstruct-2数据集上，使用Qwen2.5-Math-1.5B-Instruct模型，HDPO将pass@4提升了0.8–1.1%，pass@8提升了0.4–1.7%。
在提升覆盖率的同时，维持了与基线相当的贪婪准确率（pass@1）。
通过调节蒸馏权重λ，可以直接控制探索与利用之间的权衡。
理论分析表明，该方法能有效填补RL训练中的能力边界空白。

技术栈: Group Relative Policy Optimization (GRPO), Proximal Policy Optimization (PPO), Knowledge Distillation (Self-Distillation), Privileged Information Learning, Jensen-Shannon Divergence (JSD), KL Divergence, Qwen2.5-Math-1.5B-Instruct, OpenMathInstruct-2

优点

机制简单高效：仅需额外的正向传播和标准JSD损失，无需复杂的课程调度、回放缓冲区或提示生成器。
理论保证充分：提供了关于可实现性差距和最优策略恢复的严格数学证明。
解决核心痛点：直接针对RL训练中梯度消失的死区问题，利用模型自身能力生成训练数据。
提升多样性：使用JSD损失有助于保持解的多样性，避免模式崩溃，从而提高pass@k指标。

局限

依赖基本真值：该方法需要问题的基本真值来进行特权生成，虽然数学领域常见，但在其他开放域任务中可能受限。
计算开销增加：需要对悬崖提示进行额外的特权生成和蒸馏步骤，增加了训练时的计算成本。
领域验证有限：目前主要在数学推理任务上进行了验证，在其他推理领域的泛化效果尚需进一步研究。

与研究方向的相关性:

该论文高度相关。它属于大模型技术原理的创新，专注于改进强化学习（RL）在大模型推理训练中的应用。虽然实验主要集中在数学推理（科学领域应用），但其核心贡献在于提出了一种通用的训练算法（HDPO），解决了RL训练中的梯度消失瓶颈。这既涉及深度学习技术的底层创新，也直接提升了大模型在科学计算领域的表现，符合关于大模型技术创新及科学应用的研究方向。

8. Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

作者: Nour Bouchouchi, Thiabult Laugel, Xavier Renard, Christophe Marsala, Marie-Jeanne Lesot, Marcin Detyniecki 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24125v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs中的性别偏见问题，特别是对齐（alignment）如何影响偏见表达与编码。高度相关（10分）的关键词包括：1) “Large Language Models” OR “LLMs” OR “Foundation Models”（论文明确研究LLMs）；2) “Post-training” OR “Supervised Fine-tuning” OR “SFT”（论文通过监督微调进行对齐）；3) “Instruction Tuning” OR “Alignment” OR “Value Alignment”（论文核心研究对齐对偏见的影响）。中等相关（5分）的关键词：“Mechanistic Interpretability” OR “Explainable AI”（论文分析内部表示与偏见编码，涉及模型可解释性）。其他关键词与论文主题（偏见分析、对齐效果）无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）中的性别偏见问题，通过一个统一框架分析对齐（如监督微调）如何影响偏见在输出中的表达和在内部表示中的编码，发现对齐减少了表达偏见但未消除编码偏见，且基准测试的偏见缓解效果不一定能泛化到实际场景。

摘要翻译

在训练过程中，大语言模型（LLMs）习得的社会规律可能导致下游应用中出现性别偏见。大多数缓解措施侧重于减少生成输出中的偏见，通常通过结构化基准进行评估，这引发了两个问题：输出层面的评估无法揭示对齐过程是否改变了模型的内在表征，且结构化基准可能无法反映真实使用场景。我们提出一个统一框架，利用相同的中性提示词联合分析大语言模型的内在和外显性别偏见，从而能够直接比较内部表征中编码的性别相关信息与生成输出中表达的偏见。与先前研究报道的弱相关或不一致关联不同，我们在统一测量协议下发现潜在性别信息与表达出的偏见之间存在稳定关联。我们进一步通过旨在减少性别偏见的有监督微调来检验对齐效果。结果表明，尽管微调确实减少了外显偏见，但内部表征中仍存在可测量的性别相关关联，且这些关联在对抗性提示下可能被重新激活。最后，我们考察两种现实场景并证明，在结构化基准上观察到的去偏见效果不一定能推广到其他情境，例如故事生成任务中。

摘要 (Abstract)

During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model’s underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.

关键词: Large Language Models, Gender Bias, Alignment, Supervised Fine-tuning, Internal Representations, Expressed Bias, Debiasing, Benchmark Evaluation

深度分析:

对齐减少表达但未减少编码的性别偏见：统一框架与研究

摘要:

针对大语言模型（LLM）中的性别偏见问题，现有研究多关注通过监督微调等对齐方法减少生成输出中的偏见，但往往忽视了模型内部表征的变化。本文提出了一个统一框架，使用相同的中性提示来联合分析LLM的内在偏见（内部表征中的编码信息）和外在偏见（生成输出中的表达偏见）。研究发现，在统一协议下，潜在性别信息与表达偏见之间存在强关联，这与先前认为两者相关性较弱的结论不同。实验表明，尽管监督微调能有效减少外在表达偏见，但内部表征中仍保留着可测量的性别关联，且这些关联可通过对抗性提示重新激活。此外，研究还指出在结构化基准上观察到的去偏见效果并不一定能泛化到故事生成等现实任务中。

创新点:

提出了一个统一的分析框架，利用相同的中性提示同时测量内在和外在性别偏见，解决了以往研究中因协议异质性导致的不可比问题。
发现了内在编码的性别信息与外在表达偏见之间存在一致的强关联，挑战了先前关于两者相关性微弱的报告。
揭示了监督微调（对齐）主要作为一种行为控制机制抑制偏见输出，而非从内部表征中移除偏见知识，且偏见可被对抗性提示重新激活。
通过方向消融研究，证实了隐性的编码性别关联在功能上直接导致了性别化生成，且这种关联在微调后依然存在。
展示了在结构化基准（如BBQ）上的去偏见效果并不总是能泛化到现实场景（如故事生成），强调了评估方法多样性的重要性。

方法

!!! info

论文构建了包含概念（如职业、疾病）和中性人设（如“我的朋友”）的提示集。对于外在偏见，通过LLM生成补全，并利用LLM-as-a-judge方法根据性别线索（代词、名词）分类，计算实体级和概念级偏见分数。对于内在偏见，提取模型处理提示时最终token的隐藏状态，分析其中编码的性别信息。研究还应用了旨在减少性别偏见的监督微调（SFT），并使用对抗性提示测试偏见的可复活性。最后，通过方向消融实验移除特定性别方向，以验证其因果作用。

关键结果:

在统一测量协议下，模型内部表征中的性别信息与生成输出中的偏见表现出显著的正相关性。
监督微调显著降低了模型在生成输出中的性别偏见，但并未消除内部表征中的性别关联。
通过对抗性提示（Jailbreak），微调后的模型被重新激活，表现出与微调前相似的性别偏见。
方向消融实验证明，移除内部表征中的性别方向会直接导致生成输出的去偏见化，证实了因果联系。
在结构化基准测试中表现良好的去偏见模型，在开放域的故事生成任务中仍可能表现出显著的性别偏见。

技术栈: Large Language Models (LLMs), Supervised Fine-Tuning (SFT), Hidden State Representation Analysis, LLM-as-a-Judge, Adversarial Prompting / Jailbreaking, Directional Ablation, Transformer Architecture

优点

框架设计严谨，通过使用相同提示消除了内在与外在偏见分析之间的变量干扰，提高了结论的可信度。
深入揭示了模型对齐的机制本质，区分了“行为抑制”与“知识移除”，对理解LLM安全性具有重要意义。
不仅关注基准测试，还考察了故事生成等现实应用场景，结论更具实际指导价值。
结合了相关性分析、对抗性测试和消融实验多种手段，论证逻辑严密。

局限

研究主要关注隐性性别偏见（刻板印象），对于显性偏见或其他类型的偏见（如种族、宗教）涉及较少。
对抗性提示的构造可能无法覆盖所有可能的攻击方式，偏见重新激活的阈值和范围有待进一步探索。
依赖LLM-as-a-judge进行输出分类，评估结果可能受到评判模型自身偏见的影响。
主要关注特定概念（如职业），对于更复杂语境下的偏见表现可能需要扩展分析。

与研究方向的相关性:

该论文与关键词高度相关。它深入探讨了大模型（LLM）的技术原理，特别是关于模型对齐、内部表征机制以及知识存储与行为控制之间的关系。这属于大模型技术原理的创新性研究，揭示了当前对齐技术（如SFT）在消除深层偏见方面的局限性。虽然不直接涉及生物医药等具体科学应用，但其关于模型安全性和偏见的发现对于大模型在包括科学领域在内的任何严肃应用中的部署都至关重要，具有很高的技术参考价值。

9. MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning

作者: Andrea Manzoni 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24044v1

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究MoE模型的参数高效微调方法，与"Mixture of Experts"和"PEFT/LoRA"高度相关（10分），因为直接研究MoE架构和LoRA微调技术。与"Large Language Models"相关（8分），因为MoE模型通常是大语言模型的一种架构。与"Post-training/SFT"有一定关联（5分），因为涉及微调阶段。其他关键词如SLMs、Scaling Laws、RAG、Agents等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对MoE模型标准LoRA微调效率低下的问题，提出了一种基于路由分析的MoE-Sieve方法，通过仅对每层最活跃的专家应用LoRA适配器，在保持性能的同时显著减少了训练参数、存储开销和训练时间。

摘要翻译

对混合专家（Mixture-of-Experts, MoE）模型进行标准LoRA微调时，通常将适配器应用于所有专家。然而，我们的性能分析表明，每层的专家路由分布高度倾斜：每层中只有一小部分专家处理大多数令牌，而许多其他专家很少被激活（即“冷”专家）。我们提出了MoE-Sieve，一种简单的路由引导式LoRA微调框架，并结合了对不同架构和任务中专家路由的系统性分析研究。该方法很简单：在一个小型校准集上分析路由计数，每层选择前k个被路由最多的专家，并仅对这些专家应用LoRA。在两个架构不同的MoE模型和三项多样化任务中，每层仅微调前25%被路由最多的专家，其性能仍与完整LoRA微调相当，所有条件下的平均差异在+/-1个百分点以内。这使LoRA可训练参数减少了70-73%，适配器检查点大小减少了71-73%，实际训练时间最多减少了50%。我们还观察到专家数量与种子间方差之间存在非单调关系，这与以下假设一致：调整冷专家可能引入梯度噪声而不会提升准确性。进一步的消融实验表明，在相同参数预算下随机选择专家的性能约差2.5个百分点，这说明路由信号至关重要，而逐层贪婪预算优化并未优于均匀的前k选择。

摘要 (Abstract)

Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated (“cold”). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a small calibration set, select the top-k most-routed experts per layer, and apply LoRA only to those experts. Across two architecturally distinct MoE models and three diverse tasks, tuning only the top 25% routed experts per layer remains competitive with full LoRA, with mean differences within +/-1 percentage point across all conditions. This reduces LoRA trainable parameters by 70-73%, adapter checkpoint size by 71-73%, and wall-clock training time by up to 50%. We also observe a non-monotonic relationship between expert count and seed-to-seed variance, consistent with the hypothesis that adapting cold experts can introduce gradient noise without improving accuracy. Further ablations show that random expert selection at matched budget is about 2.5 percentage points worse, indicating that the routing signal matters, while greedy per-layer budget optimization does not improve over uniform top-k.

关键词: Mixture-of-Experts, MoE, LoRA, Parameter-efficient Fine-tuning, Expert Routing, Fine-tuning Efficiency, Sparse Models, Adapter Compression

深度分析:

MoE-Sieve：路由引导的LoRA用于高效MoE微调

摘要:

针对混合专家模型在LoRA微调中参数冗余的问题，论文提出了MoE-Sieve框架。研究发现MoE模型在每层的专家路由高度倾斜，少数“热”专家处理了大部分token，而许多“冷”专家很少被激活。MoE-Sieve通过分析校准集上的路由计数，选择每层路由频率最高的Top-k专家，并仅对这些专家应用LoRA适配器。实验表明，仅微调Top-25%的专家即可达到与全量LoRA相当的性能（差异在±1%以内），同时减少了70-73%的可训练参数和适配器检查点大小，训练时间减少高达50%。该方法简单高效，无需超参数搜索，显著提升了MoE模型微调的效率。

创新点:

提出了MoE-Sieve框架，利用路由引导机制仅对高频激活的专家应用LoRA，大幅降低微调成本。
通过系统性研究量化了MoE模型中“全局平衡但局部不平衡”的路由现象，发现每层路由倾斜度比全局高4.0-4.9倍。
证明了仅微调Top-25%的专家即可在保持精度的同时，减少70%以上的参数量和训练时间。
揭示了微调“冷”专家会引入梯度噪声并增加方差，解释了选择性微调的有效性原理。

方法

!!! info

论文采用三步走的技术路线：首先，在小型校准集上进行单次前向传播，统计每层每个专家的激活次数；其次，根据激活计数选择每层路由频率最高的Top-k专家；最后，仅对选定的专家以及始终激活的注意力、路由器和共享专家模块应用LoRA进行微调。研究在OLMoE-1B-7B和Qwen1.5-MoE-A2.7B等不同架构的模型上，通过Spider、GSM8K和HellaSwag等任务验证了该方法的有效性，并与全量LoRA及随机选择基线进行了对比。

关键结果:

仅微调Top-25%的专家，其性能与全量LoRA相当，平均差异在±1个百分点以内。
LoRA可训练参数减少了70–73%，适配器检查点大小减少了71–73%。
训练墙钟时间最多减少了50%。
每层专家激活分布的变异系数（CV）是全局CV的4.0–4.9倍，证实了局部路由的高度倾斜。
在相同预算下，基于路由的选择比随机选择性能高出约2.5个百分点。

技术栈: Mixture-of-Experts (MoE) 架构, Low-Rank Adaptation (LoRA), OLMoE-1B-7B, Qwen1.5-MoE-A2.7B, DeepSeek-MoE-16B 模型, Load-balancing loss (负载均衡损失), Coefficient of Variation (CV, 变异系数), Jaccard Index (用于衡量专家集合重叠度)

优点

显著提升了MoE模型微调的参数效率和计算效率，降低了部署门槛。
方法极其简单（Profile -> Count -> Pick Top-k -> Fine-tune），易于实现且无需复杂的超参数搜索。
具有广泛的通用性，在多种不同架构的MoE模型和不同任务上均表现良好。
提供了对MoE路由动态的深入实证分析，为后续研究提供了理论基础。

局限

依赖于校准集的代表性，如果校准数据与微调数据分布差异较大，选定的专家可能不是最优。
最优的专家预算比例（如Top-25%）可能因模型架构而异，需要针对新模型进行一定的调整。
专家选择在微调前是静态确定的，无法适应微调过程中可能发生的路由模式变化。

与研究方向的相关性:

该论文高度相关于“大模型和深度学习技术原理的创新”。它直接针对当前大模型中热门的MoE（混合专家）架构的微调效率问题提出了创新性解决方案。MoE-Sieve通过深入分析路由机制，优化了LoRA微调过程，属于大模型底层训练技术的核心创新，具有很高的技术价值和实用性。

10. The Diminishing Returns of Early-Exit Decoding in Modern LLMs

作者: Rui Wei, Rui Du, Hanfei Yu, Devesh Tiwari, Jian Li, Zhaozhuo Xu, Hao Wang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23701v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	8.0/10	8.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理中的早期退出解码技术，直接涉及"Large Language Models”（10分）和"Speculative Decoding”（8分，因早期退出是推理加速的一种形式）。论文明确比较了Mixture-of-Experts模型，故"Mixture of Experts"得8分。论文提到预训练模型和架构改进，与"Pre-training"有一定关联（5分）。其他关键词如SLMs、SFT、RAG等与论文内容无直接关系，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，随着现代大语言模型预训练方法和架构的改进，层间冗余减少，导致早期退出解码技术在降低推理延迟和成本方面的效果呈现递减趋势，且密集Transformer比MoE和状态空间模型具有更大的早期退出潜力。

摘要翻译

在大语言模型推理中，早期退出是指在预测达到足够置信度时于中间层停止计算，从而降低延迟与成本。然而，近期的大语言模型采用了改进的预训练方案和架构，减少了层间冗余，这可能限制了早期退出的机会。我们重新评估了现代大语言模型中的分层早期退出机制，并分析了训练过程中中间表示的演化规律。我们引入了一种量化模型内在早期退出适用性的指标，并提出了一个基准测试，供研究者探索不同模型和工作负载上潜在的早期退出收益。我们的研究结果显示，在新一代模型中，早期退出的有效性呈下降趋势。进一步发现，稠密Transformer通常比混合专家模型和状态空间模型具有更大的早期退出潜力。此外，参数量更大的模型（特别是超过200亿参数的模型）以及未经专门调优的预训练基础模型，往往表现出更高的早期退出潜力。

摘要 (Abstract)

In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining recipes and architectures that reduce layer redundancy, potentially limiting early-exit opportunities. We re-evaluate layer-wise early-exit in modern LLMs and analyze how intermediate representations evolve during training. We introduce a metric to quantify a model’s intrinsic suitability for early-exit and propose a benchmark for researchers to explore the potential early-exit benefits on different models and workloads. Our results show a diminishing trend in early-exit effectiveness across newer model generations. We further find that dense transformers generally offer greater early-exit potential than Mixture-of-Experts and State Space Models. In addition, larger models, particularly those with more than 20 billion parameters, and base pretrained models without specialized tuning tend to exhibit higher early-exit potential.

关键词: Large Language Models, Early-exit Decoding, Inference Acceleration, Mixture-of-Experts, Transformer Architecture, Model Redundancy, Layer-wise Prediction, Benchmark Evaluation

深度分析:

现代大语言模型中早退解码的收益递减

摘要:

论文探讨了现代大语言模型（LLM）中早退解码机制的有效性。随着LLM架构和预训练方法的改进，层冗余减少，早退机会可能受限。作者重新评估了现代LLM的层早退潜力，引入了“早退适应性得分”（EAS）这一新指标及基准框架来量化模型内在的早退适用性。研究发现，随着模型代际更新，早退有效性呈下降趋势；密集Transformer通常比混合专家模型和状态空间模型更适合早退；此外，大于200亿参数的模型及基础预训练模型表现出更高的早退潜力。

创新点:

提出了“早退适应性得分”（EAS）指标，用于量化模型内在的早退适用性并估算其加速上限。
建立了一个包含Oracle早退评估的基准框架，用于系统评估不同模型和工作负载的早退潜力。
发现了现代LLM中早退有效性的“收益递减”趋势，即新一代模型因层冗余减少而更难应用早退。
系统分析了影响早退行为的四大因素：模型规模、架构类型（密集 vs MoE vs SSM）、训练阶段（基础 vs 微调）以及工作负载特性。

方法

!!! info

论文基于OpenCompass框架，选取GPQA、GSM8K、HumanEval和MMLU等多样化数据集，对Llama系列、Qwen系列、OpenAI OSS及Mamba等不同架构和规模的现代LLM进行系统评估。通过计算中间层与最终层的隐藏状态或Logits的余弦相似度以及跳过比率，构建加权几何平均数作为早退适应性得分（EAS），从而量化模型的早退潜力。

关键结果:

现代LLM的早退有效性随着模型代际更新呈现下降趋势。
密集Transformer架构的早退潜力普遍高于混合专家模型和状态空间模型。
参数量超过200亿的大模型以及未经过专门微调的基础预训练模型通常具有更高的早退潜力。
早退模式主要取决于模型本身，受具体工作负载的影响较小。

技术栈: OpenCompass (benchmarking platform), Cosine Similarity (metric), Weighted Geometric Mean (metric calculation), Layer-wise Early-Exit decoding strategy, Dense Transformers, Mixture-of-Experts (MoE), State Space Models (SSMs)

优点

视角新颖：挑战了早退机制在现代先进LLM中依然有效的传统假设，指出了收益递减的现象。
量化指标：提出的EAS指标提供了一个通用的、无需特定训练即可评估模型早退潜力的标准。
覆盖全面：评估涵盖了多种模型架构（密集、MoE、SSM）、规模和训练阶段，结论具有广泛的参考价值。

局限

评估局限性：主要基于Oracle评估（即假设知道最佳退出点），实际应用中的动态退出策略可能难以达到理论上的加速上限。
模型覆盖：虽然涵盖了多种模型，但可能未包含所有最新的闭源模型或特定领域的微调模型。
未提出解决方案：论文主要揭示了问题（收益递减），并未提出针对现代LLM的新型早退算法来逆转这一趋势。

与研究方向的相关性:

该论文主要研究大模型（LLM）的技术原理创新，具体聚焦于推理阶段的效率优化技术（早退解码）。虽然不直接涉及科学领域的应用，但深入分析了现代LLM架构（如Transformer、MoE、SSM）的内在特性，属于大模型底层技术原理的创新研究。对于理解大模型计算冗余和优化推理成本具有重要价值，符合“大模型和深度学习技术原理的创新”这一关键词。

11. Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

作者: Han Sun, Qin Li, Peixin Wang, Min Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24058v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文专注于大型视觉语言模型（LVLMs）中的物体幻觉问题，提出了一种基于注意力失衡校正的轻量级解码时干预方法。与关键词的相关性分析如下：1）与"Large Language Models"高度相关（10分），因为LVLMs是LLMs在视觉语言领域的扩展；2）与"Hallucination Mitigation"高度相关（10分），这是论文的核心研究问题；3）与"Mechanistic Interpretability"有一定关联（5分），论文通过分析注意力模式来解释幻觉成因；4）与"AI for Science"有一定关联（5分），论文提到在自动驾驶和医学图像分析等高风险场景的应用；5）其他关键词如MoE、SFT、RAG等与论文内容无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型视觉语言模型中物体幻觉的问题，通过识别注意力分配失衡是幻觉的关键成因，并提出了一种轻量级的注意力失衡校正方法，在多个基准测试中显著降低了幻觉率并提升了模型性能。

摘要翻译

大型视觉语言模型（LVLMs）中的物体幻觉严重削弱了其在现实应用中的可靠性，对其在自动驾驶和医学图像分析等高风险场景中的部署构成了关键障碍。通过系统性实证研究，我们发现跨模态（即视觉与语言）及模态内（各独立标记间）的注意力分配失衡与物体幻觉的发生存在强因果关联。基于这一洞察，我们提出了“注意力失衡”这一新概念，它不仅能量化注意力差异的程度，还能可视化地揭示驱动物体幻觉的内在模式（例如对无关语言标记的过度关注或对判别性视觉特征的关注不足）。为缓解物体幻觉，我们进一步提出注意力失衡矫正（AIR），这是一种轻量级的解码时干预方法，通过重新分配注意力权重并调整注意力分布来纠正模态间与标记间的失衡。在四种主流LVLM和三个基准测试（CHAIR、POPE和MM-Vet）上使用七种基线方法进行的广泛评估表明，AIR能持续降低物体幻觉率，相较于基线方法最高可减少35.1%，同时在不同视觉语言任务中将LVLMs的通用能力最高提升15.9%。

摘要 (Abstract)

Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs’ general capability across diverse vision-language tasks.

关键词: Large Vision-Language Models, Object Hallucination, Attention Imbalance, Attention Imbalance Rectification, Decoding-time Intervention, Modality-wise Imbalance, Token-wise Imbalance, Vision-Language Tasks

深度分析:

通过注意力不平衡矫正缓解大视觉-语言模型中的物体幻觉

摘要:

针对大视觉-语言模型（LVLMs）中普遍存在的物体幻觉问题，本文通过实证研究提出了“注意力不平衡”这一新概念，涵盖模态间和模态内两个维度。研究发现，注意力分配的不平衡与物体幻觉有强因果相关性，且易产生幻觉的注意力头继承了基础语言模型的模式。基于此，作者提出了一种轻量级的解码时干预方法——注意力不平衡矫正（AIR）。该方法通过模态平衡注意力重分配和方差约束投影正则化来调整注意力分布。实验表明，AIR在无需额外训练的情况下，显著降低了幻觉率，并提升了模型的通用能力。

创新点:

提出了“注意力不平衡”（MAI和TAI）概念，量化并解释了导致物体幻觉的注意力分配模式。
发现易产生幻觉的注意力头倾向于继承基础语言模型的注意力模式，导致模态间注意力失衡。
提出了AIR方法，一种无需训练的解码时干预技术，通过重分配注意力和约束方差来缓解幻觉。

方法

!!! info

论文首先通过系统实证分析，计算模态间注意力不平衡（MAI）和模态内注意力不平衡（TAI），建立其与幻觉的相关性。随后，提出AIR方法，在模型推理阶段对注意力矩阵进行干预：一是进行模态平衡注意力重分配，缓解跨模态的过度不平衡；二是应用方差约束投影正则化，使注意力分布均匀化。

关键结果:

在四个主流LVLMs和三个基准测试（CHAIR, POPE, MM-Vet）上，AIR将物体幻觉率降低了高达35.1%。
将LVLMs的通用能力提升了高达15.9%。
证明了注意力不平衡与物体幻觉之间存在因果相关性。

技术栈: 大视觉-语言模型 (LVLMs), Transformer架构, 自注意力机制, 条件互信息, Softmax函数, 矩阵运算

优点

理论解释深刻，从注意力机制角度揭示了幻觉的成因。
方法轻量高效，无需额外训练或微调，仅在解码时进行干预。
泛化能力强，适用于多种LVLM架构，且在减少幻觉的同时不损害甚至提升通用性能。

局限

虽然是轻量级干预，但在解码过程中修改注意力权重仍可能引入一定的计算延迟。
主要针对“物体”幻觉，对于其他类型的幻觉（如属性、关系幻觉）的效果可能有限。
需要设定特定的阈值（如TAI的阈值τ），可能需要针对不同模型进行调整。

与研究方向的相关性:

论文高度相关于“大模型和深度学习技术原理的创新”。它深入探讨了Transformer架构中的核心机制——注意力分配，并提出了具体的改进算法（AIR），属于对大模型底层原理的深入挖掘和技术创新。虽然论文背景涉及自动驾驶和医疗等科学应用，但其核心贡献在于技术层面的优化，因此非常符合用户对新技术原理创新的关注。

12. Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition

作者: Aleix Sant, Jordi Luque, Carlos Escolano 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24242v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在联邦学习（FL）环境下的多语言指令微调（Instruction Tuning），这直接对应关键词"Large Language Models"和"Instruction Tuning”。研究涉及对LLMs进行微调，属于"Post-training"或"Supervised Fine-tuning"范畴。论文未涉及其他关键词如MoE、量化、推理加速、科学AI应用等具体技术或领域。

!!! tip deepseek-chat TL;DR

该研究探讨了在联邦学习框架下，客户端语言构成（从单语到多语）如何影响多语言大语言模型的指令微调效果、公平性和训练成本，发现客户端内多语性增强能提升全局模型性能与公平性，尤其有益于低资源语言，但会增加优化步骤。

摘要翻译

在多语言环境下进行大语言模型的联邦学习面临着显著挑战，这些挑战主要源于客户端间异构的语言分布以及语言资源可用性的差异。为解决这些问题，我们扩展了FederatedScope-LLM框架，以支持大语言模型的多语言指令微调实验。同时，我们引入了一种新颖的客户端特定早停机制——本地动态早停（Local Dynamic Early Stopping, LDES-FL），该机制允许客户端根据本地验证性能暂停和恢复本地训练，从而提升训练效率和可持续性。通过一系列实验，我们研究了客户端的语言构成——从完全单语到日益多语化的客户端——如何影响多语言质量、公平性和训练成本。单语本地微调对于单一语言专业化仍然最为有效，而联邦训练则更适合学习单一平衡的多语言模型。在联邦学习中，增加客户端内部的多语言性能够产生更强、更公平的全局模型，缩小与集中式多语言微调的差距，并为资源较少的语言带来最大的收益，尽管这是以更多的优化步骤为代价的。总体而言，我们的研究结果表明，客户端的语言构成是多语言联邦学习中的一个关键设计变量，它塑造了性能、公平性和效率。

摘要 (Abstract)

Federated Learning (FL) of Large Language Models (LLMs) in multilingual environments presents significant challenges stemming from heterogeneous language distributions across clients and disparities in language resource availability. To address these challenges, we extended the FederatedScope-LLM framework to support multilingual instruction-tuning experiments with LLMs. We also introduced a novel client-specific early stopping mechanism, Local Dynamic Early Stopping (LDES-FL), which allows clients to pause and resume local training based on client-side validation performance, enhancing training efficiency and sustainability. Through a series of experiments, we studied how client language composition - from fully monolingual to increasingly multilingual clients - affects multilingual quality, fairness and training cost. Monolingual local fine-tuning remains the most effective for single-language specialization, whereas federated training is better suited to learning a single balanced multilingual model. In FL, increasing within-client multilinguality leads to stronger and fairer global models, narrows the gap to centralized multilingual fine-tuning, and yields the largest gains for lower-resource languages, albeit at the cost of more optimization steps. Overall, our results identify client language composition as a key design variable in multilingual FL, shaping performance, fairness and efficiency

关键词: Federated Learning, Large Language Models, Multilingual, Instruction Tuning, Client Language Composition, Fine-tuning, Fairness, Training Efficiency

深度分析:

通过联邦学习优化多语言大语言模型：客户端语言构成研究

摘要:

论文探讨了在多语言环境下利用联邦学习（FL）微调大语言模型（LLMs）时，客户端语言构成对模型性能、公平性和训练成本的影响。作者扩展了FederatedScope-LLM框架以支持多语言指令微调，并提出了本地动态早停机制（LDES-FL）以提高训练效率。通过设计从完全单语到高度多语的实验场景，研究发现增加客户端内部的多语性能提升全局模型的平均性能和公平性，显著缩小与集中式微调的差距，尤其是对低资源语言，但代价是增加了优化步骤。单语微调适合特定语言专业化，而联邦训练更适合构建平衡的多语言模型。

创新点:

扩展了FederatedScope-LLM框架，增加了对多语言联邦微调的显式支持，包括灵活的提示集成和语言感知的数据处理管道。
提出了本地动态早停机制（LDES-FL），允许客户端根据本地验证损失动态暂停和恢复训练，从而减少不必要的计算并提升效率。
系统性地研究了客户端语言构成（从单语到多语）对多语言联邦学习性能、公平性和收敛行为的影响，填补了该领域的研究空白。

方法

!!! info

论文基于FederatedScope-LLM框架进行扩展，构建了支持多语言FL的实验平台。采用参数高效微调（PEFT）技术，具体使用LoRA（Low-Rank Adaptation）来减少通信和计算开销。研究设计了不同语言构成的实验场景（如100% mono, 85% mono等），通过控制变量法模拟从非IID到近似IID的数据分布。引入LDES-FL机制，利用本地验证集监控损失，实现动态早停。使用FedAvg作为聚合策略，在保持客户端数据量恒定的前提下进行对比实验。

关键结果:

单语本地微调在单一语言专业化方面最有效，而联邦训练更适合学习单一平衡的多语言模型。
在FL中，增加客户端内部的多语性（即减少跨客户端的异构性）能产生更强、更公平的全局模型。
提高客户端多语性显著缩小了联邦学习与集中式多语言微调之间的性能差距，尤其是对低资源语言有较大增益。
虽然多语性提升了性能，但代价是需要更多的优化步骤，增加了训练成本。

技术栈: FederatedScope-LLM (框架), Low-Rank Adaptation (LoRA), Federated Averaging (FedAvg), Local Dynamic Early Stopping (LDES-FL), salamandra-2b-instruct (模型)

优点

针对性强：填补了多语言LLM联邦学习领域的研究空白，聚焦于语言分布异构这一关键问题。
实用性强：提出的LDES-FL机制能有效提升资源受限客户端的训练效率和可持续性。
实验设计严谨：通过控制变量法系统分析了不同语言构成对模型性能和公平性的影响，结论具有指导意义。
开源贡献：提供了扩展后的代码库，便于社区复现和进一步研究。

局限

实验假设所有客户端的数据集大小相同，这与现实世界中数据高度不平衡的情况存在差距。
主要关注指令微调阶段，对于预训练阶段的多语言联邦学习未做深入探讨。
虽然提出了LDES-FL，但在极端异构数据下的收敛性证明和理论分析可能还不够充分。

与研究方向的相关性:

该论文高度相关。它直接涉及大语言模型（LLM）的技术原理创新，特别是联邦学习（Federated Learning）与LLM的结合，属于深度学习技术原理的前沿探索。研究聚焦于多语言NLP，属于深度学习在特定领域的应用。提出的LDES-FL和针对客户端语言构成的优化策略属于算法层面的创新，符合用户对“大模型和深度学习技术原理的创新”的关注点。

13. A^3: Towards Advertising Aesthetic Assessment

作者: Kaiyuan Ji, Yixuan Gao, Lu Sun, Yushuo Zheng, Zijian Chen, Jianbo Zhang, Xiangyang Zhu, Yuan Tian, Zicheng Zhang, Guangtao Zhai 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24037v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文开发了A^3-Align，一个用于广告美学评估的多模态大语言模型（LLM），因此与"Large Language Models"高度相关（10分）。模型通过指令调优进行训练以实现与A^3-Law范式的对齐，因此与"Instruction Tuning"或"Alignment"高度相关（10分）。数据集包含Chain-of-Thought（CoT）原理，模型训练也采用CoT引导学习，因此与"Chain of Thought"高度相关（10分）。论文未涉及其他关键词的具体技术或应用，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了A^3框架，通过理论范式A^3-Law、数据集A^3-Dataset、多模态大语言模型A^3-Align和基准A^3-Bench，解决了广告图像美学评估缺乏可扩展性、标准化和可解释性的问题，实验表明A^3-Align在美学对齐和任务泛化方面优于现有模型。

摘要翻译

广告图像对商业转化率与品牌资产具有显著影响，然而当前的评估方法依赖主观判断，缺乏可扩展性、标准化准则与可解释性。为应对这些挑战，我们提出了A³（广告美学评估）框架，该综合框架包含四个组成部分：一个范式（A³-Law）、一个数据集（A³-Dataset）、一个多模态大语言模型（A³-Align）以及一个基准测试集（A³-Bench）。A³的核心是理论驱动的范式A³-Law，它包含三个层次化阶段：（1）感知注意（Perceptual Attention），评估感知图像信号吸引注意力的能力；（2）形式兴趣（Formal Interest），评估图像色彩与空间布局的形式构成在引发兴趣方面的表现；（3）欲望影响（Desire Impact），衡量图像唤起的欲望及其说服性影响。基于A³-Law，我们构建了A³-Dataset，其中包含来自3万张广告图像的12万条指令-响应对，每张图像均附有丰富的多维度标签与思维链（Chain-of-Thought, CoT）推理标注。我们进一步开发了A³-Align模型，该模型在A³-Law指导下，基于A³-Dataset通过思维链引导学习进行训练。在A³-Bench上的大量实验表明，与现有模型相比，A³-Align实现了与A³-Law更优的对齐，且这种对齐能力能够良好泛化至优质广告筛选与规范性广告批评任务，显示出其广泛部署的潜力。数据集、代码与模型可通过以下链接获取：https://github.com/euleryuan/A3-Align。

摘要 (Abstract)

Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present A^3 (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A^3-Law), a dataset (A^3-Dataset), a multimodal large language model (A^3-Align), and a benchmark (A^3-Bench). Central to A^3 is a theory-driven paradigm, A^3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A^3-Law, we construct A^3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A^3-Align, trained under A^3-Law with CoT-guided learning on A^3-Dataset. Extensive experiments on A^3-Bench demonstrate that A^3-Align achieves superior alignment with A^3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment. Dataset, code, and models can be found at: https://github.com/euleryuan/A3-Align.

关键词: Advertising Aesthetic Assessment, Multimodal Large Language Model, Chain-of-Thought, Instruction Tuning, A^3-Law, A^3-Align, A^3-Dataset, A^3-Bench

深度分析:

A^3：迈向广告美学评估

摘要:

针对现有广告美学评估方法缺乏可扩展性和标准化标准的问题，本文提出了A3框架。该框架包含四个部分：理论驱动的A3-Law范式、包含12万指令-响应对的A3-Dataset数据集、基于多模态大语言模型的A3-Align模型以及A3-Bench基准测试。A3-Law将评估分为感知注意力、形式兴趣和欲望影响三个阶段。通过在A3-Dataset上进行监督微调和强化学习，A3-Align模型能够与A3-Law规则对齐。实验结果表明，A3-Align在A3-Bench上表现优于现有模型，并能有效应用于高质量广告选择和规定性广告批判等实际场景。

创新点:

提出了A3-Law，一种受理论驱动的分层评估范式，将广告美学解构为感知注意力、形式兴趣和欲望影响三个阶段。
构建了A3-Dataset，这是一个包含30K图像和120K指令-响应对的大规模数据集，并利用思维链推理进行了丰富标注。
开发了A3-Align，一种多模态大语言模型，通过监督微调和强化学习（GRPO）与A3-Law规则对齐。
建立了A3-Bench，一个用于评估多模态大语言模型在广告美学方面表现的综合基准。

方法

!!! info

论文首先提出了A3-Law范式，基于信号检测理论、格式塔心理学和符号学将评估分层。随后，通过“以人为中心”和“模型增强”两个阶段构建A3-Dataset。在模型训练方面，采用两阶段流程：第一阶段通过监督微调（SFT）让模型学习A3-Law规则和思维链；第二阶段利用强化学习（GRPO）优化模型，使其与多信号奖励对齐，最终得到A3-Align模型。

关键结果:

A3-Align模型在A3-Bench基准测试中表现出与A3-Law规则的高度一致性，优于现有的主流多模态大语言模型。
A3-Law范式成功将抽象的美学理论转化为可执行的层级结构，支持了模型的训练和评估。
A3-Align在高质量广告选择和规定性广告批判等实际应用任务中展现了强大的泛化能力和实用性。

技术栈: 多模态大语言模型, 思维链, 监督微调, 强化学习 (GRPO), 信号检测理论, 格式塔心理学, 符号学

优点

理论驱动：将抽象的美学和认知心理学理论（如AIDA模型）转化为具体的、可执行的评估规则。
可解释性：利用思维链推理提供逐步的诊断反馈，解决了传统模型“黑盒”和推理不稳定的问题。
系统性强：涵盖了从理论范式、数据集构建、模型训练到基准测试的完整闭环。
实用价值高：不仅提供评分，还能生成具体的改进建议，直接服务于广告优化。

局限

虽然引入了规则，但美学评估本身仍具有一定的主观性，完全标准化仍具挑战。
研究主要聚焦于广告图像，对于其他类型图像（如艺术摄影）的美学评估泛化能力有待验证。
构建大规模高质量数据集和训练大模型需要巨大的计算资源和人力成本。

与研究方向的相关性:

论文高度相关。它属于大模型在特定垂直领域（广告/商业）的应用研究，创新性地提出了结合认知科学理论的多模态大语言模型训练范式。它展示了深度学习技术在解决复杂评估任务（美学、说服力）中的创新应用，符合用户对大模型应用和技术原理创新的关注点。

📋 所有论文列表

1. ✅ AI-Supervisor: Autonomous AI Research Supervision via a Persistent Research World Model

作者: Yunbo Long 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24402v1

评分: 75.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出AutoProf框架，这是一个用于AI研究监督的多智能体系统，核心是维护一个持续演化的研究世界模型（Research World Model）。该框架明确支持主流大语言模型（LLMs），因此与"Large Language Models"高度相关（10分）。框架的核心是多智能体系统，涉及自主代理、协调和共识机制，因此与"LLM Agents”、“Multi-agent Systems"和"Self-Correction/Self-Improvement"高度相关（均为10分）。其核心创新之一是"Research World Model”，与"World Models"高度相关（10分）。应用领域是AI研究监督，属于"AI for Science"范畴（10分）。框架涉及结构化分析、自我纠正循环和迭代改进，体现了"Chain of Thought"和"System 2 Thinking"的某些方面（均为5分）。智能体可能需要调用工具或API来执行研究任务，与"Tool Use"有一定关联（5分）。论文未涉及其他关键词的具体技术细节，如MoE、模型压缩、训练方法、推理加速、对齐技术等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有自动化研究系统缺乏持久性理解和自我纠正能力的问题，提出了AutoProf框架，这是一个由持续演化的研究世界模型驱动的多智能体系统，能够实现从文献综述到论文撰写的端到端自主AI研究监督，并通过结构化差距发现和自我纠正循环来改进研究过程。

摘要翻译

现有自动化研究系统以无状态的线性流程运作，其生成输出时并未保持对研究领域的持续性理解。这些系统按顺序处理论文，在没有结构化缺口分析的情况下提出想法，且缺乏智能体间相互验证或完善发现的机制。我们提出AutoProf（自主教授）——一个多智能体协同框架，其中专业化智能体通过自主探索与自我修正更新，在人类兴趣驱动下提供从文献综述、缺口发现、方法开发、评估到论文撰写的端到端人工智能研究指导。与顺序流程不同，AutoProf维护着一个持续演化的研究世界模型（以知识图谱形式实现），将方法、基准、局限性和未探索缺口作为跨智能体的共享记忆进行捕捉。该框架包含三项核心贡献：第一，结构化缺口发现机制，将方法解构为模块，跨基准评估模块性能，并识别模块层级的缺口；第二，自我修正发现循环，通过分析模块成功或失败的原因、检测基准偏差及评估充分性来实现持续优化；第三，自我改进的开发循环，利用跨领域机制搜索迭代修正失效组件。所有智能体在共识机制下运行，任何发现需经验证后方可提交至共享模型。该框架与模型无关，支持主流大语言模型，并可根据计算资源弹性扩展——从轻量级探索到全面研究均可适配。

摘要 (Abstract)

Existing automated research systems operate as stateless, linear pipelines, generating outputs without maintaining a persistent understanding of the research landscape. They process papers sequentially, propose ideas without structured gap analysis, and lack mechanisms for agents to verify or refine each other’s findings. We present AutoProf (Autonomous Professor), a multi-agent orchestration framework where specialized agents provide end-to-end AI research supervision driven by human interests, from literature review through gap discovery, method development, evaluation, and paper writing, via autonomous exploration and self-correcting updates. Unlike sequential pipelines, AutoProf maintains a continuously evolving Research World Model implemented as a Knowledge Graph, capturing methods, benchmarks, limitations, and unexplored gaps as shared memory across agents. The framework introduces three contributions: first, structured gap discovery that decomposes methods into modules, evaluates them across benchmarks, and identifies module-level gaps; second, self-correcting discovery loops that analyze why modules succeed or fail, detect benchmark biases, and assess evaluation adequacy; third, self-improving development loops using cross-domain mechanism search to iteratively address failing components. All agents operate under a consensus mechanism where findings are validated before being committed to the shared model. The framework is model-agnostic, supports mainstream large language models, and scales elastically with token budget from lightweight exploration to full-scale investigation.

关键词: Autonomous AI Research, Multi-agent System, Research World Model, Knowledge Graph, Self-correcting Discovery, Gap Analysis, AI for Science, Agent Coordination

2. ✅ Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

评分: 66.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	15.0/10	15.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

!!! tip deepseek-chat TL;DR

该论文研究了在AI政策分析中应用检索增强生成（RAG）系统时，发现检索质量的提升并不总能改善端到端问答性能，有时甚至会导致更自信的幻觉，为动态监管语料库上的问答系统设计提供了实用见解。

摘要翻译

检索增强生成系统正日益广泛地应用于复杂政策文件的分析，然而在那些以密集法律语言和动态重叠的监管框架为特征的领域中，要达到专家使用所需的足够可靠性仍具挑战。本研究利用人工智能治理与监管档案库——一个包含947份人工智能政策文件的精选语料库——探讨了检索增强生成在人工智能治理与政策分析中的应用。我们的系统结合了基于ColBERT的检索器（通过对比学习进行微调）和采用直接偏好优化方法对齐人类偏好的生成器。我们构建了合成查询并收集成对偏好数据，以使系统适应政策领域。通过评估检索质量、答案相关性和忠实度的实验，我们发现领域特定的微调能提升检索指标，但并未持续改善端到端问答性能。在某些情况下，当语料库中缺乏相关文档时，更强的检索能力反而会反直觉地导致更自信的幻觉生成。这些结果凸显了构建政策导向检索增强生成系统的关键关切：单个组件的改进未必能转化为更可靠的答案。我们的研究结果为基于动态监管语料库设计有依据的问答系统提供了实践启示。

摘要 (Abstract)

Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.

关键词: Retrieval-augmented generation, RAG, AI policy analysis, Direct Preference Optimization, DPO, hallucinations, domain adaptation, question answering

3. ✅ LensWalk: Agentic Video Understanding by Planning How You See in Videos

作者: Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24558v1

评分: 51.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	8.0/10	8.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文解决了视频理解中推理与感知脱节的问题，通过提出LensWalk代理框架，让大型语言模型主动控制视觉观察，实现了无需微调的即插即用性能提升，在长视频基准上准确率提高超过5%。

摘要翻译

视频密集且时序化的特性为自动化分析带来了巨大挑战。尽管现有方法采用了强大的视觉语言模型，但主流视频理解技术仍受限于推理与感知之间的固有割裂：它们依赖静态的预处理信息，无法在理解深化过程中主动从视频中搜寻原始证据。为此，我们提出LensWalk——一种灵活的智能体框架，使大语言模型推理器能够主动控制其视觉观察过程。LensWalk构建了紧密的“推理-规划-观察”循环机制，智能体可在每一步动态指定所观察视频的时间范围与采样密度。通过调用一系列由这些参数配置的、基于视觉语言模型的多样化工具，智能体能够执行大范围线索扫描、聚焦特定片段进行事实提取，并整合多时刻证据以完成整体验证。该设计实现了直接服务于智能体动态思维链的渐进式按需证据收集。无需任何模型微调，LensWalk在多种模型架构上实现了显著的即插即用性能提升，在LVBench和Video-MME等具有挑战性的长视频基准测试中，将模型准确率提升了5%以上。我们的分析表明，赋予智能体控制其观察方式的能力，是解锁更精准、鲁棒且可解释的视频推理的关键所在。

摘要 (Abstract)

The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent’s evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.

关键词: Agentic Framework, Large Language Model, Video Understanding, Dynamic Observation, Chain of Thought, Vision-Language Models, Reason-Plan-Observe Loop, Plug-and-play Performance

4. ✅ Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究发现，大语言模型的自蒸馏后训练技术虽然能缩短推理轨迹，但会抑制模型在推理过程中表达不确定性，导致其在数学推理任务（尤其是分布外问题）上的性能下降，揭示了保持适当不确定性表达对鲁棒推理的重要性。

摘要翻译

自蒸馏已成为大型语言模型一种有效的后训练范式，通常能在缩短推理轨迹的同时提升模型性能。然而在数学推理任务中，我们发现该方法虽能缩减响应长度，却可能导致性能下降。我们将这种性能退化归因于认知性言语表达的抑制——即模型在推理过程中不确定性表达能力的减弱。通过控制条件上下文丰富度与任务覆盖范围的对比实验，我们证明：让教师模型基于丰富信息进行条件生成会抑制不确定性的表达，这虽能在有限任务覆盖范围内实现快速的领域内优化，却会损害分布外（OOD）性能——因为面对未见问题时，模型需要表达不确定性并进行相应调整才能获得更好表现。在Qwen3-8B、DeepSeek-Distill-Qwen-7B和Olmo3-7B-Instruct三个模型上的实验显示，性能下降幅度最高可达40%。我们的研究结果表明：暴露适当程度的不确定性对于实现鲁棒推理至关重要，同时强调了对推理行为进行优化不应仅仅局限于强化正确答案轨迹，而需关注更本质的认知表达机制。

摘要 (Abstract)

Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model’s expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

5. ✅ CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	5.0/10	5.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文解决了计算机使用代理因缺乏连续高质量人类演示视频而发展受限的问题，通过发布CUA-Suite大规模数据集（包含约55小时专家视频和密集注释）来支持代理的评估和训练，并发现当前基础动作模型在专业桌面应用上失败率较高。

摘要翻译

计算机使用智能体（Computer-use Agents, CUAs）在自动化复杂桌面工作流程方面前景广阔，但通用智能体的发展受限于连续、高质量人类演示视频的稀缺。近期研究强调，连续视频（而非稀疏截图）是扩展这类智能体规模的关键缺失要素。然而，现有最大的开放数据集ScaleCUA仅包含200万张截图，相当于不足20小时的视频数据。为突破这一瓶颈，我们推出CUA-Suite——一个面向专业桌面计算机使用智能体的大规模专家演示视频与密集标注生态系统。其核心是VideoCUA数据集，该数据集提供涵盖87种多样化应用的约1万个人类演示任务，包含30帧/秒的连续屏幕录制、运动学光标轨迹以及多层推理标注，总计约55小时、600万帧的专家级视频。与仅捕获最终点击坐标的稀疏数据集不同，这些连续视频流完整保留了人机交互的时序动态，构成信息超集，可无损转换为现有智能体框架所需的格式。CUA-Suite进一步提供两项互补资源：UI-Vision（用于评估CUAs grounding与规划能力的严谨基准）和GroundCUA（包含5.6万张标注截图及超360万个UI元素标注的大规模grounding数据集）。初步评估表明，当前基础动作模型在处理专业桌面应用时面临显著挑战（任务失败率约60%）。除评估功能外，CUA-Suite丰富的多模态语料库支持新兴研究方向，包括通用屏幕解析、连续空间控制、基于视频的奖励建模及视觉世界模型。所有数据与模型均已公开发布。

摘要 (Abstract)

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite’s rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

关键词: Computer-use agents, Video demonstrations, Desktop workflows, Foundation action models, UI element annotations, Visual world models, Continuous spatial control, Multimodal corpus

6. ✅ Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing

作者: Michael Somma, Markus Großpointner, Paul Zabalegui, Eppu Heilimo, Branka Stojanović 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24221v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究如何利用大语言模型构建一个环境接地的多智能体工作流，以自动化机器人系统的渗透测试，并在ROS/ROS2 Capture-the-Flag场景中实现了100%的成功率。

摘要翻译

数字基础设施日益增长的复杂性与互联性，使得可扩展且可靠的安全评估方法变得至关重要。机器人系统作为一类尤为重要的运营技术（Operational Technology），其现代形态是高度网络化的信息物理系统，广泛应用于工业自动化、物流及自主服务等领域。本文探讨了在机器人环境中利用大语言模型进行自动化渗透测试的方法。我们提出了一种专为机器人系统定制的、基于环境的多智能体架构。该方法在执行过程中动态构建一个基于图的共享记忆，用以捕获可观测的系统状态，包括网络拓扑、通信通道、漏洞及已尝试的攻击手段。这实现了结构化的自动化测试，同时在整个测试过程中保持了可追溯性和有效的上下文管理。通过在专门的机器人夺旗场景（ROS/ROS2）中进行多轮迭代评估，该系统表现出高可靠性，在全部测试运行（n=5）中均成功完成挑战，成功率达100%。这一性能显著超越了文献中的基准水平，同时满足了如《欧盟人工智能法案》等框架所要求的可追溯性与人工监督。

摘要 (Abstract)

The increasing complexity and interconnectivity of digital infrastructures make scalable and reliable security assessment methods essential. Robotic systems represent a particularly important class of operational technology, as modern robots are highly networked cyber-physical systems deployed in domains such as industrial automation, logistics, and autonomous services. This paper explores the use of large language models for automated penetration testing in robotic environments. We propose an environment-grounded multi-agent architecture tailored to Robotics-based systems. The approach dynamically constructs a shared graph-based memory during execution that captures the observable system state, including network topology, communication channels, vulnerabilities, and attempted exploits. This enables structured automation while maintaining traceability and effective context management throughout the testing process. Evaluated across multiple iterations within a specialized robotics Capture-the-Flag scenario (ROS/ROS2), the system demonstrated high reliability, successfully completing the challenge in 100% of test runs (n=5). This performance significantly exceeds literature benchmarks while maintaining the traceability and human oversight required by frameworks like the EU AI Act.

关键词: Large Language Models, Multi-agent Systems, Autonomous Penetration Testing, Robotic Environments, Environment-grounded Architecture, Graph-based Memory, ROS/ROS2, Capture-the-Flag

7. ✅ HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

作者: Ken Ding 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23871v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在数学推理任务中通过强化学习训练时，对完全无法解决的“悬崖”提示梯度消失的问题，提出了一种混合蒸馏策略优化方法HDPO，该方法通过特权自蒸馏来增强学习信号，实验证明能有效提升模型在数学问题上的覆盖率和准确性。

摘要翻译

采用强化学习（RL）训练的大语言模型在数学推理任务中面临一个根本性挑战：对于模型完全无法解决的“悬崖”式问题，RL梯度会完全消失，导致这些失败模式无法获得任何学习信号。我们提出了混合蒸馏策略优化（Hybrid Distillation Policy Optimization, HDPO），该方法通过针对悬崖式问题的特权自蒸馏来增强标准RL训练。在每一步训练中，HDPO首先识别所有采样轨迹均失败的提示，随后通过向模型提供真实信息生成特权轨迹，筛选出正确解，并将教师模型在词元级别的分布蒸馏至学生模型。由于教师与学生模型共享权重（仅输入不同），与跨模型蒸馏不同，其可实现性差距在理论上是有界的。我们证明，在硬阈值极限下，采用R=1过滤的特权生成能够恢复最优的KL正则化RL策略。在OpenMathInstruct-2数据集上使用Qwen2.5-Math-1.5B-Instruct模型进行的实验表明，HDPO在保持贪婪准确率的同时，持续提升了覆盖度指标（pass@4提升0.8-1.1%，pass@8提升0.4-1.7%），其中蒸馏权重λ为探索-利用权衡提供了直接控制。

摘要 (Abstract)

Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - “cliff” prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher’s token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation recovers the optimal KL-regularized RL policy in the hard-threshold limit. Experiments on OpenMathInstruct-2 with Qwen2.5-Math-1.5B-Instruct show that HDPO consistently improves coverage metrics (pass@4 by +0.8-1.1%, pass@8 by +0.4-1.7%) while maintaining greedy accuracy, with the distillation weight lambda providing direct control over the exploration-exploitation tradeoff.

8. ✅ Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）中的性别偏见问题，通过一个统一框架分析对齐（如监督微调）如何影响偏见在输出中的表达和在内部表示中的编码，发现对齐减少了表达偏见但未消除编码偏见，且基准测试的偏见缓解效果不一定能泛化到实际场景。

摘要翻译

在训练过程中，大语言模型（LLMs）习得的社会规律可能导致下游应用中出现性别偏见。大多数缓解措施侧重于减少生成输出中的偏见，通常通过结构化基准进行评估，这引发了两个问题：输出层面的评估无法揭示对齐过程是否改变了模型的内在表征，且结构化基准可能无法反映真实使用场景。我们提出一个统一框架，利用相同的中性提示词联合分析大语言模型的内在和外显性别偏见，从而能够直接比较内部表征中编码的性别相关信息与生成输出中表达的偏见。与先前研究报道的弱相关或不一致关联不同，我们在统一测量协议下发现潜在性别信息与表达出的偏见之间存在稳定关联。我们进一步通过旨在减少性别偏见的有监督微调来检验对齐效果。结果表明，尽管微调确实减少了外显偏见，但内部表征中仍存在可测量的性别相关关联，且这些关联在对抗性提示下可能被重新激活。最后，我们考察两种现实场景并证明，在结构化基准上观察到的去偏见效果不一定能推广到其他情境，例如故事生成任务中。

摘要 (Abstract)

During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model’s underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.

关键词: Large Language Models, Gender Bias, Alignment, Supervised Fine-tuning, Internal Representations, Expressed Bias, Debiasing, Benchmark Evaluation

9. ✅ MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning

作者: Andrea Manzoni 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24044v1

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对MoE模型标准LoRA微调效率低下的问题，提出了一种基于路由分析的MoE-Sieve方法，通过仅对每层最活跃的专家应用LoRA适配器，在保持性能的同时显著减少了训练参数、存储开销和训练时间。

摘要翻译

对混合专家（Mixture-of-Experts, MoE）模型进行标准LoRA微调时，通常将适配器应用于所有专家。然而，我们的性能分析表明，每层的专家路由分布高度倾斜：每层中只有一小部分专家处理大多数令牌，而许多其他专家很少被激活（即“冷”专家）。我们提出了MoE-Sieve，一种简单的路由引导式LoRA微调框架，并结合了对不同架构和任务中专家路由的系统性分析研究。该方法很简单：在一个小型校准集上分析路由计数，每层选择前k个被路由最多的专家，并仅对这些专家应用LoRA。在两个架构不同的MoE模型和三项多样化任务中，每层仅微调前25%被路由最多的专家，其性能仍与完整LoRA微调相当，所有条件下的平均差异在+/-1个百分点以内。这使LoRA可训练参数减少了70-73%，适配器检查点大小减少了71-73%，实际训练时间最多减少了50%。我们还观察到专家数量与种子间方差之间存在非单调关系，这与以下假设一致：调整冷专家可能引入梯度噪声而不会提升准确性。进一步的消融实验表明，在相同参数预算下随机选择专家的性能约差2.5个百分点，这说明路由信号至关重要，而逐层贪婪预算优化并未优于均匀的前k选择。

摘要 (Abstract)

Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated (“cold”). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a small calibration set, select the top-k most-routed experts per layer, and apply LoRA only to those experts. Across two architecturally distinct MoE models and three diverse tasks, tuning only the top 25% routed experts per layer remains competitive with full LoRA, with mean differences within +/-1 percentage point across all conditions. This reduces LoRA trainable parameters by 70-73%, adapter checkpoint size by 71-73%, and wall-clock training time by up to 50%. We also observe a non-monotonic relationship between expert count and seed-to-seed variance, consistent with the hypothesis that adapting cold experts can introduce gradient noise without improving accuracy. Further ablations show that random expert selection at matched budget is about 2.5 percentage points worse, indicating that the routing signal matters, while greedy per-layer budget optimization does not improve over uniform top-k.

关键词: Mixture-of-Experts, MoE, LoRA, Parameter-efficient Fine-tuning, Expert Routing, Fine-tuning Efficiency, Sparse Models, Adapter Compression

10. ✅ The Diminishing Returns of Early-Exit Decoding in Modern LLMs

作者: Rui Wei, Rui Du, Hanfei Yu, Devesh Tiwari, Jian Li, Zhaozhuo Xu, Hao Wang 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23701v1

评分: 31.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	8.0/10	8.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究发现，随着现代大语言模型预训练方法和架构的改进，层间冗余减少，导致早期退出解码技术在降低推理延迟和成本方面的效果呈现递减趋势，且密集Transformer比MoE和状态空间模型具有更大的早期退出潜力。

摘要翻译

在大语言模型推理中，早期退出是指在预测达到足够置信度时于中间层停止计算，从而降低延迟与成本。然而，近期的大语言模型采用了改进的预训练方案和架构，减少了层间冗余，这可能限制了早期退出的机会。我们重新评估了现代大语言模型中的分层早期退出机制，并分析了训练过程中中间表示的演化规律。我们引入了一种量化模型内在早期退出适用性的指标，并提出了一个基准测试，供研究者探索不同模型和工作负载上潜在的早期退出收益。我们的研究结果显示，在新一代模型中，早期退出的有效性呈下降趋势。进一步发现，稠密Transformer通常比混合专家模型和状态空间模型具有更大的早期退出潜力。此外，参数量更大的模型（特别是超过200亿参数的模型）以及未经专门调优的预训练基础模型，往往表现出更高的早期退出潜力。

摘要 (Abstract)

In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining recipes and architectures that reduce layer redundancy, potentially limiting early-exit opportunities. We re-evaluate layer-wise early-exit in modern LLMs and analyze how intermediate representations evolve during training. We introduce a metric to quantify a model’s intrinsic suitability for early-exit and propose a benchmark for researchers to explore the potential early-exit benefits on different models and workloads. Our results show a diminishing trend in early-exit effectiveness across newer model generations. We further find that dense transformers generally offer greater early-exit potential than Mixture-of-Experts and State Space Models. In addition, larger models, particularly those with more than 20 billion parameters, and base pretrained models without specialized tuning tend to exhibit higher early-exit potential.

关键词: Large Language Models, Early-exit Decoding, Inference Acceleration, Mixture-of-Experts, Transformer Architecture, Model Redundancy, Layer-wise Prediction, Benchmark Evaluation

11. ✅ Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

作者: Han Sun, Qin Li, Peixin Wang, Min Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24058v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文研究了大型视觉语言模型中物体幻觉的问题，通过识别注意力分配失衡是幻觉的关键成因，并提出了一种轻量级的注意力失衡校正方法，在多个基准测试中显著降低了幻觉率并提升了模型性能。

摘要翻译

大型视觉语言模型（LVLMs）中的物体幻觉严重削弱了其在现实应用中的可靠性，对其在自动驾驶和医学图像分析等高风险场景中的部署构成了关键障碍。通过系统性实证研究，我们发现跨模态（即视觉与语言）及模态内（各独立标记间）的注意力分配失衡与物体幻觉的发生存在强因果关联。基于这一洞察，我们提出了“注意力失衡”这一新概念，它不仅能量化注意力差异的程度，还能可视化地揭示驱动物体幻觉的内在模式（例如对无关语言标记的过度关注或对判别性视觉特征的关注不足）。为缓解物体幻觉，我们进一步提出注意力失衡矫正（AIR），这是一种轻量级的解码时干预方法，通过重新分配注意力权重并调整注意力分布来纠正模态间与标记间的失衡。在四种主流LVLM和三个基准测试（CHAIR、POPE和MM-Vet）上使用七种基线方法进行的广泛评估表明，AIR能持续降低物体幻觉率，相较于基线方法最高可减少35.1%，同时在不同视觉语言任务中将LVLMs的通用能力最高提升15.9%。

摘要 (Abstract)

Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs’ general capability across diverse vision-language tasks.

12. ✅ Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition

作者: Aleix Sant, Jordi Luque, Carlos Escolano 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24242v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究探讨了在联邦学习框架下，客户端语言构成（从单语到多语）如何影响多语言大语言模型的指令微调效果、公平性和训练成本，发现客户端内多语性增强能提升全局模型性能与公平性，尤其有益于低资源语言，但会增加优化步骤。

摘要翻译

在多语言环境下进行大语言模型的联邦学习面临着显著挑战，这些挑战主要源于客户端间异构的语言分布以及语言资源可用性的差异。为解决这些问题，我们扩展了FederatedScope-LLM框架，以支持大语言模型的多语言指令微调实验。同时，我们引入了一种新颖的客户端特定早停机制——本地动态早停（Local Dynamic Early Stopping, LDES-FL），该机制允许客户端根据本地验证性能暂停和恢复本地训练，从而提升训练效率和可持续性。通过一系列实验，我们研究了客户端的语言构成——从完全单语到日益多语化的客户端——如何影响多语言质量、公平性和训练成本。单语本地微调对于单一语言专业化仍然最为有效，而联邦训练则更适合学习单一平衡的多语言模型。在联邦学习中，增加客户端内部的多语言性能够产生更强、更公平的全局模型，缩小与集中式多语言微调的差距，并为资源较少的语言带来最大的收益，尽管这是以更多的优化步骤为代价的。总体而言，我们的研究结果表明，客户端的语言构成是多语言联邦学习中的一个关键设计变量，它塑造了性能、公平性和效率。

摘要 (Abstract)

Federated Learning (FL) of Large Language Models (LLMs) in multilingual environments presents significant challenges stemming from heterogeneous language distributions across clients and disparities in language resource availability. To address these challenges, we extended the FederatedScope-LLM framework to support multilingual instruction-tuning experiments with LLMs. We also introduced a novel client-specific early stopping mechanism, Local Dynamic Early Stopping (LDES-FL), which allows clients to pause and resume local training based on client-side validation performance, enhancing training efficiency and sustainability. Through a series of experiments, we studied how client language composition - from fully monolingual to increasingly multilingual clients - affects multilingual quality, fairness and training cost. Monolingual local fine-tuning remains the most effective for single-language specialization, whereas federated training is better suited to learning a single balanced multilingual model. In FL, increasing within-client multilinguality leads to stronger and fairer global models, narrows the gap to centralized multilingual fine-tuning, and yields the largest gains for lower-resource languages, albeit at the cost of more optimization steps. Overall, our results identify client language composition as a key design variable in multilingual FL, shaping performance, fairness and efficiency

关键词: Federated Learning, Large Language Models, Multilingual, Instruction Tuning, Client Language Composition, Fine-tuning, Fairness, Training Efficiency

13. ✅ A^3: Towards Advertising Aesthetic Assessment

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了A^3框架，通过理论范式A^3-Law、数据集A^3-Dataset、多模态大语言模型A^3-Align和基准A^3-Bench，解决了广告图像美学评估缺乏可扩展性、标准化和可解释性的问题，实验表明A^3-Align在美学对齐和任务泛化方面优于现有模型。

摘要翻译

广告图像对商业转化率与品牌资产具有显著影响，然而当前的评估方法依赖主观判断，缺乏可扩展性、标准化准则与可解释性。为应对这些挑战，我们提出了A³（广告美学评估）框架，该综合框架包含四个组成部分：一个范式（A³-Law）、一个数据集（A³-Dataset）、一个多模态大语言模型（A³-Align）以及一个基准测试集（A³-Bench）。A³的核心是理论驱动的范式A³-Law，它包含三个层次化阶段：（1）感知注意（Perceptual Attention），评估感知图像信号吸引注意力的能力；（2）形式兴趣（Formal Interest），评估图像色彩与空间布局的形式构成在引发兴趣方面的表现；（3）欲望影响（Desire Impact），衡量图像唤起的欲望及其说服性影响。基于A³-Law，我们构建了A³-Dataset，其中包含来自3万张广告图像的12万条指令-响应对，每张图像均附有丰富的多维度标签与思维链（Chain-of-Thought, CoT）推理标注。我们进一步开发了A³-Align模型，该模型在A³-Law指导下，基于A³-Dataset通过思维链引导学习进行训练。在A³-Bench上的大量实验表明，与现有模型相比，A³-Align实现了与A³-Law更优的对齐，且这种对齐能力能够良好泛化至优质广告筛选与规范性广告批评任务，显示出其广泛部署的潜力。数据集、代码与模型可通过以下链接获取：https://github.com/euleryuan/A3-Align。

摘要 (Abstract)

Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present A^3 (Advertising Aesthetic Assessment), a comprehensive framework encompassing four components: a paradigm (A^3-Law), a dataset (A^3-Dataset), a multimodal large language model (A^3-Align), and a benchmark (A^3-Bench). Central to A^3 is a theory-driven paradigm, A^3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A^3-Law, we construct A^3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A^3-Align, trained under A^3-Law with CoT-guided learning on A^3-Dataset. Extensive experiments on A^3-Bench demonstrate that A^3-Align achieves superior alignment with A^3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment. Dataset, code, and models can be found at: https://github.com/euleryuan/A3-Align.

关键词: Advertising Aesthetic Assessment, Multimodal Large Language Model, Chain-of-Thought, Instruction Tuning, A^3-Law, A^3-Align, A^3-Dataset, A^3-Bench

14. ❌ Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization

作者: Fei Bai, Zhipeng Chen, Chuan Hao, Ming Yang, Ran Tao, Bryan Dai, Wayne Xin Zhao, Jian Yang, Hongteng Xu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24093v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	5.0/10	5.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的强化学习训练方法（RLVR），提出DGO框架改进经验利用和知识内化，与"Large Language Models"高度相关（10分）。涉及推理任务改进，与"Chain of Thought”、“System 2 Thinking"有一定关联（各5分），框架包含经验反思与优化，与"Self-Correction"相关（5分）。其他关键词如MoE、SFT、RAG等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在强化学习训练中经验利用不足的问题，提出了Dual Guidance Optimization框架，通过外部经验和内部知识双重引导来改进训练效果，实验表明该方法能提升推理能力。

摘要翻译

近年来，强化学习已成为提升大语言模型能力的重要途径。其中，基于可验证奖励的强化学习已成为推理任务中一种颇具前景的范式。然而，现有的基于强化学习的训练方法仍只是对人类学习过程的粗略近似。人类学习者能够同时利用外部经验和内部经验来引导探索，并将有用的轨迹逐步内化为稳定的知识。受此差距启发，我们提出：在基于可验证奖励的强化学习训练中，大语言模型如何能更好地利用并内化经验？为回答此问题，我们提出了双重引导优化，这是一个利用外部经验与内部经验来提升训练效果的统一框架。具体而言，DGO首先从先前探索过的轨迹中构建一个经验库。随后，策略在经验库与模型内部知识的共同引导下进行探索。由此产生的轨迹将进一步用于优化经验库并更新模型参数，从而形成一个经验利用与内化的闭环。实验表明，DGO在各项基准测试中均持续优于基线方法，这表明对经验更有效的利用与内化能够带来更优越的推理能力。

摘要 (Abstract)

Recently, reinforcement learning~(RL) has become an important approach for improving the capabilities of large language models~(LLMs). In particular, reinforcement learning from verifiable rewards~(RLVR) has emerged as a promising paradigm for reasoning tasks. However, existing RL-based training still remains only a rough approximation to human learning. Human learners leverage both external and internal experience to guide exploration and gradually internalize useful trajectories into stable knowledge. Motivated by this gap, we ask: how can LLMs better utilize and internalize experience during RLVR training? To answer this question, we propose \textbf{D}ual \textbf{G}uidance \textbf{O}ptimization~(\textbf{DGO}), a unified framework that leverages \emph{external} and \emph{internal experience} to improve training effectiveness. Specifically, DGO first constructs an experience bank from previously explored trajectories. The policy then performs exploration under the joint guidance of the experience bank and the model’s internal knowledge. The resulting trajectories are further used to refine the experience bank and optimize model parameters, forming a closed loop of experience utilization and internalization. Experiments show that DGO consistently outperforms baseline methods, suggesting that better utilization and internalization of experience lead to more effective reasoning.

关键词: reinforcement learning, large language models, RLVR, reasoning tasks, experience utilization, knowledge internalization, DGO, trajectory optimization

15. ❌ Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

作者: Somaya Eltanbouly, Samer Rashwani 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23972v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是开发一个基于Doha历史词典的检索增强生成（RAG）框架，以提升阿拉伯语大语言模型（LLMs）对古兰经和圣训等复杂历史宗教文本的理解能力。因此，与"Large Language Models"和"Retrieval-Augmented Generation"高度相关（10分）。论文通过RAG框架旨在提高模型的事实准确性，间接涉及幻觉缓解，故"Hallucination Mitigation"给5分。其他关键词如MoE、SFT、量化等均未在摘要中提及或与核心方法无关，故给0分。

!!! tip deepseek-chat TL;DR

该研究针对阿拉伯语大语言模型在理解复杂历史宗教文本（如古兰经和圣训）时存在的困难，提出了一个基于Doha历史词典的检索增强生成框架，显著提升了模型的理解准确率至85%以上。

摘要翻译

大型语言模型（LLM）在许多语言任务中取得了显著进展，但在处理《古兰经》和圣训等复杂的历史与宗教阿拉伯语文本时仍面临困难。为应对这一局限，我们开发了一种基于历时词典知识的检索增强生成（RAG）框架。与以往依赖通用语料库的RAG系统不同，我们的方法从《多哈阿拉伯语历史词典》（DHDA）中检索证据，该词典是记录阿拉伯语词汇历史演变的大规模资源。所提出的流程将混合检索与基于意图的路由机制相结合，为LLM提供精确且符合语境的历史信息。实验表明，该方法将包括Fanar和ALLaM在内的阿拉伯语原生LLM的准确率提升至85%以上，显著缩小了其与专有大规模模型Gemini的性能差距。Gemini在我们的实验中还充当了LLM即评判员系统以进行自动评估。自动评判结果通过人工评估验证，显示出高度一致性（kappa = 0.87）。错误分析进一步揭示了关键的语言学挑战，包括变音符号和复合表达问题。这些发现证明了将历时词典资源整合到检索增强生成框架中对提升阿拉伯语理解能力——尤其是针对历史与宗教文本——的价值。代码与资源已公开于：https://github.com/somayaeltanbouly/Doha-Dictionary-RAG。

摘要 (Abstract)

Large language models (LLMs) have achieved remarkable progress in many language tasks, yet they continue to struggle with complex historical and religious Arabic texts such as the Quran and Hadith. To address this limitation, we develop a retrieval-augmented generation (RAG) framework grounded in diachronic lexicographic knowledge. Unlike prior RAG systems that rely on general-purpose corpora, our approach retrieves evidence from the Doha Historical Dictionary of Arabic (DHDA), a large-scale resource documenting the historical development of Arabic vocabulary. The proposed pipeline combines hybrid retrieval with an intent-based routing mechanism to provide LLMs with precise, contextually relevant historical information. Our experiments show that this approach improves the accuracy of Arabic-native LLMs, including Fanar and ALLaM, to over 85%, substantially reducing the performance gap with Gemini, a proprietary large-scale model. Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments. The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87). An error analysis further highlights key linguistic challenges, including diacritics and compound expressions. These findings demonstrate the value of integrating diachronic lexicographic resources into retrieval-augmented generation frameworks to enhance Arabic language understanding, particularly for historical and religious texts. The code and resources are publicly available at: https://github.com/somayaeltanbouly/Doha-Dictionary-RAG.

关键词: Large Language Models, Retrieval-Augmented Generation, Arabic language understanding, Historical texts, Quran and Hadith, Doha Historical Dictionary, Hybrid retrieval, LLM-as-a-judge

16. ❌ From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

作者: Lijing Luo, Yiben Luo, Alexey Gorbatovski, Sergey Kovalchuk, Xiaodan Liang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23964v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究强化学习环境的演变，特别关注向语言驱动的基础智能体（LLM Agents）的范式转变。摘要明确提到"Large Language Models (LLMs)“和"LLM Agents”，因此这两个关键词高度相关（10分）。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或与论文主题直接相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文通过大规模数据驱动的实证研究，分析了强化学习环境从物理模拟到语言驱动基础智能体的演变，揭示了以大型语言模型为主导的"语义先验"生态系统范式转变，并提出了设计下一代具身语义模拟器的路线图。

摘要翻译

强化学习（RL）领域的显著进展，本质上与用于训练和评估智能体的环境密不可分。本研究超越了传统的定性综述，对RL环境的演进进行了大规模、数据驱动的实证研究。通过程序化处理海量学术文献并严格提炼超过2000篇核心出版物，我们提出了一种量化方法，以描绘从孤立的物理模拟到通用、语言驱动的智能体（Foundation Agents）的转变路径。我们采用一种新颖的多维度分类法，针对不同的应用领域和所需的认知能力，对基准测试进行了系统分析。我们的自动化语义与统计分析揭示了一个深刻的、数据验证的范式转变：该领域已分化为由大语言模型（LLMs）主导的“语义先验”生态系统和“领域特定泛化”生态系统。此外，我们刻画了这些不同领域的“认知指纹”，以揭示跨任务协同、多领域干扰以及零样本泛化的内在机制。最终，本研究为设计下一代具身语义模拟器（Embodied Semantic Simulators）提供了一份严谨的、量化的路线图，旨在弥合连续物理控制与高层逻辑推理之间的鸿沟。

摘要 (Abstract)

The remarkable progress of reinforcement learning (RL) is intrinsically tied to the environments used to train and evaluate artificial agents. Moving beyond traditional qualitative reviews, this work presents a large-scale, data-driven empirical investigation into the evolution of RL environments. By programmatically processing a massive corpus of academic literature and rigorously distilling over 2,000 core publications, we propose a quantitative methodology to map the transition from isolated physical simulations to generalist, language-driven foundation agents. Implementing a novel, multi-dimensional taxonomy, we systematically analyze benchmarks against diverse application domains and requisite cognitive capabilities. Our automated semantic and statistical analysis reveals a profound, data-verified paradigm shift: the bifurcation of the field into a “Semantic Prior” ecosystem dominated by Large Language Models (LLMs) and a “Domain-Specific Generalization” ecosystem. Furthermore, we characterize the “cognitive fingerprints” of these distinct domains to uncover the underlying mechanisms of cross-task synergy, multi-domain interference, and zero-shot generalization. Ultimately, this study offers a rigorous, quantitative roadmap for designing the next generation of Embodied Semantic Simulators, bridging the gap between continuous physical control and high-level logical reasoning.

关键词: reinforcement learning environments, large language models, LLM agents, foundation agents, semantic prior ecosystem, embodied semantic simulators, cognitive fingerprints, zero-shot generalization

17. ❌ Self-Distillation for Multi-Token Prediction

作者: Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23911v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的推理效率问题，提出了一种基于自蒸馏的多令牌预测方法（MTP-D）来加速推理。因此，与"Large Language Models"高度相关（10分），因为论文明确以LLMs为研究对象；与"Speculative Decoding"或"Inference Acceleration"高度相关（10分），因为MTP是一种推测解码技术，旨在通过并行预测多个未来令牌来加速LLM推理。论文未涉及其他关键词，如MoE、SLMs、训练方法、对齐、代理、量化等，因此这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型推理效率瓶颈，提出了一种自蒸馏多令牌预测方法（MTP-D），有效提升了多令牌预测头的接受率并显著加速了推理速度。

摘要翻译

随着大语言模型（LLM）规模的扩大，推理效率成为关键瓶颈。多令牌预测（MTP）技术通过并行预测多个未来令牌，有望加速LLM推理。然而，现有MTP方法仍面临两大挑战：MTP头的接受率有限，以及多个MTP头的联合训练困难。为此，我们提出MTP-D，一种简单高效且附加训练成本极低的自蒸馏方法，该方法在最大程度保持主头性能的同时，显著提升了MTP头的接受率（+7.5%）。我们还为MTP-D引入了循环扩展策略，实现了高效且经济的MTP头扩展，并进一步将单头MTP的推理速度显著提升至（+220.4%）。此外，通过在七个基准测试上进行大量实验，我们系统性地探索并验证了关于蒸馏策略的关键见解以及MTP潜在的可扩展性。这些结果表明，我们的MTP-D及循环扩展策略有效提升了MTP头的性能与推理效率，推动了MTP在LLM中的实际应用。

摘要 (Abstract)

As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.

关键词: Large Language Models, LLMs, Multi-Token Prediction, MTP, Inference Efficiency, Self-Distillation, Inference Acceleration, Speculative Decoding

18. ❌ LLMpedia: A Transparent Framework to Materialize an LLM’s Encyclopedic Knowledge at Scale

作者: Muhammed Saeed, Simon Razniewski 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24080v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《LLMpedia》的核心是评估大语言模型（LLMs）的事实性（Factuality），通过生成百科全书文章来测试模型参数记忆中的知识准确性。因此，它与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（10分），因为研究直接针对LLMs。同时，论文重点在于揭示模型的事实性缺陷，与"Hallucination Mitigation” OR “Factuality” OR “Truthfulness"高度相关（10分）。其他关键词如MoE、SFT、RAG、推理方法、代理、压缩等，论文未涉及这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了LLMpedia框架，通过从大语言模型的参数记忆中生成百科全书文章来评估其事实性，发现模型在Wikipedia覆盖主题上的真实率仅为74.7%，远低于基准测试的乐观估计，揭示了固定问题评估的局限性，并提供了首个完全开放的参数化百科全书。

摘要翻译

MMLU等基准测试表明，旗舰语言模型的事实准确性已接近饱和，得分超过90%。我们指出这一结论并不完整。\emph{LLMpedia}完全基于参数化记忆生成百科全书式文章，在未使用检索增强的情况下，为三个模型系列生成了约100万篇文章。对于gpt-5-mini模型，在维基百科覆盖的主题上可验证的真实率仅为74.7%——比基准测试反映的情况低15个百分点以上，这印证了固定问题评估存在的可得性偏差。在维基百科覆盖范围之外，仅能通过精选网络证据验证的前沿主题真实率进一步降至63.2%。维基百科仅涵盖已浮现主题的61%，且三个模型系列在主题选择上的重叠率仅为7.3%。在受先前Grokipedia分析启发的捕获陷阱基准测试中，LLMpedia在文本相似度约为维基百科一半的情况下实现了显著更高的事实准确性。与Grokipedia不同，本研究的每个提示、生成文本及评估结论均已公开，使LLMpedia成为首个完全开放的参数化百科全书——架起了事实性评估与知识具象化之间的桥梁。所有数据、代码及可浏览界面详见https://llmpedia.net。

摘要 (Abstract)

Benchmarks such as MMLU suggest flagship language models approach factuality saturation, with scores above 90%. We show this picture is incomplete. \emph{LLMpedia} generates encyclopedic articles entirely from parametric memory, producing ${\sim}$1M articles across three model families without retrieval. For gpt-5-mini, the verifiable true rate on Wikipedia-covered subjects is only 74.7% – more than 15 percentage points below the benchmark-based picture, consistent with the availability bias of fixed-question evaluation. Beyond Wikipedia, frontier subjects verifiable only through curated web evidence fall further to 63.2% true rate. Wikipedia covers just 61% of surfaced subjects, and three model families overlap by only 7.3% in subject choice. In a capture-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia achieves substantially higher factuality at roughly half the textual similarity to Wikipedia. Unlike Grokipedia, every prompt, artifact, and evaluation verdict is publicly released, making LLMpedia the first fully open parametric encyclopedia – bridging factuality evaluation and knowledge materialization. All data, code, and a browsable interface are at https://llmpedia.net.

关键词: Large Language Models, Factuality Evaluation, Parametric Memory, Encyclopedic Knowledge, Hallucination, Benchmark Analysis, Knowledge Materialization, Open Framework

19. ❌ Can we generate portable representations for clinical time series data using LLMs?

作者: Zongliang Ji, Yifei Sun, Andre Amaral, Anna Goldenberg, Rahul G. Krishnan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23987v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 该论文核心研究大语言模型（LLMs）在临床医疗领域的应用，具体探索使用冻结的LLM将ICU时间序列数据转换为自然语言摘要，再生成可移植的患者嵌入表示，以解决跨医院模型部署的分布偏移问题。因此，与"Large Language Models"高度相关（10分），与"AI for Science"中的生物信息学/医疗应用高度相关（10分）。论文未涉及其他关键词的技术原理或方法创新，如MoE、模型压缩、推理加速、对齐调优等，故其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究探索使用大语言模型生成可移植的临床时间序列患者嵌入表示，以解决跨医院部署预测模型时的分布偏移问题，实验表明该方法简单有效，在保持竞争力的同时减少了性能下降。

摘要翻译

临床机器学习部署缓慢且脆弱：在一家医院有效的模型，在另一家医院面临分布偏移时性能常会下降。本研究探讨一个简单问题——大型语言模型能否创建可移植的患者嵌入，即一种患者表征，使得基于一家医院构建的下游预测器能够在其他地方使用，且只需极少甚至无需重新训练与微调。为此，我们使用一个冻结的LLM将不规则的重症监护室时间序列映射为简洁的自然语言摘要，随后通过一个冻结的文本嵌入模型对每个摘要进行嵌入，以获得固定长度的向量，该向量能够作为多种下游预测器的输入。在三个队列（MIMIC-IV、HIRID、PPICU）中，针对多项临床预测与分类任务，我们发现该方法简单易用，在分布内性能上与网格插补、自监督表征学习以及时间序列基础模型相比具有竞争力，同时在迁移至新医院时表现出更小的相对性能下降。我们研究了提示设计对性能的影响，发现结构化提示对于降低预测模型的方差至关重要，且不改变平均准确率。使用这些可移植表征能够改进少样本学习，并且相较于基线，并未增加年龄或性别等人口统计信息的可复原性，这表明其带来的额外隐私风险很小。我们的工作揭示了LLMs作为工具的潜力，可通过降低工程开销，实现生产级预测模型的可扩展部署。

摘要 (Abstract)

Deploying clinical ML is slow and brittle: models that work at one hospital often degrade under distribution shifts at the next. In this work, we study a simple question – can large language models (LLMs) create portable patient embeddings i.e. representations of patients enable a downstream predictor built on one hospital to be used elsewhere with minimal-to-no retraining and fine-tuning. To do so, we map from irregular ICU time series onto concise natural language summaries using a frozen LLM, then embed each summary with a frozen text embedding model to obtain a fixed length vector capable of serving as input to a variety of downstream predictors. Across three cohorts (MIMIC-IV, HIRID, PPICU), on multiple clinically grounded forecasting and classification tasks, we find that our approach is simple, easy to use and competitive with in-distribution with grid imputation, self-supervised representation learning, and time series foundation models, while exhibiting smaller relative performance drops when transferring to new hospitals. We study the variation in performance across prompt design, with structured prompts being crucial to reducing the variance of the predictive models without altering mean accuracy. We find that using these portable representations improves few-shot learning and does not increase demographic recoverability of age or sex relative to baselines, suggesting little additional privacy risk. Our work points to the potential that LLMs hold as tools to enable the scalable deployment of production grade predictive models by reducing the engineering overhead.

关键词: Large Language Models, clinical time series, patient embeddings, portable representations, distribution shift, ICU data, predictive models, hospital transfer

20. ❌ PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

作者: Rohan Khetan, Ashna Khetan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23841v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的政治偏见评估，直接高度相关于"Large Language Models"和"Instruction Tuning” OR “Alignment” OR “Value Alignment”（涉及价值观对齐）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该研究通过PoliticsBench多轮角色扮演框架评估了八个主流大语言模型的政治偏见，发现七个模型左倾，Grok右倾，并揭示了模型在自由文本交互中价值观的偏差模式。

摘要翻译

随着大型语言模型（LLM）日益成为主要信息来源，其潜在的政治偏见可能影响其客观性。现有的LLM社会偏见基准主要评估性别和种族刻板印象；即使涉及政治偏见，也通常仅进行粗略层面的衡量，忽视了塑造社会政治倾向的具体价值观。本研究采用PoliticsBench——一种基于EQ-Bench-v3心理测量基准改进的新型多轮角色扮演框架，对八个主流LLM（Claude、Deepseek、Gemini、GPT、Grok、Llama、Qwen Base、Qwen Instruction-Tuned）的政治偏见进行探究。我们测试了商业开发的LLM是否表现出系统性左倾偏见，且该偏见在多阶段角色扮演的后期更为显著。通过二十个渐进式情境模拟，每个模型陈述其立场并决定行动方案。我们依据十项政治价值观量表对这些回答进行评分，探究聊天机器人偏离中立标准背后的价值取向。八个模型中有七个呈现左倾倾向，仅Grok表现为右倾。每个左倾LLM均强烈表现出自由主义特征，同时适度体现保守主义特质。研究发现角色扮演各阶段的对齐分数存在轻微波动，但未呈现特定规律。尽管大多数模型采用基于后果的推理方式，Grok则频繁运用事实与统计数据展开论证。本研究首次通过多阶段自由文本交互，实现了对LLM政治价值观的心理测量学评估。

摘要 (Abstract)

While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate gender and racial stereotypes. When political bias is included, it is typically measured at a coarse level, neglecting the specific values that shape sociopolitical leanings. This study investigates political bias in eight prominent LLMs (Claude, Deepseek, Gemini, GPT, Grok, Llama, Qwen Base, Qwen Instruction-Tuned) using PoliticsBench: a novel multi-turn roleplay framework adapted from the EQ-Bench-v3 psychometric benchmark. We test whether commercially developed LLMs display a systematic left-leaning bias that becomes more pronounced in later stages of multi-stage roleplay. Through twenty evolving scenarios, each model reported its stance and determined its course of action. Scoring these responses on a scale of ten political values, we explored the values underlying chatbots’ deviations from unbiased standards. Seven of our eight models leaned left, while Grok leaned right. Each left-leaning LLM strongly exhibited liberal traits and moderately exhibited conservative ones. We discovered slight variations in alignment scores across stages of roleplay, with no particular pattern. Though most models used consequence-based reasoning, Grok frequently argued with facts and statistics. Our study presents the first psychometric evaluation of political values in LLMs through multi-stage, free-text interactions.

关键词: Large Language Models, Political Bias, Multi-turn Roleplay, Value Alignment, Psychometric Evaluation, Benchmarking, Left-leaning Bias, Free-text Interactions

21. ❌ Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework

作者: Zeinab Dehghani, Rameez Raja Kureshi, Koorosh Aslansefat, Faezeh Alsadat Abedi, Dhavalkumar Thakker, Lisa Greaves, Bhupesh Kumar Mishra, Baseer Ahmad, Tanaya Maslekar 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23625v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究一个用于护理院的语音智能扬声器系统，该系统结合了基于Whisper的语音识别和检索增强生成（RAG）方法（混合、稀疏和密集）。因此，与"Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation"高度相关（10分），因为RAG是核心方法之一。论文提到使用GPT-5.2，因此与"Large Language Models” OR “LLMs” OR “Foundation Models"有一定关联（8分），但LLMs不是主要创新点，而是作为系统组件。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、代理系统等均未在摘要中提及，因此得0分。论文涉及医疗保健应用，但未明确属于"AI for Science” OR “Bioinformatics” OR “Cheminformatics”（这些通常指生物信息学或化学信息学），因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出并评估了一个用于护理院的语音智能扬声器系统，结合Whisper语音识别和RAG方法，在安全框架下实现了高准确度的居民识别、提醒提取和日程安排，证明了AI在护理环境中可支持准确文档和任务管理。

摘要翻译

人工智能（AI）在健康与社会照护领域正日益受到关注，其应用旨在减轻行政工作负担，使工作人员能将更多时间投入患者照护。本文评估了一款支持语音功能的护理院智能音箱，该设备旨在协助养老院的日常活动，包括通过语音查询居民记录、设置提醒及安排任务。研究提出了一套以安全为核心的评估框架，对系统进行端到端测试，结合了基于Whisper的语音识别与检索增强生成（RAG）方法（混合式、稀疏式与密集式）。通过监督式护理院实地试验与受控测试，我们评估了涵盖11个照护类别的330份语音转录文本，其中包含184次涉及提醒功能的交互。评估重点包括：（1）对居民身份与照护类别的正确识别；（2）提醒事项的识别与提取；（3）不确定情境下端到端任务安排的准确性（包括安全延迟/澄清机制）。鉴于护理院环境对安全性的高度要求，研究特别关注系统在嘈杂环境及不同口音下的可靠性，并通过置信度评分、澄清提示和人机协同监督机制予以支持。在最优配置（GPT-5.2）下，居民身份与照护类别匹配率达到100%（95%置信区间：98.86-100），提醒识别率达到89.09%（95%置信区间：83.81-92.80），且未遗漏任何提醒事项（召回率100%），但存在少量误报。通过日历集成实现的端到端任务安排达成84.65%的提醒数量精确匹配率（95%置信区间：78.00-89.56），表明将非正式口语指令转化为可执行事件时仍存在边界案例。研究结果表明，经过严谨评估并配备适当保障措施的语音交互系统，能够在养老院场景中支持准确的记录保存、高效的任务管理，并促进人工智能的可信应用。

摘要 (Abstract)

Artificial intelligence (AI) is increasingly being explored in health and social care to reduce administrative workload and allow staff to spend more time on patient care. This paper evaluates a voice-enabled Care Home Smart Speaker designed to support everyday activities in residential care homes, including spoken access to resident records, reminders, and scheduling tasks. A safety-focused evaluation framework is presented that examines the system end-to-end, combining Whisper-based speech recognition with retrieval-augmented generation (RAG) approaches (hybrid, sparse, and dense). Using supervised care-home trials and controlled testing, we evaluated 330 spoken transcripts across 11 care categories, including 184 reminder-containing interactions. These evaluations focus on (i) correct identification of residents and care categories, (ii) reminder recognition and extraction, and (iii) end-to-end scheduling correctness under uncertainty (including safe deferral/clarification). Given the safety-critical nature of care homes, particular attention is also paid to reliability in noisy environments and across diverse accents, supported by confidence scoring, clarification prompts, and human-in-the-loop oversight. In the best-performing configuration (GPT-5.2), resident ID and care category matching reached 100% (95% CI: 98.86-100), while reminder recognition reached 89.09% (95% CI: 83.81-92.80) with zero missed reminders (100% recall) but some false positives. End-to-end scheduling via calendar integration achieved 84.65% exact reminder-count agreement (95% CI: 78.00-89.56), indicating remaining edge cases in converting informal spoken instructions into actionable events. The findings suggest that voice-enabled systems, when carefully evaluated and appropriately safeguarded, can support accurate documentation, effective task management, and trustworthy use of AI in care home settings.

关键词: voice-enabled smart speaker, care homes, retrieval-augmented generation (RAG), speech recognition, safety evaluation, task scheduling, AI in healthcare, Whisper

22. ❌ Enes Causal Discovery

作者: Alexis Kafantaris 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24436v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文《Enes Causal Discovery》提出了一种用于因果发现的混合专家（Mixture of Experts, MoE）架构，因此与关键词"Mixture of Experts” OR “MoE” OR “Sparse Models"高度相关（10分）。论文涉及科学领域（因果发现）的AI应用，与关键词"AI for Science” OR “Bioinformatics” OR “Cheminformatics"有一定关联（5分），但未明确涉及生物信息学或化学信息学。论文未提及大语言模型（LLMs）、深度学习技术原理创新或其他评分关键词，因此其余关键词得分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于混合专家（MoE）架构的模型，用于从观测数据中进行因果发现，以解决数据限制问题并改进传统线性方法。

摘要翻译

所提出的架构采用专家混合模型，使得模型实体（如因果关系）能够被进一步参数化。具体而言，本研究尝试利用神经网络实现神经元，因为对该数据集而言，实现神经元构成了一项重大挑战。需要说明的是，通常使用简单快速的皮尔逊系数线性模型即可获得良好评分，这构成了一个需要优秀模型才能超越的强基准线。此外，在观测数据的因果发现方面存在主要限制：与萨克斯研究不同，本研究未使用干预措施而仅依赖先验知识；其中最具制约性的限制是数据本身的问题，本文对此进行了处理。随后，本文描述了方法与模型，并展示了实验结果。

摘要 (Abstract)

Enes The proposed architecture is a mixture of experts, which allows for the model entities, such as the causal relationships, to be further parameterized. More specifically, an attempt is made to exploit a neural net as implementing neurons poses a great challenge for this dataset. To explain, a simple and fast Pearson coefficient linear model usually achieves good scores. An aggressive baseline that requires a really good model to overcome that is. Moreover, there are major limitations when it comes to causal discovery of observational data. Unlike the sachs one did not use interventions but only prior knowledge; the most prohibiting limitation is that of the data which is addressed. Thereafter, the method and the model are described and after that the results are presented.

关键词: causal discovery, mixture of experts, MoE, observational data, neural network, Pearson coefficient, baseline model

23. ❌ Project and Generate: Divergence-Free Neural Operators for Incompressible Flows

作者: Xigui Li, Hongwei Zhang, Ruoxi Jiang, Deshu Chen, Chensen Lin, Limei Han, Yuan Qi, Xin Guo, Yuan Cheng 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24500v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文《Project and Generate: Divergence-Free Neural Operators for Incompressible Flows》专注于流体动力学中的机器学习模型，特别是通过引入一个统一框架来强制不可压缩连续性方程作为硬约束，以解决物理上不可接受的模拟问题。论文的核心内容涉及神经网络算子、物理约束、流体模拟和生成建模，属于科学计算和物理信息机器学习领域。所有关键词中，只有“AI for Science” OR “Bioinformatics” OR “Cheminformatics”与论文高度相关，因为论文直接应用AI于科学领域（流体动力学），属于“AI for Science”的范畴。其他关键词主要涉及大语言模型（LLMs）、模型训练、推理优化、代理系统等，与论文的流体动力学和物理约束建模主题完全无关，因此评分为0。加权总分计算为：仅“AI for Science”关键词得10分，权重1.0，总分10.0。作者列表中未包含指定的专家。

!!! tip deepseek-chat TL;DR

该论文解决了基于学习的流体动力学模型因缺乏物理约束而导致模拟不稳定和物理不可接受的问题，通过引入一个统一框架强制不可压缩连续性方程作为硬约束，实现了精确的不可压缩性和显著改善的稳定性与物理一致性。

摘要翻译

基于学习的流体动力学模型通常在无约束函数空间中运行，导致物理上不可接受的不稳定模拟。虽然基于惩罚的方法提供了软正则化，但它们无法提供结构性保证，从而产生虚假发散和长期崩溃问题。在本研究中，我们提出了一个统一框架，将不可压缩连续性方程作为确定性和生成性建模的硬性内在约束加以强制执行。首先，为使确定性模型投影至无散子空间，我们集成了基于亥姆霍兹-霍奇分解的可微分谱勒让投影，将回归假设空间限制在物理可容许的速度场中。其次，在生成物理一致的分布时，我们发现当先验分布不兼容时，简单投影模型输出是不够的。为此，我们通过基于旋度的前推映射构建了无散高斯参考测度，确保整个概率流在构造上始终保持子空间一致性。在二维纳维-斯托克斯方程上的实验表明，该方法在离散误差范围内实现了精确的不可压缩性，并显著提升了稳定性和物理一致性。

摘要 (Abstract)

Learning-based models for fluid dynamics often operate in unconstrained function spaces, leading to physically inadmissible, unstable simulations. While penalty-based methods offer soft regularization, they provide no structural guarantees, resulting in spurious divergence and long-term collapse. In this work, we introduce a unified framework that enforces the incompressible continuity equation as a hard, intrinsic constraint for both deterministic and generative modeling. First, to project deterministic models onto the divergence-free subspace, we integrate a differentiable spectral Leray projection grounded in the Helmholtz-Hodge decomposition, which restricts the regression hypothesis space to physically admissible velocity fields. Second, to generate physically consistent distributions, we show that simply projecting model outputs is insufficient when the prior is incompatible. To address this, we construct a divergence-free Gaussian reference measure via a curl-based pushforward, ensuring the entire probability flow remains subspace-consistent by construction. Experiments on 2D Navier-Stokes equations demonstrate exact incompressibility up to discretization error and substantially improved stability and physical consistency.

关键词: neural operators, incompressible flows, divergence-free, physical consistency, Helmholtz-Hodge decomposition, Navier-Stokes equations, generative modeling, fluid dynamics

24. ❌ Robust Multilingual Text-to-Pictogram Mapping for Scalable Reading Rehabilitation

作者: Soufiane Jhilal, Martina Galletti 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24536v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文开发了一个多语言AI驱动的界面，用于为有特殊教育需求的儿童自动增强文本视觉支架，属于AI在教育领域的应用。论文摘要中未提及任何具体的大模型技术（如LLM、MoE、SFT等）、模型优化方法（如量化、推理加速）或高级AI能力（如推理、代理）。唯一的相关关键词是"AI for Science”，因为该研究涉及AI在特殊教育（可视为科学应用领域）中的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个多语言AI系统，能自动将文本中的关键概念映射为情境相关的象形图，以支持有特殊教育需求儿童的阅读康复，并在五种语言中验证了其高覆盖率、语义适当性和实时交互性能。

摘要翻译

阅读理解对有特殊教育需求与残障（Special Educational Needs and Disabilities, SEND）的儿童构成显著挑战，通常需要密集的一对一阅读支持。为帮助治疗师扩大此类支持的规模，我们开发了一个多语言、人工智能驱动的交互界面，能自动为文本添加视觉支架。该系统动态识别关键概念，并将其映射到语境相关的象形图，从而支持跨语言学习者。我们通过多语言覆盖度分析、言语治疗师与特殊教育专家的临床评审，以及延迟评估，在五种类型学上差异显著的语言（英语、法语、意大利语、西班牙语和阿拉伯语）中对系统进行了评估。评估结果显示，在五种语言中系统均实现了较高的象形图覆盖率和视觉支架密度。专家评审表明，自动选择的象形图在语义上是恰当的，四种欧洲语言的正确与可接受评级合计超过95%，阿拉伯语虽因象形图库覆盖度较低，仍达到约90%。系统延迟保持在适合实时教育应用的交互阈值内。这些发现证明了自动化多模态支架在技术可行性、语义安全性和可接受性方面的潜力，有助于提升神经多样性学习者的学习可及性。

摘要 (Abstract)

Reading comprehension presents a significant challenge for children with Special Educational Needs and Disabilities (SEND), often requiring intensive one-on-one reading support. To assist therapists in scaling this support, we developed a multilingual, AI-powered interface that automatically enhances text with visual scaffolding. This system dynamically identifies key concepts and maps them to contextually relevant pictograms, supporting learners across languages. We evaluated the system across five typologically diverse languages (English, French, Italian, Spanish, and Arabic), through multilingual coverage analysis, expert clinical review by speech therapists and special education professionals, and latency assessment. Evaluation results indicate high pictogram coverage and visual scaffolding density across the five languages. Expert audits suggested that automatically selected pictograms were semantically appropriate, with combined correct and acceptable ratings exceeding 95% for the four European languages and approximately 90% for Arabic despite reduced pictogram repository coverage. System latency remained within interactive thresholds suitable for real-time educational use. These findings support the technical viability, semantic safety, and acceptability of automated multimodal scaffolding to improve accessibility for neurodiverse learners.

关键词: multilingual text-to-pictogram mapping, reading rehabilitation, special educational needs, visual scaffolding, AI-powered interface, real-time educational use, semantic appropriateness, accessibility for neurodiverse learners

25. ❌ Representation Learning to Study Temporal Dynamics in Tutorial Scaffolding

作者: Conrad Borchers, Jiayi Zhang, Ashish Gurung 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24535v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文研究自适应脚手架在数学辅导对话中的测量方法，提出了一种基于嵌入的方法来分析对话动态。论文提到大型语言模型（LLMs）在远程辅导系统中的兴起，因此与"Large Language Models"关键词有一定关联（5分），但并未深入探讨LLMs的技术原理或创新应用。其他关键词主要涉及大模型的具体技术（如MoE、量化、推理加速等）、训练方法（如预训练、对齐、RLHF等）或特定应用领域（如科学AI），论文均未涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于嵌入的方法来测量数学辅导对话中的自适应脚手架动态，发现导师和学生在任务语义对齐上存在系统性差异，且角色特定的语义对齐能预测辅导进展。

摘要翻译

适应性支架能有效促进学习，但该领域仍缺乏在真实辅导对话中测量支架强度的可靠方法。随着远程人工辅导和基于大语言模型的系统兴起，这一研究缺口日益凸显。我们提出一种基于嵌入向量的分析方法，通过对齐对话轮次、问题陈述与正确答案的语义来解析支架动态。具体而言，我们通过计算辅导者与学习者的对话贡献同任务相关内容的余弦相似度来实现语义对齐操作化。我们将此框架应用于Eedi问题锚定辅导对话数据集中的1,576个真实数学辅导对话。分析揭示了任务对齐的系统性差异，以及参与者将对话内容锚定于问题与解决方案的独特时序模式。混合效应模型进一步表明，角色特异的语义对齐对辅导进程的预测力超越了消息顺序、长度等基线特征。辅导者在互动早期的对话内容表现出更强的问题锚定性，而学习者与解决方案的对齐程度则与辅导进程呈适度正相关。这些发现印证了支架教学是基于任务语义、持续且角色敏感的互动过程。通过捕捉随时间演变的角色特异性对齐，本方法为分析教学对话和评估会话式辅导系统提供了理论依据的测量工具。

摘要 (Abstract)

Adaptive scaffolding enhances learning, yet the field lacks robust methods for measuring it within authentic tutoring dialogue. This gap has become more pressing with the rise of remote human tutoring and large language model-based systems. We introduce an embedding-based approach that analyzes scaffolding dynamics by aligning the semantics of dialogue turns, problem statements, and correct solutions. Specifically, we operationalize alignment by computing cosine similarity between tutor and student contributions and task-relevant content. We apply this framework to 1,576 real-world mathematics tutoring dialogues from the Eedi Question Anchored Tutoring Dialogues dataset. The analysis reveals systematic differences in task alignment and distinct temporal patterns in how participants ground their contributions in problem and solution content. Further, mixed-effects models show that role-specific semantic alignment predicts tutorial progression beyond baseline features such as message order and length. Tutor contributions exhibited stronger grounding in problem content early in interactions. In contrast, student solution alignment was modestly positively associated with progression. These findings support scaffolding as a continuous, role-sensitive process grounded in task semantics. By capturing role-specific alignment over time, this approach provides a principled method for analyzing instructional dialogue and evaluating conversational tutoring systems.

关键词: adaptive scaffolding, tutoring dialogue, embedding-based approach, semantic alignment, temporal dynamics, mathematics tutoring, Eedi dataset, mixed-effects models

26. ❌ AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication

作者: Jie Song, Jun Jia, Wei Sun, Wangqiu Zhou, Tao Tan, Guangtao Zhai 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24296v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文《AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication》专注于医学图像融合模型的版权保护技术，属于计算机视觉和医学图像处理领域。论文的核心贡献是提出了一种带有内置认证机制的医学图像融合模型，通过嵌入版权标识来保护知识产权。所有关键词均围绕大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等）或通用AI科学应用。论文内容与这些LLM核心技术完全无关，因此除最后一个关键词外，所有关键词评分为0。最后一个关键词"AI for Science" OR “Bioinformatics” OR “Cheminformatics"评分为5，因为论文涉及AI在医学成像（生物信息学相关领域）的应用，但并非核心创新点，只是应用场景。

!!! tip deepseek-chat TL;DR

该论文针对现有医学图像融合模型缺乏知识产权保护机制的问题，提出了一种带有内置认证的授权医学图像融合模型（AMIF），通过在融合结果中嵌入可见版权标识来防止未经授权的使用。

摘要翻译

多模态图像融合技术能够实现病灶的精准定位与特征描述，从而辅助精确诊断，强化临床决策支持，这使其在医学影像研究中日益受到重视。一个强大的多模态图像融合模型依赖于高质量、具有临床代表性的多模态训练数据以及经过严格设计的模型架构。因此，此类专业影像组学模型的开发是标准化数据采集、临床专业知识与算法设计能力共同协作的成果，其相关知识产权需要得到保护。然而，当前的多模态图像融合模型在生成融合结果时缺乏内置的知识产权保护机制，无意中通过推理泄露暴露了专有模型知识与敏感训练数据。例如，恶意用户可能利用融合输出，通过模型蒸馏或其他基于推理的反向工程技术来模仿专有模型的融合性能。为解决这一问题，我们提出了AMIF（Authorizable Medical Image Fusion），首个具备内置认证功能的可授权医学图像融合模型，它将授权访问控制集成到图像融合目标中。对于未经授权的使用，AMIF会在融合结果中嵌入显式且可见的版权标识符；反之，通过基于密钥的成功认证后，用户方可获得高质量的融合结果。

摘要 (Abstract)

Multimodal image fusion enables precise lesion localization and characterization for accurate diagnosis, thereby strengthening clinical decision-making and driving its growing prominence in medical imaging research. A powerful multimodal image fusion model relies on high-quality, clinically representative multimodal training data and a rigorously engineered model architecture. Therefore, the development of such professional radiomics models represents a collaborative achievement grounded in standardized acquisition, clinical-specific expertise, and algorithmic design proficiency, which necessitates protection of associated intellectual property rights. However, current multimodal image fusion models generate fused outputs without built-in mechanisms to safeguard intellectual property rights, inadvertently exposing proprietary model knowledge and sensitive training data through inference leakage. For example, malicious users can exploit fusion outputs and model distillation or other inference-based reverse engineering techniques to approximate the fusion performance of proprietary models. To address this issue, we propose AMIF, the first Authorizable Medical Image Fusion model with built-in authentication, which integrates authorization access control into the image fusion objective. For unauthorized usage, AMIF embeds explicit and visible copyright identifiers into fusion results. In contrast, high-quality fusion results are accessible upon successful key-based authentication.

关键词: Medical Image Fusion, Authorization, Authentication, Intellectual Property Protection, Copyright Identifier, Multimodal Imaging, Model Security, Radiomics

27. ❌ PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation

作者: Yuheng Feng, Wen Zhang, Haodong Duan, Xingxing Zou 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24078v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文《PosterIQ》主要研究海报的理解与生成，属于视觉-语言多模态领域。它创建了一个包含图像标注和生成提示的基准数据集，并评估了最先进的多模态大模型（MLLMs）和基于扩散的生成器。论文的核心是设计驱动的视觉理解和生成，而非大模型技术原理的创新。因此，大多数关键词（如MoE、Scaling Laws、RLHF、PEFT等）与论文内容完全无关，评分为0。唯一相关的关键词是"Large Language Models” OR “LLMs” OR “Foundation Models”，因为论文评估了MLLMs（多模态大模型），这些模型通常基于LLMs或Foundation Models构建，但论文本身不深入探讨LLMs的技术细节，仅将其作为评估对象，因此给予5分（有一定关联）。其他关键词如AI for Science等，虽然论文涉及设计领域，但并非科学领域的AI应用，故评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为PosterIQ的设计驱动基准，用于海报的理解和生成，通过评估多模态大模型和扩散生成器，揭示了它们在视觉层次、排版语义和意图传达方面的差距，旨在将人本设计原则融入生成式视觉-语言系统。

摘要翻译

我们提出PosterIQ，一个面向海报理解与生成的设计驱动型基准数据集，其标注涵盖构图结构、版式层级与语义意图三个维度。该数据集包含7,765个图像-标注实例及822个生成提示，覆盖真实场景、专业设计及合成案例。为连接视觉设计认知与生成建模，我们定义了布局解析、图文对应、版式/可读性与字体感知、设计质量评估，以及基于隐喻的可控构图感知生成等任务。通过对前沿多模态大语言模型与扩散生成模型的评估，我们发现现有模型在视觉层级理解、版式语义感知、显著性控制及意图传达方面仍存在明显不足：商业模型在高层推理任务上表现领先，但其评估机制对设计细节敏感度不足；生成模型虽能较好渲染文字，却在构图感知合成方面存在困难。深入分析表明，PosterIQ既可作为量化评估基准，也能作为设计推理的诊断工具，提供可复现的、面向具体任务的评估指标。我们期望以此推动生成模型的创造力发展，并将以人为本的设计原则融入视觉-语言生成系统。

摘要 (Abstract)

We present PosterIQ, a design-driven benchmark for poster understanding and generation, annotated across composition structure, typographic hierarchy, and semantic intent. It includes 7,765 image-annotation instances and 822 generation prompts spanning real, professional, and synthetic cases. To bridge visual design cognition and generative modeling, we define tasks for layout parsing, text-image correspondence, typography/readability and font perception, design quality assessment, and controllable, composition-aware generation with metaphor. We evaluate state-of-the-art MLLMs and diffusion-based generators, finding persistent gaps in visual hierarchy, typographic semantics, saliency control, and intention communication; commercial models lead on high-level reasoning but act as insensitive automatic raters, while generators render text well yet struggle with composition-aware synthesis. Extensive analyses show PosterIQ is both a quantitative benchmark and a diagnostic tool for design reasoning, offering reproducible, task-specific metrics. We aim to catalyze models’ creativity and integrate human-centred design principles into generative vision-language systems.

关键词: PosterIQ, poster understanding, poster generation, multimodal large language models, diffusion-based generators, design benchmark, visual hierarchy, typographic semantics

28. ❌ EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

作者: Falong Fan, Yi Xie, Arnis Lektauers, Bo Liu, Jerzy Rozenblit 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24577v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于计算机视觉和图形学领域的手术3D重建，使用GNN和注意力机制解决软组织变形问题，与绝大多数大模型技术关键词（如LLM、MoE、RLHF、RAG等）完全无关；仅与’AI for Science’有一定关联，因为涉及医学图像分析，属于AI在科学领域的应用，但并非核心内容，故给5分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为EndoVGGT的几何中心框架，通过动态构建特征空间语义图的DeGAT模块，解决了手术中软组织3D重建因低纹理表面、镜面高光和器械遮挡导致的几何连续性断裂问题，在SCARED数据集上显著提高了重建保真度并展示了强大的零样本跨数据集泛化能力。

摘要翻译

可变形软组织的精确三维重建对于手术机器人感知至关重要。然而，低纹理表面、镜面高光和器械遮挡常常破坏几何连续性，这对现有的固定拓扑方法构成了挑战。为此，我们提出了EndoVGGT，这是一个以几何为中心的框架，配备了一个变形感知图注意力模块。DeGAT模块不依赖静态空间邻域，而是动态构建特征空间语义图，以捕捉连贯组织区域之间的长程关联。这使得结构线索能够在遮挡区域间进行鲁棒传播，从而保证全局一致性并改善非刚性变形恢复。在SCARED数据集上的大量实验表明，我们的方法显著提高了重建保真度，与先前的最先进技术相比，PSNR提升了24.6%，SSIM提升了9.1%。至关重要的是，EndoVGGT在未见过的SCARED和EndoNeRF数据集上展现出强大的零样本跨数据集泛化能力，证实了DeGAT学习到的是与领域无关的几何先验。这些结果凸显了动态特征空间建模在实现一致性手术三维重建方面的有效性。

摘要 (Abstract)

Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.

关键词: 3D reconstruction, surgical robotics, graph neural networks, deformation-aware attention, feature-space semantic graphs, zero-shot generalization, soft tissue, geometric consistency

29. ❌ The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

作者: Biplab Pal, Santanu Bhattacharya 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24582v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究代理式人工智能（Agentic AI）在组织中的可靠性审计和监管成本问题，提出了一个基于马尔可夫框架的度量方法。论文与大多数关键词无关，因为这些关键词主要涉及大模型技术细节（如训练方法、架构优化、推理技术等），而本文聚焦于代理式AI的可靠性评估框架，不涉及具体的大模型技术实现。仅与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Tool Use OR Function Calling OR API Tool Use’有一定关联（5分），因为论文讨论代理式AI的工作流程和工具调用，但未明确涉及大语言模型或具体技术实现。

!!! tip deepseek-chat TL;DR

论文提出了一个基于马尔可夫框架的度量方法，用于评估代理式人工智能在工作流程中的可靠性并计算预期监管成本，实证研究发现细化操作状态可以显著增加状态-动作盲区质量。

摘要翻译

组织中的代理人工智能（AI）是一个受可靠性与监督成本约束的序列决策问题。当确定性工作流被基于行动与工具调用的随机策略取代时，关键问题不在于下一步是否看似合理，而在于由此产生的轨迹是否在统计上可支持、局部无歧义且在经济上可管控。为此，我们构建了一个测度论的马尔可夫框架。其核心量包括：状态盲点质量 B_n(tau)、状态-行动盲质量 B^SA_{pi,n}(tau)、一个基于熵的人机协同升级门控机制，以及基于工作流访问测度的期望监督成本恒等式。
我们在2019年业务流程智能挑战赛的采购到付款日志（251,734个案例，1,595,923个事件，42种不同的工作流动作）上实例化了该框架，并利用同一流程按时间顺序划分的80/20数据构建了一个日志驱动的模拟代理。主要实证发现是：一个大型工作流可能在状态层面看似得到良好支持，却在下一步决策上保留显著的盲质量——将操作状态细化为包含案例上下文、经济规模与执行者类别后，状态空间从42扩展至668，状态-行动盲质量在 tau=50 时从0.0165上升至 tau=1000 时的0.1253。在保留的测试集上，m(s) = max_a pi-hat(a|s) 所预测的自主步骤准确率与实际观测值的平均偏差在3.4个百分点以内。
界定统计可信自主性的同一组量也决定了期望监督负担。本框架已在一个大规模企业采购工作流中得到验证，其设计可直接应用于具备操作事件日志的工程流程。

摘要 (Abstract)

Agentic artificial intelligence (AI) in organizations is a sequential decision problem constrained by reliability and oversight cost. When deterministic workflows are replaced by stochastic policies over actions and tool calls, the key question is not whether a next step appears plausible, but whether the resulting trajectory remains statistically supported, locally unambiguous, and economically governable. We develop a measure-theoretic Markov framework for this setting. The core quantities are state blind-spot mass B_n(tau), state-action blind mass B^SA_{pi,n}(tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure. We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process. The main empirical finding is that a large workflow can appear well supported at the state level while retaining substantial blind mass over next-step decisions: refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668 and raises state-action blind mass from 0.0165 at tau=50 to 0.1253 at tau=1000. On the held-out split, m(s) = max_a pi-hat(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average. The same quantities that delimit statistically credible autonomy also determine expected oversight burden. The framework is demonstrated on a large-scale enterprise procurement workflow and is designed for direct application to engineering processes for which operational event logs are available.

关键词: Agentic Artificial Intelligence, Markov Framework, Reliability Auditing, Oversight Cost, Workflow Analysis, Stochastic Policies, State-Action Blind Mass, Enterprise Procurement

30. ❌ Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

作者: Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Ying Sun, Yang Xiao, Yuhang Han, Jianfei Yang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24576v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人操作中的记忆机制，提出了一种基于几何基础的多模态令牌记忆系统Chameleon，并创建了真实机器人数据集。虽然涉及AI在机器人领域的应用，但论文内容与所有评分关键词（主要围绕大语言模型技术、训练方法、推理技术、对齐技术等）均无直接关联，未提及任何语言模型、训练技术或相关AI方法，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在机器人长时程操作中，由于感知混淆导致相同观测可能对应不同交互历史的问题，提出了一种基于人类情景记忆启发的几何基础多模态记忆系统Chameleon，在真实机器人数据集上验证了其能显著提高决策可靠性和长时程控制性能。

摘要翻译

机器人操作常需依赖记忆：遮挡与状态变化可能导致决策时刻的观测存在感知混淆，使得在观测层面上动作选择呈现非马尔可夫性——相同的观测可能源于不同的交互历史。多数具身智能体通过语义压缩轨迹和基于相似性的检索来实现记忆，这种方法会丢弃用于消除歧义的细粒度感知线索，并可能返回感知相似但与决策无关的历史片段。受人类情景记忆启发，我们提出Chameleon系统，该系统通过写入几何基础的多模态标记来保留消除歧义的上下文信息，并借助可微分记忆栈实现目标导向的回忆。我们还引入了Camo-Dataset，这是一个基于真实机器人UR5e的数据集，涵盖感知混淆情境下的情景回忆、空间追踪与序列化操作任务。在多项任务中，Chameleon在感知易混淆环境中持续提升了决策可靠性与长程控制能力，其性能显著优于现有强基线模型。

摘要 (Abstract)

Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.

关键词: robotic manipulation, episodic memory, perceptual aliasing, multimodal tokens, long-horizon control, decision reliability, geometry-grounded, differentiable memory stack

31. ❌ Completeness of Unbounded Best-First Minimax and Descent Minimax

作者: Quentin Cohen-Solal 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24572v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是双人完美信息博弈中的搜索算法（Unbounded Best-First Minimax 和 Descent Minimax），属于经典算法理论领域。论文内容聚焦于算法完备性证明和实验验证，不涉及大模型、深度学习、AI for Science 或任何现代大模型技术关键词。所有关键词均与大模型技术原理、训练方法、应用领域或相关技术（如推理、对齐、压缩等）相关，而本文是纯理论算法研究，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文解决了Unbounded Best-First Minimax和Descent Minimax算法在无限搜索时间内是否总能确定获胜策略的开放性问题，通过理论证明和实验验证了改进后的算法具有完备性。

摘要翻译

本文聚焦于双人完全信息博弈的搜索算法，其目标是确定最优策略，理想情况下是必胜策略。
遗憾的是，现有文献中的部分博弈搜索算法即使在无限搜索时间下，也无法始终确定必胜策略。例如，无界最佳优先极小化极大算法（Unbounded Best-First Minimax）和下降极小化极大算法（Descent Minimax）便是如此，它们是当前先进的无知识强化学习中的核心算法。
这些算法随后通过所谓的“补全技术”进行了改进。然而，该技术是否足以使这些算法始终确定必胜策略，此前一直是一个悬而未决的问题。
为回答这一问题，我们对这两种算法（采用补全技术的版本）进行了推广，并证明了此类算法中的任意算法均可计算出最优策略。
最后，我们通过实验证明，补全技术确实提升了获胜性能。

摘要 (Abstract)

In this article, we focus on search algorithms for two-player perfect information games, whose objective is to determine the best possible strategy, and ideally a winning strategy. Unfortunately, some search algorithms for games in the literature are not able to always determine a winning strategy, even with an infinite search time. This is the case, for example, of the following algorithms: Unbounded Best-First Minimax and Descent Minimax, which are core algorithms in state-of-the-art knowledge-free reinforcement learning. They were then improved with the so-called completion technique. However, whether this technique sufficiently improves these algorithms to allow them to always determine a winning strategy remained an open question until now. To answer this question, we generalize the two algorithms (their versions using the completion technique), and we show that any algorithm of this class of algorithms computes the best strategy. Finally, we experimentally show that the completion technique improves winning performance.

关键词: search algorithms, two-player perfect information games, winning strategy, Unbounded Best-First Minimax, Descent Minimax, completion technique, algorithm completeness, reinforcement learning

32. ❌ VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

作者: Qijia He, Xunmei Liu, Hammaad Memon, Ziang Li, Zixian Ma, Jaemin Cho, Jason Ren, Daniel S Weld, Ranjay Krishna 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24575v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出VFIG，一种用于复杂图形到SVG转换的视觉语言模型，属于大模型在特定领域（图形处理）的应用创新。核心相关关键词：1) “Supervised Fine-tuning (SFT)"（10分）：论文明确使用SFT作为训练方法；2) “Large Language Models”（5分）：VFIG属于视觉语言模型，是大模型的一种；3) “Scaling Laws AND Data Quality”（5分）：论文创建大规模数据集VFIG-DATA（66K对）并关注数据质量；4) “AI for Science”（5分）：论文处理科学论文中的图形，属于AI在科学领域的应用。其他关键词如MoE、RLHF、RAG等未涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文解决了从光栅化图形（如PNG）重建可编辑矢量图形（SVG）的难题，通过提出VFIG视觉语言模型、构建大规模数据集VFIG-DATA以及采用从SFT到强化学习的训练课程，实现了最先进的图形到SVG转换性能。

摘要翻译

可缩放矢量图形（SVG）是技术插图和数字设计的关键格式，具有精确的分辨率独立性和灵活的语义可编辑性。然而在实际应用中，原始矢量源文件常常丢失或无法获取，仅留下难以修改或缩放的“扁平化”栅格化版本（如PNG或JPEG）。手动重建这些图形是极其耗费人力的过程，需要专业知识才能还原原始几何意图。为弥合这一鸿沟，我们提出了VFIG——一个专为复杂高保真图形到SVG转换而训练的视觉-语言模型系列。尽管该任务本质上是数据驱动的，但现有数据集通常规模较小且缺乏专业图表的复杂性。为此我们引入了VFIG-DATA，这是一个包含6.6万组高质量图形-SVG配对的大规模数据集，其内容来源于真实学术论文图表与程序生成图表的多样化混合。基于SVG由重复图元和层次化局部结构组成的特点，我们提出了由粗到精的训练课程：首先通过监督微调（SFT）学习原子图元，继而采用强化学习（RL）优化阶段来提升整体图表保真度、布局一致性和拓扑边缘案例处理能力。最后，我们建立了VFIG-BENCH综合评估体系，其中包含专门设计用于衡量复杂图形结构完整性的新型指标。VFIG在开源模型中实现了最先进的性能，与GPT-5.2表现相当，在VFIG-BENCH上获得了0.829的VLM-Judge评分。

摘要 (Abstract)

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only “flat” rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

关键词: Vision-Language Models, SVG conversion, supervised fine-tuning (SFT), reinforcement learning (RL), large-scale dataset, figure reconstruction, vector graphics, evaluation benchmark

33. ❌ Anti-I2V: Safeguarding your photos from malicious image-to-video generation

作者: Duc Vu, Anh Nguyen, Chi Tran, Anh Tran 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24570v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究对抗性防御方法，保护照片免受恶意图像到视频生成模型的滥用，属于计算机视觉和AI安全领域。所有关键词均与大语言模型（LLMs）及其相关技术（如训练、推理、对齐、应用等）直接相关，而本文专注于扩散模型（特别是DiT架构）的视频生成防御，未涉及LLMs或相关技术。因此，所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Anti-I2V的新方法，通过在LAB和频域操作并针对关键网络层设计训练目标，有效防御基于扩散模型的恶意图像到视频生成，在各种视频扩散模型上实现了最先进的防御性能。

摘要翻译

基于扩散的视频生成模型虽显著提升了人体动画质量，却同时带来了滥用风险——仅凭特定人物的单张照片和文本提示即可伪造虚假视频。近期研究集中于通过对抗性攻击引入人工扰动，以保护图像免受扩散模型滥用。然而，现有方法大多针对图像生成领域，明确针对图像到视频扩散模型（VDMs）的研究相对有限，且主要集中于基于UNet的架构。由于扩散Transformer（DiT）模型凭借更大参数量与先进注意力机制展现出更强的特征保持能力与时间一致性，现有方法对其防御效果尚未得到充分探索。为此，我们提出Anti-I2V——一种适用于多种扩散骨干网络的新型防御方法，专门针对恶意的人体图像到视频生成任务。区别于将噪声更新局限于RGB空间的做法，Anti-I2V同时在$L$$a$$b$*色彩空间与频域进行操作，从而提升鲁棒性并聚焦于显著性像素区域。我们进一步识别了去噪过程中最能捕捉关键语义特征的网络层，据此设计训练目标以最大化破坏生成视频的时间连贯性与保真度。大量实验验证表明，Anti-I2V在针对多种视频扩散模型的防御任务中取得了最先进的性能，为该问题提供了有效的解决方案。

摘要 (Abstract)

Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person’s photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$$a$$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

关键词: adversarial defense, image-to-video generation, diffusion models, Diffusion Transformer (DiT), temporal coherence, generation fidelity, video diffusion models (VDMs), malicious video generation

34. ❌ The Free-Market Algorithm: Self-Organizing Optimization for Open-Ended Complex Systems

作者: Martin Jaraiz 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24559v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种受自由市场经济启发的元启发式算法（FMA），用于开放复杂系统的自组织优化，并在化学和经济学领域进行了验证。所有关键词均与深度学习、大模型技术或AI方法直接相关，而本文的核心是优化算法和复杂系统模拟，并非深度学习或大模型研究。唯一的相关性是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文在化学（氨基酸、核苷酸合成）和经济学（GDP预测）中应用了算法，属于科学领域的AI应用，但并非深度学习或大模型方法，因此给予5分（有一定关联）。其他关键词如LLMs、MoE、训练方法、推理技术、代理系统等均未涉及，评分为0分。

!!! tip deepseek-chat TL;DR

论文提出了一种受自由市场经济启发的元启发式算法（FMA），用于开放复杂系统的自组织优化，并在化学合成和宏观经济预测中验证了其有效性。

摘要翻译

我们提出自由市场算法（Free-Market Algorithm, FMA），这是一种受自由市场经济启发的新型元启发式算法。与遗传算法、粒子群优化和模拟退火等需要预设适应度函数和固定搜索空间的方法不同，FMA采用分布式供需动态机制，其中适应度是涌现的，搜索空间是开放式的，解决方案以分层路径网络的形式呈现。自主代理发现规则、交易商品、开设与关闭企业、并在无中央控制器的情况下竞争需求。
FMA通过三层架构运行：通用市场机制（供给、需求、竞争、选择）、可插拔的领域特定行为规则，以及领域特定观察。市场机制在不同应用中保持一致，仅行为规则发生变化。
该算法在两个无关领域得到验证。在生命起源前化学中，从900个基础原子（C、H、O、N）出发，FMA在笔记本电脑上5分钟内发现了全部12种可行氨基酸分子式、全部5种核苷碱基、甲醛聚糖链以及克雷布斯循环中间体——每种产物最多产生240条独立合成路径。在宏观经济预测中，仅读取单一投入产出表且无需估计任何参数，FMA对非危机时期GDP预测的平均绝对误差达到0.42个百分点，与专业预测机构水平相当，并可推广至33个国家。
组装理论（Assembly Theory）对齐表明，FMA为Sharma等人（《自然》，2023年）描述的选择特征提供了首个可显式调控的机制。其事件驱动的组装动力学与物理学基础理论——因果集理论、关系量子力学、构造器理论——产生共鸣，暗示达尔文式市场动态可能反映了导致自然本身展开的更深层组织原则。

摘要 (Abstract)

We introduce the Free-Market Algorithm (FMA), a novel metaheuristic inspired by free-market economics. Unlike Genetic Algorithms, Particle Swarm Optimization, and Simulated Annealing – which require prescribed fitness functions and fixed search spaces – FMA uses distributed supply-and-demand dynamics where fitness is emergent, the search space is open-ended, and solutions take the form of hierarchical pathway networks. Autonomous agents discover rules, trade goods, open and close firms, and compete for demand with no centralized controller. FMA operates through a three-layer architecture: a universal market mechanism (supply, demand, competition, selection), pluggable domain-specific behavioral rules, and domain-specific observation. The market mechanism is identical across applications; only the behavioral rules change. Validated in two unrelated domains. In prebiotic chemistry, starting from 900 bare atoms (C, H, O, N), FMA discovers all 12 feasible amino acid formulas, all 5 nucleobases, the formose sugar chain, and Krebs cycle intermediates in under 5 minutes on a laptop – with up to 240 independent synthesis routes per product. In macroeconomic forecasting, reading a single input-output table with zero estimated parameters, FMA achieves Mean Absolute Error of 0.42 percentage points for non-crisis GDP prediction, comparable to professional forecasters, portable to 33 countries. Assembly Theory alignment shows that FMA provides the first explicit, tunable mechanism for the selection signatures described by Sharma et al. (Nature, 2023). The event-driven assembly dynamics resonate with foundational programs in physics – causal set theory, relational quantum mechanics, constructor theory – suggesting that Darwinian market dynamics may reflect a deeper organizational principle that lead to the unfolding of Nature itself.

关键词: Free-Market Algorithm, metaheuristic, self-organizing optimization, open-ended complex systems, autonomous agents, prebiotic chemistry, macroeconomic forecasting, Assembly Theory

35. ❌ Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents

作者: Samuel Taiwo, Mohd Amaluddin Yusoff 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24556v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究RAG框架在石油天然气企业文档中的应用，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），因为摘要明确提到RAG框架并研究其文档分块策略。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为论文提到RAG用于解决LLM的限制。与’AI for Science OR Bioinformatics OR Cheminformatics’有弱关联（5分），因为石油天然气领域可视为科学应用的一个子领域，但论文未明确提及生物信息学或化学信息学。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在石油天然气企业文档中，不同文档分块策略对检索增强生成（RAG）性能的影响，发现结构感知分块在检索效果和计算成本方面优于其他方法，但所有方法在处理视觉和空间编码文档（如P&ID图）时效果有限。

摘要翻译

检索增强生成（Retrieval-Augmented Generation，简称RAG）已成为应对大语言模型（Large Language Models，简称LLMs）局限性的重要框架。然而，其效能根本上取决于文档分块——这一常被忽视的质量决定因素。本文通过实证研究，量化了四种分块策略的性能差异：固定尺寸滑动窗口法、递归法、基于断点的语义分块法以及结构感知分块法。我们使用一个包含油气企业文档的专有语料库对这些方法进行了评估，语料涵盖文本密集的手册、表格繁多的技术规范以及管道与仪表流程图（Piping and Instrumentation Diagrams，简称P and IDs）。研究结果表明，结构感知分块法在整体检索效能上表现更优，尤其在Top-K指标上，且其计算成本显著低于语义分块法或基线策略。关键的是，所有四种方法在处理P and IDs时均表现出有限的效果，这凸显了纯文本RAG在处理视觉与空间编码文档时的核心局限。我们的结论是，尽管在专业领域中保持显式结构至关重要，但未来的研究必须整合多模态模型以克服当前的局限性。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality. This paper presents an empirical study quantifying performance differences across four chunking strategies: fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware. We evaluated these methods using a proprietary corpus of oil and gas enterprise documents, including text-heavy manuals, table-heavy specifications, and piping and instrumentation diagrams (P and IDs). Our findings show that structure-aware chunking yields higher overall retrieval effectiveness, particularly in top-K metrics, and incurs significantly lower computational costs than semantic or baseline strategies. Crucially, all four methods demonstrated limited effectiveness on P and IDs, underscoring a core limitation of purely text-based RAG within visually and spatially encoded documents. We conclude that while explicit structure preservation is essential for specialised domains, future work must integrate multimodal models to overcome current limitations.

关键词: Retrieval-Augmented Generation, RAG, document chunking, oil and gas, enterprise documents, structure-aware chunking, retrieval effectiveness, P&ID diagrams

36. ❌ A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English

作者: Dana Serditova, Kevin Tang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24549v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究自动语音识别（ASR）系统在方言（纽卡斯尔英语）上的偏见问题，通过社会语言学分析评估商业ASR系统的转录错误。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，而本文聚焦于语音识别系统的评估和社会语言学分析，未涉及大模型技术、深度学习创新或AI在生物/化学信息学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过社会语言学分析揭示了商业自动语音识别系统在处理纽卡斯尔英语方言时存在系统性偏见，发现语音变异是主要错误来源，且错误率在不同社会群体（如性别、年龄）中存在差异。

摘要翻译

自动语音识别系统在日常交流、教育、医疗和工业领域得到广泛应用，但其对不同说话者的识别性能仍存在差异，尤其在方言变体与训练数据所代表的主流口音不同时更为明显。本研究通过对纽卡斯尔英语（英格兰东北部的一种地区方言）进行社会语言学分析，探讨自动语音识别系统的偏见问题。该方言已被证明对当前语音识别技术构成挑战。利用泰恩赛德英语历时电子语料库中的自然口语材料，我们评估了一款先进商业自动语音识别系统的输出结果，并对超过3000条转录错误进行了细粒度分析。错误按语言学领域分类，并结合性别、年龄和社会经济地位等社会变量进行考察。此外，通过对选定元音特征的声学案例研究，揭示了渐变式语音变异如何直接导致识别错误。
结果表明，语音变异是错误的主要来源，反复出现的识别失败与方言特有特征（如元音音质和喉塞音）、地方词汇及非标准语法形式相关。错误率在不同社会群体间也存在差异，男性和年龄谱两端的说话者表现出更高的错误频率。这些发现表明，自动语音识别错误并非随机产生，而是具有社会模式性，可以从社会语言学角度进行解释。因此，本研究论证了将社会语言学专业知识纳入语音技术评估与开发的重要性，并指出要建立更公平的自动语音识别系统，必须明确关注方言变异和基于社群的语音数据。

摘要 (Abstract)

Automatic Speech Recognition (ASR) systems are widely used in everyday communication, education, healthcare, and industry, yet their performance remains uneven across speakers, particularly when dialectal variation diverges from the mainstream accents represented in training data. This study investigates ASR bias through a sociolinguistic analysis of Newcastle English, a regional variety of North-East England that has been shown to challenge current speech recognition technologies. Using spontaneous speech from the Diachronic Electronic Corpus of Tyneside English (DECTE), we evaluate the output of a state-of-the-art commercial ASR system and conduct a fine-grained analysis of more than 3,000 transcription errors. Errors are classified by linguistic domain and examined in relation to social variables including gender, age, and socioeconomic status. In addition, an acoustic case study of selected vowel features demonstrates how gradient phonetic variation contributes directly to misrecognition. The results show that phonological variation accounts for the majority of errors, with recurrent failures linked to dialect-specific features like vowel quality and glottalisation, as well as local vocabulary and non-standard grammatical forms. Error rates also vary across social groups, with higher error frequencies observed for men and for speakers at the extremes of the age spectrum. These findings indicate that ASR errors are not random but socially patterned and can be explained from a sociolinguistic perspective. Thus, the study demonstrates the importance of incorporating sociolinguistic expertise into the evaluation and development of speech technologies and argues that more equitable ASR systems require explicit attention to dialectal variation and community-based speech data.

关键词: Automatic Speech Recognition, ASR bias, sociolinguistic analysis, Newcastle English, dialectal variation, transcription errors, phonological variation, social variables

37. ❌ SEGAR: Selective Enhancement for Generative Augmented Reality

作者: Fanjun Bu, Chenyang Yuan, Hiroshi Yasuda 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24541v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SEGAR专注于生成式世界模型在增强现实（AR）中的应用，特别是结合扩散模型和选择性校正阶段来生成和修正增强的未来帧。论文的核心是生成式世界模型（Generative World Models），这与关键词’World Models AND General World Models’高度相关（评10分），因为摘要中明确提到’Generative world models’作为AR应用的基础。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理（如MoE、Scaling Laws、训练方法等）、推理技术（如CoT、MCTS）、代理系统、模型优化（如量化、推理加速）或科学AI应用，因此其他所有关键词评0分。论文属于计算机视觉和AR领域，而非大模型或深度学习技术原理的创新，因此不符合研究背景中’大模型和深度学习在科学领域的应用’或’大模型和深度学习技术原理的创新’的要求，但基于’World Models’关键词给予部分相关性。

!!! tip deepseek-chat TL;DR

SEGAR提出一个结合扩散世界模型和选择性校正的框架，用于生成和修正增强现实中的未来图像序列，以支持驾驶场景中的实时AR应用。

摘要翻译

生成式世界模型为增强现实（AR）应用提供了一个引人注目的基础：通过预测包含刻意视觉编辑的未来图像序列，它们能够生成时间连贯的增强未来帧，这些帧可提前计算并缓存，从而避免实时逐帧从头渲染。在本研究中，我们提出了SEGAR，这是一个初步框架，它将基于扩散的世界模型与选择性校正阶段相结合，以支持这一愿景。该世界模型生成具有特定区域编辑的增强未来帧，同时保持其他区域不变；校正阶段随后将安全关键区域与现实世界观测对齐，同时保留其他区域的预期增强效果。我们以驾驶场景作为代表性环境展示了该流程，其中语义区域结构定义明确且现实世界反馈易于获取。我们将此视为生成式世界模型迈向实用AR基础设施的早期步骤，未来帧可按需生成、缓存并选择性校正。

摘要 (Abstract)

Generative world models offer a compelling foundation for augmented-reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per-frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion-based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region-specific edits while preserving others, and the correction stage subsequently aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real-world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.

关键词: Generative World Models, Augmented Reality, Diffusion Models, Selective Correction, Future Frame Generation, Driving Scenarios, Temporal Coherence, Real-world Feedback

38. ❌ CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition

作者: Florian Stilz, Vinkle Srivastav, Nassir Navab, Nicolas Padoy 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24539v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于视频-语言基础模型的预训练，特别是针对长格式手术视频的细粒度时间理解。它与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为核心贡献是新颖的预训练框架和策略（如VTC_CTX、COP、Cycle-Consistency Alignment、FTM）。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），因为应用领域是手术程序（生物医学AI），但论文更侧重于方法而非特定科学发现。其他关键词主要涉及大语言模型（LLM）技术、推理、对齐、优化等，而本文研究的是视频-语言多模态模型，因此不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了CliPPER，一种针对长格式手术视频的上下文视频-语言预训练框架，通过新颖的预训练目标（如VTC_CTX和COP）提高了多模态对齐和细粒度时间理解，在多个手术基准测试中实现了最先进的零样本识别性能。

摘要翻译

视频-语言基础模型已被证明在广泛任务中的零样本应用方面具有高效能。手术室内操作领域是一个尤为复杂的挑战区域，该领域标注数据稀缺，且复杂下游任务通常需要精确的时间理解能力。为应对这一挑战，我们提出了CliPPER（面向事件识别的长时程手术操作视频上下文视频-语言预训练框架），这是一种基于手术教学视频训练的新型视频-语言预训练框架。本方法专为细粒度时序视频-文本识别设计，并引入了多项新颖的预训练策略以提升长时程手术视频中的多模态对齐能力。具体而言，我们提出了上下文视频-文本对比学习（VTC_CTX）与片段顺序预测（COP）预训练目标，二者均利用时序与上下文依赖关系来增强局部视频理解。此外，我们在同一手术视频内引入视频-文本匹配的循环一致性对齐机制，以强化双向一致性并提升整体表征连贯性。同时，我们引入了更精细的对齐损失函数——帧-文本匹配（FTM），以改善视频帧与文本之间的对齐效果。实验表明，我们的模型在多个公开手术基准测试中创造了全新的性能纪录，包括对手术阶段、步骤、器械及三元组的零样本识别。源代码与预训练标注可通过https://github.com/CAMMA-public/CliPPER获取。

摘要 (Abstract)

Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at https://github.com/CAMMA-public/CliPPER.

关键词: video-language foundation models, surgical procedure domain, pretraining framework, temporal understanding, multimodal alignment, zero-shot recognition, long-form videos, contextual learning

39. ❌ UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

作者: Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, Deheng Ye, Jie Jiang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24533v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出UI-Voyager，一种基于多模态大语言模型（MLLMs）的自进化移动GUI代理，核心涉及LLM代理（Autonomous Agents）和自改进（Self-Improvement）技术，通过拒绝微调（RFT）和组相对自蒸馏（GRSD）实现模型与数据的协同进化，属于监督微调（SFT）范畴。其他关键词如MoE、量化、RAG等未在摘要中体现，故评分为0。

!!! tip deepseek-chat TL;DR

论文针对移动GUI代理在长时程任务中学习效率低和稀疏奖励下信用分配模糊的问题，提出了UI-Voyager，一种两阶段自进化代理，通过拒绝微调和组相对自蒸馏方法，在AndroidWorld上实现了81.0%的成功率，超越了人类水平和现有基线。

摘要翻译

随着多模态大语言模型（MLLMs）的进步，自主移动图形用户界面（GUI）智能体日益受到关注。然而，现有方法在长视野GUI任务中，仍面临从失败轨迹中学习效率低下、稀疏奖励下信用分配模糊的问题。为此，我们提出UI-Voyager，一种新颖的两阶段自进化移动GUI智能体。在第一阶段，我们采用拒绝微调（Rejection Fine-Tuning, RFT），使数据与模型能在完全自主的循环中持续协同进化。第二阶段引入组相对自蒸馏（Group Relative Self-Distillation, GRSD），该方法通过识别组探索中的关键分叉点，并从成功轨迹中构建密集的步骤级监督，以修正失败轨迹。在AndroidWorld平台上的大量实验表明，我们的40亿参数模型实现了81.0%的Pass@1成功率，超越了近期众多基线方法并超过了人类水平。消融实验与案例研究进一步验证了GRSD的有效性。我们的方法代表了在不依赖昂贵人工数据标注的情况下，向高效、自进化、高性能移动GUI自动化迈出的重要一步。

摘要 (Abstract)

Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long-horizon GUI tasks. To that end, we propose UI-Voyager, a novel two-stage self-evolving mobile GUI agent. In the first stage, we employ Rejection Fine-Tuning (RFT), which enables the continuous co-evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self-Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step-level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human-level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self-evolving, and high-performance mobile GUI automation without expensive manual data annotation.

关键词: GUI Agent, Self-evolving, Multimodal Large Language Models, Rejection Fine-Tuning, Group Relative Self-Distillation, Mobile Automation, AndroidWorld, Pass@1

40. ❌ From Liar Paradox to Incongruent Sets: A Normal Form for Self-Reference

作者: Shalender Singh, Vishnu Priya Singh Parmar 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24527v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是自指语义句子的逻辑形式（incongruent normal form）及其在模型论和语义信息论中的理论性质，属于数理逻辑和形式语义学的基础理论研究。所有评分关键词都聚焦于大模型、深度学习技术及其应用（如训练方法、推理技术、优化、应用领域等），而本文完全不涉及这些技术主题，没有讨论任何机器学习模型、算法或应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个用于表示自指语义句子的结构形式（incongruent normal form），证明了语义完整性会排除信息性而 incongruence 会保留信息性，并建立了一个量化语义框架来展示语义信息性需要无限能量才能坍缩到单一确定状态。

摘要翻译

我们引入非协调正规形式（incongruent normal form，简称INF），一种用于自指语义语句的结构化表征。INF将自指语句替换为一个有限的非自指语句族，这些语句各自可满足但无法联合满足。该转换在局部保持经典语义的同时，隔离了由自指性产生的语义障碍，并辅以刻画“局部相容的语义承诺何时导致全局不一致性”的正确性定理。随后，我们研究非协调性作为语义信息性的结构来源。基于一个极简的模型论信息性概念——即语句在可容许模型间进行区分的能力——我们证明语义完备性会排除信息性，而非协调性则能保留信息性。此外，非协调性并不局限于悖论性构造：任何一致但不完备的一阶理论都容许由不相容的完备扩张产生的有限非协调族。在此意义上，不完备性在结构上表现为局部可实现但全局不相容的语义承诺，从而为语义知识提供了最简形式基础。最后，我们引入一个量化语义框架。在一个典范的有限语义状态设定中，我们将语义承诺建模为布尔函数，并基于总影响定义了一种傅里叶分析的语义能量概念。我们推导出关联语义确定性、信息性与谱简洁性的不确定性式界限，并建立了一个矩阵不等式，以总语义能量约束聚合语义方差。这些结果定量表明，若无无限的能量代价，语义信息性无法坍缩为单一确定状态，从而将非协调性确立为语义表征的一个基本结构与量化特征。

摘要 (Abstract)

We introduce incongruent normal form (INF), a structural representation for self-referential semantic sentences. An INF replaces a self-referential sentence with a finite family of non-self-referential sentences that are individually satisfiable but not jointly satisfiable. This transformation isolates the semantic obstruction created by self-reference while preserving classical semantics locally and is accompanied by correctness theorems characterizing when global inconsistency arises from locally compatible commitments. We then study the role of incongruence as a structural source of semantic informativeness. Using a minimal model-theoretic notion of informativeness-understood as the ability of sentences to distinguish among admissible models-we show that semantic completeness precludes informativeness, while incongruence preserves it. Moreover, incongruence is not confined to paradoxical constructions: any consistent incomplete first-order theory admits finite incongruent families arising from incompatible complete extensions. In this sense, incompleteness manifests structurally as locally realizable but globally incompatible semantic commitments, providing a minimal formal basis for semantic knowledge. Finally, we introduce a quantitative semantic framework. In a canonical finite semantic-state setting, we model semantic commitments as Boolean functions and define a Fourier-analytic notion of semantic energy based on total influence. We derive uncertainty-style bounds relating semantic determinacy, informativeness, and spectral simplicity, and establish a matrix inequality bounding aggregate semantic variance by total semantic energy. These results show quantitatively that semantic informativeness cannot collapse into a single determinate state without unbounded energy cost, identifying incongruence as a fundamental structural and quantitative feature of semantic representation.

关键词: self-reference, incongruent normal form, semantic informativeness, model theory, first-order theory, semantic energy, Boolean functions, Fourier analysis

41. ❌ No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions

作者: Emily Schiller, Teodor Chiaburu, Marco Zullich, Luca Longo 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24524v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于可解释人工智能（XAI）领域，特别是解释模型预测不确定性的归因方法（uncertainty attributions）的评估框架。论文的核心贡献是提出了一个基于Co-12框架的多维度评估框架，用于系统评估不确定性归因方法的质量。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词主要针对大语言模型（LLMs）及其相关技术，而该论文研究的是通用XAI评估框架，并未涉及LLMs或深度学习模型的具体技术。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文直接研究XAI评估，属于可解释AI范畴，但论文重点在评估框架而非解释机制本身，因此给予10分（高度相关，但非核心内容）。

!!! tip deepseek-chat TL;DR

该论文针对不确定性归因方法评估不一致的问题，提出了一个基于Co-12框架的多维度评估框架，并通过实验表明没有单一指标能全面评估不确定性归因质量，梯度方法在一致性和传递性上优于扰动方法。

摘要翻译

可解释人工智能（XAI）的研究常聚焦于解释模型预测。近期，学界提出了通过将预测不确定性归因于输入特征（不确定性归因）来解释不确定性的方法。然而，由于现有研究依赖各异的代理任务与评估指标，对这些方法的评估仍缺乏一致性，阻碍了可比性。为此，我们将不确定性归因与成熟的XAI评估Co-12框架对齐，针对正确性、一致性、连续性及紧凑性提出了具体实施方案。此外，我们引入了专为不确定性归因设计的“传递性”属性，用于评估认知不确定性的受控增加是否能可靠地传递至特征层面的归因结果。我们通过表格数据与图像数据，结合不确定性量化与特征归因方法，使用八项指标验证了该评估框架。实验表明，基于梯度的方法在一致性与传递性上持续优于基于扰动的方法，而蒙特卡洛Dropconnect在多数指标上超越蒙特卡洛Dropout。尽管多数指标在不同样本间对方法的排序保持一致，但方法间的评估共识度仍较低。这表明单一指标不足以全面评估不确定性归因的质量。本研究提出的评估框架为系统性比较与发展不确定性归因方法奠定了基础，从而丰富了该领域的知识体系。

摘要 (Abstract)

Research on explainable AI (XAI) has frequently focused on explaining model predictions. More recently, methods have been proposed to explain prediction uncertainty by attributing it to input features (uncertainty attributions). However, the evaluation of these methods remains inconsistent as studies rely on heterogeneous proxy tasks and metrics, hindering comparability. We address this by aligning uncertainty attributions with the well-established Co-12 framework for XAI evaluation. We propose concrete implementations for the correctness, consistency, continuity, and compactness properties. Additionally, we introduce conveyance, a property tailored to uncertainty attributions that evaluates whether controlled increases in epistemic uncertainty reliably propagate to feature-level attributions. We demonstrate our evaluation framework with eight metrics across combinations of uncertainty quantification and feature attribution methods on tabular and image data. Our experiments show that gradient-based methods consistently outperform perturbation-based approaches in consistency and conveyance, while Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. Although most metrics rank the methods consistently across samples, inter-method agreement remains low. This suggests no single metric sufficiently evaluates uncertainty attribution quality. The proposed evaluation framework contributes to the body of knowledge by establishing a foundation for systematic comparison and development of uncertainty attribution methods.

关键词: Explainable AI, Uncertainty Attribution, Evaluation Framework, Co-12 Framework, Multi-dimensional Evaluation, Gradient-based Methods, Monte-Carlo Dropconnect, Feature Attribution

42. ❌ Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

作者: Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24511v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理（Claude Code）用于自动化AI研究，发现新的对抗攻击算法，因此与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理技术、AI for Science等均未在摘要中提及或相关，故给0分。

!!! tip deepseek-chat TL;DR

该论文研究利用LLM代理（如Claude Code）进行自动化AI研究，发现了新的白盒对抗攻击算法，在越狱和提示注入评估中显著优于现有30多种方法，并在Meta-SecAlign-70B等模型上实现100%攻击成功率。

摘要翻译

以Claude Code为代表的大型语言模型智能体不仅能编写代码，还可用于自主人工智能研究与工程实践 \citep{rank2026posttrainbench, novikov2025alphaevolve}。本文证明，由Claude Code驱动的\emph{自主研究}式流程 \citep{karpathy2026autoresearch} 能够发现全新的白盒对抗攻击\textit{算法}，在越狱和提示注入评估中\textbf{显著超越所有现有（30余种）方法}。
该智能体以现有攻击实现（如GCG~\citep{zou2023universal}）为起点，通过迭代生成新算法，在针对GPT-OSS-Safeguard-20B模型的CBRN查询中实现了高达40%的攻击成功率，而现有算法的成功率仅为$\leq$10%（\Cref{fig:teaser}左图）。所发现的算法具有泛化能力：在代理模型上优化的攻击可直接迁移至预留测试模型，\textbf{对Meta-SecAlign-70B模型实现了100%的攻击成功率} \citep{chen2025secalign}，而最佳基线方法仅为56%（\Cref{fig:teaser}中图）。
本研究拓展了~\cite{carlini2025autoadvexbench} 的发现，初步论证了利用大型语言模型智能体可实现渐进式安全研究的自动化。白盒对抗红队测试尤其适合此场景：现有方法提供了坚实的起点，且优化目标能产生密集的量化反馈。我们在https://github.com/romovpa/claudini 公开了所有发现的攻击方法、基线实现及评估代码。

摘要 (Abstract)

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style pipeline \citep{karpathy2026autoresearch} powered by Claude Code discovers novel white-box adversarial attack \textit{algorithms} that \textbf{significantly outperform all existing (30+) methods} in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citep{zou2023universal}, the agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to $\leq$10% for existing algorithms (\Cref{fig:teaser}, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving \textbf{100% ASR against Meta-SecAlign-70B} \citep{chen2025secalign} versus 56% for the best baseline (\Cref{fig:teaser}, middle). Extending the findings of~\cite{carlini2025autoadvexbench}, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.

关键词: LLM agents, autonomous AI research, adversarial attack algorithms, jailbreaking, prompt injection, white-box attacks, Claude Code, autoresearch

43. ❌ Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

作者: John Ray B. Martinez 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24481v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究多智能体框架在医学问答中的应用，使用Qwen2.5-7B-Instruct作为基础模型，因此与’Large Language Models’高度相关（10分）。研究涉及多智能体系统（10分）、智能体协调（10分）和自校正机制（10分），这些是核心创新点。在医学领域应用，与’AI for Science’高度相关（10分）。论文通过一致性验证改进不确定性校准，涉及推理过程（5分）、深入思考（5分）、事实性（5分）和可解释性（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合领域专家智能体和两阶段验证的多智能体框架，显著改善了医学多选题问答中的不确定性校准，在MedQA和MedMCQA数据集上将ECE降低了49-74%。

摘要翻译

置信度分数校准失准是人工智能在临床环境中部署的实际障碍。始终过度自信的模型无法为转诊决策提供有效信号。本文提出一种多智能体框架，通过结合领域特异性专科智能体、两阶段验证和S分数加权融合技术，以提升医学多项选择题回答任务的校准能力与判别性能。四个专科智能体（呼吸科、心内科、神经科、胃肠科）基于Qwen2.5-7B-Instruct模型生成独立诊断。每个诊断随后经过两阶段自验证流程，该流程评估内部一致性并生成专科置信度分数（S-score）。S分数驱动加权融合策略，从而选择最终答案并校准报告的置信度。我们在四个实验场景中进行评估，涵盖MedQA-USMLE和MedMCQA数据集的100题与250题高争议子集。校准性能提升是核心发现：所有四种场景的预期校准误差（ECE）均降低49-74%，包括难度更高的MedMCQA基准测试——即使绝对准确率受知识密集型记忆需求限制，这些增益依然存在。在MedQA-250数据集上，完整系统实现了ECE=0.091（较单专科基线降低74.4%），AUROC=0.630（提升0.056），准确率达59.2%。消融分析表明：两阶段验证是校准性能的主要驱动因素，而多智能体推理是准确率提升的核心驱动力。这些结果证实，基于一致性的验证能为不同医学题型生成更可靠的不确定性估计，为安全关键型临床人工智能应用中的转诊决策提供实用的置信度信号。

摘要 (Abstract)

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.

关键词: multi-agent systems, uncertainty calibration, medical question answering, consistency verification, LLM agents, confidence scoring, clinical AI, self-verification

44. ❌ Counting Without Numbers & Finding Without Words

作者: Badri Narayana Patro 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24470v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是基于生物通信原理的多模态AI系统，用于动物收容所的宠物重聚，通过整合视觉和声学生物识别技术。所有关键词都专注于大语言模型（LLM）及其相关技术（如微调、推理、对齐、代理等），而本文完全不涉及LLM、深度学习技术原理或任何大模型技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文应用AI解决科学/生物相关问题（动物认知和通信），但并非核心匹配，给5分表示有一定关联。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对宠物收容所中因仅依赖外观匹配导致70%丢失宠物无法与主人重聚的问题，提出了首个基于动物认知科学的多模态重聚系统，整合视觉和声学生物识别技术，成功证明了基于生物通信原理的AI能够服务缺乏人类语言的弱势群体。

摘要翻译

每年有1000万只宠物进入收容所，与家庭分离。尽管监护人和走失动物都在竭力寻找，仍有70%的宠物无法与家人团聚——并非因为匹配不存在，而是因为现有系统仅依赖外观识别，而动物主要通过声音辨识彼此。我们提出疑问：为何计算机视觉将能够发声的物种视为无声的视觉对象？基于五十年来认知科学的研究成果（这些研究表明动物通过近似感知数量并通过声音交流身份），我们提出了首个融合视觉与声学生物特征的多模态重聚系统。我们的物种自适应架构能够处理从10赫兹大象低频吼叫到4千赫兹幼犬哀鸣的各类发声，并结合概率视觉匹配技术，以应对压力引起的外观变化。这项研究表明，基于生物交流原则的人工智能能够为缺乏人类语言的弱势群体提供服务。

摘要 (Abstract)

Every year, 10 million pets enter shelters, separated from their families. Despite desperate searches by both guardians and lost animals, 70% never reunite, not because matches do not exist, but because current systems look only at appearance, while animals recognize each other through sound. We ask, why does computer vision treat vocalizing species as silent visual objects? Drawing on five decades of cognitive science showing that animals perceive quantity approximately and communicate identity acoustically, we present the first multimodal reunification system integrating visual and acoustic biometrics. Our species-adaptive architecture processes vocalizations from 10Hz elephant rumbles to 4kHz puppy whines, paired with probabilistic visual matching that tolerates stress-induced appearance changes. This work demonstrates that AI grounded in biological communication principles can serve vulnerable populations that lack human language.

关键词: multimodal reunification system, visual and acoustic biometrics, animal communication, species-adaptive architecture, biological communication principles, pet shelters, lost animal reunification, AI for vulnerable populations

45. ❌ Integrating Causal Machine Learning into Clinical Decision Support Systems: Insights from Literature and Practice

作者: Domenique Zipperling, Lukas Schmidt, Benedikt Hahn, Niklas Kühl, Steven Kimbrough 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24448v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究临床决策支持系统（CDSS）中因果机器学习的应用和设计原则，属于AI在医疗领域的应用。与绝大多数大模型技术关键词（如LLM、MoE、训练方法、推理优化等）完全无关，因此评分为0。仅与两个关键词有弱关联：1）‘Mechanistic Interpretability OR Explainable AI’：论文强调因果ML提供可解释的推理，与可解释AI有一定关联，评5分；2）‘AI for Science OR Bioinformatics OR Cheminformatics’：论文涉及AI在医疗（生物信息学相关领域）的应用，评5分。其他关键词均不涉及。

!!! tip deepseek-chat TL;DR

该论文研究了如何设计基于因果机器学习的临床决策支持系统，以改善临床决策中的协作和可解释性，并通过文献综述和医生访谈提出了设计需求、原则和特征。

摘要翻译

当前临床决策支持系统（CDSSs）的预测通常基于相关性而非因果关系。近年来，因果机器学习（Causal Machine Learning, ML）作为一种有前景的方法兴起，它通过提供可解释的、针对特定治疗的推理，有望改善CDSSs的决策质量。然而，现有研究往往侧重于模型开发，而非面向临床医生的界面设计。为弥补这一空白，我们探讨了基于因果机器学习的CDSSs应如何设计，以有效支持协作式临床决策。采用设计科学研究方法，我们进行了系统的文献综述并访谈了经验丰富的医师。基于此，我们归纳出八项基于实证的设计需求，提出了七条设计原则，并构建了九项实用设计特性。我们的研究结果为设计CDSSs提供了指导，使其能够提供因果洞察、无缝融入临床工作流程，并支持信任度、可用性及人机协作。同时，我们也揭示了围绕自动化、责任与监管之间的张力，强调了对基于机器学习的医疗产品建立适应性认证流程的必要性。

摘要 (Abstract)

Current clinical decision support systems (CDSSs) typically base their predictions on correlation, not causation. In recent years, causal machine learning (ML) has emerged as a promising way to improve decision-making with CDSSs by offering interpretable, treatment-specific reasoning. However, existing research often emphasizes model development rather than designing clinician-facing interfaces. To address this gap, we investigated how CDSSs based on causal ML should be designed to effectively support collaborative clinical decision-making. Using a design science research methodology, we conducted a structured literature review and interviewed experienced physicians. From these, we derived eight empirically grounded design requirements, developed seven design principles, and proposed nine practical design features. Our results establish guidance for designing CDSSs that deliver causal insights, integrate seamlessly into clinical workflows, and support trust, usability, and human-AI collaboration. We also reveal tensions around automation, responsibility, and regulation, highlighting the need for an adaptive certification process for ML-based medical products.

关键词: clinical decision support systems, causal machine learning, design principles, human-AI collaboration, interpretability, clinical workflows, medical AI, design science research

46. ❌ OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework

作者: Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang, Yue Lv, Ying Yang, Huangyu Dai, Lingtao Mao, Tong Zhao, Zhipeng Qian, Xinyu Sun, Zhixin Zhai, Yang Zhao, Bochao Liu, Jingshan Lv, Xiao Liang, Hui Kong, Jing Chen, Han Li, Chenyi Lei, Wenwu Ou, Kun Gai 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24422v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出OneSearch-V2，一个增强潜在推理的自蒸馏生成搜索框架，核心创新包括：1）思想增强的复杂查询理解模块，实现深度查询理解，克服直接推理的浅层语义匹配限制；2）推理内化的自蒸馏训练流程，通过隐式上下文学习揭示用户潜在意图；3）行为偏好对齐优化系统，缓解单一转化指标导致的奖励黑客问题。论文与LLM相关（8分），因为生成检索（Generative Retrieval）是大模型在搜索领域的应用；与Chain of Thought/System 2 Thinking高度相关（10分），因为论文强调深度推理、思想增强和克服浅层匹配；与In-context Learning高度相关（10分），因为训练流程使用隐式上下文学习。其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对现有生成搜索框架对复杂查询理解不足、用户潜在意图挖掘效率低、历史偏好过拟合等问题，提出了OneSearch-V2框架，通过思想增强查询理解、推理内化自蒸馏训练和行为偏好对齐优化，显著提升了搜索性能，在线A/B测试显示商品点击率提升3.98%、买家转化率提升3.05%、订单量提升2.11%。

摘要翻译

生成式检索（Generative Retrieval，GR）已成为现代搜索系统中一种前景广阔的研究范式。相较于多阶段级联架构，它具有端到端联合优化与高计算效率等优势。OneSearch作为工业级部署的生成式搜索框架代表，已带来显著的商业与运营效益。然而，其对复杂查询的理解不足、对潜在用户意图的低效挖掘，以及对狭窄历史偏好的过拟合问题，限制了其性能的进一步提升。为应对这些挑战，我们提出OneSearch-V2——一种潜在推理增强的自蒸馏生成式搜索框架。其包含三项核心创新：（1）思维增强的复杂查询理解模块，通过深度查询理解克服直接推理的浅层语义匹配局限；（2）推理内化的自蒸馏训练流程，通过隐式上下文学习挖掘用户超越日志拟合的潜在精准电商意图；（3）行为偏好对齐优化系统，缓解单一转化指标导致的奖励破解问题，并通过直接用户反馈优化个性化偏好。大量离线实验证明OneSearch-V2具备强大的查询识别与用户画像能力。在线A/B测试进一步验证其业务有效性，实现商品点击率提升3.98%、买家转化率提升3.05%、订单量增长2.11%。人工评估亦证实搜索体验质量提升，页面优质率提高1.65%，查询-商品相关性提升1.37%。更重要的是，OneSearch-V2有效缓解了信息茧房与长尾稀疏性等常见搜索系统问题，且未引入额外推理成本或服务延迟。

摘要 (Abstract)

Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi-stage cascaded architecture, it offers advantages such as end-to-end joint optimization and high computational efficiency. OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose \textbf{OneSearch-V2}, a latent reasoning enhanced self-distillation generative search framework. It contains three key innovations: (1) a thought-augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning-internalized self-distillation training pipeline, which uncovers users’ potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch-V2’s strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98% item CTR, +3.05% buyer conversion rate, and +2.11% order volume. Manual evaluation further confirms gains in search experience quality, with +1.65% in page good rate and +1.37% in query-item relevance. More importantly, OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.

关键词: Generative Retrieval, Generative Search Framework, Latent Reasoning, Self-distillation, Complex Query Understanding, In-context Learning, Behavior Preference Alignment, E-commerce Search

47. ❌ ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers

作者: Songyang Liu, Chaozhuo Li, Chenxu Wang, Jinyu Hou, Zejian Chen, Litian Zhang, Zheng Liu, Qiwei Ye, Yiming Hei, Xi Zhang, Zhongyuan Wang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24414v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于OpenClaw自主代理系统的安全框架ClawKeeper，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为核心是保护自主代理运行时；与’Tool Use OR Function Calling OR API Tool Use’相关（8分），因OpenClaw涉及工具集成和shell命令执行；与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因自主代理通常基于大模型，但论文未深入讨论模型本身。其他关键词如MoE、训练方法、推理优化等与论文安全保护主题无关，均得0分。

!!! tip deepseek-chat TL;DR

论文针对OpenClaw自主代理系统的安全漏洞，提出了ClawKeeper框架，通过技能层、插件层和观察者层的多维保护机制，实现了实时安全防护，有效防止了数据泄漏和恶意操作等威胁。

摘要翻译

OpenClaw已迅速成为领先的开源自主体运行时框架，其提供的强大能力包括工具集成、本地文件访问与Shell命令执行。然而，这种宽泛的操作权限也带来了严重的安全漏洞，使得模型错误可能转化为实际的系统级威胁，例如敏感数据泄露、权限提升及恶意第三方技能执行。当前OpenClaw生态系统的安全措施仍高度碎片化，仅针对智能体生命周期的孤立阶段提供保护，缺乏整体性防护方案。为弥补这一不足，我们提出了ClawKeeper——一个实时安全框架，通过三个互补的架构层次整合了多维防护机制。（1）基于技能的保护在指令层面运行，将结构化安全策略直接注入智能体上下文，以执行环境特定约束并跨越平台边界。（2）基于插件的保护作为内部运行时强制执行器，在整个执行流水线中提供配置强化、主动威胁检测与持续行为监控。（3）基于监视器的保护引入了一种新颖的解耦式系统级安全中间件，可持续验证智能体状态演进。该机制支持在不耦合智能体内部逻辑的情况下进行实时执行干预，例如暂停高风险操作或强制要求人工确认。我们认为这种监视器范式具备强大潜力，可作为保护下一代自主智能体系统的基础构建模块。广泛的定性与定量评估表明，ClawKeeper在多种威胁场景下均展现出有效性与鲁棒性。我们已公开相关代码。

摘要 (Abstract)

OpenClaw has rapidly established itself as a leading open-source autonomous agent runtime, offering powerful capabilities including tool integration, local file access, and shell command execution. However, these broad operational privileges introduce critical security vulnerabilities, transforming model errors into tangible system-level threats such as sensitive data leakage, privilege escalation, and malicious third-party skill execution. Existing security measures for the OpenClaw ecosystem remain highly fragmented, addressing only isolated stages of the agent lifecycle rather than providing holistic protection. To bridge this gap, we present ClawKeeper, a real-time security framework that integrates multi-dimensional protection mechanisms across three complementary architectural layers. (1) \textbf{Skill-based protection} operates at the instruction level, injecting structured security policies directly into the agent context to enforce environment-specific constraints and cross-platform boundaries. (2) \textbf{Plugin-based protection} serves as an internal runtime enforcer, providing configuration hardening, proactive threat detection, and continuous behavioral monitoring throughout the execution pipeline. (3) \textbf{Watcher-based protection} introduces a novel, decoupled system-level security middleware that continuously verifies agent state evolution. It enables real-time execution intervention without coupling to the agent’s internal logic, supporting operations such as halting high-risk actions or enforcing human confirmation. We argue that this Watcher paradigm holds strong potential to serve as a foundational building block for securing next-generation autonomous agent systems. Extensive qualitative and quantitative evaluations demonstrate the effectiveness and robustness of ClawKeeper across diverse threat scenarios. We release our code.

关键词: autonomous agents, security framework, real-time protection, tool integration, system-level threats, behavioral monitoring, execution intervention, OpenClaw

48. ❌ Real Talk, Virtual Faces: A Formal Concept Analysis of Personality and Sentiment in Influencer Audiences

作者: Shahram Chaudhry, Sidahmed Benabderrahmane, Talal Rahwan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24410v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究虚拟影响者与人类影响者受众评论的结构差异，使用形式概念分析和关联规则挖掘方法分析情感、人格特征和主题标签的共现模式。论文完全不涉及大模型、深度学习技术原理或AI for Science等关键词，所有关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文通过形式概念分析和关联规则挖掘，发现虚拟影响者与人类影响者的受众评论存在结构性差异：人类影响者评论集中于单一情感稳定模式，而虚拟影响者评论呈现三种不同的话语模式，并在心理敏感领域表现出更多负面情感。

摘要翻译

虚拟影响者（Virtual Influencers, VIs）——即数字合成的社交媒体人格——所吸引的受众，其讨论内容在性质上似乎与围绕人类影响者（Human Influencers, HIs）的讨论存在差异。现有研究通过问卷调查或总体互动统计数据来描述这种差异，这些方法揭示了受众“说了什么”，但未能展现多种信号“如何”共同出现。我们提出了一个基于形式概念分析（Formal Concept Analysis, FCA）和关联规则挖掘的双层、结构优先框架。第一层将FCA与基于支持度的冰山过滤应用于按周聚合的评论数据，提取出话语特征剖面——即每周共同出现的情感、大五人格线索和话题标签的组合。第二层在评论级别挖掘关联规则，揭示频率表分析无法观察到的人格—情感—话题依赖关系。
将这一分析应用于三对VI-HI影响者组合的YouTube评论后，双层分析揭示了一致的结构分化：HI讨论集中于单一、情绪受控（以稳定性为中心）的模式（低神经质锚定积极情绪），而VI讨论则支持三种结构不同的话语模式，其中包括一个在HI讨论中几乎不存在的外观话题集群，尽管其边缘出现频率相近。针对具体话题的进一步分析表明，相对于HI情境，VI情境在心理敏感领域（心理健康、身体形象、人工身份）表现出更多的负面情绪。我们的研究确立了FCA作为一种多信号话语分析的原则性工具，并证明虚拟性不仅改变了受众所言说的内容，更重塑了信号在其反应中共同出现的底层语法。

摘要 (Abstract)

Virtual influencers~(VIs) – digitally synthetic social-media personas – attract audiences whose discourse appears qualitatively different from discourse around human influencers~(HIs). Existing work characterises this difference through surveys or aggregate engagement statistics, which reveal \emph{what} audiences say but not \emph{how} multiple signals co-occur. We propose a two-layer, structure-first framework grounded in Formal Concept Analysis~(FCA) and association rule mining. The first layer applies FCA with support-based iceberg filtering to weekly-aggregated comment data, extracting discourse profiles – weekly co-occurrence bundles of sentiment, Big Five personality cues, and topic tags. The second layer mines association rules at the comment level, revealing personality–sentiment–topic dependencies invisible to frequency-table analysis. Applied to YouTube comments from three VI–HI influencer pairs, the two-layer analysis reveals a consistent structural divergence: HI discourse concentrates into a single, emotionally regulated (stability-centred) regime (low neuroticism anchoring positivity), while VI discourse supports three structurally distinct discourse modes, including an appearance-discourse cluster absent from HI despite near-equal marginal prevalence. Topic-specific analyses further show that VI contexts exhibit negative sentiment in psychologically sensitive domains (mental health, body image, artificial identity) relative to HI contexts. Our results position FCA as a principled tool for multi-signal discourse analysis and demonstrate that virtuality reshapes not just what audiences say, but the underlying grammar of how signals co-occur in their reactions.

关键词: virtual influencers, human influencers, formal concept analysis, association rule mining, sentiment analysis, personality cues, discourse analysis, YouTube comments

49. ❌ Exploring How Fair Model Representations Relate to Fair Recommendations

作者: Bjørnar Vassøy, Benjamin Kille, Helge Langseth 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24396v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究推荐系统中的公平性问题，特别是模型表示中的公平性与推荐公平性之间的关系。论文内容完全聚焦于推荐系统公平性评估方法，没有涉及任何大模型、深度学习技术原理、AI科学应用或相关技术关键词。所有评分关键词都针对大模型技术栈和AI科学应用，而本文是传统的推荐系统研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文挑战了推荐系统中通过模型表示公平性评估推荐公平性的假设，通过实验证明表示层公平性优化对推荐公平性有积极影响，但表示层评估不能作为比较模型时衡量这种效果的良好代理。

摘要翻译

近年来推荐系统研究中追求的诸多公平性定义之一，旨在减少模型表征中编码的人口统计信息。针对此定义优化的模型，通常通过基于模型表征对人口统计属性进行分类的准确度进行评估，其（隐含）假设是这一指标能准确反映推荐对等性，即不同用户获得的推荐之间的相似程度。我们通过比较表征中编码的人口统计信息量与多种推荐差异度量指标，对这一假设提出质疑。我们提出了两种新方法，用于衡量基于排序推荐列表对人口统计信息进行分类的准确度。我们在一个真实数据集和多个合成生成的数据集上对多种模型进行广泛测试，结果表明：优化公平表征确实对推荐对等性产生积极影响，但在比较不同模型时，表征层面的评估并不能很好地作为衡量此效果的替代指标。此外，我们通过在多种不同特性的生成数据集上评估各模型的性能，深入揭示了推荐层面公平性指标在不同模型中的表现规律。

摘要 (Abstract)

One of the many fairness definitions pursued in recent recommender system research targets mitigating demographic information encoded in model representations. Models optimized for this definition are typically evaluated on how well demographic attributes can be classified given model representations, with the (implicit) assumption that this measure accurately reflects \textit{recommendation parity}, i.e., how similar recommendations given to different users are. We challenge this assumption by comparing the amount of demographic information encoded in representations with various measures of how the recommendations differ. We propose two new approaches for measuring how well demographic information can be classified given ranked recommendations. Our results from extensive testing of multiple models on one real and multiple synthetically generated datasets indicate that optimizing for fair representations positively affects recommendation parity, but also that evaluation at the representation level is not a good proxy for measuring this effect when comparing models. We also provide extensive insight into how recommendation-level fairness metrics behave for various models by evaluating their performances on numerous generated datasets with different properties.

关键词: fairness, recommender systems, model representations, demographic information, recommendation parity, fairness metrics, evaluation methods, algorithmic fairness

50. ❌ When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools

作者: Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao, Qingyong Hu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24389v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是开发基于LLM的框架Interaction2Eval用于学前教育质量评估，因此与’Large Language Models’高度相关（10分）。论文属于AI在教育领域的应用，与’AI for Science’有一定关联（5分），但并非严格意义上的科学领域应用。其他关键词如MoE、SFT、RAG等均未在摘要中提及或涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该研究解决了中国学前教育中教师-儿童互动质量评估的可扩展性问题，通过开发基于大语言模型的Interaction2Eval框架，实现了与专家评估88%的一致性，并将评估效率提升了18倍。

摘要翻译

高质量的师幼互动是儿童早期发展的基石，然而传统的专家评估模式面临可扩展性的严峻挑战。在中国这样庞大的教育体系中——覆盖超过25万所幼儿园、服务3600万儿童——人工观察评估所需的时间与经济成本使得持续性质量监测难以实现，导致评估工作往往局限于偶发性的抽查，难以及时干预并追踪改进成效。
本文探讨了人工智能能否作为可扩展的评估协作者，通过提取结构化质量指标并验证其与人类专家判断的一致性。我们的贡献包括：（1）构建TEPE-TCI-370h（追踪有效学前教育）数据集，这是首个中国幼儿园自然情境下师幼互动的大规模数据集（370小时，105间教室），包含标准化的ECQRS-EC（早期照护质量评定量表扩展版）与SSTEW（促进幼儿可持续思维评估量表）标注；（2）开发Interaction2Eval框架，这一基于大语言模型的专用系统解决了领域特有挑战——儿童语音识别、普通话同音词消歧及基于评估量表的推理，实现了最高达88%的专家一致性；（3）在43间教室的部署验证表明，评估流程效率提升18倍，凸显了该系统推动年度专家审计转向月度AI辅助监测（辅以定向人工督导）的潜力。本研究不仅证明了可扩展的AI增强质量评估在技术上的可行性，更为学前教育领域奠定了新范式的基础——持续、普惠的AI辅助评估将成为系统性改进与公平发展的引擎。

摘要 (Abstract)

High-quality teacher-child interaction (TCI) is fundamental to early childhood development, yet traditional expert-based assessment faces a critical scalability challenge. In large systems like China’s-serving 36 million children across 250,000+ kindergartens-the cost and time requirements of manual observation make continuous quality monitoring infeasible, relegating assessment to infrequent episodic audits that limit timely intervention and improvement tracking. In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments. Our contributions include: (1) TEPE-TCI-370h (Tracing Effective Preschool Education), the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations; (2) We develop Interaction2Eval, a specialized LLM-based framework addressing domain-specific challenges-child speech recognition, Mandarin homophone disambiguation, and rubric-based reasoning-achieving up to 88% agreement; (3) Deployment validation across 43 classrooms demonstrating an 18x efficiency gain in the assessment workflow, highlighting its potential for shifting from annual expert audits to monthly AI-assisted monitoring with targeted human oversight. This work not only demonstrates the technical feasibility of scalable, AI-augmented quality assessment but also lays the foundation for a new paradigm in early childhood education-one where continuous, inclusive, AI-assisted evaluation becomes the engine of systemic improvement and equitable growth.

关键词: Large Language Models, Early Childhood Education, Teacher-Child Interaction, AI Assessment, Scalable Evaluation, Preschool Education, Quality Monitoring, AI-assisted Monitoring

51. ❌ MolEvolve: LLM-Guided Evolutionary Search for Interpretable Molecular Optimization

作者: Xiangsen Chen, Ruilong Wu, Yanyan Lan, Ting Ma, Yang Liu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24382v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	10.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLM指导的进化搜索进行分子优化，属于AI for Science应用。高度相关的关键词包括：LLMs（核心组件）、MCTS with LLM（规划引擎）、LLM Agents（自主搜索框架）、Tool Use（使用RDKit等工具）、Explainable AI（透明推理链）、AI for Science（化学领域应用）。中等相关的关键词包括：Chain of Thought（推理过程）、System 2 Thinking（规划搜索）、Self-Improvement（自主进化）。其他关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出MolEvolve框架，利用大型语言模型指导进化搜索解决分子优化中的可解释性和活性悬崖问题，实验表明其在分子优化任务中优于基线方法。

摘要翻译

尽管深度学习在化学领域取得了成功，但其影响因缺乏可解释性以及无法解决活性悬崖问题而受到限制——活性悬崖指微小的结构差异引发性质剧烈变化的现象。受相似性原理约束，现有的表示学习方法往往难以捕捉这些结构-活性间的非连续性。为解决这一问题，我们提出了MolEvolve，这是一个将分子发现重新定义为自主前瞻性规划问题的进化框架。与传统方法依赖人工设计特征或僵化的先验知识不同，MolEvolve利用大语言模型（Large Language Model, LLM）主动探索并演化一个可执行的化学符号操作库。通过使用LLM进行冷启动，并借助蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）引擎结合外部工具（如RDKit）进行测试时规划，该系统能够自主发现最优演化轨迹。这一过程生成透明的推理链，将复杂的结构转化转化为可操作、人类可读的化学见解。实验结果表明，MolEvolve的自主搜索不仅演化出透明、可读的化学见解，而且在性质预测和分子优化任务中均优于基线方法。

摘要 (Abstract)

Despite deep learning’s success in chemistry, its impact is hindered by a lack of interpretability and an inability to resolve activity cliffs, where minor structural nuances trigger drastic property shifts. Current representation learning, bound by the similarity principle, often fails to capture these structural-activity discontinuities. To address this, we introduce MolEvolve, an evolutionary framework that reformulates molecular discovery as an autonomous, look-ahead planning problem. Unlike traditional methods that depend on human-engineered features or rigid prior knowledge, MolEvolve leverages a Large Language Model (LLM) to actively explore and evolve a library of executable chemical symbolic operations. By utilizing the LLM to cold start and an Monte Carlo Tree Search (MCTS) engine for test-time planning with external tools (e.g. RDKit), the system self-discovers optimal trajectories autonomously. This process evolves transparent reasoning chains that translate complex structural transformations into actionable, human-readable chemical insights. Experimental results demonstrate that MolEvolve’s autonomous search not only evolves transparent, human-readable chemical insights, but also outperforms baselines in both property prediction and molecule optimization tasks.

关键词: Molecular Optimization, Large Language Model, Evolutionary Search, Monte Carlo Tree Search, Interpretability, AI for Science, Autonomous Agents, Chemical Discovery

52. ❌ Language-Guided Structure-Aware Network for Camouflaged Object Detection

作者: Min Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24355v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的伪装目标检测（COD），提出了一种结合CLIP文本引导和结构感知的视觉网络。虽然使用了CLIP（一种多模态基础模型）来生成文本提示的掩码，但论文的核心是视觉网络架构设计（PVT-v2骨干、傅里叶边缘增强、结构感知注意力等），并非大模型或深度学习技术原理的创新。所有评分关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而本论文未涉及这些主题，仅将CLIP作为现成的文本-图像对齐工具使用，未对其技术原理进行改进或创新。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对伪装目标检测中缺乏文本语义先验引导的问题，提出了一种语言引导的结构感知网络（LGSAN），通过结合CLIP文本提示和多尺度视觉特征增强，在多个数据集上实现了竞争性的检测性能。

摘要翻译

伪装目标检测（Camouflaged Object Detection，COD）旨在分割那些在颜色、纹理和结构上与背景高度融合的目标，这使其成为计算机视觉领域中一项极具挑战性的任务。尽管现有方法引入了多尺度融合和注意力机制以缓解上述问题，但它们普遍缺乏文本语义先验的引导，这限制了模型在复杂场景中对伪装区域的聚焦能力。为解决这一问题，本文提出了一种语言引导的结构感知网络（Language-Guided Structure-Aware Network，LGSAN）。具体而言，基于视觉骨干网络PVT-v2，我们引入CLIP以从文本提示和RGB图像生成掩码，从而引导PVT-v2提取的多尺度特征聚焦于潜在的目标区域。在此基础上，我们进一步设计了傅里叶边缘增强模块（Fourier Edge Enhancement Module，FEEM），该模块将多尺度特征与频域中的高频信息相结合，以提取边缘增强特征。此外，我们提出了结构感知注意力模块（Structure-Aware Attention Module，SAAM），以有效增强模型对目标结构和边界的感知能力。最后，我们引入了粗粒度引导的局部细化模块（Coarse-Guided Local Refinement Module，CGLRM），以增强伪装目标区域的细粒度重建和边界完整性。大量实验表明，我们的方法在多个COD数据集上均能持续取得极具竞争力的性能，验证了其有效性和鲁棒性。

摘要 (Abstract)

Camouflaged Object Detection (COD) aims to segment objects that are highly integrated with the background in terms of color, texture, and structure, making it a highly challenging task in computer vision. Although existing methods introduce multi-scale fusion and attention mechanisms to alleviate the above issues, they generally lack the guidance of textual semantic priors, which limits the model’s ability to focus on camouflaged regions in complex scenes. To address this issue, this paper proposes a Language-Guided Structure-Aware Network (LGSAN). Specifically, based on the visual backbone PVT-v2, we introduce CLIP to generate masks from text prompts and RGB images, thereby guiding the multi-scale features extracted by PVT-v2 to focus on potential target regions. On this foundation, we further design a Fourier Edge Enhancement Module (FEEM), which integrates multi-scale features with high-frequency information in the frequency domain to extract edge enhancement features. Furthermore, we propose a Structure-Aware Attention Module (SAAM) to effectively enhance the model’s perception of object structures and boundaries. Finally, we introduce a Coarse-Guided Local Refinement Module (CGLRM) to enhance fine-grained reconstruction and boundary integrity of camouflaged object regions. Extensive experiments demonstrate that our method consistently achieves highly competitive performance across multiple COD datasets, validating its effectiveness and robustness.

关键词: Camouflaged Object Detection, Language-Guided, Structure-Aware Network, CLIP, Fourier Edge Enhancement, Multi-scale Features, Computer Vision, Semantic Priors

53. ❌ Evidence of an Emergent “Self” in Continual Robot Learning

作者: Adidev Jhunjhunwala, Judah Goldfeder, Hod Lipson 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24350v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器人持续学习中的自我概念形成，通过分析认知结构稳定性来识别“自我”，属于认知AI系统的基础研究。与绝大多数关键词（涉及大模型技术、训练方法、推理优化、应用领域等）完全无关。仅与“Self-Correction OR Self-Improvement OR Self-Reflection”有一定关联（5分），因为论文涉及自我认知结构的识别，但并非直接研究自我纠正或改进技术。

!!! tip deepseek-chat TL;DR

该论文提出通过识别认知过程中相对不变的部分来量化智能系统中的“自我”概念，并在持续学习的机器人实验中发现了显著更稳定的不变子网络，为探索AI系统的自我意识提供了新方法。

摘要翻译

理解自我意识的一个关键挑战在于，如何以系统化的方式量化一个智能系统是否拥有“自我”概念，以及如何将“自我”与其他认知结构区分开来。我们提出，“自我”可以通过寻找认知过程中相对不变的部分来分离，这部分变化相较于快速习得的认知知识与技能要小得多，因为自我是我们经验中最持久的方面。我们运用这一原理分析了两种情境下机器人的认知结构：一个机器人学习固定任务，而另一个机器人在多变任务下进行持续学习。研究发现，经历持续学习的机器人发展出了一个显著更稳定（p < 0.001）的不变子网络（invariant subnetwork），与对照组相比差异明显。我们认为，这一原理可为探索其他认知人工智能系统（cognitive AI systems）中的自我特性提供一个新的视角。

摘要 (Abstract)

A key challenge to understanding self-awareness has been a principled way of quantifying whether an intelligent system has a concept of a “self,” and if so how to differentiate the “self” from other cognitive structures. We propose that the “self” can be isolated by seeking the invariant portion of cognitive process that changes relatively little compared to more rapidly acquired cognitive knowledge and skills, because our self is the most persistent aspect of our experiences. We used this principle to analyze the cognitive structure of robots under two conditions: One robot learns a constant task, while a second robot is subjected to continual learning under variable tasks. We find that robots subjected to continual learning develop an invariant subnetwork that is significantly more stable (p < 0.001) compared to the control. We suggest that this principle can offer a window into exploring selfhood in other cognitive AI systems.

关键词: self-awareness, continual learning, cognitive structure, invariant subnetwork, robot learning, selfhood, AI systems, persistent experiences

54. ❌ Enhancing Efficiency and Performance in Deepfake Audio Detection through Neuron-level dropin & Neuroplasticity Mechanisms

作者: Yupei Li, Shuaijie Shao, Manuel Milling, Björn Schuller 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24343v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究音频深度伪造检测，提出了一种动态调整神经元数量的dropin和plasticity算法以提高效率和性能。与关键词的相关性分析如下：1）论文提到大语言模型（LLMs）的成功展示了参数扩展的好处，但未深入探讨LLMs本身，因此给5分。2）论文提出的dropin和plasticity算法本质上是一种参数高效微调方法，通过动态调整神经元数量来灵活调制模型参数，这与PEFT/LoRA/参数高效微调的精神一致（旨在高效调整模型而不完全重新训练），因此给5分。其他关键词如MoE、SLMs、Scaling Laws、RAG、推理加速等均未在论文中涉及或仅作为背景提及，因此给0分。AI for Science关键词虽涉及科学应用，但论文专注于音频深度伪造检测，属于特定应用而非广义的科学AI，因此给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种受神经可塑性启发的dropin和plasticity算法，通过动态调整神经元数量来提高音频深度伪造检测模型的效率和性能，在多个数据集上实现了计算效率的提升和错误率的显著降低。

摘要翻译

当前音频深度伪造检测领域通过采用ResNet等多种深度学习架构已取得显著性能，而Wav2Vec等大型模型的引入进一步推动了性能提升。大型语言模型的成功不仅印证了扩展模型参数规模的益处，同时也凸显了性能增长受参数数量制约的瓶颈。若如当前大型语言模型般简单堆叠更多层数，将导致高昂的计算成本并需进行完整模型重训练。此外，现有的低秩自适应方法主要应用于基于注意力机制的架构，这限制了其适用范围。受哺乳动物大脑神经元可塑性的启发，我们提出了两种新颖算法——dropin与进阶可塑性算法，它们能动态调整特定层的神经元数量以灵活调控模型参数。我们在包括ResNet、门控循环神经网络及Wav2Vec在内的多种架构上评估了这些算法。使用广泛认可的ASVSpoof2019 LA、PA及FakeorReal数据集的实验结果表明：dropin方法能持续提升计算效率；在该系列数据集中，dropin与可塑性方法分别实现了最高约39%和66%的等错误率相对降低。相关代码与补充材料已发布于Github链接。

摘要 (Abstract)

Current audio deepfake detection has achieved remarkable performance using diverse deep learning architectures such as ResNet, and has seen further improvements with the introduction of large models (LMs) like Wav2Vec. The success of large language models (LLMs) further demonstrates the benefits of scaling model parameters, but also highlights one bottleneck where performance gains are constrained by parameter counts. Simply stacking additional layers, as done in current LLMs, is computationally expensive and requires full retraining. Furthermore, existing low-rank adaptation methods are primarily applied to attention-based architectures, which limits their scope. Inspired by the neuronal plasticity observed in mammalian brains, we propose novel algorithms, dropin and further plasticity, that dynamically adjust the number of neurons in certain layers to flexibly modulate model parameters. We evaluate these algorithms on multiple architectures, including ResNet, Gated Recurrent Neural Networks, and Wav2Vec. Experimental results using the widely recognised ASVSpoof2019 LA, PA, and FakeorReal dataset demonstrate consistent improvements in computational efficiency with the dropin approach and a maximum of around 39% and 66% relative reduction in Equal Error Rate with the dropin and plasticity approach among these dataset, respectively. The code and supplementary material are available at Github link.

关键词: deepfake audio detection, neuron-level dropin, neuroplasticity mechanisms, parameter-efficient fine-tuning, Wav2Vec, computational efficiency, Equal Error Rate reduction, dynamic neuron adjustment

55. ❌ GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

作者: Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi, Soham Hans, Volkan Ustun 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24329v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态大语言模型作为3D环境中自主代理的感知骨干，并提出了GameplayQA基准来评估代理中心的感知和推理能力。核心相关关键词包括：‘LLM Agents/Autonomous Agents/Agentic Workflow’（高度相关，论文聚焦自主代理）、‘Multi-agent Systems/Agent Coordination’（高度相关，研究多代理环境中的并发行为）、‘World Models AND General World Models’（高度相关，论文明确提到世界建模）、‘Large Language Models/LLMs/Foundation Models’（相关，使用多模态LLMs作为感知骨干）、‘Hallucination Mitigation/Factuality/Truthfulness’（相关，通过结构化干扰项分析模型幻觉）、‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’（有一定关联，涉及认知复杂度的推理）。其他关键词如MoE、量化、训练方法等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了GameplayQA基准框架，用于评估多模态大语言模型在3D多代理环境中作为自主代理感知骨干时的视频理解能力，发现前沿模型在时间定位、跨视频关联和代理角色归因等方面与人类性能存在显著差距。

摘要翻译

多模态大语言模型正日益作为自主智能体在三维环境（从机器人学到虚拟世界）中的感知主干被部署。这些应用要求智能体能够感知快速的状态变化、将行动归因于正确的实体，并从第一人称视角推理并发的多智能体行为——这些能力是现有基准测试未能充分评估的。我们提出了GameplayQA，一个通过视频理解来评估以智能体为中心的感知与推理能力的框架。具体而言，我们以每秒1.22个标签的密度对多人三维游戏视频进行了密集标注，提供了时间同步、并行的状态、行动和事件描述。这些描述围绕一个三元系统（自我、其他智能体与世界）构建，这是对多智能体环境的一种自然分解。基于这些标注，我们提炼出2.4K个诊断性问答对，并将其组织为三个认知复杂度层级，同时配以一个结构化的干扰项分类法，使得能够对模型产生幻觉的具体环节进行细粒度分析。对前沿多模态大语言模型的评估揭示了其与人类表现之间存在显著差距，常见失败包括时间和跨视频定位、智能体角色归因，以及处理游戏中的决策密度等方面。我们希望GameplayQA能够推动具身人工智能、智能体感知与世界建模交叉领域的未来研究。

摘要 (Abstract)

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

关键词: Multimodal LLMs, Autonomous Agents, 3D Environments, Video Understanding, Multi-agent Systems, World Modeling, Benchmark Evaluation, Hallucination Analysis

56. ❌ Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing

作者: Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Jing Zhang, Jun Zhang, Xing Wei, Yi Liu, Dianhai Yu, Yanjun Ma 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24326v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文档解析的计算机视觉任务，提出了一种粗到细的视觉处理架构（PaddleOCR-VL），核心是视觉语言模型（VLM）在文档理解中的应用。虽然涉及视觉语言模型，但论文重点在于视觉处理效率优化（如减少视觉令牌数量、轻量级模块设计），而非大语言模型（LLM）技术本身。所有评分关键词均针对大语言模型的技术原理、训练方法、推理优化、对齐、应用范式等，与本文的视觉文档解析任务无直接关联。论文未涉及LLM架构创新、训练技术、推理加速、对齐方法、代理系统等任何评分关键词内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种粗到细的视觉处理架构PaddleOCR-VL，通过轻量级有效区域聚焦模块减少冗余视觉令牌，显著提升了文档解析的效率和性能。

摘要翻译

文档解析是一项细粒度任务，图像分辨率对其性能有显著影响。尽管利用视觉语言模型的先进研究受益于高分辨率输入以提升模型性能，但这通常会导致视觉标记数量呈二次增长，并显著增加计算成本。我们将这种低效归因于文档图像中存在大量冗余视觉区域（如背景）。为解决此问题，我们提出了PaddleOCR-VL，一种新颖的由粗到精架构，该架构聚焦于语义相关区域并抑制冗余区域，从而同时提升效率与性能。具体而言，我们引入了一个轻量级的有效区域聚焦模块（Valid Region Focus Module, VRFM），该模块利用定位和上下文关系预测能力来识别有效视觉标记。随后，我们设计并训练了一个紧凑而强大的9亿参数视觉语言模型（PaddleOCR-VL-0.9B）以执行细粒度识别，在VRFM输出的引导下避免直接处理整个大尺寸图像。大量实验表明，PaddleOCR-VL在页面级解析和元素级识别任务中均达到了最先进的性能。它显著优于现有解决方案，在与顶尖视觉语言模型的对比中展现出强大竞争力，并在使用更少视觉标记和参数的同时实现了快速推理，凸显了针对性由粗到精解析方法对于实现准确高效文档理解的有效性。源代码与模型已公开于https://github.com/PaddlePaddle/PaddleOCR。

摘要 (Abstract)

Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.

关键词: Document Parsing, Vision-Language Model, Coarse-to-Fine Architecture, Valid Region Focus Module, Efficiency Optimization, Visual Token Reduction, Page-level Parsing, Element-level Recognition

57. ❌ Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

作者: Dogan Urgun, Gokhan Gungor 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24324v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLMs）为合作多智能体强化学习系统自动设计奖励函数，因此与’Large Language Models’高度相关（10分）。研究聚焦于多智能体系统的协调问题，与’Multi-agent Systems’和’LLM Agents’高度相关（各10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、压缩方法、科学AI应用等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种利用大语言模型自动设计奖励函数的框架，用于解决合作多智能体强化学习中的激励对齐问题，实验表明该框架在多种环境布局中能有效提升任务回报和协调效率。

摘要翻译

为合作多智能体系统设计有效的辅助奖励仍是一项艰巨任务；激励错位可能导致次优协调，尤其在稀疏任务反馈无法提供充分依据的场景中。本研究引入一种自动化奖励设计框架，利用大语言模型从环境监测数据中合成可执行的奖励程序。该流程将候选程序约束在形式化有效范围内，并通过在固定计算预算下从头训练策略来评估其效能；选择过程完全依赖于稀疏任务回报。该框架在四种不同的Overcooked-AI布局中进行评估，这些布局以差异化走廊拥堵度、交接依赖性和结构不对称性为特征。迭代搜索过程持续产生更优的任务回报与配送数量，其中在交互瓶颈主导的环境中增益最为显著。对合成奖励组件的诊断分析表明，在协调密集型任务中，动作选择的相互依赖性增强且信号对齐性得到改善。这些结果证明，通过搜索基于客观事实的奖励程序，既能减轻人工设计的负担，又能在有限预算下生成与合作学习兼容的塑形信号。

摘要 (Abstract)

Designing effective auxiliary rewards for cooperative multi-agent systems remains a precarious task; misaligned incentives risk inducing suboptimal coordination, especially where sparse task feedback fails to provide sufficient grounding. This study introduces an automated reward design framework that leverages large language models to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and evaluates their efficacy by training policies from scratch under a fixed computational budget; selection depends exclusively on the sparse task return. The framework is evaluated across four distinct Overcooked-AI layouts characterized by varied corridor congestion, handoff dependencies, and structural asymmetries. Iterative search generations consistently yield superior task returns and delivery counts, with the most pronounced gains occurring in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components indicates increased interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the search for objectivegrounded reward programs can mitigate the burden of manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.

关键词: Large Language Models, Multi-Agent Reinforcement Learning, Reward Design, Cooperative Systems, Automated Framework, Incentive Alignment, Overcooked-AI, Coordination

58. ❌ Toward Generalist Neural Motion Planners for Robotic Manipulators: Challenges and Opportunities

作者: Davood Soleymanzadeh, Ivan Lopez-Sanchez, Hao Su, Yunzhu Li, Xiao Liang, Minghui Zheng 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24318v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一篇关于机器人操作器神经运动规划器的综述性论文，主要关注机器人运动规划领域的技术挑战、现有神经运动规划器的优缺点以及未来发展方向。论文内容完全聚焦于机器人运动规划这一特定领域，没有涉及任何大语言模型、深度学习技术原理、模型训练方法、推理优化、对齐技术、AI代理或科学AI应用等主题。所有评分关键词都与大模型和深度学习技术相关，而该论文讨论的是机器人运动规划这一完全不同领域的问题，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

这篇论文综述了机器人操作器神经运动规划器的现状，分析了其在复杂环境中泛化能力不足的局限性，并提出了向通用神经运动规划器发展的路径。

摘要翻译

当前最先进的通用操作策略已使机器人操纵器能够在非结构化的人类环境中部署。然而，这些框架在杂乱环境中表现不佳，主要因为它们依赖辅助模块进行底层运动规划与控制。由于机器人构型空间的高维性以及工作空间中障碍物的存在，运动规划仍然具有挑战性。神经运动规划器通过提供快速推理能力并有效处理运动规划问题固有的多模态特性，提升了运动规划效率。尽管具备这些优势，现有的神经运动规划器往往难以泛化至未见过的、分布外的规划场景。本文回顾并分析了最先进的神经运动规划器，着重阐述了其优势与局限性。同时，本文还勾勒了建立通用神经运动规划器的路径，以应对特定领域的挑战。所评论文的列表请参阅 https://davoodsz.github.io/planning-manip-survey.github.io/。

摘要 (Abstract)

State-of-the-art generalist manipulation policies have enabled the deployment of robotic manipulators in unstructured human environments. However, these frameworks struggle in cluttered environments primarily because they utilize auxiliary modules for low-level motion planning and control. Motion planning remains challenging due to the high dimensionality of the robot’s configuration space and the presence of workspace obstacles. Neural motion planners have enhanced motion planning efficiency by offering fast inference and effectively handling the inherent multi-modality of the motion planning problem. Despite such benefits, current neural motion planners often struggle to generalize to unseen, out-of-distribution planning settings. This paper reviews and analyzes the state-of-the-art neural motion planners, highlighting both their benefits and limitations. It also outlines a path toward establishing generalist neural motion planners capable of handling domain-specific challenges. For a list of the reviewed papers, please refer to https://davoodsz.github.io/planning-manip-survey.github.io/.

关键词: neural motion planners, robotic manipulators, motion planning, generalist manipulation policies, configuration space, workspace obstacles, out-of-distribution generalization, domain-specific challenges

59. ❌ Cost-Sensitive Neighborhood Aggregation for Heterophilous Graphs: When Does Per-Edge Routing Help?

作者: Eyal Weiss 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24291v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究图神经网络（GNN）中的异配图（heterophilous graphs）分类问题，提出了一种基于成本敏感邻域聚合（CSNA）的方法来优化消息传递。论文主题完全聚焦于图神经网络和图表示学习领域，与所有评分关键词（均涉及大模型、深度学习技术原理、AI应用等）无直接关联。论文未提及任何大模型、语言模型、训练技术、推理方法、AI代理或科学AI应用相关内容。

!!! tip deepseek-chat TL;DR

该论文研究了在异配图中何时需要逐边消息路由，提出了一种成本敏感邻域聚合方法，发现在对抗性异配数据集上表现优异，但在信息性异配数据集上无优势，从而揭示了细粒度路由的价值边界。

摘要翻译

近期研究区分了两种异质性机制：对抗性异质性（跨类边稀释类别信号并损害分类性能）与信息性异质性（异质结构本身携带有效信号）。本文探讨：何时基于单边的消息路由策略具有增益？何时均匀的谱通道已足够？为系统研究该问题，我们提出成本敏感邻域聚合（CSNA），这是一种图神经网络层，其通过学习的投影空间计算成对节点距离，并利用该距离将每条消息通过独立变换的协调通道与不协调通道进行软路由。在上下文随机块模型下，我们证明当满足 $w_+/w_- > q/p$ 条件时，成本敏感加权能保留类别判别信号，而均值聚合在该条件下可证明会衰减此类信号。在六个基准数据集上采用统一调参的实验表明，CSNA 在对抗性异质性数据集（Texas、Wisconsin、Cornell、Actor）上与前沿方法竞争力相当，但在信息性异质性数据集（Chameleon、Squirrel）上表现欠佳——这恰恰对应了单边路由缺乏有效可分解信息的机制。该模式本身即是研究发现：成本函数区分边类型的能力可作为异质性机制的诊断工具，揭示细粒度路由在何种场景下优于均匀通道，在何种场景下无效。代码发布于 https://github.com/eyal-weiss/CSNA-public。

摘要 (Abstract)

Recent work distinguishes two heterophily regimes: adversarial, where cross-class edges dilute class signal and harm classification, and informative, where the heterophilous structure itself carries useful signal. We ask: when does per-edge message routing help, and when is a uniform spectral channel sufficient? To operationalize this question we introduce Cost-Sensitive Neighborhood Aggregation (CSNA), a GNN layer that computes pairwise distance in a learned projection and uses it to soft-route each message through concordant and discordant channels with independent transformations. Under a contextual stochastic block model we show that cost-sensitive weighting preserves class-discriminative signal where mean aggregation provably attenuates it, provided $w_+/w_- > q/p$. On six benchmarks with uniform tuning, CSNA is competitive with state-of-the-art methods on adversarial-heterophily datasets (Texas, Wisconsin, Cornell, Actor) but underperforms on informative-heterophily datasets (Chameleon, Squirrel) – precisely the regime where per-edge routing has no useful decomposition to exploit. The pattern is itself the finding: the cost function’s ability to separate edge types serves as a diagnostic for the heterophily regime, revealing when fine-grained routing adds value over uniform channels and when it does not. Code is available at https://github.com/eyal-weiss/CSNA-public .

关键词: heterophilous graphs, graph neural networks, neighborhood aggregation, message routing, cost-sensitive weighting, adversarial heterophily, informative heterophily, contextual stochastic block model

60. ❌ The Specification Gap: Coordination Failure Under Partial Knowledge in Code Agents

作者: Camilo Chacón Sartori 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24284v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多LLM代码代理在部分知识下的协调失败问题，核心涉及LLM代理和多代理系统，因此与’Large Language Models OR LLMs OR Foundation Models’、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关（10分）。其他关键词如MoE、SFT、RAG、推理方法、压缩技术、科学AI应用等均未在论文中涉及，故给0分。

!!! tip deepseek-chat TL;DR

该论文研究了当多个基于LLM的代码代理独立实现同一类的不同部分时，由于规范不完整导致的协调失败问题，发现规范详细程度是影响多代理代码生成协调准确性的关键因素。

摘要翻译

当多个基于大语言模型的代码代理独立实现同一类的不同部分时，即使规范未明确定义相关选择，它们也必须在共享的内部表征上达成一致。本研究通过51个类生成任务探讨这一协调问题：逐步从完整文档字符串（L0）剥离规范细节至仅存签名（L3），并引入对立的结构偏好（列表与字典）以压力测试集成效果。主要发现有三点：首先，存在持续的规范缺口——随着细节减少，双代理集成准确率从58%降至25%，而单代理基线下降更为平缓（89%至56%），形成25-39个百分点的稳定协调缺口，该现象在两个Claude模型（Sonnet、Haiku）和三次独立运行中保持一致。其次，基于抽象语法树（AST）的冲突检测器在最弱规范层级达到97%的精确度且无需额外调用大语言模型，但析因恢复实验表明，仅恢复完整规范即可达到单代理上限准确率（89%），而提供冲突报告未带来可测量的增益。第三，将缺口分解为协调成本（+16个百分点）与信息不对称（+11个百分点）显示，这两种效应相互独立且近似可叠加。该缺口不仅是隐藏信息的后果，更反映了缺乏共同决策时生成兼容代码的固有难度。这些结果支持多代理代码生成的“规范优先”观点：更丰富的规范既是核心协调机制，也是充分的恢复工具。

摘要 (Abstract)

When multiple LLM-based code agents independently implement parts of the same class, they must agree on shared internal representations, even when the specification leaves those choices implicit. We study this coordination problem across 51 class-generation tasks, progressively stripping specification detail from full docstrings (L0) to bare signatures (L3), and introducing opposing structural biases (lists vs. dictionaries) to stress-test integration. Three findings emerge. First, a persistent specification gap: two-agent integration accuracy drops from 58% to 25% as detail is removed, while a single-agent baseline degrades more gracefully (89% to 56%), leaving a 25–39 pp coordination gap that is consistent across two Claude models (Sonnet, Haiku) and three independent runs. Second, an AST-based conflict detector achieves 97% precision at the weakest specification level without additional LLM calls, yet a factorial recovery experiment shows that restoring the full specification alone recovers the single-agent ceiling (89%), while providing conflict reports adds no measurable benefit. Third, decomposing the gap into coordination cost (+16 pp) and information asymmetry (+11 pp) suggests that the two effects are independent and approximately additive. The gap is not merely a consequence of hidden information, but reflects the difficulty of producing compatible code without shared decisions. These results support a specification-first view of multi-agent code generation: richer specifications are both the primary coordination mechanism and the sufficient recovery instrument.

关键词: LLM-based code agents, multi-agent coordination, specification gap, code generation, coordination failure, partial knowledge, integration accuracy, AST-based conflict detection

61. ❌ Bridging Biological Hearing and Neuromorphic Computing: End-to-End Time-Domain Audio Signal Processing with Reservoir Computing

作者: Rinku Sebastian, Simon O’Keefe, Martin Trefzer 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24283v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于音频信号处理，特别是使用储层计算（Reservoir Computing）进行时域音频处理和MFCC特征提取，以实现实时、高效的语音分析。研究内容与深度学习、大模型技术无关，未涉及任何评分关键词中的技术（如LLM、MoE、SFT、RAG、量化等）或科学AI应用（如生物信息学）。所有关键词均完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于储层计算的端到端时域音频信号处理方法，以简化MFCC特征提取并实现实时高效的语音处理系统。

摘要翻译

尽管尖端技术不断进步，音频信号处理仍面临诸多挑战，且缺乏人类语音处理系统的精确性。为应对这些挑战，我们提出一种利用时域技术和储备池计算来简化音频信号处理的新方法。通过研究，我们借助储备池计算机简化了音频信号处理流程，开发出一套实时音频信号处理系统，该系统的训练难度显著降低。
特征提取是语音信号处理的基础步骤，梅尔频率倒谱系数因其与人类听觉的感知相关性而成为主流选择。然而，传统的MFCC提取依赖于计算密集的时频变换，限制了实时应用的效率。为此，我们提出一种利用储备池计算来简化MFCC提取的新方法。通过用卷积操作替代传统的频域转换，我们在保持特征区分度的同时消除了复杂变换的需求。我们提出了一种集成此方法的端到端音频处理框架，展示了其在高效实时语音分析中的潜力。我们的研究成果推动了节能音频处理技术的发展，使其能够无缝部署于嵌入式系统和语音驱动应用中。这项工作填补了仿生特征提取与现代神经形态计算之间的空白，为下一代语音识别系统提供了可扩展的解决方案。

摘要 (Abstract)

Despite the advancements in cutting-edge technologies, audio signal processing continues to pose challenges and lacks the precision of a human speech processing system. To address these challenges, we propose a novel approach to simplify audio signal processing by leveraging time-domain techniques and reservoir computing. Through our research, we have developed a real-time audio signal processing system by simplifying audio signal processing through the utilization of reservoir computers, which are significantly easier to train. Feature extraction is a fundamental step in speech signal processing, with Mel Frequency Cepstral Coefficients (MFCCs) being a dominant choice due to their perceptual relevance to human hearing. However, conventional MFCC extraction relies on computationally intensive time-frequency transformations, limiting efficiency in real-time applications. To address this, we propose a novel approach that leverages reservoir computing to streamline MFCC extraction. By replacing traditional frequency-domain conversions with convolution operations, we eliminate the need for complex transformations while maintaining feature discriminability. We present an end-to-end audio processing framework that integrates this method, demonstrating its potential for efficient and real-time speech analysis. Our results contribute to the advancement of energy-efficient audio processing technologies, enabling seamless deployment in embedded systems and voice-driven applications. This work bridges the gap between biologically inspired feature extraction and modern neuromorphic computing, offering a scalable solution for next-generation speech recognition systems.

关键词: audio signal processing, reservoir computing, time-domain processing, MFCC extraction, real-time speech analysis, neuromorphic computing, energy-efficient processing, speech recognition

62. ❌ Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep

作者: Tianyi Liu, Ye Lu, Linfeng Zhang, Chen Cai, Jianjun Gao, Yi Wang, Kim-Hui Yap, Lap-Pui Chau 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24260v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型（Diffusion Transformers, DiT）在视频编辑中的加速技术，通过异构缓存（HetCache）减少冗余注意力计算。所有评分关键词均与大语言模型（LLM）相关，而论文研究的是扩散模型（一种生成模型），并非大语言模型。尽管扩散模型和大语言模型都属于深度学习领域，但论文未涉及任何LLM技术、训练方法、推理优化、对齐、代理系统或科学AI应用。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为HetCache的训练免费扩散加速框架，通过评估时空令牌的上下文相关性和交互强度，选择性缓存最具代表性的语义令牌，以减少扩散变换器（DiT）在视频编辑中的冗余注意力计算，实现了2.67倍的延迟加速和FLOPs减少，同时保持编辑质量。

摘要翻译

基于扩散的视频编辑已成为高质量与灵活内容生成的重要范式。然而，尽管扩散变换器（Diffusion Transformers, DiT）具备通用性与强大的建模能力，其迭代去噪过程仍导致高昂的计算成本，为实际部署带来挑战。现有的视频扩散加速方法主要利用去噪时间步级别的特征复用，这缓解了去噪过程中的冗余，但忽视了DiT内部的结构冗余——许多针对时空令牌的注意力操作被冗余执行，对模型输出的增量贡献微乎其微。本文提出HetCache，一种无需训练的扩散加速框架，旨在利用基于扩散的掩码视频到视频（masked video-to-video, MV2V）生成与编辑中固有的异质性。HetCache不采用均匀复用或随机采样令牌的策略，而是评估特定计算步骤中各类令牌之间的上下文相关性与交互强度。在空间先验的引导下，它将DiT模型中的时空令牌划分为上下文令牌与生成令牌，并选择性缓存那些与生成令牌相关性最强、语义最具代表性的上下文令牌。该策略在保持编辑一致性与保真度的同时，减少了冗余的注意力操作。实验表明，HetCache实现了显著的加速效果，在常用基础模型上获得了2.67倍的延迟加速与计算量（FLOPs）降低，且编辑质量下降可忽略不计。

摘要 (Abstract)

Diffusion-based video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, Diffusion Transformers (DiT) remain computationally expensive due to the iterative denoising process, posing challenges for practical deployment. Existing video diffusion acceleration methods primarily exploit denoising timestep-level feature reuse, which mitigates the redundancy in denoising process, but overlooks the architectural redundancy within the DiT that many attention operations over spatio-temporal tokens are redundantly executed, offering little to no incremental contribution to the model output. This work introduces HetCache, a training-free diffusion acceleration framework designed to exploit the inherent heterogeneity in diffusion-based masked video-to-video (MV2V) generation and editing. Instead of uniformly reuse or randomly sampling tokens, HetCache assesses the contextual relevance and interaction strength among various types of tokens in designated computing steps. Guided by spatial priors, it divides the spatial-temporal tokens in DiT model into context and generative tokens, and selectively caches the context tokens that exhibit the strongest correlation and most representative semantics with generative ones. This strategy reduces redundant attention operations while maintaining editing consistency and fidelity. Experiments show that HetCache achieves a noticeable acceleration, including a 2.67$\times$ latency speedup and FLOPs reduction over commonly used foundation models, with negligible degradation in editing quality.

关键词: Diffusion-based video editing, Diffusion Transformers (DiT), Heterogeneous caching, Attention redundancy reduction, Training-free acceleration, Spatio-temporal tokens, Context and generative tokens, Latency speedup

63. ❌ DVM: Real-Time Kernel Generation for Dynamic AI Models

作者: Jingzhi Fang, Xiong Gao, Renwei Zhang, Zichun Ye, Lei Chen, Jie Zhao, Chengnuo Huang, Hui Xu, Xuefeng Jin 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24239v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于AI模型编译和运行时优化技术，特别是针对动态张量形状和控制流的实时内核生成。所有评分关键词均涉及大模型技术原理、训练方法、对齐、推理优化、应用等具体方向，而本文研究的是底层编译器和运行时系统优化，与这些关键词无直接关联。论文未提及任何大模型、深度学习技术原理创新或科学领域应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文提出了一种名为DVM的实时编译器，通过字节码虚拟机和运行时操作符融合技术，解决了动态AI模型因编译时间长而导致的效率问题，实现了比现有方法高达11.77倍的算子/模型效率和5个数量级更快的最大编译时间。

摘要翻译

动态性是人工智能计算中的常见特征，例如模型中的动态张量形状和动态控制流。由于编译时间较长，现有的运行时编译会损害模型效率，而离线编译器要么需要耗费大量编译时间和设备内存来覆盖动态模型所有可能的执行实例，要么为了易用性牺牲优化机会。本文重新思考了运行时编译在动态模型中的可行性，并指出其成功的关键在于加速编译或隐藏编译开销。为此，我们提出了一种实时编译器DVM。在DVM中，我们设计了一个基于字节码虚拟机（bytecode virtual machine）的运行时算子编译器，能够根据每个动态算子实例的输入进行高效编译。具体而言，我们并非将程序编译为机器码，而是在CPU上将算子程序编码为字节码，并在NPU上将字节码解码为虚拟指令直接执行。基于该运行时算子编译器，我们进一步提出了算子融合器（operator fuser），可在静态图上执行基于符号推导（symbol-deduction）的融合，在动态图上进行运行时融合。系统同时支持基于模式和基于堆叠的融合，以提升融合机会。在算子、子图和模型上的评估表明，与TorchInductor、PyTorch-eager和MindSpore-graph-O0相比，我们在算子/模型效率上最高提升11.77倍，最大编译时间最高缩短五个数量级。

摘要 (Abstract)

Dynamism is common in AI computation, e.g., the dynamic tensor shapes and the dynamic control flows in models. Due to the long compilation time, existing runtime compilation damages the model efficiency, while the offline compilers either suffer from the long compilation time and device memory footprint to cover all the possible execution instances of a dynamic model, or sacrifice optimization opportunities for usability. In this paper, we rethink the feasibility of runtime compilation for dynamic models and identify that the key for it to work is to speed up the compilation or hide the compilation overhead. To do this, we propose a real-time compiler, DVM. In DVM, we design a runtime operator compiler based on a bytecode virtual machine to perform effective and efficient compilation for each dynamic operator instance given its input. Specifically, instead of compiling programs into machine code, we encode the operator program into bytecode on the CPU and decode the bytecode into virtual instructions for direct execution on the NPU. Based on the runtime operator compiler, we further propose an operator fuser, which performs symbol-deduction-based fusion on static graphs and runtime fusion on dynamic graphs. Both pattern- and stacking-based fusion are supported to increase fusion opportunities. Evaluation on operators, subgraphs, and models shows that, compared with TorchInductor, PyTorch-eager and MindSpore-graph-O0, we are up to 11.77$\times$ better in terms of the operator/model efficiency and up to 5 orders of magnitude faster in terms of the maximum compilation time.

关键词: real-time compilation, dynamic AI models, runtime operator compiler, bytecode virtual machine, operator fusion, kernel generation, NPU execution, compilation efficiency

64. ❌ Embracing Heteroscedasticity for Probabilistic Time Series Forecasting

作者: Yijun Wang, Qiyuan Zhuang, Xiu-Shen Wei 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24254v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 该论文专注于概率时间序列预测（PTSF），提出了一种名为LSG-VAE的变分自编码器框架，旨在通过位置-尺度似然公式显式参数化预测均值和时变方差，以更好地捕捉异方差性。论文的核心内容涉及时间序列分析、生成模型（VAE）和不确定性量化。所有给定的评分关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的特定应用（如生物信息学）直接相关。该论文的研究主题（概率时间序列预测）与这些关键词没有直接关联，既未提及LLM、MoE、缩放定律、训练/微调技术、推理优化、智能体、模型压缩等，也未涉及AI在生物/化学信息学等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有概率时间序列预测方法因均方误差训练目标隐含同方差假设而无法有效建模时变异方差性的问题，提出了一个显式参数化预测均值和时变方差的Location-Scale Gaussian VAE框架，实验表明其在多个基准数据集上优于现有生成基线并保持高效性。

摘要翻译

概率时间序列预测（Probabilistic Time Series Forecasting, PTSF）旨在对未来观测值的完整预测分布进行建模，从而实现准确预测和基于原则的不确定性量化。PTSF的一个核心要求是处理异方差性，因为现实世界的时间序列由于非平稳动态、状态转换和不断变化的外部条件而表现出随时间变化的条件方差。然而，现有的大多数非自回归生成式PTSF方法，如TimeVAE和$K^2$VAE，依赖于基于均方误差的训练目标，这些目标隐式地施加了同方差假设，从而从根本上限制了它们对时间异方差性建模的能力。为解决这一局限，我们提出了位置-尺度高斯变分自编码器（Location-Scale Gaussian VAE, LSG-VAE），这是一个简单而有效的框架，通过位置-尺度似然公式显式参数化预测均值和时间依赖的方差。这一设计使LSG-VAE能够忠实捕捉异方差的偶然不确定性，并引入了一种自适应衰减机制，在训练过程中自动降低对高波动性观测值的权重，从而提升趋势预测的鲁棒性。在九个基准数据集上的大量实验表明，LSG-VAE在保持适用于实时部署的高计算效率的同时，始终优于十五个强大的生成式基线模型。

摘要 (Abstract)

Probabilistic time series forecasting (PTSF) aims to model the full predictive distribution of future observations, enabling both accurate forecasting and principled uncertainty quantification. A central requirement of PTSF is to embrace heteroscedasticity, as real-world time series exhibit time-varying conditional variances induced by nonstationary dynamics, regime changes, and evolving external conditions. However, most existing non-autoregressive generative approaches to PTSF, such as TimeVAE and $K^2$VAE, rely on MSE-based training objectives that implicitly impose a homoscedastic assumption, thereby fundamentally limiting their ability to model temporal heteroscedasticity. To address this limitation, we propose the Location-Scale Gaussian VAE (LSG-VAE), a simple but effective framework that explicitly parameterizes both the predictive mean and time-dependent variance through a location-scale likelihood formulation. This design enables LSG-VAE to faithfully capture heteroscedastic aleatoric uncertainty and introduces an adaptive attenuation mechanism that automatically down-weights highly volatile observations during training, leading to improved robustness in trend prediction. Extensive experiments on nine benchmark datasets demonstrate that LSG-VAE consistently outperforms fifteen strong generative baselines while maintaining high computational efficiency suitable for real-time deployment.

关键词: Probabilistic Time Series Forecasting, Heteroscedasticity, Location-Scale Gaussian VAE, Uncertainty Quantification, Generative Models, Non-autoregressive, Time-varying Variance, Robust Trend Prediction

65. ❌ Who Benefits from RAG? The Role of Exposure, Utility and Attribution Bias

作者: Mahdi Dehghan, Graham McDonald 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24218v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	15.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统在LLM中的公平性问题，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（15分），直接涉及LLM（10分），并间接涉及准确性/事实性（5分），其他关键词未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了检索增强生成（RAG）系统中的查询组公平性问题，发现RAG系统会放大不同群体查询之间的平均准确性差异，并且群体效用、曝光和归因与准确性显著相关。

摘要翻译

采用检索增强生成技术的大型语言模型通过将回答基于与用户查询相关的外部文档，在准确性方面取得了显著提升。然而，目前较少有研究探讨检索增强生成在公平性方面的影响。特别是，我们尚不清楚在公平性类别中，与特定群体相关的查询是否会系统性地获得更高的准确性，或在检索增强生成系统中相较于纯语言模型获得更大的准确性提升——这一现象我们称之为查询群体公平性。在本研究中，我们进行了大量实验，以探究三个关键因素对检索增强生成中查询群体公平性的影响，即：群体曝光度（由检索器决定，指检索结果集中来自各群体的文档比例）；群体效用（指来自各群体的文档对提升答案准确性的贡献程度，反映了检索器与生成器之间的交互）；以及群体归因（指生成器在生成回答时对来自各群体文档的依赖程度）。我们利用源自TREC 2022公平排序赛道的三个数据集，针对文章生成和标题生成两项任务，在四个公平性类别中检验了群体层面的平均准确性及准确性提升差异。研究结果表明，与纯语言模型设置相比，检索增强生成系统存在查询群体公平性问题，并加剧了不同群体查询间平均准确性的差异。此外，群体效用、曝光度和归因可能与特定群体查询的平均准确性或准确性提升呈现强烈的正相关或负相关，这凸显了它们在实现公平的检索增强生成中的重要作用。我们的数据和代码已在Github上公开。

摘要 (Abstract)

Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have achieved substantial improvements in accuracy by grounding their responses in external documents that are relevant to the user’s query. However, relatively little work has investigated the impact of RAG in terms of fairness. Particularly, it is not yet known if queries that are associated with certain groups within a fairness category systematically receive higher accuracy, or accuracy improvements in RAG systems compared to LLM-only, a phenomenon we refer to as query group fairness. In this work, we conduct extensive experiments to investigate the impact of three key factors on query group fairness in RAG, namely: Group exposure, i.e., the proportion of documents from each group appearing in the retrieved set, determined by the retriever; Group utility, i.e., the degree to which documents from each group contribute to improving answer accuracy, capturing retriever-generator interactions; and Group attribution, i.e., the extent to which the generator relies on documents from each group when producing responses. We examine group-level average accuracy and accuracy improvements disparities across four fairness categories using three datasets derived from the TREC 2022 Fair Ranking Track for two tasks: article generation and title generation. Our findings show that RAG systems suffer from the query group fairness problem and amplify disparities in terms of average accuracy across queries from different groups, compared to an LLM-only setting. Moreover, group utility, exposure, and attribution can exhibit strong positive or negative correlations with average accuracy or accuracy improvements of queries from that group, highlighting their important role in fair RAG. Our data and code are publicly available from Github.

关键词: Retrieval-Augmented Generation, Large Language Models, fairness, query group fairness, accuracy disparities, retriever-generator interactions, TREC 2022 Fair Ranking Track

66. ❌ Where Do Your Citations Come From? Citation-Constellation: A Free, Open-Source, No-Code, and Auditable Tool for Citation Network Decomposition with Complementary BARON and HEROCON Scores

作者: Mahbub Ul Alam 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24216v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	2.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	2.0/10	0.0

评分理由: 论文主要研究学术引用网络分析工具，与大多数大模型技术关键词无关。仅与’Large Language Models’有微弱关联（工具第4阶段计划使用本地LLM进行AI代理驱动的场地治理提取，但该功能尚在开发中），与’AI for Science’有一定关联（属于科学领域的AI应用工具）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了Citation-Constellation工具，通过BARON和HEROCON两个互补的文献计量学分数来分解研究者的引用网络，以揭示学术影响力的社会传播路径，而非评估研究质量。

摘要翻译

标准引文计量方法将所有引文视为等同，掩盖了学术影响力传播的社会与结构路径。本文介绍Citation-Constellation——一个免费的无代码引文网络分析工具，它通过两个互补的文献计量指标，依据施引作者与被引作者间的网络邻近性分解研究者的引文特征。BARON（边界锚定研究外联网络指标）是严格的二元指标，仅统计来自已识别合作网络外部的引文。HEROCON（整体均衡研究外联网络指标）采用渐进加权机制，依据关系邻近性对群体内部引文赋予部分权重。两个指标间的差距可作为内部圈子依赖性的诊断依据。论文的扩展摘要包含完整细节。
该工具通过分阶段架构实现此功能：（1）自引分析，（2）合著关系图遍历，（3）通过ROR进行时序机构隶属关系匹配，以及（4）使用本地大语言模型的AI智能体驱动期刊治理信息提取。阶段1-3已完全实现；阶段4正在开发中。关键设计包括：ORCID验证的作者身份解析、对元数据不足的引文设置UNKNOWN分类，以及记录每个分类决策的完整审计追踪。无代码网络界面使研究者无需编程、安装或注册即可计算指标。
本文将这些指标定位为结构诊断工具，而非质量评价标准。BARON和HEROCON描述的是引文在社会网络图中的来源位置，不应被用于招聘、晋升或资助决策。HEROCON的权重设置具有实验性质，需通过实证数据进行校准。

摘要 (Abstract)

Standard citation metrics treat all citations as equal, obscuring the social and structural pathways through which scholarly influence propagates. I introduce Citation-Constellation, a freely available no-code tool for citation network analysis with two complementary bibliometric scores that decompose a researcher’s citation profile by network proximity between citing and cited authors. BARON (Boundary-Anchored Research Outreach Network score) is a strict binary metric counting only citations from outside the detected collaborative network. HEROCON (Holistic Equilibrated Research Outreach CONstellation score) applies graduated weights assigning partial credit to in-group citations based on relationship proximity. The gap between scores serves as a diagnostic of inner-circle dependence. An extended abstract with full details appears in the paper. The tool implements this through a phased architecture: (1) self-citation analysis, (2) co-authorship graph traversal, (3) temporal institutional affiliation matching via ROR, and (4) AI-agent-driven venue governance extraction using a local LLM. Phases 1-3 are fully operational; Phase 4 is under development. Key design choices include ORCID-validated author identity resolution, an UNKNOWN classification for citations with insufficient metadata, and comprehensive audit trails documenting every classification decision. A no-code web interface enables researchers to compute scores without programming, installation, or registration. I present these scores as structural diagnostics, not quality indicators. BARON and HEROCON describe where in the social graph citations originate. They should not be used for hiring, promotion, or funding decisions. HEROCON weights are experimental and require empirical calibration.

关键词: citation network analysis, bibliometric scores, BARON, HEROCON, academic influence, no-code tool, social graph, audit trails

67. ❌ Uncovering Memorization in Timeseries Imputation models: LBRM Membership Inference and its link to attribute Leakage

作者: Faiz Taleb, Ivan Gazeau, Maryline Laurent 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24213v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究时间序列插补模型的隐私攻击（成员推断和属性推断），属于深度学习安全领域，但未涉及大模型、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、长上下文、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI等关键词。所有关键词均与大模型技术原理、训练方法、推理优化、对齐、应用等直接相关，而本文专注于传统深度学习模型（注意力机制和自编码器）的隐私攻击，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对时间序列插补模型的两阶段隐私攻击框架，包括改进的成员推断攻击和首个属性推断攻击，实验表明该攻击能有效识别训练数据并预测敏感属性。

摘要翻译

时间序列插补的深度学习模型目前在医疗健康、物联网和金融等领域至关重要。然而，其部署引发了严重的隐私担忧。除了生成模型中已被广泛研究的非预期记忆问题外，我们证明时间序列模型在黑盒设置下容易受到推理攻击。在本研究中，我们提出一个两阶段攻击框架，包括：（1）一种基于参考模型的新型成员推理攻击，该攻击提高了检测精度，即使对于对基于过拟合的攻击具有鲁棒性的模型也有效；（2）首个针对时间序列插补模型的属性推理攻击，可预测训练数据的敏感特征。我们在两种场景下对基于注意力机制和自编码器架构的模型评估了这些攻击：从头开始训练的模型，以及攻击者能够获取初始权重的微调模型。实验结果表明，所提出的成员推理攻击能够检索出相当一部分训练数据，其tpr@top25%分数显著高于朴素攻击基线。我们还证明，我们的成员推理攻击能够有效预判属性推理攻击是否可行（在一般情况下，其精确度达到90%，而非78%）。

摘要 (Abstract)

Deep learning models for time series imputation are now essential in fields such as healthcare, the Internet of Things (IoT), and finance. However, their deployment raises critical privacy concerns. Beyond the well-known issue of unintended memorization, which has been extensively studied in generative models, we demonstrate that time series models are vulnerable to inference attacks in a black-box setting. In this work, we introduce a two-stage attack framework comprising: (1) a novel membership inference attack based on a reference model that improves detection accuracy, even for models robust to overfitting-based attacks, and (2) the first attribute inference attack that predicts sensitive characteristics of the training data for timeseries imputation model. We evaluate these attacks on attention-based and autoencoder architectures in two scenarios: models that are trained from scratch, and fine-tuned models where the adversary has access to the initial weights. Our experimental results demonstrate that the proposed membership attack retrieves a significant portion of the training data with a tpr@top25% score significantly higher than a naive attack baseline. We show that our membership attack also provides a good insight of whether attribute inference will work (with a precision of 90% instead of 78% in the genral case).

关键词: time series imputation, membership inference attack, attribute inference attack, privacy concerns, deep learning models, attention-based architectures, autoencoder architectures, black-box setting

68. ❌ Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement

作者: Xin Zhang, Jianyang Xu, Hao Peng, Dongjing Wang, Jingyuan Zheng, Yu Li, Yuyu Yin, Hongbo Wang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24208v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究知识蒸馏（Knowledge Distillation）技术，提出TMKD方法利用视觉教师和文本教师（CLIP）进行多视图知识蒸馏，属于计算机视觉和模型压缩领域。所有评分关键词均针对大语言模型（LLM）相关技术，包括模型架构、训练方法、推理优化、对齐、应用等，而本文完全不涉及LLM，仅使用CLIP作为文本教师，CLIP是多模态模型而非LLM，且论文未讨论LLM的任何方面。因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种文本引导的多视图知识蒸馏方法TMKD，通过增强视觉教师的多视图输入和利用文本教师生成语义权重来指导自适应特征融合，在五个基准测试中提升了知识蒸馏性能达4.49%。

摘要翻译

知识蒸馏通过将大型教师模型的知识迁移至更小的学生模型以实现高效推理。现有方法主要关注蒸馏策略，却往往忽视了提升教师知识质量的重要性。本文提出文本引导的多视角知识蒸馏方法，该方法利用双模态教师——视觉教师与文本教师，以提供更丰富的监督信号。具体而言，我们通过融入视觉先验信息的多视角输入来增强视觉教师，同时文本教师通过先验感知提示生成语义权重，以指导自适应特征融合。此外，我们引入视觉-语言对比正则化来强化学生模型中的语义知识。在五个基准数据集上的大量实验表明，TMKD 能持续提升知识蒸馏性能，最高可达 4.49%，验证了我们所提出的双教师多视角增强策略的有效性。代码发布于 https://anonymous.4open.science/r/TMKD-main-44D1。

摘要 (Abstract)

Knowledge distillation transfers knowledge from large teacher models to smaller students for efficient inference. While existing methods primarily focus on distillation strategies, they often overlook the importance of enhancing teacher knowledge quality. In this paper, we propose Text-guided Multi-view Knowledge Distillation (TMKD), which leverages dual-modality teachers, a visual teacher and a text teacher (CLIP), to provide richer supervisory signals. Specifically, we enhance the visual teacher with multi-view inputs incorporating visual priors (edge and high-frequency features), while the text teacher generates semantic weights through prior-aware prompts to guide adaptive feature fusion. Additionally, we introduce vision-language contrastive regularization to strengthen semantic knowledge in the student model. Extensive experiments on five benchmarks demonstrate that TMKD consistently improves knowledge distillation performance by up to 4.49%, validating the effectiveness of our dual-teacher multi-view enhancement strategy. Code is available at https://anonymous.4open.science/r/TMKD-main-44D1.

关键词: Knowledge Distillation, Multi-view Learning, Visual Prior Enhancement, Dual-modality Teachers, CLIP, Vision-Language Contrastive Regularization, Model Compression, Efficient Inference

69. ❌ Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search

作者: Yulin Shen, Xudong Pan, Geng Hong, Min Yang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24203v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究MCP（Model Context Protocol）环境下LLM代理的安全漏洞，核心涉及LLM代理（LLM Agents）和工具调用（Tool Use）的攻击方法，与这两个关键词高度相关（10分）。论文使用LLM作为攻击模型，因此与LLM关键词相关（10分）。其他关键词如MoE、量化、推理加速、科学AI等均未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对Model Context Protocol（MCP）的黑盒攻击方法TIP，通过树结构搜索生成隐蔽的注入载荷，能够有效控制MCP增强的LLM代理，在无防御和多种防御场景下均表现出高攻击成功率。

摘要翻译

模型上下文协议（Model Context Protocol, MCP）的最新进展使得大语言模型（LLMs）能够以前所未有的便捷性调用外部工具，从而催生了一类新型的强大工具增强智能体。然而，这种能力也引入了一个尚未被充分探索的攻击面，即对工具响应的恶意操控。现有针对MCP的间接提示注入技术存在部署成本高、语义连贯性弱或对白盒条件依赖性强等问题，且往往容易被近期提出的防御机制所检测。本文提出了一种新颖的黑盒攻击方法——树结构载荷注入（Tree-structured Injection for Payloads, TIP），该方法能够生成自然的载荷，即使在防御条件下也能可靠地夺取MCP智能体的控制权。在技术上，我们将载荷生成建模为一个树结构搜索问题，并通过一个在我们提出的由粗到精优化框架下运行的攻击者LLM来引导搜索。为稳定学习过程并避免局部最优，我们引入了一种路径感知反馈机制，仅将高质量的历史轨迹呈现给攻击者模型。该框架还通过显式地将搜索条件与可观测的防御信号相结合，并动态重新分配探索预算，进一步强化了对防御性变换的抵抗能力。在四种主流LLM上的大量实验表明，TIP在无防御场景下实现了超过95%的攻击成功率，且所需查询量比先前的自适应攻击少一个数量级。针对四种代表性防御方法，TIP仍能保持超过50%的有效性，并显著优于现有最先进的攻击技术。通过在真实世界的MCP系统中实施该攻击，我们的研究结果揭示了MCP部署中一个隐蔽但实际存在的威胁向量。本文亦讨论了应对这一关键安全漏洞的潜在缓解方案。

摘要 (Abstract)

Recent advances in the Model Context Protocol (MCP) have enabled large language models (LLMs) to invoke external tools with unprecedented ease. This creates a new class of powerful and tool augmented agents. Unfortunately, this capability also introduces an under explored attack surface, specifically the malicious manipulation of tool responses. Existing techniques for indirect prompt injection that target MCP suffer from high deployment costs, weak semantic coherence, or heavy white box requirements. Furthermore, they are often easily detected by recently proposed defenses. In this paper, we propose Tree structured Injection for Payloads (TIP), a novel black-box attack which generates natural payloads to reliably seize control of MCP enabled agents even under defense. Technically, We cast payload generation as a tree structured search problem and guide the search with an attacker LLM operating under our proposed coarse-to-fine optimization framework. To stabilize learning and avoid local optima, we introduce a path-aware feedback mechanism that surfaces only high quality historical trajectories to the attacker model. The framework is further hardened against defensive transformations by explicitly conditioning the search on observable defense signals and dynamically reallocating the exploration budget. Extensive experiments on four mainstream LLMs show that TIP attains over 95% attack success in undefended settings while requiring an order of magnitude fewer queries than prior adaptive attacks. Against four representative defense approaches, TIP preserves more than 50% effectiveness and significantly outperforms the state-of-the-art attacks. By implementing the attack on real world MCP systems, our results expose an invisible but practical threat vector in MCP deployments. We also discuss potential mitigation approaches to address this critical security gap.

关键词: Model Context Protocol, LLM agents, tool invocation, black-box attack, payload generation, tree-structured search, security vulnerability, indirect prompt injection

70. ❌ A Deep Dive into Scaling RL for Code Generation with Synthetic Data and Curricula

作者: Cansu Sancaktar, David Zhang, Gabriel Synnaeve, Taco Cohen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24202v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大规模语言模型（Llama3.1-8B、Qwen3-8B、Qwen2.5-32B）的强化学习训练，与’Large Language Models’高度相关（10分）。研究重点是通过合成数据和课程学习来扩展RL训练，这直接涉及’RLHF’（10分），因为RLHF是RL训练的一种形式。论文强调数据多样性和结构而非单纯数据量，这与’Scaling Laws AND Data Quality’相关（8分）。研究中教师模型基于上下文学生表现摘要迭代生成问题，这使用了’In-context Learning’技术（8分）。论文提到监督微调（SFT）作为对比基线，因此’Post-training OR Supervised Fine-tuning OR SFT’得5分。其他关键词如MoE、SLMs、PEFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究如何通过多轮合成数据生成和课程学习来扩展大规模语言模型的强化学习训练，以提高代码生成性能，并发现这种方法能持续提升领域内代码生成能力，并在大多数情况下提升领域外数学性能。

摘要翻译

强化学习（Reinforcement Learning, RL）已成为超越监督微调、提升大语言模型性能的强大范式，然而如何在规模化应用中持续保持性能提升仍是一个开放挑战——因为数据多样性与结构（而非单纯的数据量）正成为关键制约因素。为此，我们提出一种可扩展的多轮次合成数据生成流程：教师模型根据上下文中的学生表现摘要迭代优化问题，无需对教师模型进行微调即可生成结构化的难度递进序列。与单轮次生成相比，这种多轮次方法显著提高了有效合成问题的产出率，并自然地生成“垫脚石”——即同一核心任务的较易与较难变体——从而支持基于课程的学习训练。我们以Llama3.1-8B Instruct和Qwen3-8B Base模型系列为主要对象，并在Qwen2.5-32B上进行扩展实验，系统研究了RL训练过程中任务难度、课程安排与环境多样性之间的相互作用。结果表明，合成数据增强能持续提升领域内代码任务性能，并在多数情况下提升领域外数学任务表现；我们进一步通过实证分析揭示了课程设计与数据多样性如何共同影响RL训练的动态过程。

摘要 (Abstract)

Reinforcement learning (RL) has emerged as a powerful paradigm for improving large language models beyond supervised fine-tuning, yet sustaining performance gains at scale remains an open challenge, as data diversity and structure, rather than volume alone, become the limiting factor. We address this by introducing a scalable multi-turn synthetic data generation pipeline in which a teacher model iteratively refines problems based on in-context student performance summaries, producing structured difficulty progressions without any teacher fine-tuning. Compared to single-turn generation, this multi-turn approach substantially improves the yield of valid synthetic problems and naturally produces stepping stones, i.e. easier and harder variants of the same core task, that support curriculum-based training. We systematically study how task difficulty, curriculum scheduling, and environment diversity interact during RL training across the Llama3.1-8B Instruct and Qwen3-8B Base model families, with additional scaling experiments on Qwen2.5-32B. Our results show that synthetic augmentation consistently improves in-domain code and in most cases out-of-domain math performance, and we provide empirical insights into how curriculum design and data diversity jointly shape RL training dynamics.

关键词: Reinforcement Learning, Large Language Models, Synthetic Data, Curriculum Learning, Code Generation, Scaling, Multi-turn Generation, Data Diversity

71. ❌ MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare

作者: Shubham Kumar Nigam, Suparnojit Sarkar, Piyush Patel 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24132v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心贡献是：1）使用大语言模型（LLMs）生成多语言医疗对话数据集MedAidDialog（高度相关）；2）基于量化的小语言模型（SLMs）通过参数高效微调（PEFT）开发医疗对话模型MedAidLM（高度相关）；3）模型部署考虑计算效率，涉及量化（Quantization）技术（高度相关）；4）属于AI在生物医学/医疗领域的应用（AI for Science相关）。其他关键词如MoE、Scaling Laws、RLHF等未在摘要中提及或涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究构建了多语言多轮医疗对话数据集MedAidDialog，并基于量化的小语言模型通过参数高效微调开发了可部署的医疗对话模型MedAidLM，实验表明其能有效进行症状询问和诊断推荐。

摘要翻译

会话式人工智能具备在初步医疗咨询中辅助用户的潜力，尤其在医疗专业人员资源有限的环境中。然而，现有许多医疗对话系统仍采用单轮问答模式或依赖基于模板的数据集，这限制了对话的真实性与多语言适用性。本研究提出了MedAidDialog——一个旨在模拟真实医患咨询的多语言多轮医疗对话数据集。该数据集基于MDDial语料库，通过使用大语言模型生成合成咨询对话，并将其进一步扩展为涵盖七种语言的平行多语料库，包括英语、印地语、泰卢固语、泰米尔语、孟加拉语、马拉地语和阿拉伯语。
基于此数据集，我们开发了MedAidLM模型，这是一个通过对量化后的小语言模型进行参数高效微调而训练的会话医疗模型，使其无需高端计算基础设施即可部署。我们的框架还整合了可选的患者预置背景信息（如年龄、性别、过敏史），以实现咨询过程的个性化。实验结果表明，所提出的系统能够通过多轮对话有效进行症状采集，并生成诊断建议。我们进一步开展了医学专家评估，以衡量生成对话的合理性与连贯性。

摘要 (Abstract)

Conversational artificial intelligence has the potential to assist users in preliminary medical consultations, particularly in settings where access to healthcare professionals is limited. However, many existing medical dialogue systems operate in a single-turn question–answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. In this work, we introduce MedAidDialog, a multilingual multi-turn medical dialogue dataset designed to simulate realistic physician–patient consultations. The dataset extends the MDDial corpus by generating synthetic consultations using large language models and further expands them into a parallel multilingual corpus covering seven languages: English, Hindi, Telugu, Tamil, Bengali, Marathi, and Arabic. Building on this dataset, we develop MedAidLM, a conversational medical model trained using parameter-efficient fine-tuning on quantized small language models, enabling deployment without high-end computational infrastructure. Our framework additionally incorporates optional patient pre-context information (e.g., age, gender, allergies) to personalize the consultation process. Experimental results demonstrate that the proposed system can effectively perform symptom elicitation through multi-turn dialogue and generate diagnostic recommendations. We further conduct medical expert evaluation to assess the plausibility and coherence of the generated consultations.

关键词: medical dialogue dataset, multilingual, large language models, small language models, parameter-efficient fine-tuning, quantization, AI for healthcare, synthetic consultations

作者: Iris Dumeur, Jérémy Anger, Gabriele Facciolo 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24109v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	5.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于卫星图像时间序列分析，主要研究双形式注意力机制（包括线性注意力）在遥感领域的应用，属于AI for Science范畴。论文与绝大多数大模型技术关键词无关，仅与’KV Cache Compression OR Linear Attention OR FlashAttention’有一定关联（论文研究了线性注意力机制），与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（属于地球科学/遥感领域的AI应用）。

!!! tip deepseek-chat TL;DR

该论文研究了用于多模态卫星图像时间序列高效分析的双形式注意力机制，在线性注意力和保留机制方面进行了比较，并在太阳能电池板监测等实际任务中验证了其性能与标准Transformer相当且支持高效的增量推理。

摘要翻译

多模态卫星图像时间序列分析在实时地表监测应用中面临显著的计算挑战。尽管Transformer架构在捕捉时序依赖性和融合多模态数据方面表现优异，但其二次计算复杂度以及对每个新增数据点需重新处理整个序列的要求，限制了其在常态化大范围监测任务中的部署。本文研究了多种对偶形式注意力机制，以实现高效的多模态卫星图像时间序列分析，这些机制在支持增量处理所需的循环推理的同时，能够进行并行训练。我们在多模态光谱-时序编码器框架内比较了线性注意力与保留机制。针对卫星图像时间序列特有的时序不规则性和不对齐问题，我们开发了对偶形式机制的时序自适应方法，该方法依据实际采集日期而非序列索引来计算标记距离。我们使用哨兵一号和哨兵二号数据，通过两项任务评估所提方法：作为代理任务的多模态卫星图像时间序列预测，以及真实场景下的太阳能电池板建设监测。实验结果表明，对偶形式机制在实现高效循环推理的同时，达到了与标准Transformer相当的性能。在多模态框架下，两项任务的表现均持续优于单模态方法，证明了对偶机制在传感器融合中的有效性。本研究成果为需要在大范围地理区域进行定期更新的业务化地表监测系统开辟了新的可能性。

摘要 (Abstract)

Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring. This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices. Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multimodal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion. The results presented in this work open new opportunities for operational land monitoring systems requiring regular updates over large geographic areas.

关键词: dual-form attention, multi-modal satellite image time series, linear attention, retention mechanisms, temporal irregularity, sensor fusion, recurrent inference, land monitoring

73. ❌ The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation

作者: Mingyi Liu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24124v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLHF对齐语言模型中的响应同质化现象及其对不确定性估计的影响，与’Large Language Models’、‘Instruction Tuning/Alignment’、‘RLHF/DPO’高度相关（10分），涉及对齐后模型行为，属于大模型技术原理创新。与’SFT’相关（5分），因论文进行了SFT阶段消融实验。与’Factuality’有一定关联（5分），因使用TruthfulQA评估真实性相关的不确定性。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现RLHF对齐的语言模型存在响应同质化问题，导致基于采样的不确定性估计方法失效，并通过实验证明DPO是主要原因，同时提出级联方法提升选择性预测性能。

摘要翻译

经过RLHF对齐的语言模型表现出回答同质化现象：在TruthfulQA数据集（n=790）上，40-79%的问题在10个独立同分布样本中产生单一语义聚类。在受影响的问题上，基于采样的不确定性方法完全失去判别能力（AUROC=0.500），而词元熵（token entropy）仍保留信号（0.603）。这种对齐代价具有任务依赖性：在GSM8K数据集（n=500）上，词元熵达到0.724（科恩d值=0.81）。
基础模型与指令模型的消融实验证实了对齐的因果作用：基础模型的单一聚类率为1.0%，而指令模型为28.5%（p < 10^{-6}）。训练阶段消融（基础模型0.0% → 监督微调[SFT] 1.5% → 直接偏好优化[DPO] 4.0% 单一聚类率）表明问题根源在于DPO而非SFT。在四个模型系列中的跨家族复现显示，对齐代价的严重程度因模型家族和规模而异。我们在22项实验、5个基准测试、4个模型家族和3种模型规模（3B-14B）中进行了验证，并采用基于杰卡德系数、嵌入和自然语言推理（NLI）的基线方法，在三种DeBERTa规模下均得到约0.51的AUROC。使用两个独立嵌入家族的跨编码器验证排除了耦合偏差。在WebQuestions数据集上的跨数据集验证（58.0%单一聚类率）证实了该现象在TruthfulQA之外的泛化性。核心发现——回答同质化——具有实现无关性和无标签特性。
基于此诊断，我们探索了在正交不确定性信号上构建的最廉价优先级联策略（UCBD）。选择性预测在50%覆盖率下将GSM8K准确率从84.4%提升至93.2%；弱相关边界（|r| ≤ 0.12）可实现57%的成本节约。

摘要 (Abstract)

RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen’s d=0.81). A base-vs-instruct ablation confirms the causal role of alignment: the base model shows 1.0% single-cluster rate vs. 28.5% for the instruct model (p < 10^{-6}). A training stage ablation (Base 0.0% -> SFT 1.5% -> DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross-family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B), with Jaccard, embedding, and NLI-based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross-embedder validation with two independent embedding families rules out coupling bias. Cross-dataset validation on WebQuestions (58.0% SCR) confirms generalization beyond TruthfulQA. The central finding – response homogenization – is implementation-independent and label-free. Motivated by this diagnosis, we explore a cheapest-first cascade (UCBD) over orthogonal uncertainty signals. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage; weakly dependent boundaries (|r| <= 0.12) enable 57% cost savings.

关键词: RLHF, alignment, response homogenization, uncertainty estimation, DPO, TruthfulQA, selective prediction, language models

74. ❌ KCLNet: Electrically Equivalence-Oriented Graph Representation Learning for Analog Circuits

作者: Peng Xu, Yapeng Li, Tinghuan Chen, Tsung-Yi Ho, Bei Yu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24101v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文KCLNet专注于模拟电路的图表示学习，提出了一种基于基尔霍夫电流定律的异步图神经网络框架。所有关键词均与大型语言模型、深度学习技术原理或AI在科学领域的应用相关，但论文内容仅涉及电子设计自动化中的模拟电路表示学习，未涉及大模型、深度学习技术原理或生物医药等科学领域的AI应用。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为电子设计自动化可视为AI在工程科学中的应用，但论文未明确提及生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文主题完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为KCLNet的模拟电路表示学习框架，通过基于基尔霍夫电流定律的图神经网络方法，有效解决了模拟电路因连续电气特性导致的表示学习挑战，并在分类、检测和预测等下游任务中取得了显著性能提升。

摘要翻译

数字电路表示学习在电子设计自动化领域已取得显著进展，有效支撑了可测试性分析和逻辑推理等关键任务。然而，与数字电路的离散状态相比，模拟电路因其连续的电气特性，其表示学习仍面临挑战。本文提出了一种面向直流（DC）电气等效的模拟电路表示学习框架，命名为 KCLNet。该框架包含一个具有电气仿真消息传递机制的异步图神经网络结构，以及一种受基尔霍夫电流定律（Kirchhoff’s Current Law, KCL）启发的表示学习方法。该方法通过强制每个深度上流出与流入电流嵌入之和相等，保持了电路嵌入空间的有序性，从而显著增强了电路嵌入的泛化能力。KCLNet 为在保持电气约束的条件下进行模拟电路表示学习提供了一种新颖而有效的解决方案。实验结果表明，我们的方法在多种下游任务中取得了显著性能，例如模拟电路分类、子电路检测和电路编辑距离预测。

摘要 (Abstract)

Digital circuits representation learning has made remarkable progress in the electronic design automation domain, effectively supporting critical tasks such as testability analysis and logic reasoning. However, representation learning for analog circuits remains challenging due to their continuous electrical characteristics compared to the discrete states of digital circuits. This paper presents a direct current (DC) electrically equivalent-oriented analog representation learning framework, named \textbf{KCLNet}. It comprises an asynchronous graph neural network structure with electrically-simulated message passing and a representation learning method inspired by Kirchhoff’s Current Law (KCL). This method maintains the orderliness of the circuit embedding space by enforcing the equality of the sum of outgoing and incoming current embeddings at each depth, which significantly enhances the generalization ability of circuit embeddings. KCLNet offers a novel and effective solution for analog circuit representation learning with electrical constraints preserved. Experimental results demonstrate that our method achieves significant performance in a variety of downstream tasks, e.g., analog circuit classification, subcircuit detection, and circuit edit distance prediction.

关键词: analog circuits, graph representation learning, Kirchhoff’s Current Law, graph neural network, electronic design automation, circuit embeddings, electrical equivalence, downstream tasks

75. ❌ Bridging the Evaluation Gap: Standardized Benchmarks for Multi-Objective Search

作者: Hadar Peer, Carlos Hernandez, Sven Koenig, Ariel Felner, Oren Salzman 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24084v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多目标搜索（MOS）的标准化基准测试，涉及图论、路径规划、算法评估等领域，但完全不涉及大语言模型、深度学习、AI技术原理或AI在科学领域的应用。论文内容与所有评分关键词（均围绕大模型技术、训练方法、推理优化、AI应用等）无任何关联，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对多目标搜索领域缺乏标准化评估基准的问题，提出了首个全面的标准化基准套件，涵盖道路网络、合成图、网格环境和机器人运动规划等多个领域，以支持鲁棒、可重复和结构全面的未来评估。

摘要翻译

多目标搜索领域的实证评估长期存在碎片化问题，其依赖于目标定义互不兼容的异构问题实例，导致跨研究比较困难。该领域的传统默认基准——DIMACS道路网络——其目标间存在高度相关性，无法涵盖多样化的帕累托前沿结构，这一认识进一步加剧了标准化缺失的问题。为此，我们首次提出了一个面向精确与近似多目标搜索的综合性标准化基准套件。该套件涵盖四个结构各异的领域：真实世界道路网络、结构化合成图、基于游戏的网格环境以及高维机器人运动规划路网图。通过提供固定的图实例、标准化的起点-终点查询，以及精确和近似的参考帕累托最优解集，本套件捕捉了从强相关到严格独立的全谱系目标交互关系。最终，该基准为未来多目标搜索评估建立了一个共同基础，确保其具备鲁棒性、可复现性和结构全面性。

摘要 (Abstract)

Empirical evaluation in multi-objective search (MOS) has historically suffered from fragmentation, relying on heterogeneous problem instances with incompatible objective definitions that make cross-study comparisons difficult. This standardization gap is further exacerbated by the realization that DIMACS road networks, a historical default benchmark for the field, exhibit highly correlated objectives that fail to capture diverse Pareto-front structures. To address this, we introduce the first comprehensive, standardized benchmark suite for exact and approximate MOS. Our suite spans four structurally diverse domains: real-world road networks, structured synthetic graphs, game-based grid environments, and high-dimensional robotic motion-planning roadmaps. By providing fixed graph instances, standardized start-goal queries, and both exact and approximate reference Pareto-optimal solution sets, this suite captures a full spectrum of objective interactions: from strongly correlated to strictly independent. Ultimately, this benchmark provides a common foundation to ensure future MOS evaluations are robust, reproducible, and structurally comprehensive.

关键词: multi-objective search, benchmark suite, standardized evaluation, Pareto-optimal solutions, road networks, robotic motion planning, graph instances, objective interactions

76. ❌ Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning

作者: Aditya Narendra, Mukhammadrizo Maribjonov, Dmitry Makarov, Dmitry Yudin, Aleksandr Panov 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24083v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人操作的多任务强化学习框架，结合了知识图谱、3D场景图和图神经网络，但未涉及任何大语言模型（LLMs）、深度学习技术原理创新或AI for Science的具体应用。所有关键词均与大模型技术、深度学习创新或科学AI应用无关，因此相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个知识引导的多任务强化学习框架KG-M3PO，用于部分可观测环境下的机器人操作，通过结合感知、知识和策略实现了更高的成功率、样本效率和泛化能力。

摘要翻译

本文提出基于知识图谱的大规模多任务模型策略优化框架（Knowledge Graph based Massually Multi-task Model-based Policy Optimization, KG-M3PO），该框架面向部分可观测环境下的多任务机器人操作，实现了感知、知识与策略的统一。该方法通过在线构建的三维场景图（3D scene graph）增强以自我为中心的视觉感知，将开放词汇检测结果映射为度量化的关系表征。动态关系机制在每一步更新空间关系、包含关系和可供性关系边，并通过强化学习目标端到端训练图神经网络编码器，使关系特征直接受控于控制性能。多种观测模态（视觉、本体感知、语言和图结构数据）被编码至共享潜在空间，强化学习智能体在此空间驱动控制循环。策略在视觉与本体感知输入基础上，结合轻量级图谱查询进行条件决策，从而形成紧凑且蕴含语义信息的决策状态。
在包含遮挡、干扰物和布局变化的操作任务集上的实验表明，本方法相较于强基线模型取得持续优势：基于知识条件化的智能体实现了更高的成功率、更优的样本效率，以及对新物体和未见场景配置更强的泛化能力。这些结果验证了结构化、持续维护的世界知识可作为可扩展、可泛化操作任务的强大归纳偏置：当知识模块参与强化学习计算图时，关系表征与控制目标对齐，使得智能体能够在部分可观测条件下实现鲁棒的长期行为。

摘要 (Abstract)

This paper introduces Knowledge Graph based Massively Multi-task Model-based Policy Optimization (KG-M3PO), a framework for multi-task robotic manipulation in partially observable settings that unifies Perception, Knowledge, and Policy. The method augments egocentric vision with an online 3D scene graph that grounds open-vocabulary detections into a metric, relational representation. A dynamic-relation mechanism updates spatial, containment, and affordance edges at every step, and a graph neural encoder is trained end-to-end through the RL objective so that relational features are shaped directly by control performance. Multiple observation modalities (visual, proprioceptive, linguistic, and graph-based) are encoded into a shared latent space, upon which the RL agent operates to drive the control loop. The policy conditions on lightweight graph queries alongside visual and proprioceptive inputs, yielding a compact, semantically informed state for decision making. Experiments on a suite of manipulation tasks with occlusions, distractors, and layout shifts demonstrate consistent gains over strong baselines: the knowledge-conditioned agent achieves higher success rates, improved sample efficiency, and stronger generalization to novel objects and unseen scene configurations. These results support the premise that structured, continuously maintained world knowledge is a powerful inductive bias for scalable, generalizable manipulation: when the knowledge module participates in the RL computation graph, relational representations align with control, enabling robust long-horizon behavior under partial observability.

关键词: multi-task reinforcement learning, robotic manipulation, knowledge graph, 3D scene graph, graph neural network, partial observability, generalization, control performance

77. ❌ When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm

作者: Ye Leng, Junjie Chu, Mingjie Li, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, Yang Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24079v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态大语言模型（MLLMs）在图像生成中的安全风险，与"Large Language Models OR LLMs OR Foundation Models"高度相关（10分），因为MLLMs是LLMs的扩展；与"Hallucination Mitigation OR Factuality OR Truthfulness"有一定关联（5分），因为论文涉及不安全内容生成和虚假图像合成，属于事实性和真实性风险范畴；其他关键词如MoE、SLMs、训练技术、推理优化、代理系统等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，相比扩散模型，新兴的多模态大语言模型（MLLMs）因其更强的语义理解能力，在图像生成中会产生更多不安全内容和更难以检测的虚假图像，带来了未被充分认识的安全风险。

摘要翻译

近年来，多模态大语言模型（MLLMs）已成为语言与图像生成的统一范式。与扩散模型相比，MLLMs具备更强的语义理解能力，使其能够处理更复杂的文本输入并理解更丰富的上下文含义。然而，这种增强的语义能力也可能带来新的、潜在更大的安全风险。以扩散模型为参照，我们系统性地从两个维度分析和比较了新兴MLLMs的安全风险：不安全内容生成与虚假图像合成。在多个不安全生成基准数据集上的测试表明，MLLMs倾向于比扩散模型生成更多不安全图像。这种差异部分源于扩散模型常因无法解析抽象提示而生成损坏的输出，而MLLMs却能理解这些提示并生成不安全内容。对于当前先进的虚假图像检测器，MLLM生成的图像也显著更难识别。即使使用MLLMs特定数据对检测器进行再训练，仅需向MLLMs提供更长、更具描述性的输入仍可绕过检测。我们的评估表明，前沿生成范式MLLMs所带来的新兴安全风险尚未得到充分认识，这为现实世界安全带来了新的挑战。

摘要 (Abstract)

Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability may also introduce new and potentially greater safety risks. Taking diffusion models as a reference point, we systematically analyze and compare the safety risks of emerging MLLMs along two dimensions: unsafe content generation and fake image synthesis. Across multiple unsafe generation benchmark datasets, we observe that MLLMs tend to generate more unsafe images than diffusion models. This difference partly arises because diffusion models often fail to interpret abstract prompts, producing corrupted outputs, whereas MLLMs can comprehend these prompts and generate unsafe content. For current advanced fake image detectors, MLLM-generated images are also notably harder to identify. Even when detectors are retrained with MLLMs-specific data, they can still be bypassed by simply providing MLLMs with longer and more descriptive inputs. Our measurements indicate that the emerging safety risks of the cutting-edge generative paradigm, MLLMs, have not been sufficiently recognized, posing new challenges to real-world safety.

关键词: multimodal large language models, MLLMs, image generation, safety risks, unsafe content generation, fake image synthesis, semantic understanding, diffusion models

78. ❌ Enhanced Mycelium of Thought (EMoT): A Bio-Inspired Hierarchical Reasoning Architecture with Strategic Dormancy and Mnemonic Encoding

作者: Florian Odi Stummer 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24065v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为EMoT的新型推理架构，专门针对大语言模型（LLMs）的推理能力进行改进。该研究与’Large Language Models’和’Chain of Thought’高度相关，因为论文明确将EMoT与CoT进行比较，并针对LLMs的推理过程提出创新架构。与’System 2 Thinking’有一定关联，因为EMoT涉及分层认知处理和深度推理。与’LLM Agents’和’In-context Learning’有中等关联，因为该框架涉及复杂的认知代理工作流程和上下文学习机制。与’Self-Correction’有轻微关联，因为战略休眠和重新激活机制可能涉及自我调整。其他关键词如模型训练、优化、压缩、特定领域应用等与论文内容无关，论文专注于推理架构而非这些技术方面。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型现有推理方法（如Chain-of-Thought）缺乏持久记忆和战略休眠的问题，提出了生物启发的分层推理架构EMoT，在复杂跨领域任务中达到与CoT相近的性能但稳定性更高，但在简单任务上因过度思考而表现较差。

摘要翻译

当前大型语言模型（LLM）的主流提示范式，包括思维链（Chain-of-Thought, CoT）和思维树（Tree-of-Thoughts, ToT），均遵循线性或树状结构的推理路径，缺乏持久记忆、策略性休眠与跨领域综合能力。本文提出增强型菌丝体思维（Enhanced Mycelium of Thought, EMoT）框架，这是一种受生物启发的推理架构。它将认知过程组织为四个层级（微观、介观、宏观、元级），实现了推理节点的策略性休眠与再激活，并整合了具备五种记忆编码风格的内存宫殿（Memory Palace）。EMoT是针对复杂、多领域问题的研究原型，而非通用提示增强方法。两项互补性评估揭示了一个特征性的权衡关系：在涵盖三个领域的盲测LLM-as-Judge评估中，EMoT与CoT表现接近持平（4.20 vs. 4.33/5.0），且稳定性更高，并在跨领域综合（Cross-Domain Synthesis）任务上优于CoT（4.8 vs. 4.4）。消融研究表明，策略性休眠在架构上至关重要（禁用后质量从4.2骤降至1.0）。然而，在一个包含15项简短答案的基准测试中，EMoT（27%）显著逊于更简单的基线方法，这证实了其在简单问题上存在系统性过度思考。这些结果受到若干重要限制：样本量较小（n=3个复杂案例，n=15个简短答案项），采用可能存在自我偏好偏差的LLM-as-Judge评估，以及约33倍的计算成本开销。据我们所知，EMoT是首个在单一架构中融合了层级拓扑、策略性思维休眠与再激活以及记忆编码技术的推理框架。

摘要 (Abstract)

Current prompting paradigms for large language models (LLMs), including Chain-of-Thought (CoT) and Tree-of-Thoughts (ToT), follow linear or tree-structured reasoning paths that lack persistent memory, strategic dormancy, and cross-domain synthesis. We present the Enhanced Mycelium of Thought (EMoT) framework, a bio-inspired reasoning architecture that organises cognitive processing into a four-level hierarchy (Micro, Meso, Macro, Meta), implements strategic dormancy and reactivation of reasoning nodes, and integrates a Memory Palace with five mnemonic encoding styles. EMoT is a research prototype for complex, multi-domain problems, not a general-purpose prompting enhancement. Two complementary evaluations reveal a characteristic trade-off. In a blind LLM-as-Judge evaluation across three domains, EMoT achieved near-parity with CoT (4.20 vs. 4.33/5.0) with higher stability, and outperformed CoT on Cross-Domain Synthesis (4.8 vs. 4.4). Ablation studies show that strategic dormancy is architecturally essential (quality collapsed from 4.2 to 1.0 when disabled). On a 15-item short-answer benchmark, EMoT (27%) substantially underperformed simpler baselines, confirming systematic overthinking on simple problems. These results are subject to important limitations: small sample sizes (n=3 complex cases, n=15 short-answer items), LLM-as-Judge evaluation with potential self-preference bias, and approximately 33-fold computational cost overhead. To our knowledge, EMoT is the first reasoning framework to combine hierarchical topology, strategic thought dormancy with reactivation, and mnemonic memory encoding in a single architecture.

关键词: Large Language Models, Chain-of-Thought, reasoning architecture, hierarchical reasoning, strategic dormancy, mnemonic encoding, cross-domain synthesis, LLM-as-Judge evaluation

79. ❌ From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

作者: Xiaoyong Guo, Nanjie Li, Zijie Zeng, Kai Wang, Hao Huang, Haihua Xu, Wei Shi 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24034v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Speech-LLMs（语音大语言模型）在上下文自动语音识别中的应用，核心贡献是提出一个统一的训练框架来缓解上下文暴露偏差。该框架包含三个关键组件：使用Whisper假设作为训练时历史的教师错误知识、上下文丢弃正则化以及直接偏好优化（DPO）。因此，论文与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为Speech-LLMs是LLMs在语音领域的应用；与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文明确使用了SFT进行训练；与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），因为DPO是论文提出的核心方法之一。其他关键词如MoE、SLMs、Scaling Laws、RAG、Agents等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对Speech-LLMs在上下文自动语音识别中因训练时使用完美历史而推理时依赖错误历史导致的上下文暴露偏差问题，提出了一个包含教师错误知识、上下文丢弃和直接偏好优化的统一训练框架，实验表明该框架在预测历史解码下能显著降低词错误率并提高对误导上下文的鲁棒性。

摘要翻译

基于语音大语言模型（Speech-LLMs）的上下文自动语音识别（ASR）通常在训练时使用准确的对话历史，但在推理时依赖于易出错的历史，这导致了上下文通道中的训练-测试不匹配，我们将其称为上下文暴露偏差。我们提出了一个统一的训练框架，以提升在真实历史条件下的鲁棒性：（i）通过使用Whisper large-v3的识别假设作为训练时历史，引入教师错误知识；（ii）采用上下文丢弃技术，以正则化对历史的过度依赖；（iii）在精选的失败案例上进行直接偏好优化。在TED-LIUM 3（领域内）和零样本LibriSpeech（领域外）上的实验表明，在基于预测历史的解码条件下，模型性能获得了一致的提升。当使用两个话语的历史作为上下文时，采用Whisper假设进行监督微调可将词错误率从5.59%（使用准确历史训练）降低至5.47%，而直接偏好优化进一步将其改善至5.17%。在不相关上下文攻击下，直接偏好优化带来的性能下降最小（5.17% -> 5.63%），表明其对误导性上下文的鲁棒性得到了提升。我们的代码和模型已发布于https://github.com/XYGuo1996/Contextual_Speech_LLMs。

摘要 (Abstract)

Contextual automatic speech recognition (ASR) with Speech-LLMs is typically trained with oracle conversation history, but relies on error-prone history at inference, causing a train-test mismatch in the context channel that we term contextual exposure bias. We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large-v3 hypotheses as training-time history, (ii) Context Dropout to regularize over-reliance on history, and (iii) Direct Preference Optimization (DPO) on curated failure cases. Experiments on TED-LIUM 3 (in-domain) and zero-shot LibriSpeech (out-of-domain) show consistent gains under predicted-history decoding. With a two-utterance history as context, SFT with Whisper hypotheses reduce WER from 5.59% (oracle-history training) to 5.47%, and DPO further improves to 5.17%. Under irrelevant-context attacks, DPO yields the smallest degradation (5.17% -> 5.63%), indicating improved robustness to misleading context. Our code and models are published on https://github.com/XYGuo1996/Contextual_Speech_LLMs.

关键词: Speech-LLMs, contextual automatic speech recognition, contextual exposure bias, Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO), robustness, Whisper, word error rate (WER)

80. ❌ Schema on the Inside: A Two-Phase Fine-Tuning Method for High-Efficiency Text-to-SQL at Scale

作者: Chinmay Soni, Shivam Chourasia, Gaurav Kumar, Hitesh Kapoor 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24023v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大型语言模型（LLMs）在文本到SQL任务中的应用，特别是通过监督微调（SFT）方法优化8B参数模型，实现高效本地推理，减少对昂贵API的依赖。因此，与’Large Language Models’和’Supervised Fine-tuning’高度相关（10分）。模型为8B参数，属于相对较小的规模，与’Small Language Models’有一定关联（8分）。其他关键词如MoE、Scaling Laws、RAG等未在论文中涉及，故得0分。论文未提及指定的专家作者。

!!! tip deepseek-chat TL;DR

该论文提出了一种两阶段监督微调方法，使一个8B参数的自托管语言模型能够内化数据库模式，从而在文本到SQL任务中减少99%的输入令牌并实现高精度（98.4%执行成功率），替代了成本高昂的外部API调用。

摘要翻译

将大型、基于专有API的语言模型应用于文本到SQL任务面临显著的行业挑战：依赖庞大且模式密集的提示会导致高昂的单令牌API成本和严重延迟，阻碍了可扩展的生产部署。我们提出一个专为CriQ对话机器人设计的、自托管的80亿参数模型。CriQ是印度最大幻想体育平台Dream11（拥有超过2.5亿用户）的姊妹应用，用于回答用户关于板球统计数据的查询。我们新颖的两阶段监督微调方法使模型能够内化整个数据库模式，从而无需长上下文提示。这将输入令牌数减少了99%以上，从1.7万令牌的基线降至不足100令牌，并以高效的本地推理替代了昂贵的外部API调用。最终系统实现了98.4%的执行成功率和92.5%的语义准确率，显著优于使用谷歌Gemini Flash 2.0通过提示工程优化的基线（95.6%执行成功率，89.4%语义准确率）。这些结果表明，通过使用领域专业化、自托管语言模型，在大规模生产环境中实现高精度、低延迟的文本到SQL应用具有可行路径。

摘要 (Abstract)

Applying large, proprietary API-based language models to text-to-SQL tasks poses a significant industry challenge: reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment. We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11, India’s largest fantasy sports platform with over 250 million users, that answers user queries about cricket statistics. Our novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts. This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100, and replaces costly external API calls with efficient local inference. The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google’s Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy). These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments.

关键词: text-to-SQL, supervised fine-tuning, large language models, self-hosted model, database schema internalization, low-latency inference, domain specialization, API cost reduction

81. ❌ ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents

作者: Bingqing Wei, Zhongyu Xia, Dingai Liu, Xiaoyu Zhou, Zhiwei Lin, Yongtao Wang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24018v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究具身智能体框架ELITE，通过自反思知识构建和意图感知检索实现从环境交互经验中持续学习。与"Self-Correction OR Self-Improvement OR Self-Reflection"高度相关（10分），因为核心机制包含自反思知识构建。与"LLM Agents OR Autonomous Agents OR Agentic Workflow"高度相关（10分），因为研究具身智能体框架。与"Large Language Models OR LLMs OR Foundation Models"有一定关联（5分），因为基于视觉语言模型（VLMs）构建智能体，但论文重点不是LLM技术本身。其他关键词如MoE、SFT、RAG、CoT等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出ELITE框架，通过自反思知识构建和意图感知检索解决具身智能体在复杂任务中因静态训练数据与物理交互脱节导致的失败问题，在EB-ALFRED和EB-Habitat基准上实现了9%和5%的性能提升。

摘要翻译

视觉语言模型（VLMs）已展现出卓越的通用能力，但基于其构建的具身智能体在复杂任务中仍频繁失败，常表现为跳过关键步骤、提出无效动作以及重复错误。这些失败源于VLMs静态训练数据与具身任务所需物理交互之间的根本性脱节。VLMs能够从静态数据中学习丰富的语义知识，但缺乏与世界交互的能力。为解决这一问题，我们提出了ELITE，一个具备经验学习与意图感知迁移能力的具身智能体框架，使智能体能够持续从自身环境交互经验中学习，并将习得的知识迁移至流程相似的任务中。ELITE通过两个协同机制运行，即自反思知识构建与意图感知检索。具体而言，自反思知识构建从执行轨迹中提取可复用的策略，并通过结构化精炼操作维护一个持续演化的策略池。随后，意图感知检索从策略池中识别相关策略，并将其应用于当前任务。在EB-ALFRED和EB-Habitat基准测试上的实验表明，在无任何监督的在线设置中，ELITE相比基础VLMs实现了9%和5%的性能提升。在有监督设置下，ELITE能有效泛化至未见过的任务类别，相比基于训练的最先进方法取得了更优性能。这些结果证明了ELITE在弥合语义理解与可靠动作执行之间差距方面的有效性。

摘要 (Abstract)

Vision-language models (VLMs) have shown remarkable general capabilities, yet embodied agents built on them fail at complex tasks, often skipping critical steps, proposing invalid actions, and repeating mistakes. These failures arise from a fundamental gap between the static training data of VLMs and the physical interaction for embodied tasks. VLMs can learn rich semantic knowledge from static data but lack the ability to interact with the world. To address this issue, we introduce ELITE, an embodied agent framework with {E}xperiential {L}earning and {I}ntent-aware {T}ransfer that enables agents to continuously learn from their own environment interaction experiences, and transfer acquired knowledge to procedurally similar tasks. ELITE operates through two synergistic mechanisms, \textit{i.e.,} self-reflective knowledge construction and intent-aware retrieval. Specifically, self-reflective knowledge construction extracts reusable strategies from execution trajectories and maintains an evolving strategy pool through structured refinement operations. Then, intent-aware retrieval identifies relevant strategies from the pool and applies them to current tasks. Experiments on the EB-ALFRED and EB-Habitat benchmarks show that ELITE achieves 9% and 5% performance improvement over base VLMs in the online setting without any supervision. In the supervised setting, ELITE generalizes effectively to unseen task categories, achieving better performance compared to state-of-the-art training-based methods. These results demonstrate the effectiveness of ELITE for bridging the gap between semantic understanding and reliable action execution.

关键词: embodied agents, experiential learning, self-reflection, vision-language models, knowledge transfer, intent-aware retrieval, strategy pool, environment interaction

82. ❌ Language-Grounded Multi-Agent Planning for Personalized and Fair Participatory Urban Sensing

作者: Xusen Guo, Mingxing Peng, Hongliang Lu, Hai Yang, Jun Ma, Yuxuan Liang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24014v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出MAPUS框架，明确使用LLM作为基础技术（摘要中明确提到’LLM-based multi-agent framework’），因此与’Large Language Models’高度相关（8分）。该框架的核心是多智能体系统，其中参与者被建模为自主智能体，协调器智能体进行公平感知选择和基于语言的协商，这与’LLM Agents’和’Multi-agent Systems’完全匹配（10分）。论文应用于城市感知这一科学领域，属于AI在科学/城市科学中的应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG、推理方法、压缩技术等均未在摘要中提及或暗示，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文针对传统参与式城市感知方法忽视参与者个人偏好和城市异质性导致分配僵化的问题，提出了一个基于LLM的多智能体框架MAPUS，通过语言协商实现个性化公平的任务分配，实验表明该框架在保持感知覆盖的同时显著提高了参与者满意度和公平性。

摘要翻译

参与式城市感知利用人类移动性进行大规模城市数据采集，但现有方法通常依赖集中式优化并假设参与者同质化，导致任务分配僵化，忽视个人偏好与城市环境的异质性。我们提出MAPUS，一种基于大语言模型的多智能体框架，用于实现个性化且公平的参与式城市感知。在该框架中，参与者被建模为具有个人档案与行程安排的自主智能体，而协调智能体则执行公平感知的任务选择，并通过基于语言的协商机制优化感知路线。在真实数据集上的实验表明，MAPUS在保持竞争力的感知覆盖率的同时，显著提升了参与者满意度与公平性，有助于构建更加以人为本且可持续的城市感知系统。

摘要 (Abstract)

Participatory urban sensing leverages human mobility for large-scale urban data collection, yet existing methods typically rely on centralized optimization and assume homogeneous participants, resulting in rigid assignments that overlook personal preferences and heterogeneous urban contexts. We propose MAPUS, an LLM-based multi-agent framework for personalized and fair participatory urban sensing. In our framework, participants are modeled as autonomous agents with individual profiles and schedules, while a coordinator agent performs fairness-aware selection and refines sensing routes through language-based negotiation. Experiments on real-world datasets show that MAPUS achieves competitive sensing coverage while substantially improving participant satisfaction and fairness, promoting more human-centric and sustainable urban sensing systems.

关键词: Participatory Urban Sensing, Multi-Agent Systems, Large Language Models, Personalization, Fairness, Agent Coordination, Human-Centric Systems, Language-Based Negotiation

83. ❌ Understanding the Challenges in Iterative Generative Optimization with LLMs

作者: Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swaminathan, Ching-An Cheng 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23994v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在生成式优化中的应用，特别是构建自改进代理（self-improving agents），因此与’Large Language Models’、‘Self-Correction/Self-Improvement’和’LLM Agents’高度相关（10分）。论文涉及使用执行反馈优化工作流，与’Tool Use’有一定关联（5分）。其他关键词如MoE、量化、推理加速、科学AI等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该论文研究了使用大型语言模型进行迭代生成式优化时面临的挑战，发现起始工件、信用视野和批量试验等隐藏设计选择会决定优化成败，并提供了实际指导。

摘要翻译

生成式优化利用大型语言模型（LLMs），通过执行反馈迭代改进人工制品（如代码、工作流程或提示）。这是一种构建自我改进智能体的有前景的方法，但在实践中仍显脆弱：尽管研究活跃，但调查显示仅有9%的智能体采用了任何自动化优化。我们认为这种脆弱性源于工程师在建立学习循环时必须做出“隐性”设计选择：优化器可以编辑什么？每次更新时应提供何种“恰当”的学习证据？我们研究了影响大多数应用的三个因素：起始人工制品、执行轨迹的信用分配范围，以及将试错过程批处理为学习证据的方式。通过在MLAgentBench、Atari和BigBench Extra Hard（BBEH）中的案例研究，我们发现这些设计决策能决定生成式优化的成败，但在先前工作中很少被明确讨论。不同的起始人工制品决定了MLAgentBench中可达到的解决方案范围，截断的轨迹仍能改进Atari智能体，而更大的小批量在BBEH上并不能单调提升泛化性能。我们的结论是，缺乏一种简单、通用的跨领域学习循环建立方法是该方法走向产品化和广泛应用的主要障碍。我们为如何做出这些选择提供了实用指导。

摘要 (Abstract)

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden’’ design choices: What can the optimizer edit and what is the “right” learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.

关键词: Generative Optimization, Large Language Models, Self-improving Agents, Iterative Improvement, Execution Feedback, Learning Loop, Design Choices, MLAgentBench

84. ❌ From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLMs Architecture for Adaptive Tutoring

作者: Nizam Kadir 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23990v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出ES-LLMS架构，将LLM应用于教育领域，通过多智能体系统（LLM Agents、Multi-agent Systems）实现可解释、可控的教学代理。论文直接涉及LLMs作为基础技术，并强调可解释性（Explainable AI），但未涉及其他关键词的具体技术细节。

!!! tip deepseek-chat TL;DR

该论文针对教育对话中单一LLM作为黑箱模型违反教学约束的问题，提出了ES-LLMS架构，通过分离决策与表达、使用多智能体协调和可解释规则，实现了更高的教学质量、约束遵守率和资源效率。

摘要翻译

教育对话中使用的单体大语言模型（LLM）通常表现为“黑箱”，其教学决策过程隐晦且难以审查，常因过早提供答案而违反教学约束。我们提出专业化大语言模型集成架构，该架构将决策制定与语言表达相分离。教学行动由一个基于确定性规则的编排器选择，该编排器协调覆盖辅导、评估、反馈、支架式教学、动机激励及伦理规范的专业化智能体，这些智能体的运作由可解释的贝叶斯知识追踪学生模型指导。一个大语言模型渲染器将选定的行动以自然语言进行表层实现。此设计强调可靠性与可控性：诸如“尝试优先于提示”和提示次数上限等约束被作为显式规则强制执行，系统会记录每轮对话的智能体轨迹和约束检查结果。通过人类专家评审员（N=6）和多LLM评委组（六个前沿模型）对教学质量的验证表明，ES-LLMs分别在91.7%和79.2%的案例中更受青睐。该架构在所有七个评估维度上均显著优于单体基线模型，尤其在支架式指导以及信任与可解释性方面。此外，一项蒙特卡洛模拟（N=2,400）揭示了一个“掌握增益悖论”：单体辅导模型通过过度辅助人为提升了短期表现。相比之下，ES-LLMs实现了对教学约束（如尝试优先于提示）的100%遵守，并将提示效率提升了3.3倍。在运行层面，ES-LLMs通过采用无状态提示，将成本降低了54%，延迟减少了22%。我们得出结论：结构解耦对于将随机性模型转化为可信赖、可验证且资源高效的教学智能体至关重要。

摘要 (Abstract)

Monolithic Large Language Models (LLMs) used in educational dialogue often behave as “black boxes,” where pedagogical decisions are implicit and difficult to audit, frequently violating instructional constraints by providing answers too early. We introduce the Ensemble of Specialized LLMS (ES-LLMS) architecture that separates decision-making from wording. Pedagogical actions are selected by a deterministic rules-based orchestrator coordinating specialized agents covering tutoring, assessment, feedback, scaffolding, motivation and ethics-guided by an interpretable Bayesian Knowledge Tracing (BKT) student model. An LLM renderer surface-realizes the chosen action in natural language. This design emphasizes reliability and controllability: constraints such as “attempt-before-hint” and hint caps are enforced as explicit rules, and the system logs per-turn agent traces and constraint checks. Validation of pedagogical quality via human expert reviewers (N=6) and a multi-LLM-as-Judge panel (six state-of-the-art models) showed that ES-LLMs were preferred in 91.7% and 79.2% of cases, respectively. The architecture significantly outperformed monolithic baselines across all seven dimensions, particularly in Scaffolding & Guidance, and Trust & Explainability. Furthermore, a Monte Carlo simulation (N=2,400) exposed a “Mastery Gain Paradox,” where monolithic tutors inflated short-term performance through over-assistance. In contrast, ES-LLMs achieved 100% adherence to pedagogical constraints (e.g., attempt-before-hint) and a 3.3x increase in hint efficiency. Operationally, ES-LLMs reduced costs by 54% and latency by 22% by utilizing stateless prompts. We conclude that structural decoupling is essential for transforming stochastic models into trustworthy, verifiable and resource-efficient pedagogical agents.

关键词: Ensemble of Specialized LLMs, pedagogical orchestration, interpretable tutoring, multi-agent system, Bayesian Knowledge Tracing, constraint adherence, educational dialogue, trustworthy AI

85. ❌ SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating

作者: Hanbyel Cho, Sang-Hun Kim, Jeonguk Kang, Donghan Koo 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23983v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究文本驱动的人形机器人全身控制框架，核心是物理引导的运动生成和安全门控机制，使用Rectified Flow Matching、VAE、扩散模型等技术。所有评分关键词均与大语言模型（LLM）相关，而本文专注于机器人控制、运动生成和物理仿真，未涉及LLM技术、训练方法、推理优化或AI for Science的具体应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了SafeFlow框架，通过物理引导的Rectified Flow Matching和3阶段安全门控，解决了文本驱动人形机器人运动生成中的物理不可行性和安全问题，在Unitree G1上实现了更高的成功率、物理合规性和推理速度。

摘要翻译

实时交互式文本驱动运动生成技术的最新进展已使人形机器人能够执行多样化行为。然而，仅基于运动学的生成器常出现物理幻觉现象，产生下游运动跟踪控制器难以追踪的物理不可行轨迹，或在现实部署中存在安全隐患。这些失效通常源于缺乏面向真实机器人执行的显式物理感知目标，且在分布外用户输入下问题更为严重。为此，我们提出SafeFlow——一个结合物理引导运动生成与三级安全闸门的文本驱动人形机器人全身控制框架，该安全闸门由显式风险指标驱动。SafeFlow采用双层架构：在高层级，我们通过VAE潜在空间中的物理引导整流流匹配生成运动轨迹以提升真实机器人可执行性，并利用回流技术加速采样以减少实时控制所需函数评估次数；三级安全闸门通过文本嵌入空间中的马氏距离检测语义分布外指令，借助方向敏感度差异度量过滤不稳定生成结果，并在将轨迹传递至底层运动跟踪控制器前强制执行关节限位与速度限制等硬性运动学约束，从而实现选择性执行。在Unitree G1机器人上的大量实验表明，SafeFlow在成功率、物理合规性与推理速度方面均优于现有基于扩散模型的方法，同时保持了丰富的动作表现力。

摘要 (Abstract)

Recent advances in real-time interactive text-driven motion generation have enabled humanoids to perform diverse behaviors. However, kinematics-only generators often exhibit physical hallucinations, producing motion trajectories that are physically infeasible to track with a downstream motion tracking controller or unsafe for real-world deployment. These failures often arise from the lack of explicit physics-aware objectives for real-robot execution and become more severe under out-of-distribution (OOD) user inputs. Hence, we propose SafeFlow, a text-driven humanoid whole-body control framework that combines physics-guided motion generation with a 3-Stage Safety Gate driven by explicit risk indicators. SafeFlow adopts a two-level architecture. At the high level, we generate motion trajectories using Physics-Guided Rectified Flow Matching in a VAE latent space to improve real-robot executability, and further accelerate sampling via Reflow to reduce the number of function evaluations (NFE) for real-time control. The 3-Stage Safety Gate enables selective execution by detecting semantic OOD prompts using a Mahalanobis score in text-embedding space, filtering unstable generations via a directional sensitivity discrepancy metric, and enforcing final hard kinematic constraints such as joint and velocity limits before passing the generated trajectory to a low-level motion tracking controller. Extensive experiments on the Unitree G1 demonstrate that SafeFlow outperforms prior diffusion-based methods in success rate, physical compliance, and inference speed, while maintaining diverse expressiveness.

关键词: text-driven motion generation, humanoid whole-body control, physics-guided rectified flow, safety gating, real-time control, motion tracking controller, diffusion-based methods, physical compliance

86. ❌ Kirchhoff-Inspired Neural Networks for Evolving High-Order Perception

作者: Tongfei Chen, Jingying Yang, Linlin Yang, Jinhu Lü, David Doermann, Chunyu Xie, Long He, Tian Wang, Juan Zhang, Guodong Guo, Baochang Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23977v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种基于基尔霍夫定律的新型神经网络架构（KINN），属于深度学习技术原理的创新，与’AI for Science’有一定关联（用于PDE求解），但论文未涉及大语言模型（LLMs）、训练方法（如预训练、微调）、推理优化、对齐、智能体等关键词。唯一相关的是’Mechanistic Interpretability OR Explainable AI’（KINN强调物理一致性和可解释性），以及’AI for Science’（应用于PDE求解）。其他关键词均未提及或相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于基尔霍夫定律的神经网络架构（KINN），用于解决传统深度学习在信号强度、耦合结构和状态演化联合表征上的不足，并在PDE求解和ImageNet分类任务上验证了其优于现有方法的性能。

摘要翻译

深度学习架构从根本上受到神经科学的启发，尤其是大脑感觉通路的结构，并在学习信息丰富的数据表征方面取得了显著成功。尽管这些架构模仿了生物神经元的通信机制，但其信息编码与传输策略存在本质差异。生物系统依赖于膜电位的动态波动；相比之下，传统的深度网络通过调整神经元间连接的强度来优化权重与偏置，缺乏一个系统化的机制来共同表征信号强度、耦合结构和状态演化之间的相互作用。为应对这一局限，我们提出了基尔霍夫启发的神经网络（Kirchhoff-Inspired Neural Network, KINN），这是一种基于基尔霍夫电流定律构建的、以状态变量为基础的网络架构。KINN从基本常微分方程推导出数值稳定的状态更新规则，能够在单层内显式解耦并编码高阶演化分量，同时保持物理一致性、可解释性以及端到端的可训练性。在偏微分方程（PDE）求解和ImageNet图像分类上的大量实验验证了KINN优于现有的先进方法。

摘要 (Abstract)

Deep learning architectures are fundamentally inspired by neuroscience, particularly the structure of the brain’s sensory pathways, and have achieved remarkable success in learning informative data representations. Although these architectures mimic the communication mechanisms of biological neurons, their strategies for information encoding and transmission are fundamentally distinct. Biological systems depend on dynamic fluctuations in membrane potential; by contrast, conventional deep networks optimize weights and biases by adjusting the strengths of inter-neural connections, lacking a systematic mechanism to jointly characterize the interplay among signal intensity, coupling structure, and state evolution. To tackle this limitation, we propose the Kirchhoff-Inspired Neural Network (KINN), a state-variable-based network architecture constructed based on Kirchhoff’s current law. KINN derives numerically stable state updates from fundamental ordinary differential equations, enabling the explicit decoupling and encoding of higher-order evolutionary components within a single layer while preserving physical consistency, interpretability, and end-to-end trainability. Extensive experiments on partial differential equation (PDE) solving and ImageNet image classification validate that KINN outperforms state-of-the-art existing methods.

关键词: Kirchhoff-Inspired Neural Network, state-variable-based architecture, higher-order evolutionary components, physical consistency, interpretability, PDE solving, ImageNet classification, deep learning

87. ❌ The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

作者: Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, James Zou 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23971v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究推理语言模型（RLMs）的API定价与实际推理成本之间的关系，发现价格反转现象。核心与推理相关技术高度相关：论文明确研究推理模型（RLMs），涉及Chain of Thought/CoT Reasoning/Multi-step Reasoning（评分10分）和System 2 Thinking/Slow Thinking/In-depth Reasoning（评分10分），因为论文分析思考令牌消耗是成本差异的关键原因。与Large Language Models/LLMs/Foundation Models有一定关联（评分8分），因为研究基于前沿RLMs（如GPT-5.2、Gemini 3 Flash），这些属于大模型范畴。其他关键词如MoE、SLMs、Scaling Laws、训练方法、对齐、RAG、压缩、代理等，论文未涉及技术细节或创新，评分为0分。AI for Science等应用领域关键词不相关，因为论文任务覆盖数学、科学QA、代码等，但未聚焦科学领域应用创新。

!!! tip deepseek-chat TL;DR

论文研究发现推理语言模型的API定价不能准确反映实际推理成本，存在价格反转现象，例如更便宜的模型可能因思考令牌消耗更高而实际成本更高，揭示了定价与成本之间的不一致性。

摘要翻译

开发者和消费者日益倾向于依据所列API价格来选择推理语言模型（RLMs）。然而，这些价格在多大程度上准确反映了实际推理成本？我们针对此问题开展了首次系统性研究，评估了8个前沿RLM在涵盖竞赛数学、科学问答、代码生成及多领域推理的9项多样化任务上的表现。我们揭示了价格倒挂现象：在21.8%的模型对比较中，标价较低的模型实际产生了更高的总成本，倒挂幅度最高可达28倍。例如，Gemini 3 Flash的标价比GPT-5.2低78%，但其在所有任务中的实际成本却高出22%。我们追溯其根本原因在于思考令牌（thinking token）消耗的巨大异质性：对于同一查询，一个模型可能比另一个模型多消耗900%的思考令牌。事实上，若剔除思考令牌成本，排名倒挂现象可减少70%，并使价格与成本排名间的等级相关性（肯德尔$τ$系数）从0.563提升至0.873。我们进一步证明，单次查询的成本预测本质上极为困难：同一查询的多次运行产生的思考令牌数量差异最高可达9.7倍，这为任何预测器设定了不可降低的噪声基底。我们的研究结果表明，所列API价格并非实际成本的可靠参照，这呼吁需要采用成本感知的模型选择方法，并建立透明的按请求成本监控机制。

摘要 (Abstract)

Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash’s listed price is 78% cheaper than GPT-5.2’s, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall’s $τ$ ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.

关键词: reasoning language models, API pricing, inference cost, thinking tokens, cost prediction, model selection, pricing reversal, token consumption

88. ❌ Policy-Guided Threat Hunting: An LLM enabled Framework with Splunk SOC Triage

作者: Rishikesh Sahay, Bell Eapen, Weizhi Meng, Md Rasel Al Mamun, Nikhil Kumar Dora, Manjusha Sumasadan, Sumit Kumar Tetarave, Rod Soto 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23966v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于LLM和Agentic AI的网络安全威胁狩猎框架，核心内容涉及LLM在网络安全领域的应用（用于上下文分析）和Agentic AI（自主代理）的集成，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理优化、AI for Science等均未在摘要中提及，与论文内容完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究提出了一种集成Agentic AI和大型语言模型（LLM）的自动化威胁狩猎框架，用于监控不断演变的网络威胁、适应变化的网络条件，并通过风险优先级排序来缓解可疑和恶意流量，实验表明该框架能有效自主适应不同SOC目标并识别威胁。

摘要翻译

随着网络空间中高级持续性威胁（APT）的持续演变，传统安全解决方案已难以满足组织的威胁狩猎需求。此外，安全运营中心（SOC）分析师常因需要分析来自组织内各类设备的海量日志而不堪重负。为应对这些挑战，我们提出一种自动化、动态的威胁狩猎框架，用于监控不断演变的威胁，适应变化的网络条件，并对可疑与恶意流量的缓解进行基于风险的优先级排序。通过将智能体人工智能（Agentic AI）与成熟的SIEM平台Splunk集成，我们开发了一种独特的威胁狩猎框架。该框架系统且无缝地整合了从流量采集到异常评估（使用基于重构的自编码器）、初步分级（采用双层深度强化学习，DRL）以及上下文分析（利用大语言模型，LLL）等不同威胁狩猎模块。我们使用公开的基准数据集以及模拟数据集对该框架进行了评估。实验结果表明，该框架能够有效自主适应不同的SOC目标，并识别可疑与恶意流量。该框架通过支持SOC分析师做出阻断、允许或监控网络流量的决策，提升了运营效率。因此，本研究通过提出这一用于安全决策的新型威胁狩猎框架，丰富了网络安全与威胁狩猎领域的文献，并推动累积性研究，以开发更有效的框架来应对持续演变的网络威胁。

摘要 (Abstract)

With frequently evolving Advanced Persistent Threats (APTs) in cyberspace, traditional security solutions approaches have become inadequate for threat hunting for organizations. Moreover, SOC (Security Operation Centers) analysts are often overwhelmed and struggle to analyze the huge volume of logs received from diverse devices in organizations. To address these challenges, we propose an automated and dynamic threat hunting framework for monitoring evolving threats, adapting to changing network conditions, and performing risk-based prioritization for the mitigation of suspicious and malicious traffic. By integrating Agentic AI with Splunk, an established SIEM platform, we developed a unique threat hunting framework. The framework systematically and seamlessly integrates different threat hunting modules together, ranging from traffic ingestion to anomaly assessment using a reconstruction-based autoencoder, deep reinforcement learning (DRL) with two layers for initial triage, and a large language model (LLM) for contextual analysis. We evaluated the framework against a publicly available benchmark dataset, as well as against a simulated dataset. The experimental results show that the framework can effectively adapt to different SOC objectives autonomously and identify suspicious and malicious traffic. The framework enhances operational effectiveness by supporting SOC analysts in their decision-making to block, allow, or monitor network traffic. This study thus enhances cybersecurity and threat hunting literature by presenting the novel threat hunting framework for security decision- making, as well as promoting cumulative research efforts to develop more effective frameworks to battle continuously evolving cyber threats.

关键词: threat hunting, LLM, Agentic AI, cybersecurity, SOC, deep reinforcement learning, autoencoder, Splunk

89. ❌ Variable-Length Audio Fingerprinting

作者: Hongjie Chen, Hanyu Meng, Huimin Zeng, Ryan A. Rossi, Lie Lu, Josh Kimball 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23947v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究音频指纹识别技术，提出了一种支持变长音频处理的深度学习方法VLAFP。所有关键词均与大语言模型、深度学习技术原理或科学AI应用相关，但论文专注于音频信号处理领域，未涉及大模型、深度学习技术原理创新或科学AI应用，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为VLAFP的变长音频指纹识别方法，解决了现有方法只能处理固定长度音频片段的问题，并在三个真实数据集上实现了优于现有方法的音频识别和检索性能。

摘要翻译

音频指纹技术将音频转换为维度低得多的表征，使得经过失真的录音仍能通过相似的指纹被识别为原始音频。现有的深度学习方法僵化地对固定长度音频片段进行指纹提取，从而忽视了分段过程中的时序动态特性。为克服这种僵化性带来的局限，我们提出可变长度音频指纹提取方法（Variable-Length Audio FingerPrinting，简称VLAFP），这是一种支持可变长度指纹提取的新方法。据我们所知，VLAFP是首个能够在训练和测试阶段均处理可变长度音频的深度音频指纹模型。实验表明，在三个真实数据集上，VLAFP在实时音频识别和音频检索任务中的性能均优于现有先进方法。

摘要 (Abstract)

Audio fingerprinting converts audio to much lower-dimensional representations, allowing distorted recordings to still be recognized as their originals through similar fingerprints. Existing deep learning approaches rigidly fingerprint fixed-length audio segments, thereby neglecting temporal dynamics during segmentation. To address limitations due to this rigidity, we propose Variable-Length Audio FingerPrinting (VLAFP), a novel method that supports variable-length fingerprinting. To the best of our knowledge, VLAFP is the first deep audio fingerprinting model capable of processing audio of variable length, for both training and testing. Our experiments show that VLAFP outperforms existing state-of-the-arts in live audio identification and audio retrieval across three real-world datasets.

关键词: audio fingerprinting, variable-length, deep learning, audio identification, audio retrieval, VLAFP, temporal dynamics

90. ❌ High-Fidelity Face Content Recovery via Tamper-Resilient Versatile Watermarking

作者: Peipeng Yu, Jinfeng Xie, Chengfu Ou, Xiaoyu Zhou, Jianwei Fei, Yunshu Dai, Zhihua Xia, Chip Hong Chang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23940v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是针对AIGC生成的人脸图像的数字水印技术，具体涉及水印嵌入、篡改定位和内容恢复，属于多媒体安全/数字取证领域。所有评分关键词均与大模型/深度学习技术原理、训练方法、推理优化、对齐、代理系统、科学AI应用等直接相关，而本文的核心技术（水印框架、语义潜在水印、AIGC攻击模拟器）并未涉及这些关键词所描述的大模型核心技术或应用场景。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为VeriFi的通用数字水印框架，旨在解决AIGC生成的人脸图像面临的版权保护、篡改定位和内容恢复问题，并通过实验验证了其在鲁棒性、定位精度和恢复质量上的优越性。

摘要翻译

AIGC驱动的人脸操纵与深度伪造技术的扩散对媒体溯源、完整性及版权保护构成了严重威胁。现有通用水印系统通常依赖嵌入显式定位载荷，这导致了保真度与功能性之间的权衡：较大的定位信号会降低视觉质量，并常在强生成编辑下削弱解码鲁棒性。此外，现有方法大多不支持内容恢复，当需要重建原始证据时，其取证价值受限。为应对这些挑战，我们提出了VeriFi——一个统一版权保护、像素级操纵定位与高保真人脸内容恢复的通用水印框架。VeriFi包含三项核心贡献：（1）嵌入紧凑的语义潜在水印作为内容保持先验，即使在严重篡改后仍能实现忠实还原；（2）通过关联图像特征与解码出的溯源信号，实现细粒度定位，而无需嵌入特定定位伪影；（3）引入结合潜在空间混合与无缝融合的AIGC攻击模拟器，以提升对真实深度伪造流程的鲁棒性。在CelebA-HQ和FFHQ数据集上的大量实验表明，VeriFi在水印鲁棒性、定位精度与恢复质量上均持续优于现有基线方法，为深度伪造取证提供了实用且可验证的防御方案。

摘要 (Abstract)

The proliferation of AIGC-driven face manipulation and deepfakes poses severe threats to media provenance, integrity, and copyright protection. Prior versatile watermarking systems typically rely on embedding explicit localization payloads, which introduces a fidelity–functionality trade-off: larger localization signals degrade visual quality and often reduce decoding robustness under strong generative edits. Moreover, existing methods rarely support content recovery, limiting their forensic value when original evidence must be reconstructed. To address these challenges, we present VeriFi, a versatile watermarking framework that unifies copyright protection, pixel-level manipulation localization, and high-fidelity face content recovery. VeriFi makes three key contributions: (1) it embeds a compact semantic latent watermark that serves as an content-preserving prior, enabling faithful restoration even after severe manipulations; (2) it achieves fine-grained localization without embedding localization-specific artifacts by correlating image features with decoded provenance signals; and (3) it introduces an AIGC attack simulator that combines latent-space mixing with seamless blending to improve robustness to realistic deepfake pipelines. Extensive experiments on CelebA-HQ and FFHQ show that VeriFi consistently outperforms strong baselines in watermark robustness, localization accuracy, and recovery quality, providing a practical and verifiable defense for deepfake forensics.

关键词: versatile watermarking, face content recovery, tamper localization, AIGC attack simulation, deepfake forensics, semantic latent watermark, copyright protection, manipulation detection

91. ❌ Revealing Multi-View Hallucination in Large Vision-Language Models

作者: Wooje Park, Insu Lee, Soohyun Kim, Jaeyun Jang, Minyoung Noh, Kyuhong Shim, Byonghyo Shim 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23934v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型视觉语言模型（LVLMs）在多视图图像输入中的幻觉问题，并提出了一种名为RSCD的解码技术来缓解这一问题。与关键词的相关性分析如下：1）与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为LVLMs是大语言模型的扩展；2）与’Hallucination Mitigation OR Factuality OR Truthfulness’高度相关（10分），因为论文核心是解决幻觉问题；3）与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文分析了幻觉的机制并提出了可解释的解决方案；4）其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文揭示了大型视觉语言模型在多视图图像输入中存在的跨实例和跨视图幻觉问题，并提出了一种无需训练的参考移位对比解码技术，在基准测试中比现有方法提升了高达34.6个百分点的性能。

摘要翻译

大型视觉语言模型（LVLMs）正日益应用于从不同视角捕获的多视角图像输入。然而，尽管使用日益广泛，当前的LVLMs常常混淆或错误匹配源自不同实例或视角的视觉信息，我们将这种现象称为多视角幻觉。为系统分析此问题，我们构建了MVH-Bench基准测试集，包含4.8k个针对两类幻觉（跨实例与跨视角）的问答对。实证结果表明，近期LVLMs难以正确将视觉证据与其对应的实例或视角关联。为克服这一局限，我们提出参考转移对比解码（Reference Shift Contrastive Decoding, RSCD），这是一种无需训练的解码技术，通过注意力掩码生成负对数概率以抑制视觉干扰。在MVH-Bench上使用Qwen2.5-VL和LLaVA-OneVision进行的实验表明，RSCD相较于现有幻觉缓解方法持续提升性能达21.1和34.6个百分点，凸显了我们方法的有效性。

摘要 (Abstract)

Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.

关键词: Large Vision-Language Models, Multi-view Hallucination, Hallucination Mitigation, Reference Shift Contrastive Decoding, MVH-Bench, Cross-instance Hallucination, Cross-view Hallucination, Training-free Decoding

92. ❌ DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning

作者: Jiajian Huang, Dongliang Zhu, Zitong YU, Hui Ma, Jiayu Zhang, Chunmei Zhu, Xiaochun Cao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23916v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态欺骗检测，涉及数据集构建（T4-Deception）、多模态表示学习（SICS模块）和知识蒸馏（DMC模块），但未提及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体应用。所有关键词均与大模型技术、深度学习原理或科学AI应用相关，而本文研究的是传统的多模态机器学习任务（音频-视觉欺骗检测），与给定关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对多模态欺骗检测中缺乏可解释推理链和跨文化泛化能力的问题，通过构建带结构化线索描述的数据集、发布多文化数据集T4-Deception，并提出SICS和DMC两个模块，实现了在域内和跨域场景下的最先进性能。

摘要翻译

多模态欺骗检测旨在通过分析视听线索来识别欺骗行为，应用于司法取证与安全领域。在此类高风险场景中，调查人员需要可验证的证据将视听线索与最终决策相关联，并确保模型在不同领域和文化背景下具备可靠的泛化能力。然而，现有基准数据集仅提供二元标签而缺乏中间推理线索，且数据规模较小、场景覆盖有限，易导致模型陷入捷径学习。我们通过三项贡献应对这些问题。首先，我们通过为现有基准数据添加结构化线索级描述与推理链，构建了推理数据集，使模型能够输出可审计的报告。其次，我们发布了T4-Deception数据集，该多文化数据集基于在四个国家实施的统一电视节目形式“To Tell The Truth”构建，包含1695个样本，是目前规模最大的非实验室环境欺骗检测数据集。第三，我们提出了两个适用于小数据条件下的鲁棒学习模块。稳定化个体-共性协同模块（Stabilized Individuality-Commonality Synergy, SICS）通过融合可学习的全局先验与样本自适应残差来优化多模态表征，并采用极性感知调整机制对表征进行双向重校准。蒸馏模态一致性模块（Distilled Modality Consistency, DMC）通过知识蒸馏将单模态预测与融合多模态预测对齐，以防止单模态捷径学习。在三个现有基准数据集及我们新构建数据集上的实验表明，该方法在域内与跨域场景中均达到最先进的性能，并在不同文化背景下展现出卓越的迁移能力。数据集与代码将公开发布。

摘要 (Abstract)

Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings, investigators need verifiable evidence connecting audiovisual cues to final decisions, along with reliable generalization across domains and cultural contexts. However, existing benchmarks provide only binary labels without intermediate reasoning cues. Datasets are also small with limited scenario coverage, leading to shortcut learning. We address these issues through three contributions. First, we construct reasoning datasets by augmenting existing benchmarks with structured cue-level descriptions and reasoning chains, enabling model output auditable reports. Second, we release T4-Deception, a multicultural dataset based on the unified ``To Tell The Truth’’ television format implemented across four countries. With 1695 samples, it is the largest non-laboratory deception detection dataset. Third, we propose two modules for robust learning under small-data conditions. Stabilized Individuality-Commonality Synergy (SICS) refines multimodal representations by synergizing learnable global priors with sample-adaptive residuals, followed by a polarity-aware adjustment that bi-directionally recalibrates representations. Distilled Modality Consistency (DMC) aligns modality-specific predictions with the fused multimodal predictions via knowledge distillation to prevent unimodal shortcut learning. Experiments on three established benchmarks and our novel dataset demonstrate that our method achieves state-of-the-art performance in both in-domain and cross-domain scenarios, while exhibiting superior transferability across diverse cultural contexts. The datasets and codes will be released.

关键词: multimodal deception detection, audiovisual cues, reasoning chains, multicultural dataset, robust learning, cross-domain generalization, knowledge distillation, shortcut learning

93. ❌ AnalogAgent: Self-Improving Analog Circuit Design Automation with LLM Agents

作者: Zhixuan Bao, Zhuoyi Lin, Jiageng Wang, Jinhai Hu, Yuan Gao, Yaoxin Wu, Xiaoli Li, Xun Xu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23910v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLM多智能体系统（MAS）实现模拟电路设计的自动化，高度相关关键词包括：LLMs（论文明确使用）、LLM Agents（框架核心）、Multi-agent Systems（协调多个智能体）、Self-Improvement（通过自我演化记忆实现）、AI for Science（应用于模拟电路设计）。与Small Language Models有一定关联（提到使用Qwen-8B等紧凑模型）。其他关键词如MoE、Scaling Laws、Fine-tuning方法、推理优化、解释性等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了AnalogAgent框架，通过LLM多智能体系统与自我演化记忆相结合，解决了传统LLM方法在模拟电路设计中存在上下文信息丢失和缺乏领域洞察的问题，实现了无需额外训练即可显著提升电路设计自动化性能的目标。

摘要翻译

大规模语言模型（LLM）的最新进展显示出其在模拟电路设计自动化方面的巨大潜力。然而，当前大多数基于LLM的方法依赖于生成、诊断与修正的单模型循环，这种方法倾向于简洁的总结而非领域专精的洞察，且存在语境衰减问题，导致关键技术细节丢失。为应对这些局限，我们提出了AnalogAgent，一种无需训练的智能体框架，它将基于LLM的多智能体系统（MAS）与自演进记忆（SEM）相结合，用于模拟电路设计自动化。AnalogAgent协调代码生成器、设计优化器和知识管理器的协作，将执行反馈提炼为SEM中的自适应策略库，并为后续生成检索针对性指导，从而实现在无需额外专家反馈、数据库或库支持下的跨任务知识迁移。在多个公认基准测试中，AnalogAgent使用Gemini实现了92%的Pass@1，使用GPT-5实现了97.4%的Pass@1。此外，在采用紧凑模型（如Qwen-8B）时，它在各项任务中平均Pass@1提升了48.8%，总体达到72.1%的Pass@1。这表明AnalogAgent显著增强了开源权重模型实现高质量模拟电路设计自动化的能力。

摘要 (Abstract)

Recent advances in large language models (LLMs) suggest strong potential for automating analog circuit design. Yet most LLM-based approaches rely on a single-model loop of generation, diagnosis, and correction, which favors succinct summaries over domain-specific insight and suffers from context attrition that erases critical technical details. To address these limitations, we propose AnalogAgent, a training-free agentic framework that integrates an LLM-based multi-agent system (MAS) with self-evolving memory (SEM) for analog circuit design automation. AnalogAgent coordinates a Code Generator, Design Optimizer, and Knowledge Curator to distill execution feedback into an adaptive playbook in SEM and retrieve targeted guidance for subsequent generation, enabling cross-task transfer without additional expert feedback, databases, or libraries. Across established benchmarks, AnalogAgent achieves 92% Pass@1 with Gemini and 97.4% Pass@1 with GPT-5. Moreover, with compact models (e.g., Qwen-8B), it yields a +48.8% average Pass@1 gain across tasks and reaches 72.1% Pass@1 overall, indicating that AnalogAgent substantially strengthens open-weight models for high-quality analog circuit design automation.

关键词: LLM agents, multi-agent system, analog circuit design, self-improving, training-free, automation, memory, AI for science

94. ❌ DUPLEX: Agentic Dual-System Planning via LLM-Driven Information Extraction

作者: Keru Hua, Ding Wang, Yaoying Gu, Xiaoguang Ma 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23909v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在机器人任务规划中的应用，提出DUPLEX双系统架构解决LLM的幻觉和逻辑不一致问题。高度相关关键词：LLMs（核心基础技术）、LLM Agents（论文研究agentic架构）、Hallucination Mitigation（直接解决该问题）。中等相关：System 2 Thinking（Slow System体现深度推理）、Self-Correction（迭代反思修复机制）。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对LLM在机器人任务规划中存在的幻觉和逻辑不一致问题，提出了DUPLEX双系统神经符号架构，通过将LLM严格限制在结构化语义提取而非端到端规划，显著提高了规划成功率和可靠性。

摘要翻译

尽管大语言模型（LLM）为机器人任务规划提供了语义灵活性，但其易产生幻觉和逻辑不一致的特性限制了其在长视野领域中的可靠性。为弥合非结构化环境与严谨规划生成之间的差距，我们提出了DUPLEX：一种代理式双系统神经符号架构，该架构严格限制LLM仅用于模式引导的信息提取，而非端到端规划或代码生成。在我们的框架中，前馈式快速系统利用轻量级LLM从自然语言中提取实体、关系等信息，并将其确定性地映射为经典符号规划器所需的规划领域定义语言（Planning Domain Definition Language，PDDL）问题文件。为解决复杂或定义不明确的场景，系统仅在规划失败时激活慢速系统，借助求解器诊断驱动高性能LLM进行迭代反思与修正。在12个经典及家庭规划领域的广泛评估表明，DUPLEX在成功率和可靠性上均显著优于现有的端到端及混合LLM基线方法。这些结果证实：关键并非让LLM更好地规划，而是将其限制在擅长的部分——结构化语义落地——而将逻辑规划生成交由符号规划器处理。

摘要 (Abstract)

While Large Language Models (LLMs) provide semantic flexibility for robotic task planning, their susceptibility to hallucination and logical inconsistency limits their reliability in long-horizon domains. To bridge the gap between unstructured environments and rigorous plan synthesis, we propose DUPLEX, an agentic dual-system neuro-symbolic architecture that strictly confines the LLM to schema-guided information extraction rather than end-to-end planning or code generation. In our framework, a feed-forward Fast System utilizes a lightweight LLM to extract entities, relations etc. from natural language, deterministically mapping them into a Planning Domain Definition Language (PDDL) problem file for a classical symbolic planner. To resolve complex or underspecified scenarios, a Slow System is activated exclusively upon planning failure, leveraging solver diagnostics to drive a high-capacity LLM in iterative reflection and repair. Extensive evaluations across 12 classical and household planning domains demonstrate that DUPLEX significantly outperforms existing end-to-end and hybrid LLM baselines in both success rate and reliability. These results confirm that The key is not to make the LLM plan better, but to restrict the LLM to the part it is good at - structured semantic grounding - and leave logical plan synthesis to a symbolic planner.

关键词: Large Language Models, Agentic Architecture, Dual-System Planning, Hallucination Mitigation, Neuro-Symbolic, PDDL, Task Planning, Robotic Planning

95. ❌ Latent Bias Alignment for High-Fidelity Diffusion Inversion in Real-World Image Reconstruction and Manipulation

作者: Weiming Chen, Qifan Liu, Siyi Liu, Yushun Tang, Yijia Wang, Zhihan Zhu, Zhihai He 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23903v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型（Diffusion Models）的图像重建和编辑技术，特别是扩散反演（Diffusion Inversion）问题。论文提出的Latent Bias Optimization (LBO)和Image Latent Boosting (ILB)方法旨在解决扩散反演中的轨迹错位和重建不匹配问题。然而，所有评分关键词均明确针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、Agent等）、特定推理方法（如CoT、MCTS）或科学AI应用（如Bioinformatics）。论文内容完全不涉及语言模型、文本生成、对齐、微调、代理系统或科学领域AI应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了扩散模型在真实世界图像重建和编辑中的扩散反演问题，通过提出Latent Bias Optimization和Image Latent Boosting方法，显著提高了图像重建质量和下游任务性能。

摘要翻译

近期研究表明，文本到图像扩散模型能够根据文本提示生成高质量图像。但它们能否从初始噪声中生成或逼近真实世界图像？这被称为扩散反演问题，是连接扩散模型与真实场景的基础构建模块。然而，现有扩散反演方法常面临重建质量低或鲁棒性弱的问题。两个主要挑战需被审慎解决：（1）扩散过程中反演轨迹与生成轨迹的错位；（2）扩散反演过程与VQ自编码器（VQAE，Vector Quantized Autoencoder）重建之间的不匹配。为解决这些挑战，我们在每个反演步骤中引入潜在偏置向量，通过学习该向量来减少反演与生成轨迹间的错位。我们将此策略称为潜在偏置优化（LBO，Latent Bias Optimization）。此外，通过学习调整图像潜在表征——该表征作为两个过程间的连接接口，我们对扩散反演与VQAE重建过程进行近似联合优化。我们将此技术称为图像潜在增强（ILB，Image Latent Boosting）。大量实验结果表明，所提方法显著提升了扩散模型的图像重建质量，并改善了包括图像编辑和稀有概念生成在内的下游任务性能。

摘要 (Abstract)

Recent research has shown that text-to-image diffusion models are capable of generating high-quality images guided by text prompts. But can they be used to generate or approximate real-world images from the seed noise? This is known as the diffusion inversion problem, which serves as a fundamental building block for bridging diffusion models and real-world scenarios. However, existing diffusion inversion methods often suffer from low reconstruction quality or weak robustness. Two major challenges need to be carefully addressed: (1) the misalignment between the inversion and generation trajectories during the diffusion process, and (2) the mismatch between the diffusion inversion process and the VQ autoencoder (VQAE) reconstruction. To address these challenges, we introduce a latent bias vector at each inversion step, which is learned to reduce the misalignment between inversion and generation trajectories. We refer to this strategy as Latent Bias Optimization (LBO). Furthermore, we perform an approximate joint optimization of the diffusion inversion and VQAE reconstruction processes by learning to adjust the image latent representation, which serves as the connecting interface between them. We refer to this technique as Image Latent Boosting (ILB). Extensive experimental results demonstrate that the proposed method significantly improves the image reconstruction quality of the diffusion model, as well as the performance of downstream tasks, including image editing and rare concept generation.

关键词: Diffusion Inversion, Image Reconstruction, Latent Bias Optimization, Image Latent Boosting, Diffusion Models, VQ Autoencoder, Image Editing, Real-world Images

96. ❌ Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval

作者: Junkai Yang, Qirui Wang, Yaoqing Jin, Shuai Ma, Minghan Xu, Shanmin Pang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23902v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频检索任务，提出了一种名为KDC-Net的网络架构，通过层次语义聚合和动态时间注意力机制解决文本-视频信息密度不匹配和注意力机制有限的问题。虽然论文涉及知识蒸馏和CLIP模型，但所有关键词均与大模型技术原理、训练方法、推理优化、对齐技术、代理系统或科学AI应用直接相关，而本文核心是计算机视觉中的视频检索方法，未涉及任何大模型技术或深度学习原理创新，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种知识精炼的双上下文感知网络（KDC-Net），通过层次语义聚合和动态时间注意力机制解决未修剪视频中部分相关片段检索的挑战，在PRVR基准测试中优于现有方法。

摘要翻译

从未经剪辑的视频中检索部分相关片段仍面临两大持续挑战：文本与视频片段间的信息密度不匹配，以及现有注意力机制对语义焦点和事件关联性的忽视。为此，我们提出KDC-Net（Knowledge-Refined Dual Context-Aware Network），该网络从文本与视觉双视角应对上述问题。在文本侧，层级语义聚合模块（Hierarchical Semantic Aggregation）捕获并自适应融合多尺度短语线索，以丰富查询语义。在视频侧，动态时序注意力机制（Dynamic Temporal Attention）采用相对位置编码与自适应时序窗口，以突出具有局部时序连贯性的关键事件。此外，基于动态CLIP的蒸馏策略通过时序连续性感知的优化增强，确保了片段感知且目标对齐的知识迁移。在PRVR基准测试上的实验表明，KDC-Net持续优于现有先进方法，尤其在低片段-视频比条件下表现突出。

摘要 (Abstract)

Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.

关键词: video retrieval, partially relevant segments, knowledge-refined network, dual context-aware, hierarchical semantic aggregation, dynamic temporal attention, CLIP-based distillation, temporal-continuity-aware refinement

97. ❌ SM-Net: Learning a Continuous Spectral Manifold from Multiple Stellar Libraries

作者: Omar Anwar, Aaron S. G. Robotham, Luca Cortese, Kevin Vinsen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23899v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文SM-Net专注于天文学领域，使用机器学习模型从多个恒星光谱库中学习连续光谱流形，生成恒星光谱。所有关键词均与大语言模型（LLM）、深度学习技术原理、训练对齐方法、推理优化、智能体系统等具体技术直接相关，而本文是传统的监督式机器学习在天体物理学的应用，未涉及任何大模型或深度学习的前沿技术创新。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（天文学）领域的应用，但并非核心匹配大模型在科学领域的应用创新，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究开发了SM-Net机器学习模型，通过整合多个高分辨率恒星光谱库，学习一个连续的光谱流形，能够直接从基本恒星参数（有效温度、表面重力、金属丰度）快速生成恒星光谱，实现了对传统恒星种群合成库的数据驱动补充。

摘要翻译

我们提出SM-Net，这是一种从多个高分辨率恒星光谱库中学习连续谱流形的机器学习模型。SM-Net能够直接从基本恒星参数——有效温度（Teff）、表面重力（log g）和金属丰度（log Z）——生成恒星光谱。该模型在整合PHOENIX-Husser、C3K-Conroy、OB-PoWR和TMAP-Werner光谱库构建的联合网格上进行训练。通过融合这些光谱库的参数空间，我们构建了一个复合数据集，其覆盖的恒星参数空间范围比任何单一光谱库都更广且更连续。统一网格覆盖Teff = 2,000-190,000 K、log g = -1至9、log Z = -4至1，光谱范围达3,000-100,000埃。在此区域内，SM-Net能够在异质光谱库边界间实现平滑插值。在采样区域外，模型虽无法直接通过参考模型验证，但仍可生成数值上平滑的探索性预测。零值或掩码通量值被视作未知而非物理零值，这使得网络能够利用从相邻网格点学习到的相关性来推断缺失区域。在3,538条训练光谱和11,530条测试光谱上，SM-Net在经log1p变换的通量表示中，训练集均方误差为1.47×10^-5，测试集均方误差为2.34×10^-5。在单GPU上推理吞吐量超过每秒14,000条光谱。我们同时发布了模型及交互式网络仪表板，支持实时光谱生成与可视化。SM-Net为传统恒星种群合成光谱库提供了一个快速、稳健且灵活的数据驱动补充工具。

摘要 (Abstract)

We present SM-Net, a machine-learning model that learns a continuous spectral manifold from multiple high-resolution stellar libraries. SM-Net generates stellar spectra directly from the fundamental stellar parameters effective temperature (Teff), surface gravity (log g), and metallicity (log Z). It is trained on a combined grid derived from the PHOENIX-Husser, C3K-Conroy, OB-PoWR, and TMAP-Werner libraries. By combining their parameter spaces, we construct a composite dataset that spans a broader and more continuous region of stellar parameter space than any individual library. The unified grid covers Teff = 2,000-190,000 K, log g = -1 to 9, and log Z = -4 to 1, with spectra spanning 3,000-100,000 Angstrom. Within this domain, SM-Net provides smooth interpolation across heterogeneous library boundaries. Outside the sampled region, it can produce numerically smooth exploratory predictions, although these extrapolations are not directly validated against reference models. Zero or masked flux values are treated as unknowns rather than physical zeros, allowing the network to infer missing regions using correlations learned from neighbouring grid points. Across 3,538 training and 11,530 test spectra, SM-Net achieves mean squared errors of 1.47 x 10^-5 on the training set and 2.34 x 10^-5 on the test set in the transformed log1p-scaled flux representation. Inference throughput exceeds 14,000 spectra per second on a single GPU. We also release the model together with an interactive web dashboard for real-time spectral generation and visualisation. SM-Net provides a fast, robust, and flexible data-driven complement to traditional stellar population synthesis libraries.

关键词: SM-Net, stellar spectra, spectral manifold, machine learning, stellar parameters, interpolation, astrophysics, data-driven model

98. ❌ AgentChemist: A Multi-Agent Experimental Robotic Platform Integrating Chemical Perception and Precise Control

作者: Xiangyi Wei, Fei Wang, Haotian Zhang, Xin An, Haitian Zhu, Lianrui Hu, Yang Li, Changbo Wang, Xiao He 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23886v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究化学实验室自动化中的多智能体机器人平台，与大多数大模型技术关键词（如LLM、MoE、训练方法、推理优化等）完全无关。仅与两个关键词相关：1）‘Multi-agent Systems OR Agent Coordination’（10分）- 论文核心是多智能体平台，涉及协作任务分解和动态调度；2）‘AI for Science OR Bioinformatics OR Cheminformatics’（10分）- 论文属于AI在科学（化学）领域的应用。其他关键词均未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对化学实验室自动化中刚性工作流程难以适应长尾实验任务的问题，提出了一个集成化学感知和精确控制的多智能体机器人平台，通过验证酸-碱滴定实验展示了其自主进度跟踪、自适应分配控制和可靠的端到端实验执行能力。

摘要翻译

化学实验室自动化长期以来受限于僵化的工作流程以及对实验任务长尾分布的适应性不足。尽管大多数自动化平台在有限的标准化流程上表现良好，但实际实验室涉及多样化、低频且不断演变的操作，这些操作往往超出预定义规程的范围。这种不匹配导致现有系统难以推广至新的反应条件、非常规仪器配置以及意外的流程变化。我们提出了一种多智能体机器人平台，旨在通过协同任务分解、动态调度与自适应控制来解决这一长尾挑战。该系统将用于实时反应监测的化学感知能力与反馈驱动执行相结合，使其能够根据实验状态的动态变化而非固定脚本调整操作。通过酸碱滴定的验证实验，平台展示了自主进度跟踪、自适应加液控制以及可靠的端到端实验执行能力。通过提升在多样化实验室场景中的泛化能力，该平台为实现智能化、灵活且可扩展的实验室自动化提供了一条实用路径。

摘要 (Abstract)

Chemical laboratory automation has long been constrained by rigid workflows and poor adaptability to the long-tail distribution of experimental tasks. While most automated platforms perform well on a narrow set of standardized procedures, real laboratories involve diverse, infrequent, and evolving operations that fall outside predefined protocols. This mismatch prevents existing systems from generalizing to novel reaction conditions, uncommon instrument configurations, and unexpected procedural variations. We present a multi-agent robotic platform designed to address this long-tail challenge through collaborative task decomposition, dynamic scheduling, and adaptive control. The system integrates chemical perception for real-time reaction monitoring with feedback-driven execution, enabling it to adjust actions based on evolving experimental states rather than fixed scripts. Validation via acid-base titration demonstrates autonomous progress tracking, adaptive dispensing control, and reliable end-to-end experiment execution. By improving generalization across diverse laboratory scenarios, this platform provides a practical pathway toward intelligent, flexible, and scalable laboratory automation.

关键词: multi-agent robotic platform, chemical laboratory automation, chemical perception, adaptive control, task decomposition, dynamic scheduling, acid-base titration, experimental robotics

99. ❌ The Luna Bound Propagator for Formal Analysis of Neural Networks

作者: Henry LeCates, Haoze Wu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23878v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于神经网络的形式化验证方法（特别是alpha-CROWN的C++实现），属于深度学习技术的基础工具研究。所有关键词均涉及大模型技术、训练方法、推理优化、对齐、应用等领域，与论文的神经网络验证主题无直接关联。论文未涉及大模型、语言模型、训练技术、推理加速、对齐方法或科学AI应用等关键词内容。

!!! tip deepseek-chat TL;DR

该论文提出了Luna，一个用C++实现的神经网络边界传播器，用于形式化验证，相比现有Python实现具有更好的集成性和计算效率。

摘要翻译

参数化CROWN分析（亦称alpha-CROWN）已成为神经网络验证中一种实际效果卓越的边界传播方法。然而，现有的alpha-CROWN实现仅限于Python环境，这使其难以集成到现有的深度神经网络验证器及长期生产级系统中。本文提出Luna——一种基于C++实现的新型边界传播器。Luna支持在通用计算图上进行区间边界传播、CROWN分析及alpha-CROWN分析。我们阐述了Luna的体系架构，并通过VNN-COMP 2025基准测试表明，其在边界紧致度和计算效率方面均与当前最先进的alpha-CROWN实现具有可比性。

摘要 (Abstract)

The parameterized CROWN analysis, a.k.a., alpha-CROWN, has emerged as a practically successful bound propagation method for neural network verification. However, existing implementations of alpha-CROWN are limited to Python, which complicates integration into existing DNN verifiers and long-term production-level systems. We introduce Luna, a new bound propagator implemented in C++. Luna supports Interval Bound Propagation, the CROWN analysis, and the alpha-CROWN analysis over a general computational graph. We describe the architecture of Luna and show that it is competitive with the state-of-the-art alpha-CROWN implementation in terms of both bound tightness and computational efficiency on benchmarks from VNN-COMP 2025.

关键词: neural network verification, bound propagation, alpha-CROWN, C++ implementation, computational efficiency, VNN-COMP benchmarks, formal analysis, Luna

100. ❌ The DeepXube Software Package for Solving Pathfinding Problems with Learned Heuristic Functions and Search

作者: Forest Agostinelli 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23873v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DeepXube专注于使用深度强化学习和启发式搜索解决路径规划问题，未涉及大语言模型（LLMs）或深度学习技术原理创新。摘要中提到的技术（如深度神经网络、强化学习、启发式搜索）属于传统机器学习范畴，与评分关键词列表中的大模型相关技术（如LLMs、MoE、Scaling Laws、对齐、RAG等）无直接关联。论文也未涉及生物信息学等科学AI应用。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

DeepXube是一个开源Python软件包，通过深度强化学习训练启发式函数来自动解决路径规划问题，并利用GPU并行化提高效率。

摘要翻译

DeepXube 是一款免费开源的 Python 软件包及命令行工具，旨在通过利用机器学习学习启发式函数，以自动化解决路径规划问题。这些启发式函数用于指导专为深度神经网络（Deep Neural Networks, DNNs）定制的启发式搜索算法。DeepXube 集成了解决路径规划问题领域在深度强化学习、启发式搜索和形式逻辑方面的最新进展，包括基于有限视野的贝尔曼学习、后见经验回放、批处理启发式搜索，以及使用答案集编程来指定目标。其健壮的多重继承结构简化了路径规划领域的定义和训练数据的生成。通过跨中央处理器（CPUs）自动并行化生成训练数据，以及跨图形处理器（GPUs）并行化强化学习更新，训练启发式函数的效率得以显著提升。该工具可轻松通过命令行参数调用，以利用 GPUs 和 DNN 架构并行性的路径规划算法（例如批量加权 A* 搜索、Q* 搜索和集束搜索）来解决路径规划问题。最后，该工具还提供了若干便捷功能，用于在训练和求解过程中进行可视化、代码性能分析和进度监控。其 GitHub 代码库公开于 https://github.com/forestagostinelli/deepxube。

摘要 (Abstract)

DeepXube is a free and open-source Python package and command-line tool that seeks to automate the solution of pathfinding problems by using machine learning to learn heuristic functions that guide heuristic search algorithms tailored to deep neural networks (DNNs). DeepXube is comprised of the latest advances in deep reinforcement learning, heuristic search, and formal logic for solving pathfinding problems. This includes limited-horizon Bellman-based learning, hindsight experience replay, batched heuristic search, and specifying goals with answer-set programming. A robust multiple-inheritance structure simplifies the definition of pathfinding domains and the generation of training data. Training heuristic functions is made efficient through the automatic parallelization of the generation of training data across central processing units (CPUs) and reinforcement learning updates across graphics processing units (GPUs). Pathfinding algorithms that take advantage of the parallelism of GPUs and DNN architectures, such as batch weighted A* and Q* search and beam search are easily employed to solve pathfinding problems through command-line arguments. Finally, several convenient features for visualization, code profiling, and progress monitoring during training and solving are available. The GitHub repository is publicly available at https://github.com/forestagostinelli/deepxube.

关键词: pathfinding problems, heuristic functions, deep reinforcement learning, heuristic search, GPU parallelization, batch weighted A*, Q* search, answer-set programming

101. ❌ Can VLMs Reason Robustly? A Neuro-Symbolic Investigation

作者: Weixin Chen, Antonio Vergari, Han Zhao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23867v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）在分布偏移下的鲁棒推理能力，提出神经符号方法VLC。与关键词的相关性：1）与"Post-training OR Supervised Fine-tuning OR SFT”（5分）相关，因为论文提到VLMs通过端到端微调训练；2）与"Chain of Thought OR CoT Reasoning OR Multi-step Reasoning"（10分）高度相关，论文研究视觉演绎推理任务，涉及多步推理；3）与"System 2 Thinking OR Slow Thinking OR In-depth Reasoning"（10分）高度相关，论文关注深度推理和鲁棒推理能力；4）与"Mechanistic Interpretability OR Explainable AI"（5分）有一定关联，因为神经符号方法旨在提高模型的可解释性。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型在分布偏移下的鲁棒推理能力，发现传统微调方法泛化能力不足，提出了一种结合视觉概念识别和电路符号推理的神经符号方法VLC，在三个视觉演绎推理任务中实现了稳定的强性能。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）已被广泛应用于各类推理任务，但其在分布变化下是否具备稳健的推理能力仍不明确。本文研究了协变量偏移问题，即感知输入分布发生变化而底层预测规则保持不变的情况。为探讨此问题，我们聚焦于视觉演绎推理任务，要求模型根据给定图像及图像中物体概念定义的逻辑规则来回答问题。实验发现，通过基于梯度的端到端训练微调的视觉语言模型虽能在分布内取得高准确率，却无法在此类偏移下有效泛化，这表明微调并不能可靠地引导出底层的推理函数。这促使我们从神经符号视角出发，将感知与推理解耦。然而，我们进一步观察到，近期依赖黑盒组件进行推理的神经符号方法在不同任务间仍可能表现出不一致的稳健性。为解决这一问题，我们提出了VLC——一种结合基于VLM的概念识别与基于电路的符号推理的神经符号方法。具体而言，任务规则被编译为符号程序（特别是电路），该程序在VLM识别的物体概念上精确执行规则。在三个具有不同规则集的视觉演绎推理任务上的实验表明，VLC在协变量偏移下持续表现出强劲性能，凸显了其支持稳健推理的能力。

摘要 (Abstract)

Vision-Language Models (VLMs) have been applied to a wide range of reasoning tasks, yet it remains unclear whether they can reason robustly under distribution shifts. In this paper, we study covariate shifts in which the perceptual input distribution changes while the underlying prediction rules do not. To investigate this question, we consider visual deductive reasoning tasks, where a model is required to answer a query given an image and logical rules defined over the object concepts in the image. Empirically, we find that VLMs fine-tuned through gradient-based end-to-end training can achieve high in-distribution accuracy but fail to generalize under such shifts, suggesting that fine-tuning does not reliably induce the underlying reasoning function. This motivates a neuro-symbolic perspective that decouples perception from reasoning. However, we further observe that recent neuro-symbolic approaches that rely on black-box components for reasoning can still exhibit inconsistent robustness across tasks. To address this issue, we propose VLC, a neuro-symbolic method that combines VLM-based concept recognition with circuit-based symbolic reasoning. In particular, task rules are compiled into a symbolic program, specifically a circuit, which executes the rules exactly over the object concepts recognized by the VLM. Experiments on three visual deductive reasoning tasks with distinct rule sets show that VLC consistently achieves strong performance under covariate shifts, highlighting its ability to support robust reasoning.

关键词: Vision-Language Models, Robust Reasoning, Distribution Shifts, Neuro-Symbolic Methods, Visual Deductive Reasoning, Covariate Shifts, Concept Recognition, Symbolic Reasoning

102. ❌ Generative AI User Experience: Developing Human–AI Epistemic Partnership

作者: Xiaoming Zhai 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文聚焦于生成式AI（特别是ChatGPT类系统）在教育领域的用户体验理论构建，而非具体的大模型技术原理或算法创新。它主要与’Large Language Models OR LLMs OR Foundation Models’相关（5分），因为讨论的是以LLM为基础的生成式AI系统。与’LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定关联（5分），因为论文探讨了人机协作、代理式工作流中的认知分配和问责制。与’AI for Science OR Bioinformatics OR Cheminformatics’有弱关联（5分），因为论文提到了AI在科学论证（scientific argumentation）中的应用，属于AI for Science的广义范畴。其他关键词均涉及具体的技术方法、训练策略或优化算法，论文未涉及这些技术细节，故评0分。

!!! tip deepseek-chat TL;DR

本文针对生成式AI（如ChatGPT）在教育中超越传统工具角色、参与知识建构所引发的用户体验问题（如权威协商、认知重分配），提出了人机认知伙伴关系理论（HAEPT），将其解释为认知、代理和问责三个契约的动态协商过程，并通过案例分析展示了该理论在理解AI协作学习和科学论证中的应用价值。

摘要翻译

生成式人工智能（GenAI）已迅速进入教育领域，但其用户体验常被通过有用性、易用性和参与度等以采纳为导向的构念来解释。我们认为这些构念已不再充分，因为像ChatGPT这样的系统不仅支持学习任务，还参与了知识建构。现有理论无法解释为何GenAI频繁催生出以协商式权威、分布式认知和责任张力为特征的体验。为填补这一空白，本文提出了“人—AI认知伙伴关系理论”（Human–AI Epistemic Partnership Theory, HAEPT），将GenAI用户体验解释为一种认知伙伴关系形式，其特征体现在三个相互关联的契约——认知契约、能动性契约与责任契约——的动态协商中。我们认为，关于GenAI的信任、过度依赖、学术诚信、教师谨慎态度以及关系性互动等研究发现，可被重新解读为这些契约内部的张力，而非孤立的问题。用户并非对GenAI持有单一、稳定的看法，而是通过校准循环随时间调整其与AI的关系模式。这些反复的互动解释了为何信任与怀疑常共存，也说明了伙伴关系模式如何描述跨任务中人—AI协作的反复出现的配置。为展示HAEPT的实用性，我们将其应用于分析AI发言者支持的协作学习以及AI辅助科学论证的用户体验，阐释了不同的契约配置。

摘要 (Abstract)

Generative AI (GenAI) has rapidly entered education, yet its user experience is often explained through adoption-oriented constructs such as usefulness, ease of use, and engagement. We argue that these constructs are no longer sufficient because systems such as ChatGPT do not merely support learning tasks but also participate in knowledge construction. Existing theories cannot explain why GenAI frequently produces experiences characterized by negotiated authority, redistributed cognition, and accountability tension. To address this gap, this paper develops the Human–AI Epistemic Partnership Theory (HAEPT), explaining the GenAI user experience as a form of epistemic partnership that features a dynamic negotiation of three interlocking contracts: epistemic, agency, and accountability. We argue that findings on trust, over-reliance, academic integrity, teacher caution, and relational interaction about GenAI can be reinterpreted as tensions within these contracts rather than as isolated issues. Instead of holding a single, stable view of GenAI, users adjust how they relate to it over time through calibration cycles. These repeated interactions account for why trust and skepticism often coexist and for how partnership modes describe recurrent configurations of human–AI collaboration across tasks. To demonstrate the usefulness of HAEPT, we applied it to analyze the UX of collaborative learning with AI speakers and AI-facilitated scientific argumentation, illustrating different contract configurations.

关键词: Generative AI, Human-AI Interaction, Epistemic Partnership, User Experience, Education, Knowledge Construction, Accountability, Collaborative Learning

103. ❌ Deep Convolutional Neural Networks for predicting highest priority functional group in organic molecules

作者: Kunal Khatri, Vineet Mehta 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23862v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度卷积神经网络（CNN）从傅里叶变换红外光谱（FTIR）预测有机分子的最高优先级官能团，属于化学信息学领域的传统深度学习应用。论文未涉及任何大语言模型（LLM）、MoE、缩放定律、预训练/后训练、对齐、RAG、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等大模型相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于化学信息学（Cheminformatics）范畴，但并非大模型在该领域的应用，因此给予5分（有一定关联）。其他所有关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出使用深度卷积神经网络从有机分子的傅里叶变换红外光谱中预测最高优先级官能团，并证明其性能优于传统的支持向量机方法。

摘要翻译

本研究致力于解决有机分子中最高优先级官能团的预测问题。官能团是指决定有机分子物理与化学性质的成键原子基团。当分子中存在多个官能团时，主导官能团决定了化合物的特性。傅里叶变换红外光谱（Fourier-transform Infrared spectroscopy, FTIR）是一种常用于鉴定化合物中官能团存在与否的光谱学方法。我们提出采用深度卷积神经网络（Deep Convolutional Neural Networks, CNN），依据有机分子的傅里叶变换红外光谱（FTIR）来预测其最高优先级官能团。我们将所提出的模型与先前应用的机器学习（Machine Learning, ML）方法——支持向量机（Support Vector Machine, SVM）进行了比较，并论证了卷积神经网络性能更优的原因。

摘要 (Abstract)

Our work addresses the problem of predicting the highest priority functional group present in an organic molecule. Functional Groups are groups of bound atoms that determine the physical and chemical properties of organic molecules. In the presence of multiple functional groups, the dominant functional group determines the compound’s properties. Fourier-transform Infrared spectroscopy (FTIR) is a commonly used spectroscopic method for identifying the presence or absence of functional groups within a compound. We propose the use of a Deep Convolutional Neural Networks (CNN) to predict the highest priority functional group from the Fourier-transform infrared spectrum (FTIR) of the organic molecule. We have compared our model with other previously applied Machine Learning (ML) method Support Vector Machine (SVM) and reasoned why CNN outperforms it.

关键词: Deep Convolutional Neural Networks, functional group prediction, organic molecules, Fourier-transform infrared spectroscopy, FTIR, machine learning, Support Vector Machine, cheminformatics

104. ❌ Why the Maximum Second Derivative of Activations Matters for Adversarial Robustness

作者: Yunrui Yu, Hang Su, Jun Zhu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23860v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究激活函数曲率（最大二阶导数）与对抗鲁棒性的关系，属于深度学习基础理论研究。所有评分关键词均聚焦于大模型（LLM）相关技术、应用或优化方法，而本文完全不涉及大模型、语言模型、科学AI应用或任何评分关键词中的具体技术。论文内容与评分关键词列表完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究发现激活函数的最大二阶导数存在一个最优范围（4-10），在此范围内神经网络能获得最佳的对抗鲁棒性，揭示了激活曲率与鲁棒性之间的非单调关系。

摘要翻译

本研究探讨了激活函数曲率（以最大二阶导数 $\max|σ’’|$ 量化）在对抗鲁棒性中的关键作用。利用递归曲率可调激活函数族（Recursive Curvature-Tunable Activation Family, RCT-AF）——该族函数通过参数 $α$ 和 $β$ 实现对曲率的精确控制——我们系统分析了这一关系。我们的研究揭示了一个根本性的权衡：曲率不足会限制模型表达能力，而曲率过大则会放大损失函数的归一化海森矩阵对角范数，导致更尖锐的极小值，从而阻碍鲁棒泛化。这导致了一种非单调关系：当 $\max|σ’’|$ 处于 4 到 10 的范围内时，对抗鲁棒性始终达到最优。这一发现在不同的网络架构、数据集和对抗训练方法中均成立。我们从理论上阐释了激活函数曲率如何影响损失函数海森矩阵的对角元素，并通过实验证明归一化海森对角范数与 $\max|σ’’|$ 呈 U 形依赖关系，其最小值恰好位于最优鲁棒性区间内，从而验证了所提出的机制。

摘要 (Abstract)

This work investigates the critical role of activation function curvature – quantified by the maximum second derivative $\max|σ’’|$ – in adversarial robustness. Using the Recursive Curvature-Tunable Activation Family (RCT-AF), which enables precise control over curvature through parameters $α$ and $β$, we systematically analyze this relationship. Our study reveals a fundamental trade-off: insufficient curvature limits model expressivity, while excessive curvature amplifies the normalized Hessian diagonal norm of the loss, leading to sharper minima that hinder robust generalization. This results in a non-monotonic relationship where optimal adversarial robustness consistently occurs when $\max|σ’’|$ falls within 4 to 10, a finding that holds across diverse network architectures, datasets, and adversarial training methods. We provide theoretical insights into how activation curvature affects the diagonal elements of the hessian matrix of the loss, and experimentally demonstrate that the normalized Hessian diagonal norm exhibits a U-shaped dependence on $\max|σ’’|$, with its minimum within the optimal robustness range, thereby validating the proposed mechanism.

关键词: adversarial robustness, activation function curvature, maximum second derivative, RCT-AF, Hessian diagonal norm, sharp minima, robust generalization

105. ❌ When AI output tips to bad but nobody notices: Legal implications of AI’s mistakes

作者: Dylan J. Restrepo, Nicholas J. Restrepo, Frank Y. Huo, Neil F. Johnson 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23857v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究生成式AI在法律领域应用中产生虚假内容（幻觉）的问题，并分析其技术机制和法律影响。与’Large Language Models’相关（8分），因为论文讨论生成式AI（基于Transformer）的法律应用；与’Hallucination Mitigation’高度相关（10分），因为这是论文的核心问题；与’Mechanistic Interpretability’相关（8分），因为论文分析了Transformer机制导致幻觉的确定性成分。其他关键词如MoE、SLMs、训练方法、推理技术、压缩加速等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究生成式AI在法律应用中产生权威性虚假内容（幻觉）的问题，通过物理机制分析发现这是Transformer设计的可预见后果，并提出用验证协议替代黑箱模型以应对法律职业的技术能力责任。

摘要翻译

生成式人工智能在商业和法律行业的应用带来了显著的效率提升——然而，尤其对法律领域而言，它引入了一种危险的故障模式：人工智能会生成完全看似真实的虚构判例、法规和司法意见。律师若在不知情的情况下提交此类捏造内容，将面临职业处罚、渎职风险及声誉损害，而法院则需应对对抗式程序完整性遭受的新型威胁。这种故障模式常被简单归为随机“幻觉”，但近期基于物理学原理对Transformer核心机制的分析揭示了一种确定性因素：当人工智能的内部状态超越可计算的阈值时，其输出会从可靠的法律推理转变为极具权威性的虚构内容。本文将在法律行业背景下阐释这一科学原理，通过模拟案情摘要起草场景进行推演。我们的分析表明，虚构风险并非异常故障，而是该技术设计可预见的后果，这对技术能力义务的演进具有直接意义。我们建议法律从业者、法院及监管机构摒弃过时的“黑箱”思维模型，转而依据这些系统的实际故障机制建立验证协议。

摘要 (Abstract)

The adoption of generative AI across commercial and legal professions offers dramatic efficiency gains – yet for law in particular, it introduces a perilous failure mode in which the AI fabricates fictitious case law, statutes, and judicial holdings that appear entirely authentic. Attorneys who unknowingly file such fabrications face professional sanctions, malpractice exposure, and reputational harm, while courts confront a novel threat to the integrity of the adversarial process. This failure mode is commonly dismissed as random hallucination', but recent physics-based analysis of the Transformer's core mechanism reveals a deterministic component: the AI's internal state can cross a calculable threshold, causing its output to flip from reliable legal reasoning to authoritative-sounding fabrication. Here we present this science in a legal-industry setting, walking through a simulated brief-drafting scenario. Our analysis suggests that fabrication risk is not an anomalous glitch but a foreseeable consequence of the technology's design, with direct implications for the evolving duty of technological competence. We propose that legal professionals, courts, and regulators replace the outdated black box’ mental model with verification protocols based on how these systems actually fail.

关键词: generative AI, legal applications, hallucination, fabricated case law, Transformer mechanism, verification protocols, technological competence, deterministic failure

106. ❌ SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

作者: Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23853v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多视觉语言模型（VLM）系统中的不确定性量化和幻觉检测，与关键词’Hallucination Mitigation OR Factuality OR Truthfulness’高度相关（10分），因为其核心贡献是减少幻觉风险。其他关键词均与论文内容无关（0分），因为论文不涉及大语言模型（LLM）技术、训练方法、推理技术、模型优化、AI代理或科学AI应用，而是针对视觉语言模型（VLM）的特定聚合问题。

!!! tip deepseek-chat TL;DR

论文提出了SCoOP框架，通过语义一致的意见池化来量化多视觉语言模型系统中的不确定性，有效检测幻觉并在ScienceQA上实现了比基线高10-13%的AUROC性能。

摘要翻译

融合多个视觉-语言模型（VLMs）能够增强多模态推理能力与鲁棒性，但聚合异构模型的输出会放大不确定性并增加幻觉风险。我们提出SCoOP（语义一致意见池化），一种免训练的不确定性量化（UQ）框架，通过不确定性加权线性意见池化来优化多VLM系统。与先前针对单一模型设计的UQ方法不同，SCoOP显式地度量多个VLM之间的集体性、系统级不确定性，从而实现对高不确定性样本的有效幻觉检测与弃答。在ScienceQA数据集上，SCoOP在幻觉检测任务中取得了0.866的AUROC，较基线方法（0.732-0.757）提升约10-13%；在弃答任务中获得了0.907的AURAC，较基线（0.818-0.840）提升7-9%。尽管性能显著提升，SCoOP仅引入微秒级的聚合开销，相对于典型的VLM推理时间（秒级）可忽略不计。这些结果表明，SCoOP为不确定性感知的聚合提供了一种高效且原理清晰的机制，推动了多模态人工智能系统的可靠性发展。

摘要 (Abstract)

Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models’ outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Semantic-Consistent Opinion Pooling), a training-free uncertainty quantification (UQ) framework multi-VLM systems through uncertainty-weighted linear opinion pooling. Unlike prior UQ methods designed for single models, SCoOP explicitly measures collective, system-level uncertainty across multiple VLMs, enabling effective hallucination detection and abstention for highly uncertain samples. On ScienceQA, SCoOP achieves an AUROC of 0.866 for hallucination detection, outperforming baselines (0.732-0.757) by approximately 10-13%. For abstention, it attains an AURAC of 0.907, exceeding baselines (0.818-0.840) by 7-9%. Despite these gains, SCoOP introduces only microsecond-level aggregation overhead relative to the baselines, which is trivial compared to typical VLM inference time (on the order of seconds). These results demonstrate that SCoOP provides an efficient and principled mechanism for uncertainty-aware aggregation, advancing the reliability of multimodal AI systems.

关键词: Vision-Language Models, Uncertainty Quantification, Hallucination Detection, Multi-model Systems, Opinion Pooling, Semantic Consistency, Abstention, ScienceQA

107. ❌ VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents

作者: Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu, Luxi Lin, Qingyu Zhang, Ya Li, Quan Liu, Tong Xu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23840v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究车载智能体中的多用户长期记忆问题，开发了VehicleMemBench基准测试。与关键词的相关性分析如下：1）与’LLM Agents’和’Tool Use’高度相关（10分），因为论文明确研究车载智能体（agents）及其工具使用能力；2）与’Large Language Models’和’Long Context LLMs’有一定关联（5分），因为论文提到现有模型在记忆演化场景中表现不佳，暗示可能使用LLM作为智能体基础，且涉及长期记忆（长上下文）；3）其他关键词与论文内容无直接关系（0分），因为论文聚焦于特定应用场景的基准测试，而非大模型技术原理、训练方法、推理优化等底层技术创新。

!!! tip deepseek-chat TL;DR

该论文针对车载智能体在多用户长期记忆管理方面的不足，提出了VehicleMemBench基准测试，发现现有模型在用户偏好动态变化的记忆演化场景中表现不佳，需要更专业的记忆管理机制。

摘要翻译

随着对智能车载体验需求的日益增长，车载智能体正从简单的助手演变为长期伴侣。这一演进要求智能体能够持续建模多用户偏好，并在面对用户间偏好冲突及随时间变化的习惯时做出可靠决策。然而，现有基准测试大多局限于单用户、静态问答场景，未能捕捉真实车载环境中偏好的时序演化特性以及多用户、工具交互的本质。为填补这一空白，我们提出了VehicleMemBench——一个基于可执行车载仿真环境构建的多用户长上下文记忆基准。该基准通过对比智能体执行操作后的环境状态与预设目标状态，评估其工具使用与记忆能力，从而实现无需依赖大语言模型或人工评分、客观且可复现的评估。VehicleMemBench包含23个工具模块，每个样本涵盖超过80个历史记忆事件。实验表明，尽管主流大模型在直接指令任务上表现良好，但在涉及记忆演化的场景中（尤其是当用户偏好动态变化时）仍面临困难。即使是先进的记忆系统也难以有效处理该环境中领域特定的记忆需求。这些发现凸显了开发更鲁棒、更专业化的记忆管理机制的必要性，以支持现实车载系统中长期自适应决策的实现。为促进后续研究，我们已公开相关数据与代码。

摘要 (Abstract)

With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions. This evolution requires agents to continuously model multi-user preferences and make reliable decisions in the face of inter-user preference conflicts and changing habits over time. However, existing benchmarks are largely limited to single-user, static question-answer settings, failing to capture the temporal evolution of preferences and the multi-user, tool-interactive nature of real vehicle environments. To address this gap, we introduce VehicleMemBench, a multi-user long-context memory benchmark built on an executable in-vehicle simulation environment. The benchmark evaluates tool use and memory by comparing the post-action environment state with a predefined target state, enabling objective and reproducible evaluation without LLM-based or human scoring. VehicleMemBench includes 23 tool modules, and each sample contains over 80 historical memory events. Experiments show that powerful models perform well on direct instruction tasks but struggle in scenarios involving memory evolution, particularly when user preferences change dynamically. Even advanced memory systems struggle to handle domain-specific memory requirements in this environment. These findings highlight the need for more robust and specialized memory management mechanisms to support long-term adaptive decision-making in real-world in-vehicle systems. To facilitate future research, we release the data and code.

关键词: in-vehicle agents, multi-user long-term memory, memory benchmark, tool use, preference evolution, executable simulation, adaptive decision-making, memory management

108. ❌ Learning-guided Prioritized Planning for Lifelong Multi-Agent Path Finding in Warehouse Automation

作者: Han Zheng, Yining Ma, Brandon Araki, Jingkai Chen, Cathy Wu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23838v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于仓库自动化中的多智能体路径规划问题，使用强化学习（RL）与基于搜索的规划方法结合。仅与关键词’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文核心是多智能体系统的协调与路径规划。其他关键词均涉及大模型、深度学习技术原理或特定AI应用领域（如生物信息学），而本文使用传统RL和规划方法，未涉及大模型、LLM相关技术或AI for Science的具体应用，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对仓库自动化中的终身多智能体路径规划问题，提出了结合强化学习与滚动时域优先规划的框架，在仿真中实现了比基线更高的总吞吐量并展现出良好的泛化能力。

摘要翻译

终身多智能体路径规划（Lifelong Multi-Agent Path Finding，简称MAPF）对于现代仓库自动化至关重要，它要求多个机器人持续规划无冲突路径以优化系统整体吞吐量。然而，仓库环境的复杂性以及终身MAPF的长期动态性，通常需要对传统的基于搜索的求解器进行成本高昂的适配。尽管已有机器学习方法的探索，但其相对于基于搜索方法的优越性仍未定论。本文提出了强化学习引导的滚动时域优先级规划（Reinforcement Learning guided Rolling Horizon Prioritized Planning，简称RL-RH-PP），这是首个将强化学习与基于搜索的规划相结合以解决终身MAPF的框架。具体而言，我们利用经典的优先级规划（Prioritized Planning，简称PP）作为主干，因其简洁性及易于与基于学习的优先级分配策略集成。通过将动态优先级分配建模为部分可观测马尔可夫决策过程（Partially Observable Markov Decision Process，简称POMDP），RL-RH-PP充分利用了终身规划中的序贯决策特性，同时将智能体间复杂的时空交互交由强化学习处理。一个基于注意力机制的神经网络以自回归方式实时解码优先级顺序，使得PP规划器能够进行高效的序贯单智能体规划。在真实的仓库仿真评估中，RL-RH-PP在基线方法中实现了最高的总吞吐量，并能有效泛化至不同的智能体密度、规划时域和仓库布局。我们的解释性分析表明，RL-RH-PP能主动优先处理拥堵智能体，并策略性地引导智能体绕离拥堵区域，从而缓解交通流并提升吞吐量。这些发现凸显了学习引导方法在增强现代仓库自动化中传统启发式策略方面的潜力。

摘要 (Abstract)

Lifelong Multi-Agent Path Finding (MAPF) is critical for modern warehouse automation, which requires multiple robots to continuously navigate conflict-free paths to optimize the overall system throughput. However, the complexity of warehouse environments and the long-term dynamics of lifelong MAPF often demand costly adaptations to classical search-based solvers. While machine learning methods have been explored, their superiority over search-based methods remains inconclusive. In this paper, we introduce Reinforcement Learning (RL) guided Rolling Horizon Prioritized Planning (RL-RH-PP), the first framework integrating RL with search-based planning for lifelong MAPF. Specifically, we leverage classical Prioritized Planning (PP) as a backbone for its simplicity and flexibility in integrating with a learning-based priority assignment policy. By formulating dynamic priority assignment as a Partially Observable Markov Decision Process (POMDP), RL-RH-PP exploits the sequential decision-making nature of lifelong planning while delegating complex spatial-temporal interactions among agents to reinforcement learning. An attention-based neural network autoregressively decodes priority orders on-the-fly, enabling efficient sequential single-agent planning by the PP planner. Evaluations in realistic warehouse simulations show that RL-RH-PP achieves the highest total throughput among baselines and generalizes effectively across agent densities, planning horizons, and warehouse layouts. Our interpretive analyses reveal that RL-RH-PP proactively prioritizes congested agents and strategically redirects agents from congestion, easing traffic flow and boosting throughput. These findings highlight the potential of learning-guided approaches to augment traditional heuristics in modern warehouse automation.

关键词: Multi-Agent Path Finding, Warehouse Automation, Reinforcement Learning, Prioritized Planning, Lifelong Planning, Throughput Optimization, Agent Coordination, Rolling Horizon

109. ❌ Circuit Complexity of Hierarchical Knowledge Tracing and Implications for Log-Precision Transformers

作者: Naiming Liu, Richard Baraniuk, Shashank Sonkar 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23823v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究知识追踪中的层次先决条件传播，使用电路复杂性理论分析Transformer的计算能力，属于理论计算机科学和机器学习理论交叉领域。虽然涉及Transformer模型，但论文聚焦于理论计算复杂性分析（如TC^0、NC^1电路类）、层次结构建模和理论证明，而非大模型技术原理创新或具体应用开发。所有关键词均针对大模型技术栈、训练方法、应用场景或特定问题（如幻觉缓解、推理加速），与本文的理论分析性质完全无关。

!!! tip deepseek-chat TL;DR

该论文通过电路复杂性理论分析层次先决条件知识追踪任务中Transformer的计算能力，证明了递归多数传播在NC^1中但可能不在TC^0中，并发现Transformer在实践中会学习捷径而非利用结构，但通过辅助监督可改善性能。

摘要翻译

知识追踪模型对相互关联的概念（通常按先决条件组织）的掌握程度进行建模。我们通过电路复杂性视角分析层次化先决条件的传播，以阐明在深层概念层次结构上基于Transformer风格的计算可证明的性质。利用近期关于对数精度Transformer属于对数空间均匀$\mathsf{TC}^0$类的研究结果，我们形式化了包括递归多数掌握传播在内的先决条件树任务。无条件情况下，递归多数传播通过$O(\log n)$深度有限扇入电路属于$\mathsf{NC}^1$类，而将其与均匀$\mathsf{TC}^0$分离则需要在对数下界问题上取得重大进展。在单调性限制下，我们获得了一个无条件障碍：交替的ALL/ANY先决条件树为\emph{单调}阈值电路产生了严格的深度层次结构。实证研究表明，在递归多数树上训练的Transformer编码器会收敛到置换不变捷径；仅靠显式结构无法避免此现象，但对中间子树进行辅助监督可引发依赖结构的计算，并在深度3-4时实现近乎完美的准确率。这些发现为面向深层层次结构的、对先决条件敏感的知识追踪任务，提出了结构感知目标与迭代机制的设计动机。

摘要 (Abstract)

Knowledge tracing models mastery over interconnected concepts, often organized by prerequisites. We analyze hierarchical prerequisite propagation through a circuit-complexity lens to clarify what is provable about transformer-style computation on deep concept hierarchies. Using recent results that log-precision transformers lie in logspace-uniform $\mathsf{TC}^0$, we formalize prerequisite-tree tasks including recursive-majority mastery propagation. Unconditionally, recursive-majority propagation lies in $\mathsf{NC}^1$ via $O(\log n)$-depth bounded-fanin circuits, while separating it from uniform $\mathsf{TC}^0$ would require major progress on open lower bounds. Under a monotonicity restriction, we obtain an unconditional barrier: alternating ALL/ANY prerequisite trees yield a strict depth hierarchy for \emph{monotone} threshold circuits. Empirically, transformer encoders trained on recursive-majority trees converge to permutation-invariant shortcuts; explicit structure alone does not prevent this, but auxiliary supervision on intermediate subtrees elicits structure-dependent computation and achieves near-perfect accuracy at depths 3–4. These findings motivate structure-aware objectives and iterative mechanisms for prerequisite-sensitive knowledge tracing on deep hierarchies.

关键词: knowledge tracing, hierarchical prerequisites, circuit complexity, log-precision transformers, TC^0, NC^1, recursive-majority propagation, structure-aware objectives

110. ❌ Perturbation: A simple and efficient adversarial tracer for representation learning in language models

作者: Joshua Rozner, Cory Shain 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23821v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究语言模型中的表示学习，提出了一种通过对抗性微调扰动来追踪表示的方法。与’Large Language Models’相关（8分），因为论文研究语言模型的表示学习；与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为核心方法涉及在单个对抗样本上微调模型；与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为研究目标是理解语言模型内部表示机制。其他关键词如MoE、量化、推理加速、AI for Science等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过对抗性微调扰动来追踪语言模型表示学习的方法，揭示了训练后的语言模型能够沿着表示线进行泛化并从经验中获取语言抽象。

摘要翻译

深度神经语言模型中的语言表征学习已历经数十年研究，这既出于实用目的，也源于理论需求。然而，在语言模型中定位表征仍是一个未解难题，部分原因在于两种困境：要么对表征施加不合理的约束（如线性假设；Arora等人，2024），要么完全消解表征的概念意义（Sutter等人，2025）。本研究通过重新定义表征概念摆脱了这一困境——我们不再将表征视为激活模式，而是将其理解为学习传导的媒介。我们的方法简洁明了：通过对单个对抗样本进行微调来扰动语言模型，并测量这种扰动如何“感染”其他样本。扰动方法无需几何假设，且与其他方法不同，它不会在不应存在表征的模型中（例如未经训练的语言模型）虚假地发现表征。但在已训练的语言模型中，扰动揭示了多语言粒度层面的结构化迁移现象，这表明语言模型既能沿着表征路径进行泛化，也能仅从经验中习得语言抽象。

摘要 (Abstract)

Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects’’ other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured transfer at multiple linguistic grain sizes, suggesting that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone.

关键词: representation learning, language models, adversarial perturbation, fine-tuning, linguistic abstractions, transfer learning, neural networks, interpretability

111. ❌ Willful Disobedience: Automatically Detecting Failures in Agentic Traces

作者: Reshabh K Sharma, Shraddha Barke, Benjamin Zorn 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23806v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究AI代理（LLM Agents）在多步工作流中的执行轨迹评估，与’LLM Agents’高度相关（10分），涉及工具调用（Tool Use，8分），并隐含使用大模型（LLMs，8分）进行分析。其他关键词如MoE、SFT、RAG等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了AgentPex工具，通过从代理提示中提取行为规则，自动评估AI代理在多步工作流执行轨迹中的合规性，解决了传统结果基准无法捕捉程序性失败的问题，并在多个领域验证了其有效性。

摘要翻译

人工智能体正日益嵌入真实软件系统，通过多轮对话、工具调用和中间决策执行多步骤工作流。这些被称为智能体轨迹的长执行历史使得验证变得困难。仅关注结果的基准测试可能遗漏关键的程序性故障，例如错误的工作流路由、不安全的工具使用或违反提示词指定规则。本文提出AgentPex——一种基于人工智能的工具，旨在系统化评估智能体轨迹。AgentPex从智能体提示词和系统指令中提取行为规则，随后利用这些规约自动评估轨迹的合规性。我们在电信、零售和航空客服领域的多个模型上，使用τ2-bench的424条轨迹对AgentPex进行评估。结果表明，AgentPex能够区分不同模型间的智能体行为，并揭示仅凭结果评分无法捕捉的规约违反现象。该工具还提供按领域和度量指标的细粒度分析，帮助开发者大规模理解智能体的优势与不足。

摘要 (Abstract)

AI agents are increasingly embedded in real software systems, where they execute multi-step workflows through multi-turn dialogue, tool invocations, and intermediate decisions. These long execution histories, called agentic traces, make validation difficult. Outcome-only benchmarks can miss critical procedural failures, such as incorrect workflow routing, unsafe tool usage, or violations of prompt-specified rules. This paper presents AgentPex, an AI-powered tool designed to systematically evaluate agentic traces. AgentPex extracts behavioral rules from agent prompts and system instructions, then uses these specifications to automatically evaluate traces for compliance. We evaluate AgentPex on 424 traces from τ2-bench across models in telecom, retail, and airline customer service. Our results show that AgentPex distinguishes agent behavior across models and surfaces specification violations that are not captured by outcome-only scoring. It also provides fine-grained analysis by domain and metric, enabling developers to understand agent strengths and weaknesses at scale.

关键词: AI agents, agentic traces, tool invocations, multi-step workflows, behavioral rules, specification violations, outcome-only scoring, fine-grained analysis

112. ❌ Deep Neural Regression Collapse

作者: Akshay Rangamani, Altay Unal 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23805v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究深度回归模型中的Neural Regression Collapse现象，属于深度学习理论分析范畴，但所有评分关键词均聚焦于大语言模型（LLM）相关技术、应用和优化方法。论文内容完全不涉及LLM、MoE、SLM、scaling laws、预训练/后训练、对齐、RLHF、PEFT、RAG、推理优化、智能体、量化、幻觉缓解、可解释性、世界模型、模型融合、上下文学习或科学AI应用等主题。论文探讨的是通用深度神经网络在回归任务中的结构特性，与评分关键词列表中的大模型特定技术无任何关联。

!!! tip deepseek-chat TL;DR

该论文研究了深度回归模型中的Neural Regression Collapse现象，发现该现象不仅出现在最后一层，还存在于中间层，揭示了深度网络在回归任务中学习到的低秩结构和内在维度特征。

摘要翻译

神经坍缩是一种有助于识别深度分类器中稀疏与低秩结构的现象。近期研究已将神经坍缩的定义扩展至回归问题，但仅测量了最后一层的该现象。本文证实神经回归坍缩（Neural Regression Collapse, NRC）在不同类型模型的最后一层之前同样存在。我们证明，在神经回归模型的坍缩层中，特征位于与目标维度对应的子空间内，特征协方差与目标协方差对齐，层权重的输入子空间与特征子空间对齐，且特征的线性预测误差接近模型的整体预测误差。除确立深度神经回归坍缩外，本文还表明呈现深度神经回归坍缩的模型能够学习低秩目标的内在维度，并探讨了权重衰减在诱导深度神经回归坍缩中的必要性。本研究为回归背景下深度网络所学习的简单结构提供了更完整的图景。

摘要 (Abstract)

Neural Collapse is a phenomenon that helps identify sparse and low rank structures in deep classifiers. Recent work has extended the definition of neural collapse to regression problems, albeit only measuring the phenomenon at the last layer. In this paper, we establish that Neural Regression Collapse (NRC) also occurs below the last layer across different types of models. We show that in the collapsed layers of neural regression models, features lie in a subspace that corresponds to the target dimension, the feature covariance aligns with the target covariance, the input subspace of the layer weights aligns with the feature subspace, and the linear prediction error of the features is close to the overall prediction error of the model. In addition to establishing Deep NRC, we also show that models that exhibit Deep NRC learn the intrinsic dimension of low rank targets and explore the necessity of weight decay in inducing Deep NRC. This paper provides a more complete picture of the simple structure learned by deep networks in the context of regression.

关键词: Neural Regression Collapse, deep neural networks, regression problems, low rank structures, feature covariance, weight decay, intrinsic dimension, prediction error

113. ❌ Object Search in Partially-Known Environments via LLM-informed Model-based Planning and Prompt Selection

作者: Abhishek Paudel, Abhish Khanal, Raihan I. Arnob, Shahriar Hossain, Gregory J. Stein 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23800v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于LLM的模型规划框架和提示选择方法，用于部分已知环境中的物体搜索。核心是使用LLM来估计在不同位置找到目标物体的可能性，并结合环境地图中的旅行成本来实例化模型，从而实现有效的搜索规划。因此，论文与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLM是其核心组件。同时，论文涉及LLM在自主规划中的应用，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。其他关键词如MoE、SFT、RAG、推理方法、压缩技术等，论文未涉及或仅间接相关，故给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于LLM的模型规划框架和提示选择方法，用于部分已知环境中的物体搜索，实验表明该方法在模拟和真实机器人实验中均优于基线策略。

摘要翻译

本文提出了一种新颖的基于大语言模型（LLM）信息的模型化规划框架，以及一种创新的提示选择方法，用于在部分已知环境中进行目标物体搜索。我们的方法利用大语言模型来估计在场景中不同位置搜索到目标物体的可能性统计量，这些统计量与从环境地图中提取的移动成本相结合，用于实例化一个规划模型，从而借助大语言模型的信息辅助规划，实现高效的搜索性能。此外，本方法所依赖的抽象表示适用于通过近期提出的离线回放方法进行部署时的模型选择，我们利用这一特性，实现了在部署过程中快速选择提示词和大语言模型。仿真实验表明，我们这种基于大语言模型信息的模型化规划方法，其性能分别优于完全依赖大语言模型的基线规划策略和乐观策略，提升幅度最高可达11.8%和39.2%；同时，我们提出的类赌博机选择方法能够快速选择最佳提示词和大语言模型，与基线UCB赌博机选择方法相比，平均成本降低了6.5%，平均累积遗憾降低了33.8%。在一套公寓环境中进行的真实机器人实验也显示出类似的性能提升，从而进一步验证了我们的方法。

摘要 (Abstract)

We present a novel LLM-informed model-based planning framework, and a novel prompt selection method, for object search in partially-known environments. Our approach uses an LLM to estimate statistics about the likelihood of finding the target object when searching various locations throughout the scene that, combined with travel costs extracted from the environment map, are used to instantiate a model, thus using the LLM to inform planning and achieve effective search performance. Moreover, the abstraction upon which our approach relies is amenable to deployment-time model selection via the recent offline replay approach, an insight we leverage to enable fast prompt and LLM selection during deployment. Simulation experiments demonstrate that our LLM-informed model-based planning approach outperforms the baseline planning strategy that fully relies on LLM and optimistic strategy with as much as 11.8% and 39.2% improvements respectively, and our bandit-like selection approach enables quick selection of best prompts and LLMs resulting in 6.5% lower average cost and 33.8% lower average cumulative regret over baseline UCB bandit selection. Real-robot experiments in an apartment demonstrate similar improvements and so further validate our approach.

关键词: LLM-informed planning, model-based planning, object search, partially-known environments, prompt selection, autonomous agents, simulation experiments, real-robot experiments

114. ❌ The Cognitive Firewall:Securing Browser Based AI Agents Against Indirect Prompt Injection Via Hybrid Edge Cloud Defense

作者: Qianlong Lan, Anuj Kaul 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23791v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为自主浏览器代理的安全防御架构，与’Large Language Models’和’LLM Agents’高度相关（10分），因为论文明确研究LLM代理的安全问题。其他关键词如MoE、SLMs、训练方法、推理优化、科学应用等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

论文研究了保护基于浏览器的LLM代理免受间接提示注入攻击的混合边缘云防御架构，将攻击成功率降低到1%以下，同时实现17000倍的延迟优势。

摘要翻译

将大型语言模型（LLM）部署为自主浏览器代理会以间接提示注入（IPI）的形式暴露出显著的攻击面。基于云的防御方案能够提供强大的语义分析，但会引入延迟并引发隐私担忧。我们提出认知防火墙，这是一种三阶段的分割计算架构，将安全检查分布在客户端与云端。该系统由本地视觉哨兵、基于云的深度规划器以及强制执行时策略的确定性守卫组成。在1000个对抗性样本测试中，纯边缘防御方案未能检测出86.9%的语义攻击。相比之下，完整的混合架构将总体攻击成功率（ASR）降低至1%以下（静态评估下为0.88%，自适应评估下为0.67%），同时对具有副作用的操作保持确定性约束。通过在本地过滤表示层攻击，该系统避免了不必要的云端推理，相比纯云端基线实现了约17,000倍的延迟优势。这些结果表明，在执行边界实施确定性约束可以补充概率性语言模型，且分割计算为保护交互式LLM代理提供了实用基础。

摘要 (Abstract)

Deploying large language models (LLMs) as autonomous browser agents exposes a significant attack surface in the form of Indirect Prompt Injection (IPI). Cloud-based defenses can provide strong semantic analysis, but they introduce latency and raise privacy concerns. We present the Cognitive Firewall, a three-stage split-compute architecture that distributes security checks across the client and the cloud. The system consists of a local visual Sentinel, a cloud-based Deep Planner, and a deterministic Guard that enforces execution-time policies. Across 1,000 adversarial samples, edge-only defenses fail to detect 86.9% of semantic attacks. In contrast, the full hybrid architecture reduces the overall attack success rate (ASR) to below 1% (0.88% under static evaluation and 0.67% under adaptive evaluation), while maintaining deterministic constraints on side-effecting actions. By filtering presentation-layer attacks locally, the system avoids unnecessary cloud inference and achieves an approximately 17,000x latency advantage over cloud-only baselines. These results indicate that deterministic enforcement at the execution boundary can complement probabilistic language models, and that split-compute provides a practical foundation for securing interactive LLM agents.

关键词: Large Language Models, LLM Agents, Indirect Prompt Injection, Hybrid Edge Cloud Defense, Browser Agents, Security Architecture, Split-compute, Attack Success Rate

115. ❌ Probabilistic Geometric Alignment via Bayesian Latent Transport for Domain-Adaptive Foundation Models

作者: Kuepon Aueawatthanaphisut, Kuepon Aueawatthanaphisut 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23783v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大模型（Foundation Models）的领域自适应（Domain Adaptation）问题，与关键词1和5高度相关（10分）。论文提到与确定性微调（fine-tuning）基线比较，因此与关键词6有一定关联（5分）。论文未涉及其他关键词的具体技术或应用场景，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究大模型在有限监督下适应新领域时面临的潜在分布不匹配、优化不稳定和不确定性传播失准问题，提出了一种基于贝叶斯潜在传输的概率几何对齐框架，在理论上保证了收敛稳定性、损失景观平滑性和分布偏移下的样本效率，并在实验中显著减少了潜在流形差异、加速了传输能量衰减并改善了协方差校准。

摘要翻译

将大规模基础模型在有限监督下适配至新领域，仍是一个因潜在分布失配、优化动态不稳定及不确定性传播失准而产生的根本性挑战。本文提出一种不确定性感知的概率隐空间传输框架，将领域适应问题表述为表示空间中的随机几何对齐问题。我们设计了一种贝叶斯传输算子，用于沿Wasserstein型测地线轨迹重新分配隐空间概率质量，同时引入PAC-Bayesian正则化机制以约束后验模型复杂度，从而缓解灾难性过拟合。该框架为分布偏移下的收敛稳定性、损失景观平滑性及样本效率提供了理论保证。实证分析表明，相较于确定性微调与对抗性领域适应基线方法，本方法显著降低了隐流形差异，加速了传输能量衰减，并改善了协方差校准。此外，有界的后验不确定性演化表明跨领域迁移过程中概率可靠性得到增强。通过建立随机最优传输几何与统计泛化理论之间的原则性关联，本框架为异构环境下现代基础架构的鲁棒适应提供了新见解。这些发现表明，不确定性感知的概率对齐构成了下一代深度表示系统中实现可靠迁移学习的一种前景广阔的范式。

摘要 (Abstract)

Adapting large-scale foundation models to new domains with limited supervision remains a fundamental challenge due to latent distribution mismatch, unstable optimization dynamics, and miscalibrated uncertainty propagation. This paper introduces an uncertainty-aware probabilistic latent transport framework that formulates domain adaptation as a stochastic geometric alignment problem in representation space. A Bayesian transport operator is proposed to redistribute latent probability mass along Wasserstein-type geodesic trajectories, while a PAC-Bayesian regularization mechanism constrains posterior model complexity to mitigate catastrophic overfitting. The proposed formulation yields theoretical guarantees on convergence stability, loss landscape smoothness, and sample efficiency under distributional shift. Empirical analyses demonstrate substantial reduction in latent manifold discrepancy, accelerated transport energy decay, and improved covariance calibration compared with deterministic fine-tuning and adversarial domain adaptation baselines. Furthermore, bounded posterior uncertainty evolution indicates enhanced probabilistic reliability during cross-domain transfer. By establishing a principled connection between stochastic optimal transport geometry and statistical generalization theory, the proposed framework provides new insights into robust adaptation of modern foundation architectures operating in heterogeneous environments. These findings suggest that uncertainty-aware probabilistic alignment constitutes a promising paradigm for reliable transfer learning in next-generation deep representation systems.

关键词: Foundation Models, Domain Adaptation, Probabilistic Alignment, Bayesian Transport, Latent Distribution Mismatch, Uncertainty Propagation, Wasserstein Geodesic, PAC-Bayesian Regularization

116. ❌ Comparing Developer and LLM Biases in Code Evaluation

作者: Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donahue, Ameet Talwalkar, Wayne Chi, Valerie Chen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24586v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为代码评估法官时与人类偏好的对齐问题，直接涉及LLM技术和对齐评估，因此与’Large Language Models’和’Alignment’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统、科学AI应用等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM作为代码评估法官时与开发者偏好的对齐问题，发现最佳LLM法官比人类注释者差12-23%，并识别出35个显著的对齐偏差来源。

摘要翻译

随着大语言模型在代码应用中被越来越多地用作评判者，应在能捕捉局部上下文和模糊意图的现实交互环境中对其评估。我们提出TRACE（代码评估中的评分标准分析工具），这是一个评估大语言模型评判者预测人类偏好能力的框架，并能自动提取评分项以揭示人类与模型在各项权重分配上的系统性偏差。通过三种交互模态——基于聊天的编程、IDE自动补全和指令化代码编辑——我们使用TRACE测量大语言模型评判者与开发者偏好的对齐程度。在13个不同模型中，最佳评判者的表现仍低于人类标注者12-23%。TRACE识别出跨交互模态下人类与评判者之间35个显著的对齐偏差来源，其中大部分对应于现有的软件工程代码质量标准。例如，在基于聊天的编程中，评判者偏向更长的代码解释，而人类更偏好简短的说明。我们发现现有大多数代码质量维度上均存在显著偏差，这表明在现实编码应用中，大语言模型评判者与人类偏好之间存在明显的对齐差距。

摘要 (Abstract)

As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges’ ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities – chat-based programming, IDE autocompletion, and instructed code editing – we use TRACE to measure how well LLM judges align with developer preferences. Among 13 different models, the best judges underperform human annotators by 12-23%. TRACE identifies 35 significant sources of misalignment between humans and judges across interaction modalities, the majority of which correspond to existing software engineering code quality criteria. For example, in chat-based coding, judges are biased towards longer code explanations while humans prefer shorter ones. We find significant misalignment on the majority of existing code quality dimensions, showing alignment gaps between LLM judges and human preference in realistic coding applications.

关键词: LLM judges, code evaluation, human preference alignment, systematic biases, TRACE framework, software engineering, code quality criteria, interactive settings

117. ❌ MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

作者: Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie Hu, Yu Qin, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24579v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MARCH专注于解决LLM在RAG系统中的幻觉问题，核心贡献是提出一个多智能体强化学习框架，包含Solver、Proposer和Checker三个智能体，通过信息不对称设计打破自我确认偏见。因此，与以下关键词高度相关（10分）：Large Language Models（论文研究对象）、Retrieval-Augmented Generation（应用场景）、Self-Correction/Self-Improvement（框架目标）、LLM Agents/Multi-agent Systems（方法核心）、Hallucination Mitigation（研究问题）。其他关键词在论文中未涉及或仅边缘提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MARCH的多智能体强化学习框架，通过设计信息不对称的Solver-Proposer-Checker协作管道，有效减少了大型语言模型在检索增强生成系统中的幻觉问题，使8B参数模型达到与闭源模型竞争的性能。

摘要翻译

幻觉问题仍是大型语言模型（LLM）面临的关键瓶颈，削弱了其在实际应用中的可靠性，尤其在检索增强生成（RAG）系统中。现有幻觉检测方法通常采用“以LLM作为评判者”的方式，依据检索证据验证LLM输出，但这些方法存在固有的确认偏差——验证者会无意中复现原始生成中的错误。为解决此问题，我们提出多智能体强化自检幻觉框架（MARCH），该框架通过精心设计的信息不对称机制来强化事实对齐。MARCH协调三个专用智能体的协作流程：求解器（Solver）、提议器（Proposer）和检查器（Checker）。求解器首先生成初始RAG响应，随后提议器将其分解为可验证的原子命题。关键创新在于，检查器在隔离求解器原始输出的条件下，独立依据检索证据验证这些命题。这种精心设计的信息不对称机制打破了自我确认偏差的循环。通过多智能体强化学习（MARL）训练该流程，我们使智能体能够协同进化并优化事实一致性。在多个幻觉基准测试上的广泛实验表明，MARCH显著降低了幻觉率。值得注意的是，搭载MARCH的80亿参数LLM达到了与强大闭源模型相竞争的性能。MARCH通过协同进化为LLM的事实性自我改进提供了一条可扩展的路径。代码发布于https://github.com/Qwen-Applications/MARCH。

摘要 (Abstract)

Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi-Agent Reinforced Self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver’s original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at https://github.com/Qwen-Applications/MARCH.

关键词: Hallucination Mitigation, Retrieval-Augmented Generation, Multi-agent Systems, Self-Correction, Large Language Models, Reinforcement Learning, Factual Alignment, Information Asymmetry

118. ❌ Analysing the Safety Pitfalls of Steering Vectors

作者: Yuxiao Li, Alina Fastowski, Efstratios Zaradoukas, Bardh Prenkaj, Gjergji Kasneci 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24543v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究激活导向（activation steering）对LLM安全性的影响，属于LLM安全对齐和可解释性领域。与’Large Language Models’高度相关（10分），因为全文围绕LLM行为控制展开；与’Instruction Tuning OR Alignment OR Value Alignment’相关（8分），涉及模型行为导向和安全对齐；与’Mechanistic Interpretability OR Explainable AI’相关（8分），提供了对拒绝行为的可追溯解释。其他关键词如MoE、量化、推理加速等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文通过系统安全审计发现，使用对比激活加法获得的导向向量会显著影响LLM对越狱攻击的成功率（最高增加57%或减少50%），揭示了LLM可控性与安全性之间的权衡关系。

摘要翻译

激活导向已成为一种无需更新权重即可塑造大语言模型行为的强大工具。尽管其固有的脆弱性与不可靠性已得到充分记载，但其安全影响仍未得到充分探索。在本研究中，我们对通过对比激活加法这一广泛使用的导向方法获得的导向向量进行了系统性安全审计，并采用统一的评估协议。以JailbreakBench为基准，我们发现导向向量能持续影响越狱攻击的成功率，在基于简单模板的攻击下表现出更强的放大效应。在不同系列和规模的LLM中，沿特定方向引导模型会使其攻击成功率（ASR）出现剧烈波动——根据目标行为的不同，可显著提升（最高达57%）或降低（最高达50%）。我们将此现象归因于导向向量与拒绝行为的潜在方向之间的重叠性，从而为这一发现提供了可追溯的解释。综合而言，我们的研究揭示了LLM中此前未被观测到的安全漏洞根源，凸显了可控性与安全性之间存在的权衡关系。

摘要 (Abstract)

Activation steering has emerged as a powerful tool to shape LLM behavior without the need for weight updates. While its inherent brittleness and unreliability are well-documented, its safety implications remain underexplored. In this work, we present a systematic safety audit of steering vectors obtained with Contrastive Activation Addition (CAA), a widely used steering approach, under a unified evaluation protocol. Using JailbreakBench as benchmark, we show that steering vectors consistently influence the success rate of jailbreak attacks, with stronger amplification under simple template-based attacks. Across LLM families and sizes, steering the model in specific directions can drastically increase (up to 57%) or decrease (up to 50%) its attack success rate (ASR), depending on the targeted behavior. We attribute this phenomenon to the overlap between the steering vectors and the latent directions of refusal behavior. Thus, we offer a traceable explanation for this discovery. Together, our findings reveal the previously unobserved origin of this safety gap in LLMs, highlighting a trade-off between controllability and safety.

关键词: Activation Steering, LLM Safety, Jailbreak Attacks, Contrastive Activation Addition, Safety Audit, Refusal Behavior, Steering Vectors, Attack Success Rate

119. ❌ Mechanic: Sorrifier-Driven Formal Decomposition Workflow for Automated Theorem Proving

作者: Ruichen Qiu, Yichuan Cao, Junqi Liu, Dakai Guo, Xiao-Shan Gao, Lihong Zhi, Ruyong Feng 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24465v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM-based agents在自动定理证明中的应用，与’Large Language Models’和’LLM Agents’高度相关（10分）。研究涉及复杂数学推理，与’Chain of Thought’、‘System 2 Thinking’和’Self-Correction’有一定关联（各5分）。论文提到长上下文问题，与’Context Window Extension’相关（5分）。研究属于AI在科学领域的应用，与’AI for Science’高度相关（10分）。其他关键词如MoE、SFT、RAG等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM-based agents在自动定理证明中因局部错误导致全盘重试或长上下文修复效率低的问题，提出了一种基于sorry占位符的形式化分解工作流Mechanic，在IMO和Putnam等数学竞赛基准上显著提升了证明效率。

摘要翻译

近期，大规模语言模型（LLM）及基于LLM的智能体在自动定理证明能力方面取得了显著进展。然而，对于需要复杂数学推理的问题，现有系统很少能一次证明成功，而必须反复调整其证明策略。当前处理失败尝试的方法通常要么完全丢弃原有证明并从头开始重新生成，要么在证明内部迭代修正错误。前者效率低下，因为它可能因局部错误而放弃基本正确的推理过程；后者虽然保留了先前进展，但会导致上下文长度不断增加，从而逐渐削弱模型对剩余未解决子问题的关注能力。为解决这一困境，我们提出了Mechanic——一种采用占位符驱动形式化分解策略的新型智能体系统。通过利用Lean证明辅助系统中的sorry占位符来精确隔离未解决的子目标，同时保留周围已验证的证明结构，Mechanic将每个失败的子问题提取到独立、自包含的清洁上下文中进行独立求解。这种方法既避免了完全重新生成造成的资源浪费，也规避了重复修复导致上下文过长的弊端。在包括IMO 2025和Putnam 2025在内的挑战性数学竞赛基准测试上的实验结果表明，我们的智能体在证明效率方面取得了显著优势。

摘要 (Abstract)

Recent advances in large language models (LLMs) and LLM-based agents have substantially improved the capabilities of automated theorem proving. However, for problems requiring complex mathematical reasoning, current systems rarely succeed on the first try and must repeatedly modify their proof strategies. Existing approaches for handling failed attempts typically either discard the entire proof and regenerate it from scratch or iteratively fix errors within the proof. The former is inefficient, as it may abandon mostly correct reasoning due to localized errors, while the latter, although preserving prior progress, leads to progressively longer contexts which progressively degrades the model’s ability to attend to the remaining unresolved subproblems. To address this dilemma, we propose Mechanic, a novel agent system that employs a sorry-driven formal decomposition strategy. By leveraging the sorry placeholder in Lean to precisely isolate unresolved subgoals while preserving the surrounding verified proof structure, Mechanic extracts each failed subproblem into a clean, self-contained context and resolves it independently. This avoids both the waste of full regeneration and the excessive context length induced by repeated repairs. Experimental results on challenging mathematical competition benchmarks, including IMO 2025 and Putnam 2025, demonstrate that our agent achieves significant advantages in proving efficiency.

关键词: Large Language Models, LLM-based agents, Automated Theorem Proving, Mathematical Reasoning, Formal Decomposition, Sorry-driven Strategy, Proof Efficiency, Lean Theorem Prover

120. ❌ What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification

作者: Massa Baali, Sarthak Bisht, Rita Singh, Bhiksha Raj 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24432v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于大规模说话人验证任务，提出了一种名为CURriculum Ranking（Curry）的自适应损失函数，通过在线估计样本难度来改善模型训练。论文的核心贡献在于深度学习损失函数设计和大规模说话人验证系统，但所有给定的关键词都直接与大语言模型（LLMs）相关，包括技术原理、训练方法、推理优化、对齐技术、应用场景等。该论文研究的是说话人识别领域的深度学习模型，而非大语言模型，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对大规模说话人验证中固定边界损失对所有样本同等对待的问题，提出了一种基于课程学习的自适应损失函数Curry，通过在线估计样本难度并分层处理，在VoxCeleb1-O和SITW数据集上分别将等错误率降低了86.8%和60.0%。

摘要翻译

大规模说话人验证仍是一个开放挑战，因为固定间隔损失函数对所有样本进行同等处理，忽略了其质量差异。我们假设错误标注或质量退化的样本会引入噪声梯度，从而破坏紧凑的说话人流形。本文提出Curry（课程排序）自适应损失函数，该方法通过子中心ArcFace在线估计样本难度：利用主导子中心余弦相似度产生的置信度分数，结合运行批次统计量将样本动态划分为简单、中等和困难三个层级，无需额外标注。可学习的权重参数引导模型从稳定的身份特征基础出发，经过流形优化阶段，最终实现边界锐化。据我们所知，这是迄今训练规模最大的说话人验证系统。在VoxCeleb1-O和SITW数据集上的评估表明，Curry相比子中心ArcFace基线将等错误率（EER）分别降低了86.8%和60.0%，为不完善大规模数据下的鲁棒性说话人验证建立了新范式。

摘要 (Abstract)

Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8% and 60.0% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.

关键词: speaker verification, large-scale, curriculum learning, adaptive loss, Sub-center ArcFace, sample difficulty estimation, noisy gradients, robust verification

121. ❌ PINGALA: Prosody-Aware Decoding for Sanskrit Poetry Generation

作者: Manoj Balaji Jagadeeshan, Atul Singh, Nallani Chakravartula Sahith, Amrith Krishna, Pawan Goyal 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24413v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用大型语言模型（Phi-4）进行梵语诗歌生成，并采用指令微调（instruction fine-tuning）来提升模型性能。因此，与’Large Language Models’和’Instruction Tuning’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT、Agents、Quantization等均未在摘要中提及或与论文主题无关，故得0分。论文虽涉及AI在特定领域（诗歌生成）的应用，但未明确属于’AI for Science’中的生物信息学或化学信息学子领域，故该关键词也得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PINGALA的解码方法，通过分组行处理和语音感知音译方案，显著提升了指令微调大型语言模型（如Phi-4）在生成符合梵语韵律规则且语义连贯的诗歌方面的性能。

摘要翻译

梵语诗歌生成通常要求诗句在语义上连贯并遵循严格的韵律规则。在梵语诗律中，每行诗句通常是由固定数量的音节组成，这些音节需符合规定的轻重音节二元模式。我们观察到，与其将整节诗视为一个整体序列，不如将其按分组诗行进行分割，这能在保持相近格律遵循度的同时，将语义连贯性显著提升10%。具体而言，我们提出的解码方法PINGALA旨在促使每一行诗都形成结构完整的词语，并通过优先选择更长的词元来引导模型实现这一目标。梵语书写遵循音位正字法，因此采用具有语音感知的转写方案SLP1，对于像Phi-4这类经过指令微调的大型语言模型，能在保持相近语义相似度的前提下，将格律对齐度提高46%。我们还引入了一种基于交叉编码器的无参考评估新方法，该方法与真实诗歌实例取得了更好的对齐效果。

摘要 (Abstract)

Poetry generation in Sanskrit typically requires the verse to be semantically coherent and adhere to strict prosodic rules. In Sanskrit prosody, every line of a verse is typically a fixed length sequence of syllables adhering to prescribed binary patterns of syllable weights. We observe that instead of treating a verse as a monolithic sequence, segmenting them as grouped-lines leads to significant improvement in semantic coherence by 10% with comparable metrical adherence. Specifically, PINGALA, our proposed decoding approach is designed to encourage every line to have well-formed words and our token selection biases the model towards it by preferring longer tokens. Writing in Sanskrit follows phonemic orthography, hence using a phonetically aware transliteration scheme, SLP1, increased the metrical alignment by 46% with comparable semantic similarity, for a instruction fine-tuned large language models like Phi-4. We also introduce a new approach for reference-free evaluation using cross-encoders which achieved better alignment with true poetry instances.

关键词: Sanskrit poetry generation, prosody-aware decoding, large language models, instruction fine-tuning, phonetic transliteration, metrical alignment, semantic coherence, reference-free evaluation

122. ❌ Towards Reward Modeling for AI Tutors in Math Mistake Remediation

作者: Kseniia Petukhova, Ekaterina Kochmar 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24375v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于AI导师在数学错误纠正中的奖励建模，与RLHF/DPO相关（8分），因为它开发了基于人类偏好的Bradley-Terry偏好模型，这属于对齐和偏好学习范畴。与LLMs相关（5分），因为AI导师可能基于大模型，但论文未明确说明模型类型。与CoT推理相关（5分），因为涉及错误识别和推理支架，但非核心。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了AI导师在数学错误纠正中教学质量的评估难题，通过从人类偏好中提取教学层次并合成对比响应对，开发了奖励模型，在仅使用合成数据时达到0.69的配对准确率，结合加权和数据和目标合成组后提升至0.74，优于更大的通用奖励模型。

摘要翻译

评估AI导师的教学质量仍具挑战性：标准自然语言生成指标无法判定其回应是否识别错误、搭建推理框架或避免直接揭示答案。针对错误纠正任务，我们基于MRBench中人工成对偏好数据推导出教学要素的层级结构，并合成了沿关键维度（如错误识别与定位、针对性、支架式引导、可操作性、清晰度与连贯性）存在最小差异的对比回应对。我们开发并发布了基于加权排序训练的Bradley-Terry偏好模型，该排序数据通过自动整合MRBench数据集、合成对比对及组合数据构建而成。仅使用合成数据时，我们最优模型在人工偏好测试中达到0.69的成对准确率；而将加权排序数据与定向合成组结合后，准确率提升至0.74，在使用仅0.5B参数骨干网络的情况下，其表现超越了规模更大的通用奖励模型。

摘要 (Abstract)

Evaluating the pedagogical quality of AI tutors remains challenging: standard NLG metrics do not determine whether responses identify mistakes, scaffold reasoning, or avoid revealing the answers. For the task of mistake remediation, we derive a hierarchy of pedagogical aspects from human pairwise preferences on MRBench, and synthesize minimally contrastive response pairs that differ along key aspects (e.g., mistake identification and location, targetedness, scaffolding, actionability, clarity, and coherence). We develop and release Bradley-Terry preference models trained on weighted-sum rankings that we automatically create from MRBench, synthetic pairs, and data combinations. Using only synthetic data, our best model reaches 0.69 pairwise accuracy on a human preference test, and combining weighted-sum data with targeted synthetic groups improves accuracy to 0.74, outperforming larger general-purpose reward models while using only a 0.5B-parameter backbone.

关键词: AI tutors, mistake remediation, reward modeling, pedagogical quality, Bradley-Terry preference models, pairwise preferences, synthetic data, MRBench

123. ❌ Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning

作者: Arsen Shebzukhov 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24372v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究使用Qwen3.5-2B模型进行自然语言数学文本到Lean4形式化语言的自动转换，核心涉及大模型（LLM）在数学科学领域的应用。高度相关的关键词包括：1）‘Large Language Models’（使用Qwen3.5-2B，权重1.0，评分10.0）；2）‘Small Language Models’（使用2B参数模型，权重1.0，评分8.0）；3）‘Post-training OR Supervised Fine-tuning OR SFT’（使用SFT进行微调，权重1.0，评分10.0）；4）‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（使用GRPO强化学习，权重1.0，评分8.0）；5）‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（使用LoRA进行参数高效微调，权重1.0，评分10.0）；6）‘AI for Science OR Bioinformatics OR Cheminformatics’（应用于数学研究，权重1.0，评分10.0）。其他关键词如MoE、Scaling Laws、RAG等未在论文中涉及，评分为0.0。

!!! tip deepseek-chat TL;DR

该论文研究了通过循环一致性微调（使用SFT和GRPO强化学习）改进Qwen3.5-2B模型在自然语言数学文本到Lean4形式化语言的自动转换性能，结果显示强化学习方法在保持语义一致性方面显著优于监督微调。

摘要翻译

自动形式化——将自然语言数学文本自动转换为如Lean4等形式化证明语言——能够通过证明验证或证明搜索，助力加速人工智能辅助的数学研究。本研究基于FineLeanCorpus数据集，采用LoRA方法对Qwen3.5-2B模型进行自然语言到Lean4形式化的微调，并比较三种训练策略：采用课程学习（难度从1到10递增）的监督微调、无课程排序的监督微调，以及使用基于循环一致性奖励的组相对策略优化的强化学习。循环一致性通过计算现成句子嵌入的余弦相似度，衡量语句在“自然语言→Lean4→自然语言”循环中意义保留的程度。在FineLeanCorpus未见子集和PutnamBench上的实验表明，强化学习方法显著优于两种监督微调变体（在FineLeanCorpus上平均循环一致性为0.669对比0.513；在PutnamBench上为0.561对比0.422），同时仅使交叉熵损失增加0.011纳特，且对形式化质量影响甚微。课程排序相比随机打乱训练未带来可测量的收益。

摘要 (Abstract)

Autoformalization - automatically translating natural language mathematical texts into formal proof language such as Lean4 - can help accelerate AI-assisted mathematical research, be it via proof verification or proof search. I fine-tune Qwen3.5-2B with LoRA for natural language to Lean4 formalization on FineLeanCorpus and consider three training regimes: supervised fine-tuning (SFT) with curriculum learning (difficulty 1 to 10), SFT without curriculum ordering, and reinforcement learning using group relative policy optimization (GRPO) with a cycle consistency reward. Cycle consistency measures how well the meaning of a statement is preserved through a NL to Lean4 to NL’ loop, computed as cosine similarity of off-the-shelf sentence embeddings. On an unseen subset of FineLeanCorpus (FLC) and on PutnamBench, RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench), while increasing cross-entropy loss by only 0.011 nats, with minimal impact on formalization quality. Curriculum ordering provides no measurable benefit over shuffled training.

关键词: Autoformalization, Lean4, Qwen3.5-2B, LoRA, Supervised Fine-tuning, GRPO, Cycle Consistency, Mathematical Research

124. ❌ Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

作者: N J Karthika, Keerthana Suryanarayanan, Jahanvi Purohit, Ganesh Ramakrishnan, Jitin Singla, Anil Kumar Gourishetty 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24307v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要贡献是发布了一个新的印地语-梵语平行语料库，并通过微调现有模型（ByT5、NLLB、IndicTrans-v2）来评估其效用。论文的核心是数据集创建和机器翻译基准测试，而非大模型或深度学习技术原理的创新。唯一相关的关键词是’Post-training OR Supervised Fine-tuning OR SFT’，因为论文提到了对模型进行微调，但这只是应用现有技术来评估数据集，并非技术创新的核心。其他所有关键词都涉及大模型架构、训练方法、推理优化、对齐、代理系统等具体技术，与这篇数据集论文完全无关。

!!! tip deepseek-chat TL;DR

该论文发布了一个名为Samasāmayik的新颖、大规模印地语-梵语平行语料库，并通过微调现有机器翻译模型证明该数据集能显著提升当代文本翻译性能，为低资源印度语言机器翻译建立了新的性能基准。

摘要翻译

我们发布了Samasāmayik——一个新颖、精心构建的大规模印地语-梵语平行语料库，包含92,196组平行句对。与现有大多数聚焦古典时期文本和诗歌的梵语数据不同，该语料库聚合了涵盖当代材料的多样化来源数据，包括口语教程、儿童杂志、广播对话和教学材料。我们通过微调三种互补模型——ByT5、NLLB和IndicTrans-v2对该数据集进行基准测试，以验证其实用性。实验表明，基于Samasamayik语料库训练的模型在领域内测试数据上取得显著性能提升，同时在其它广泛使用的测试集上保持相当性能，为当代印地语-梵语翻译建立了强大的新性能基线。此外，与现有语料库的对比分析显示，本数据集在语义和词汇层面重叠度极低，证实了其作为低资源印度语言机器翻译（MT）新型资源的创新性与非冗余性。

摘要 (Abstract)

We release Samasāmayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children’s magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals minimal semantic and lexical overlap, confirming the novelty and non-redundancy of our dataset as a robust new resource for low-resource Indic language MT.

关键词: Hindi-Sanskrit machine translation, parallel corpus, low-resource language, dataset creation, fine-tuning, Indic languages, contemporary text, benchmark evaluation

125. ❌ SpinGQE: A Generative Quantum Eigensolver for Spin Hamiltonians

作者: Alexander Holden, Moinul Hossain Rahat, Nii Osae Osae Dade 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24298v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《SpinGQE: A Generative Quantum Eigensolver for Spin Hamiltonians》专注于量子计算领域，特别是使用生成式方法解决自旋哈密顿量的基态搜索问题。论文的核心技术是生成式量子本征求解器（GQE）框架的扩展，并采用基于Transformer的解码器来学习量子电路的分布。虽然论文使用了Transformer架构，但其应用仅限于量子电路生成，并未涉及大语言模型（LLM）或深度学习在自然语言处理等传统领域的应用。因此，与大多数关键词（涉及LLM技术、训练方法、推理优化、对齐、代理系统等）完全无关。唯一相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在科学（量子物理）中的应用，但并非核心匹配（如生物信息学或化学信息学），故给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出SpinGQE，一种基于生成式模型的量子本征求解器，用于解决自旋哈密顿量的基态搜索问题，并在四量子比特海森堡模型上验证了其有效性。

摘要翻译

基态搜索问题是量子计算的核心，其应用涵盖量子化学、凝聚态物理和优化领域。变分量子本征求解器（VQE）在小规模系统中展现出潜力，但面临显著局限，包括贫瘠高原、受限的拟设表达能力以及对领域特定结构的依赖。本文提出SpinGQE，将生成式量子本征求解器（Generative Quantum Eigensolver, GQE）框架扩展至自旋哈密顿量。我们的方法将电路设计重新构建为生成式建模任务。我们采用基于Transformer的解码器来学习能产生低能态量子电路的分布。训练通过模型逻辑值与每个门子序列评估的电路能量之间的加权均方误差损失来引导。我们在四量子比特海森堡模型上验证了该方法，展示了成功收敛至近基态的结果。通过系统化的超参数探索，我们确定了最优配置：较小的模型架构（12层，8个注意力头）、较长的序列长度（12个门）以及精心选择的算子池能产生最可靠的收敛。我们的结果表明，生成式方法能有效探索复杂的能量景观，而无需依赖问题特定的对称性或结构。这为一般量子系统提供了一种可扩展的传统变分方法替代方案。开源实现可在https://github.com/Mindbeam-AI/SpinGQE获取。

摘要 (Abstract)

The ground state search problem is central to quantum computing, with applications spanning quantum chemistry, condensed matter physics, and optimization. The Variational Quantum Eigensolver (VQE) has shown promise for small systems but faces significant limitations. These include barren plateaus, restricted ansatz expressivity, and reliance on domain-specific structure. We present SpinGQE, an extension of the Generative Quantum Eigensolver (GQE) framework to spin Hamiltonians. Our approach reframes circuit design as a generative modeling task. We employ a transformer-based decoder to learn distributions over quantum circuits that produce low-energy states. Training is guided by a weighted mean-squared error loss between model logits and circuit energies evaluated at each gate subsequence. We validate our method on the four-qubit Heisenberg model, demonstrating successfulconvergencetonear-groundstates. Throughsystematichyperparameterexploration, we identify optimal configurations: smaller model architectures (12 layers, 8 attention heads), longer sequence lengths (12 gates), and carefully chosen operator pools yield the most reliable convergence. Our results show that generative approaches can effectively navigate complex energy landscapes without relying on problem-specific symmetries or structure. This provides a scalable alternative to traditional variational methods for general quantum systems. An open-source implementation is available at https://github.com/Mindbeam-AI/SpinGQE.

关键词: SpinGQE, Generative Quantum Eigensolver, spin Hamiltonians, ground state search, transformer-based decoder, quantum circuits, Heisenberg model, variational methods

126. ❌ Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning

作者: He Huang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24258v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究古埃及语言四个历史阶段的词级语义对齐，使用紧凑的编码器-解码器模型结合多种任务（如MLM、TLM、翻译、词性标注）和辅助视图（拉丁转写、IPA重建）。所有关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用创新，但该论文专注于历史语言学中的特定任务，未使用或涉及任何大模型（如LLMs）、深度学习新技术（如MoE、LoRA、RAG等）或AI在生物/化学信息学中的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了在平行数据稀缺的情况下，通过多任务学习和归一化方法实现古埃及语言四个历史阶段的词级语义对齐，发现翻译任务对对齐效果提升最大，但整体对齐效果有限，为历史语言建模提供了可复现的基线。

摘要翻译

本研究探讨了古埃及语四个历史阶段间的词汇级语义对齐问题。这些阶段在文字系统和正字法上存在差异，且平行数据稀缺。我们采用共享的字节级分词器，联合训练了一个紧凑的编码器-解码器模型，该模型在固定权重和基于不确定性的缩放任务感知损失下，结合了掩码语言建模（MLM）、翻译语言建模（TLM）、序列到序列翻译以及词性标注任务。为减少表层差异，我们添加了拉丁转写和IPA（国际音标）重构作为辅助视图，并通过基于KL散度的一致性约束和嵌入层融合来整合这些视图。我们在精心构建的埃及语-英语及埃及语内部同源词数据集上，使用配对度量指标（特别是ROC-AUC和三元组准确率）评估对齐质量。结果表明，翻译任务带来的提升最为显著；结合KL一致性的IPA视图改善了跨语支对齐，而早期融合策略效果有限。尽管整体对齐程度仍有局限，但本研究为在现实约束下建模历史语言提供了可复现的基线结果与实践指导，同时揭示了在类型学距离较远的语境中，归一化方法与任务设计如何影响对齐的实质内涵。

摘要 (Abstract)

We study word-level semantic alignment across four historical stages of Ancient Egyptian. These stages differ in script and orthography, and parallel data are scarce. We jointly train a compact encoder-decoder model with a shared byte-level tokenizer on all four stages, combining masked language modeling (MLM), translation language modeling (TLM), sequence-to-sequence translation, and part-of-speech tagging under a task-aware loss with fixed weights and uncertainty-based scaling. To reduce surface divergence we add Latin transliteration and IPA reconstruction as auxiliary views. We integrate these views through KL-based consistency and through embedding-level fusion. We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets. Translation yields the strongest gains. IPA with KL consistency improves cross-branch alignment, while early fusion demonstrates limited efficacy. Although the overall alignment remains limited, the findings provide a reproducible baseline and practical guidance for modeling historical languages under real constraints. They also show how normalization and task design shape what counts as alignment in typologically distant settings.

关键词: Semantic Alignment, Ancient Egyptian Language, Multitask Learning, Normalization, Encoder-Decoder Model, Translation Language Modeling, IPA Reconstruction, Historical Linguistics

127. ❌ Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution

作者: Julia Matela, Frank Krüger 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24246v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于跨文档软件共指消解任务，使用Sentence-BERT预训练模型进行语义嵌入，结合知识库查找和HDBSCAN聚类。论文内容主要涉及自然语言处理中的信息提取和聚类技术，与大多数大模型技术关键词（如LLM、MoE、RLHF、RAG等）无直接关联。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文处理科学文献中的软件提及，属于AI在科学领域的应用，但并非核心创新点，因此给予5分（有一定关联）。其他关键词均未涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合语义嵌入、知识库查找和密度聚类的混合框架，用于解决科学文献中跨文档软件提及的共指消解问题，在SOMD 2026共享任务的三个子任务中分别取得了0.98、0.98和0.96的CoNLL F1分数。

摘要翻译

本文介绍了为SOMD 2026跨文档软件提及共指消解（Cross-Document Coreference Resolution, CDCR）共享任务提交的系统。我们的方法旨在应对在科学文献语料库中识别并聚类不一致软件提及的挑战。我们提出了一个混合框架，该框架结合了以下技术：使用预训练的Sentence-BERT模型生成稠密语义嵌入；构建基于训练集聚类质心的知识库（Knowledge Base, KB）查找策略，并利用FAISS进行高效检索；对于无法确信归入现有聚类的提及，则采用基于密度的HDBSCAN聚类方法。我们还应用了表层形式归一化和缩写消解技术以改进规范名称匹配。同一核心流程应用于子任务1和子任务2。针对子任务3的大规模场景，我们通过采用基于实体类型和规范化表层形式的区块划分策略来调整流程。我们的系统在子任务1、2和3上分别取得了0.98、0.98和0.96的CoNLL F1分数。

摘要 (Abstract)

This paper describes the system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) of software mentions. Our approach addresses the challenge of identifying and clustering inconsistent software mentions across scientific corpora. We propose a hybrid framework that combines dense semantic embeddings from a pre-trained Sentence-BERT model, Knowledge Base (KB) lookup strategy built from training-set cluster centroids using FAISS for efficient retrieval, and HDBSCAN density-based clustering for mentions that cannot be confidently assigned to existing clusters. Surface-form normalization and abbreviation resolution are applied to improve canonical name matching. The same core pipeline is applied to Subtasks 1 and 2. To address the large scale settings of Subtask 3, the pipeline was adapted by utilising a blocking strategy based on entity types and canonicalized surface forms. Our system achieved CoNLL F1 scores of 0.98, 0.98, and 0.96 on Subtasks 1, 2, and 3 respectively.

关键词: Cross-Document Coreference Resolution, Software Mentions, Semantic Embeddings, Sentence-BERT, HDBSCAN Clustering, Knowledge Base Lookup, Scientific Corpora, CoNLL F1 Score

128. ❌ Stance Labels Fail When They Matter Most: The Projection Problem in Stance Detection

作者: Bowen Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24231v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究立场检测中的标注问题，聚焦于社会媒体文本分析，未涉及大模型、深度学习技术原理或科学领域应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文揭示了立场检测中存在的投影问题，即当文本的多维态度冲突时，强制压缩为单一标签会导致标注一致性崩溃，而维度层面的标注一致性反而保持。

摘要翻译

立场检测几乎总是被形式化为将文本分类为"支持"、“反对"或"中立”——这一惯例继承自辩论分析，自SemEval-2016以来未经修改地应用于社交媒体分析。然而，人们对复杂目标的态度并非单一：一个人可以接受气候科学却反对碳税，在一个维度上表达支持，在另一个维度上表达反对。当标注者必须将这种多维态度压缩为单一标签时，不同的标注者会侧重不同维度——由此产生的分歧反映的并非混淆，而是不同的压缩选择。我们将此称为投影问题，并证明其代价是条件性的：当文本各维度一致时，任何加权都会产生相同标签，三分类标注效果良好；当维度冲突时，标签一致性崩溃，而各独立维度的一致性却得以保持。对SemEval-2016任务6的试点研究证实了这种交叉现象：在维度一致的文本上，标签一致性（Krippendorff’s $α= 0.307$）超过维度一致性（$α= 0.082$）；在维度冲突的文本上，模式发生逆转——标签$α$降至$0.085$，而维度$α$升至$0.334$，其中政策维度达到$0.572$。投影问题真实存在——但它恰恰在最关键之处被激活。

摘要 (Abstract)

Stance detection is nearly always formulated as classifying text into Favor, Against, or Neutral – a convention inherited from debate analysis and applied without modification to social media since SemEval-2016. But attitudes toward complex targets are not unitary: a person can accept climate science while opposing carbon taxes, expressing support on one dimension and opposition on another. When annotators must compress such multi-dimensional attitudes into a single label, different annotators weight different dimensions – producing disagreement that reflects not confusion but different compression choices. We call this the \textbf{projection problem}, and show that its cost is conditional: when a text’s dimensions align, any weighting yields the same label and three-way annotation works well; when dimensions conflict, label agreement collapses while agreement on individual dimensions remains intact. A pilot study on SemEval-2016 Task 6 confirms this crossover: on dimension-consistent texts, label agreement (Krippendorff’s $α= 0.307$) exceeds dimensional agreement ($α= 0.082$); on dimension-conflicting texts, the pattern reverses – label $α$ drops to $0.085$ while dimensional $α$ rises to $0.334$, with Policy reaching $0.572$. The projection problem is real – but it activates precisely where it matters most.

关键词: stance detection, projection problem, annotation disagreement, multi-dimensional attitudes, label agreement, dimensional agreement, SemEval-2016, social media analysis

129. ❌ Variation is the Norm: Embracing Sociolinguistics in NLP

作者: Anne-Marie Lutgen, Alistair Plum, Verena Blaschke, Barbara Plank, Christoph Purschke 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24222v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究自然语言处理（NLP）中语言变异（如拼写变异）的处理，提出结合社会语言学的框架，并通过卢森堡语的案例研究展示了变异对模型性能的影响及通过微调改进的方法。论文的核心是NLP中的语言变异处理和社会语言学结合，并非大模型或深度学习技术原理的创新。仅与"Post-training OR Supervised Fine-tuning OR SFT"有一定关联（5分），因为论文提到了微调（fine-tuning）过程以改进性能，但这不是论文的核心创新点。其他关键词均与大模型技术、科学应用等无关，得0分。加权总分计算为5.0（5.0 × 1.0）。

!!! tip deepseek-chat TL;DR

论文研究如何将社会语言学中的语言变异整合到NLP框架中，通过卢森堡语的案例表明模型对拼写变异不鲁棒，并提出了通过微调改进性能的方法。

摘要翻译

在自然语言处理（NLP）领域中，语言变异通常被视为噪声，并在处理前被“规范化”消除，尽管其本质上是语言不可或缺的组成部分。相反，在社会语境中研究语言变异则是社会语言学的核心议题。本文提出一个框架，旨在将语言的社会语言学维度与自然语言处理的技术维度相结合。我们认为，通过引入社会语言学视角，变异可以主动纳入研究设计，进而为自然语言处理提供新的启示。为阐明这一观点，我们以卢森堡语为例展开案例研究——该语言正处于动态演变阶段，其正字法层面存在大量变异现象，研究展示了这种变异如何影响自然语言处理的性能。实验结果表明，相较于更接近（正字法）标准的数据，在含有大量正字法变异的数据上进行测试和微调的模型性能存在显著差异。此外，我们提出一种通过在微调过程中纳入变异数据以提升模型性能的可行方案。本案例研究凸显了将变异纳入研究设计的重要性，因为现有模型对实际存在的变异缺乏鲁棒性。我们的框架不仅有助于在技术思考中纳入变异因素，同时也植根于社会语言学的理论体系之中。

摘要 (Abstract)

In Natural Language Processing (NLP), variation is typically seen as noise and “normalised away” before processing, even though it is an integral part of language. Conversely, studying language variation in social contexts is central to sociolinguistics. We present a framework to combine the sociolinguistic dimension of language with the technical dimension of NLP. We argue that by embracing sociolinguistics, variation can actively be included in a research setup, in turn informing the NLP side. To illustrate this, we provide a case study on Luxembourgish, an evolving language featuring a large amount of orthographic variation, demonstrating how NLP performance is impacted. The results show large discrepancies in the performance of models tested and fine-tuned on data with a large amount of orthographic variation in comparison to data closer to the (orthographic) standard. Furthermore, we provide a possible solution to improve the performance by including variation in the fine-tuning process. This case study highlights the importance of including variation in the research setup, as models are currently not robust to occurring variation. Our framework facilitates the inclusion of variation in the thought-process while also being grounded in the theoretical framework of sociolinguistics.

关键词: Natural Language Processing, sociolinguistics, language variation, orthographic variation, fine-tuning, Luxembourgish, model robustness, NLP framework

130. ❌ A visual observation on the geometry of UMAP projections of the difference vectors of antonym and synonym word pair embeddings

作者: Rami Luisto 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24150v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究词嵌入向量几何特性的可视化分析，特别是反义词和同义词对差异向量的UMAP投影。虽然提到transformer模型，但论文焦点是词嵌入的几何分析而非大模型技术原理、训练方法、推理优化、对齐、应用等。所有关键词均涉及大模型技术栈的特定方面（训练、推理、对齐、应用等），而本文是基础词嵌入的几何分析，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文通过UMAP投影可视化分析反义词和同义词对嵌入向量差异的几何特性，发现了一种跨嵌入模型出现的特定“漩涡”模式。

摘要翻译

反义词，有时被定义为除一项语境相关属性外其余属性完全相同的词对。鉴于Transformer模型似乎将概念编码为方向向量，这引发了一个问题：能否基于词对嵌入向量的几何结构（特别是其差分向量）来检测“反义性”。此类几何研究通常通过将反义词对与其对立面——同义词——进行比较来展开对照分析。
本文最初是一项探索性研究，旨在探讨检测反义词语对嵌入向量几何结构所需系统的复杂性。我们当前报告的是一个奇特的“旋涡”现象，该现象在多种嵌入模型中以一种较为特定的投影配置反复出现。

摘要 (Abstract)

Antonyms, or opposites, are sometimes defined as \emph{word pairs that have all of the same contextually relevant properties but one}. Seeing how transformer models seem to encode concepts as directions, this begs the question if one can detect antonymity'' in the geometry of the embedding vectors of word pairs, especially based on their difference vectors. Such geometrical studies are then naturally contrasted by comparing antonymic pairs to their opposites; synonyms. This paper started as an exploratory project on the complexity of the systems needed to detect the geometry of the embedding vectors of antonymic word pairs. What we now report is a curious swirl’’ that appears across embedding models in a somewhat specific projection configuration.

关键词: word embeddings, antonyms, synonyms, UMAP projection, geometry, difference vectors, transformer models, visual analysis

131. ❌ ConceptKT: A Benchmark for Concept-Level Deficiency Prediction in Knowledge Tracing

作者: Yu-Chen Kang, Yu-Chien Tang, An-Zi Yen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24073v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确提到使用Large Language Models (LLMs)进行知识追踪评估，并探索in-context learning方法，因此这两个关键词高度相关（10分）。论文涉及教育领域的AI应用，属于AI for Science的广义范畴，有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文技术内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了概念级缺陷预测任务，通过构建ConceptKT数据集并探索基于概念对齐和语义相似性的上下文学习方法，评估了大型语言模型在知识追踪中预测学生概念掌握缺陷的能力。

摘要翻译

知识追踪（Knowledge Tracing，KT）是一种对学生知识状态进行建模以支持个性化学习的关键技术。然而，大多数KT系统仅关注二元正确性预测，无法诊断导致错误的潜在概念性误解。这种细粒度的诊断反馈对于设计针对性教学和有效补救措施至关重要。在本研究中，我们提出了概念级缺陷预测任务，该任务通过识别学生在未来习题中可能遇到困难的具体概念，扩展了传统的KT框架。我们提出了ConceptKT数据集，其标注不仅包含解答每道题目所需的概念，还标注了错误答案背后所缺失的概念。我们研究了KT中的上下文学习方法，并评估了多种大型语言模型（LLMs）和大型推理模型（LRMs）的诊断能力。同时，我们探索了不同信息性历史记录的选择策略。实验结果表明，基于概念对齐和语义相似度来选择作答历史记录，能够同时提升正确性预测和概念级缺陷识别的性能。

摘要 (Abstract)

Knowledge Tracing (KT) is a critical technique for modeling student knowledge to support personalized learning. However, most KT systems focus on binary correctness prediction and cannot diagnose the underlying conceptual misunderstandings that lead to errors. Such fine-grained diagnostic feedback is essential for designing targeted instruction and effective remediation. In this work, we introduce the task of concept-level deficiency prediction, which extends traditional KT by identifying the specific concepts a student is likely to struggle with on future problems. We present ConceptKT, a dataset annotated with labels that capture both the concepts required to solve each question and the missing concepts underlying incorrect responses. We investigate in-context learning approaches to KT and evaluate the diagnostic capabilities of various Large Language Models (LLMs) and Large Reasoning Models (LRMs). Different strategies for selecting informative historical records are explored. Experimental results demonstrate that selecting response histories based on conceptual alignment and semantic similarity leads to improved performance on both correctness prediction and concept-level deficiency identification.

关键词: Knowledge Tracing, Concept-level Deficiency Prediction, Large Language Models, In-context Learning, Conceptual Alignment, Semantic Similarity, Educational AI

132. ❌ FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval

作者: Caishuang Huang, Yang Qiao, Rongyu Zhang, Junjie Ye, Pu Lu, Wenxi Wu, Meng Zhou, Xiku Du, Tao Gui, Qi Zhang, Xuanjing Huang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24051v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在金融领域的工具使用能力，与’Large Language Models’、‘Tool Use’、‘LLM Agents’高度相关（10分），涉及动态工具检索与RAG有一定关联（5分），其他关键词如MoE、SFT、量化等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对金融领域LLM工具使用数据合成方法的局限性，提出了一个前向合成框架FinToolSyn，通过动态工具检索生成高质量对话数据，实验表明基于该数据训练的模型在金融工具调用能力上提升了21.06%。

摘要翻译

在金融领域中，投资标的庞大且查询需求数据密集，工具使用能力对大型语言模型（LLMs）至关重要。然而，现有的数据合成方法通常依赖反向合成范式，即从预先采样的工具生成用户查询。这种方法不可避免地引入了人为的显性特征，导致生成的查询无法捕捉现实需求中隐含的、事件驱动的本质。此外，其依赖静态工具集的设定忽视了在庞大工具空间中导航所需的动态检索过程。为应对这些挑战，我们提出了 FinToolSyn——一种旨在生成高质量金融对话的前向合成框架。从角色指令与原子工具合成出发，逐步推进至动态检索对话生成，我们的流程构建了包含43,066个工具的资源库，合成了超过148,000个对话实例，并通过引入动态检索来模拟庞大工具空间中典型的噪声候选集。我们还建立了一个专用基准，以评估真实金融场景下的工具调用能力。大量实验表明，基于FinToolSyn训练的模型实现了21.06%的性能提升，为金融场景下的工具学习奠定了坚实基础。

摘要 (Abstract)

Tool-use capabilities are vital for Large Language Models (LLMs) in finance, a domain characterized by massive investment targets and data-intensive inquiries. However, existing data synthesis methods typically rely on a reverse synthesis paradigm, generating user queries from pre-sampled tools. This approach inevitably introduces artificial explicitness, yielding queries that fail to capture the implicit, event-driven nature of real-world needs. Moreover, its reliance on static tool sets overlooks the dynamic retrieval process required to navigate massive tool spaces. To address these challenges, we introduce \textit{FinToolSyn}, a forward synthesis framework designed to generate high-quality financial dialogues. Progressing from persona instruction and atomic tool synthesis to dynamic retrieval dialogue generation, our pipeline constructs a repository of 43,066 tools and synthesizes over 148k dialogue instances, incorporating dynamic retrieval to emulate the noisy candidate sets typical of massive tool spaces. We also establish a dedicated benchmark to evaluate tool-calling capabilities in realistic financial scenarios. Extensive experiments demonstrate that models trained on FinToolSyn achieve a 21.06% improvement, providing a robust foundation for tool learning in financial scenarios.

关键词: Large Language Models, Tool-use, Financial dialogues, Dynamic tool retrieval, Data synthesis, Forward synthesis framework, Tool-calling capabilities, Benchmark evaluation

作者: Wassim Swaileh, Mohammed-En-Nadhir Zighem, Hichem Telli, Salah Eddine Bekhouche, Abdellah Zakaria Sellam, Fadi Dornaika, Dimitrios Kotzinos 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24012v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是使用RAG增强的LLM进行伊斯兰继承法的多阶段法律推理，高度相关关键词包括：LLMs（核心模型）、RAG（核心方法）、CoT Reasoning（多阶段推理）、System 2 Thinking（深度法律推理）。中等相关：Scaling Laws AND Data Quality（生成高质量合成数据）、Hallucination Mitigation（提高可靠性）、AI for Science（法律科学应用）。其余关键词未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于检索增强生成（RAG）的LLM系统，用于解决伊斯兰继承法中的复杂多阶段法律推理问题，在官方评测中取得了0.935的MIR-E分数并排名第一。

摘要翻译

伊斯兰继承法（Ilm al-Mawarith）是一项多阶段的法律推理任务，需要识别符合条件的继承人、解决继承排除规则（hajb）、分配固定份额与剩余份额、处理诸如增额（awl）与返额（radd）等调整机制，并生成一致的最终分配方案。由于不同法学流派及民法法典化实践存在差异，该任务进一步复杂化，要求模型在明确的法律配置下运作。我们为此提出一种检索增强生成（RAG）流程，该方法结合了基于规则的合成数据生成、混合检索（稠密检索与BM25）与交叉编码器重排序，以及模式约束的输出验证。我们使用一个符号化继承计算器来生成大规模高质量合成语料，其中包含完整的中间推理轨迹，确保了法律与数值的一致性。所提出的系统在官方QIAS 2026盲测排行榜上取得了0.935的MIR-E分数并位列第一。结果表明，基于检索且具备模式意识的生成方法显著提升了高精度阿拉伯语法律推理任务的可靠性。

摘要 (Abstract)

Islamic inheritance (Ilm al-Mawarith) is a multi-stage legal reasoning task requiring the identification of eligible heirs, resolution of blocking rules (hajb), assignment of fixed and residual shares, handling of adjustments such as awl and radd, and generation of a consistent final distribution. The task is further complicated by variations across legal schools and civil-law codifications, requiring models to operate under explicit legal configurations. We present a retrieval-augmented generation (RAG) pipeline for this setting, combining rule-grounded synthetic data generation, hybrid retrieval (dense and BM25) with cross-encoder reranking, and schema-constrained output validation. A symbolic inheritance calculator is used to generate a large high-quality synthetic corpus with full intermediate reasoning traces, ensuring legal and numerical consistency. The proposed system achieves a MIR-E score of 0.935 and ranks first on the official QIAS 2026 blind-test leaderboard. Results demonstrate that retrieval-grounded, schema-aware generation significantly improves reliability in high-precision Arabic legal reasoning tasks.

关键词: Retrieval-Augmented Generation, Large Language Models, Legal Reasoning, Islamic Inheritance, Multi-stage Reasoning, Schema-constrained Validation, Synthetic Data Generation, Arabic Legal AI

作者: Kun-Yang Yu, Zhi Zhou, Shi-Yu Tian, Xiao-Wen Yang, Zi-Yi Jia, Ming Yang, Zi-Jian Cheng, Lan-Zhe Guo, Yu-Feng Li 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24004v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出TWT方法，核心是增强多模态大语言模型（MLLMs）对表格数据的理解能力，属于大模型在特定领域（表格理解）的应用研究。论文明确提到使用Multimodal Large Language Models（MLLMs），因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。方法采用neuro-symbolic reasoning机制，涉及程序辅助的代码推理，这本质上是一种多步、深入的推理过程，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’高度相关（各10分）。论文未涉及其他关键词的具体技术（如MoE、量化、对齐等），也未明确属于生物信息学等特定科学领域，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对表格-视觉多模态理解任务，提出了名为Thinking with Tables的神经符号推理方法，显著提升了多模态大语言模型在表格数据上的理解性能，在多个数据集上平均准确率提升10%，达到或超越了商业SOTA模型。

摘要翻译

多模态大语言模型（MLLMs）在图像和文本等模态上已展现出卓越的推理能力。然而，表格数据作为现实世界中的关键模态，在多模态学习领域仍相对缺乏深入探索。本文聚焦于表格-视觉多模态理解任务，并识别出三大核心挑战：（1）表格结构高度可变且数据不完整，（2）特征间存在隐式且复杂的依赖关系，（3）下游任务的问题解决流程存在显著异质性。为应对这些问题，我们提出了“基于表格的思维”方法。该方法采用一种程序辅助、基于代码的神经符号推理机制，通过与外部环境交互，促进信息提取和元素建模等关键操作。我们在八个代表性数据集上对TWT进行了评估。实验结果表明，TWT在准确率上平均优于现有基线方法10%，在TVMU任务上达到了与专有商业SOTA大语言模型相当甚至更优的性能。模型与代码已发布于https://github.com/kunyang-YU/Thinking-with-Tables。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities across modalities such as images and text. However, tabular data, despite being a critical real-world modality, remains relatively underexplored in multimodal learning. In this paper, we focus on the task of Tabular-Vision Multi-Modal Understanding (TVMU) and identify three core challenges: (1) high structural variability and data incompleteness in tables, (2) implicit and complex feature dependencies, and (3) significant heterogeneity in problem-solving pipelines across downstream tasks. To address these issues, we propose Thinking with Tables (TWT). TWT employs a program-aided code-based neuro-symbolic reasoning mechanism that facilitates key operations, such as information extraction and element modeling, by interacting with external environments. We evaluate TWT on eight representative datasets. Experimental results demonstrate that TWT consistently outperforms existing baselines by an average of 10% in accuracy, achieving performance comparable to, or even surpassing, proprietary commercial SOTA LLMs on TVMU tasks. Models and codes are available at https://github.com/kunyang-YU/Thinking-with-Tables

关键词: Multimodal Large Language Models, Tabular-Vision Multi-Modal Understanding, Neuro-Symbolic Reasoning, Program-aided Code-based Reasoning, Table Understanding, MLLMs, TVMU, TWT

135. ❌ Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

作者: Yao Chen, Yilong Chen, Yinqi Yang, Junyuan Shang, Zhenyu Zhang, Zefeng Zhang, Shuaiyi Nie, Shuohuan Wang, Yu Sun, Hua Wu, HaiFeng Wang, Tingwen Liu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23998v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Sparse Growing Transformer（SGT），专注于训练时稀疏深度分配，通过渐进式注意力循环实现。核心相关关键词：1）‘Mixture of Experts OR MoE OR Sparse Models’（10分）：论文核心涉及稀疏模型和稀疏深度分配；2）‘Large Language Models OR LLMs OR Foundation Models’（8分）：基于Transformer架构，是大模型技术原理创新；3）‘Pre-training OR Continual Pre-training OR Domain Adaptation’（8分）：涉及训练过程优化。其他关键词如推理加速、对齐、科学AI应用等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对Transformer训练时深度分配静态预设导致计算冗余的问题，提出了Sparse Growing Transformer（SGT），通过渐进式注意力循环在训练时动态分配稀疏深度，在减少额外训练FLOPs开销的同时提升性能。

摘要翻译

现有提升Transformer有效深度的主流方法主要依赖参数复用，通过递归执行扩展计算量。在此范式下，网络结构在训练时间线上保持静态，额外的计算深度在参数层面被均匀分配给整个模块。这种在训练时间与参数空间上的刚性设计，导致训练过程中产生显著的计算冗余。与之相对，我们认为训练过程中的深度分配不应是静态预设，而应是一个渐进增长的结构化过程。我们的系统性分析揭示了各层间存在从深到浅的成熟轨迹，其中高熵注意力头在语义整合中起着关键作用。基于这一发现，我们提出了稀疏增长Transformer（Sparse Growing Transformer, SGT）。SGT是一种训练时稀疏深度分配框架，它通过在信息丰富的注意力头上进行定向循环，实现从深层到浅层的渐进式递归扩展。该机制通过仅在训练过程中对一小部分参数子集选择性增加深度，从而引入结构稀疏性。在多参数规模下的广泛实验表明，在可比设置下，SGT始终优于训练时静态模块级循环基线方法，同时将额外训练FLOPs开销从标准Transformer骨干网络的约16–20%降低至仅1–3%。

摘要 (Abstract)

Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16–20% to only 1–3% relative to a standard Transformer backbone.

关键词: Sparse Growing Transformer, training-time sparse depth allocation, progressive attention looping, Transformer depth, sparse models, attention heads, computational redundancy, FLOPs reduction

136. ❌ CoCR-RAG: Enhancing Retrieval-Augmented Generation in Web Q&A via Concept-oriented Context Reconstruction

作者: Kaize Shi, Xueyao Sun, Qika Lin, Firoj Alam, Qing Li, Xiaohui Tao, Guandong Xu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23989v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG框架的改进，与’Retrieval-Augmented Generation’高度相关（10分），使用LLMs进行概念融合，与’Large Language Models’高度相关（10分）。论文关注多源文档融合以提升事实一致性，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Alignment等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出CoCR-RAG框架，通过概念导向的上下文重构解决多源信息融合问题，在Web Q&A基准测试中显著优于现有方法。

摘要翻译

检索增强生成技术通过整合网络及其他外部信息，在提升问答系统性能方面展现出显著潜力。然而，从异构网络检索到的支撑文档通常源自多个信息源，其写作风格、文本格式与信息粒度存在显著差异。将此类多源文档融合为连贯且知识密集的上下文仍面临重大挑战，因为无关信息与冗余内容可能损害生成答案的事实一致性。本文提出面向概念的上下文重构检索增强生成框架，该框架通过基于语言学的概念层级整合来解决多源信息融合问题。具体而言，我们设计了一种概念蒸馏算法，该算法从抽象语义表示中提取核心概念——这是一种将文本意义结构化为逻辑图的稳定语义表征方法。随后，大型语言模型将来自多篇检索文档的蒸馏概念进行融合，重构为统一的信息密集型上下文，仅补充必要的句子成分以突出核心知识。在PopQA和EntityQuestions数据集上的实验表明，在这些网络问答基准测试中，本框架显著优于现有的上下文重构方法。此外，该框架在不同骨干大型语言模型中均表现出强鲁棒性，可成为适配多种检索增强生成框架的灵活即插即用组件。

摘要 (Abstract)

Retrieval-augmented generation (RAG) has shown promising results in enhancing Q&A by incorporating information from the web and other external sources. However, the supporting documents retrieved from the heterogeneous web often originate from multiple sources with diverse writing styles, varying formats, and inconsistent granularity. Fusing such multi-source documents into a coherent and knowledge-intensive context remains a significant challenge, as the presence of irrelevant and redundant information can compromise the factual consistency of the inferred answers. This paper proposes the Concept-oriented Context Reconstruction RAG (CoCR-RAG), a framework that addresses the multi-source information fusion problem in RAG through linguistically grounded concept-level integration. Specifically, we introduce a concept distillation algorithm that extracts essential concepts from Abstract Meaning Representation (AMR), a stable semantic representation that structures the meaning of texts as logical graphs. The distilled concepts from multiple retrieved documents are then fused and reconstructed into a unified, information-intensive context by Large Language Models, which supplement only the necessary sentence elements to highlight the core knowledge. Experiments on the PopQA and EntityQuestions datasets demonstrate that CoCR-RAG significantly outperforms existing context-reconstruction methods across these Web Q&A benchmarks. Furthermore, CoCR-RAG shows robustness across various backbone LLMs, establishing itself as a flexible, plug-and-play component adaptable to different RAG frameworks.

关键词: Retrieval-Augmented Generation, RAG, Large Language Models, Concept-oriented Context Reconstruction, Multi-source Information Fusion, Abstract Meaning Representation, Web Q&A, Factual Consistency

137. ❌ From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

作者: Sirui Xia, Yikai Zhang, Aili Chen, Siye Wu, Siyu Yuan, Yanghua Xiao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23951v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM驱动的自动化发现框架（POISE）用于改进语言模型的策略优化算法，与’Large Language Models’高度相关（10分），因为研究基于LLM；与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），因为论文专注于策略优化算法（policy optimization algorithms），这是RLHF/DPO等对齐技术的核心组成部分；与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为框架使用LLM代理进行自动化算法发现。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为POISE的闭环框架，利用LLM代理自动发现和改进语言模型的策略优化算法，在数学推理实验中成功发现了能提升性能的新机制。

摘要翻译

为语言模型寻找更优的策略优化算法仍是一个成本高昂的手动过程，需要反复进行机制层面的修改与验证。与简单的组合代码搜索不同，该问题需要在紧密耦合训练动态的算法机制空间中进行搜索，同时跨迭代复用实证证据。我们提出POISE，一个用于自动化发现语言模型策略优化算法的闭环框架。POISE维护一个结构化、具有谱系关联的档案库，将算法提案、可执行实现、标准化评估和自然语言反思相互关联，以支持证据驱动的迭代。在从GRPO出发的数学推理实验中，POISE评估了64个候选算法，并发现了改进的机制，包括解析方差缩放（analytic-variance scaling）和有效性掩码（validity masking）。最佳变体将加权总体得分（weighted Overall）从47.8提升至52.5（+4.6），并将AIME25 pass@32从26.7%提高至43.3%，证明了自动化策略优化发现的可行性，同时支持可解释的设计原则。

摘要 (Abstract)

Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.

关键词: LLM agents, policy optimization algorithms, automated discovery, reinforcement learning, mathematical reasoning, algorithmic mechanisms, closed-loop framework, GRPO

138. ❌ Argument Mining as a Text-to-Text Generation Task

作者: Masayuki Kawarada, Tsutomu Hirao, Wataru Uchida, Masaaki Nagata 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23949v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出了一种基于预训练编码器-解码器语言模型的文本到文本生成方法，用于论证挖掘任务。论文与’Large Language Models’相关，因为它使用了预训练语言模型；与’Pre-training’相关，因为它利用了预训练模型；与’Post-training’相关，因为它涉及对预训练模型进行适应特定任务的调整。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、CoT、Agents、Quantization等均未在论文中涉及，因此评分为0。论文属于自然语言处理领域，而非生物信息学或化学信息学等科学AI应用领域。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于预训练语言模型的文本到文本生成方法，用于论证挖掘任务，该方法简化了传统多子任务流程，并在多个基准数据集上实现了最先进的性能。

摘要翻译

论辩挖掘旨在揭示文本内部的论辩结构。传统方法需要多个子任务，如论元单元识别、组件分类和关系分类，因此需要基于规则的后处理来从各子任务输出中推导论辩结构。这种方法增加了模型复杂度，并扩大了超参数的搜索空间。为解决这一难题，我们提出了一种基于预训练编码器-解码器语言模型的文本到文本生成方法，该方法简洁而高效。我们的方法能同步生成包含论元单元、组件及关系标注的文本，无需任务特定的后处理与超参数调优。此外，由于采用直接的文本到文本生成范式，本方法可轻松适配多种类型的论辩结构。实验结果表明，我们的方法在三种不同类型的基准数据集——论辩标注论文语料库、AbstRCT及康奈尔电子规则制定语料库上均取得了最先进的性能表现。

摘要 (Abstract)

Argument Mining(AM) aims to uncover the argumentative structures within a text. Previous methods require several subtasks, such as span identification, component classification, and relation classification. Consequently, these methods need rule-based postprocessing to derive argumentative structures from the output of each subtask. This approach adds to the complexity of the model and expands the search space of the hyperparameters. To address this difficulty, we propose a simple yet strong method based on a text-to-text generation approach using a pretrained encoder-decoder language model. Our method simultaneously generates argumentatively annotated text for spans, components, and relations, eliminating the need for task-specific postprocessing and hyperparameter tuning. Furthermore, because it is a straightforward text-to-text generation method, we can easily adapt our approach to various types of argumentative structures. Experimental results demonstrate the effectiveness of our method, as it achieves state-of-the-art performance on three different types of benchmark datasets: the Argument-annotated Essays Corpus(AAEC), AbstRCT, and the Cornell eRulemaking Corpus(CDCP)

关键词: Argument Mining, Text-to-Text Generation, Pretrained Language Model, Encoder-Decoder, Argumentative Structures, State-of-the-Art, Benchmark Datasets

作者: Seunghee Kim, Bumkyu Park, Kyudan Jung, Joosung Lee, Soyoon Kim, Jeonghoon Kim, Taeuk Kim, Hwiyeol Jo 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23938v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多模态模型（特别是全模态模型）的语音生成能力评估，提出了一个名为OmniACBench的基准测试，用于评估模型在给定文本、图像和语音指令后生成恰当语音的能力。论文的核心是评估模型的语音控制、多模态整合和语音生成质量，而非大语言模型（LLM）或深度学习技术原理的创新。所有评分关键词均与大语言模型、深度学习技术原理、模型优化方法或特定AI应用领域（如生物信息学）相关，而该论文的研究内容与这些关键词无直接关联，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了OmniACBench基准测试，用于评估全模态模型在整合文本、图像和语音指令后生成恰当语音的能力，实验发现现有模型在多模态上下文整合方面存在瓶颈，并识别了三种常见失败模式。

摘要翻译

当前多数全模态模型的测试平台通过文本输出评估多模态理解能力，这导致我们难以判断这些模型能否以恰当的口语形式表达答案。为探究此问题，我们提出了OmniACBench——一个用于评估全模态模型中语境化声学控制能力的基准测试。给定一段语音指令、文本脚本和图像，模型需以合适的语气和方式朗读脚本。OmniACBench包含3,559个经过验证的测试实例，涵盖六大声学特征：语速、发声方式、发音、情感、整体口音和音色。对八个模型的大量实验表明，尽管它们在先前基于文本输出的评估中表现优异，但在本研究所设情境中仍存在明显局限。分析显示，主要瓶颈不在于对单一模态的处理能力，而在于整合多模态语境以生成忠实于语义的语音。此外，我们识别出三种常见失效模式——弱直接控制、隐式推理失败和多模态语义关联失败——这为开发具备有效口语化响应能力的模型提供了关键洞见。

摘要 (Abstract)

Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. Given a spoken instruction, a text script, and an image, a model must read the script aloud with an appropriate tone and manner. OmniACBench comprises 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Extensive experiments on eight models reveal their limitations in the proposed setting, despite their strong performance on prior textual-output evaluations. Our analyses show that the main bottleneck lies not in processing individual modalities, but in integrating multimodal context for faithful speech generation. Moreover, we identify three common failure modes-weak direct control, failed implicit inference, and failed multimodal grounding-providing insights for developing models that can verbalize responses effectively.

关键词: Omni-modal models, Acoustic control, Multimodal integration, Speech generation, Benchmark evaluation, Context-grounded, Failure modes, Verbalize responses

140. ❌ ORACLE: Orchestrate NPC Daily Activities using Contrastive Learning with Transformer-CVAE

作者: Seong-Eun Hong, JuYeong Hwang, RyunHa Lee, HyeongYeop Kang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23933v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ORACLE专注于使用Transformer-CVAE和对比学习生成NPC日常活动计划，属于特定领域的生成模型应用。所有评分关键词均围绕大模型（LLM）技术原理、训练方法、推理优化、对齐、应用等主题，而本文未涉及任何大模型技术，也未应用于科学领域（如生物信息学），因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出ORACLE模型，通过结合Transformer、条件变分自编码器和对比学习，解决了生成非玩家角色真实室内日常活动计划的挑战，并在实验中验证了其优于现有方法的性能。

摘要翻译

数字环境中非玩家角色（NPC）的整合，因其增强用户沉浸感与认知参与度的潜力而日益受到重视。对其日常活动进行精细编排，以反映人类日常行为的细微差别，对提升数字环境的真实感具有重要作用。然而，传统方法常产生单调重复的结果，难以捕捉真实人类活动计划的复杂性。为此，我们提出了ORACLE，一种用于生成逼真室内日常活动计划的新型生成模型，旨在确保NPC在数字栖息地中的真实存在。利用CASAS智能家居数据集中的24小时室内活动序列，ORACLE解决了该数据集面临的挑战，包括序列数据不平衡、训练样本稀缺以及缺乏能够概括人类日常活动模式的预训练模型。ORACLE的训练过程结合了Transformer在序列数据处理上的优势、条件变分自编码器（CVAE）的生成可控性以及对比学习的判别优化能力。我们的实验结果验证了所生成NPC活动计划的优越性，并证明我们的设计策略相较于现有方法具有更高的效能。

摘要 (Abstract)

The integration of Non-player characters (NPCs) within digital environments has been increasingly recognized for its potential to augment user immersion and cognitive engagement. The sophisticated orchestration of their daily activities, reflecting the nuances of human daily routines, contributes significantly to the realism of digital environments. Nevertheless, conventional approaches often produce monotonous repetition, falling short of capturing the intricacies of real human activity plans. In response to this, we introduce ORACLE, a novel generative model for the synthesis of realistic indoor daily activity plans, ensuring NPCs’ authentic presence in digital habitats. Exploiting the CASAS smart home dataset’s 24-hour indoor activity sequences, ORACLE addresses challenges in the dataset, including its imbalanced sequential data, the scarcity of training samples, and the absence of pre-trained models encapsulating human daily activity patterns. ORACLE’s training leverages the sequential data processing prowess of Transformers, the generative controllability of Conditional Variational Autoencoders (CVAE), and the discriminative refinement of contrastive learning. Our experimental results validate the superiority of generating NPC activity plans and the efficacy of our design strategies over existing methods.

关键词: NPC daily activities, generative model, Transformer, Conditional Variational Autoencoder, contrastive learning, activity plans, smart home dataset, sequential data

141. ❌ Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development

作者: Zongliang Ji, Ziyang Zhang, Xincheng Tan, Matthew Thompson, Anna Goldenberg, Carl Yang, Rahul G. Krishnan, Fan Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23937v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文明确使用LLMs（Gemini 2.5）作为医疗指南代理开发的核心技术，属于大模型在生物医学领域的应用，因此与’Large Language Models’和’AI for Science’高度相关（10分）。研究涉及生成问题以支持医生推理，与’Chain of Thought’和’System 2 Thinking’有一定关联（5分）。论文旨在开发医疗代理，与’LLM Agents’相关（8分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，评分为0分。

!!! tip deepseek-chat TL;DR

该研究探讨了使用大型语言模型（LLMs）在临床医患对话中生成基于证据的医学指南问题，以辅助医生决策，结果表明LLMs能产生有临床意义的问题，具有减少认知负担和提升循证医学实践可行性的潜力。

摘要翻译

循证医学（Evidence-Based Medicine, EBM）是高质量医疗的核心，但在快节奏的初级诊疗环境中仍难以有效实施。医生面临问诊时间短、患者数量增加以及临床指南文件冗长等问题，难以在诊疗过程中实时查阅。为弥补这一差距，本研究探讨了利用大语言模型（Large Language Models, LLMs）作为环境智能助手，在医患交流过程中即时生成有针对性的循证医学问题的可行性。本研究聚焦于问题生成而非问题回答，旨在辅助医生临床推理，并将基于指南的诊疗实践整合到简短问诊中。我们以Gemini 2.5为核心模型，实施了两种提示策略：零样本基线方法和多阶段推理变体。评估基于80份真实临床诊疗脱敏记录构成的基准数据集，并由六位经验丰富的医师进行了超过90小时的结构化评审。结果表明，尽管通用大语言模型尚未完全可靠，但其能够生成具有临床意义且与指南相关的问题，这显示出其在减轻认知负荷、促进循证医学在诊疗点更具可操作性方面具有重要潜力。

摘要 (Abstract)

Evidence-based medicine (EBM) is central to high-quality care, but remains difficult to implement in fast-paced primary care settings. Physicians face short consultations, increasing patient loads, and lengthy guideline documents that are impractical to consult in real time. To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations. We implemented two prompting strategies, a zero-shot baseline and a multi-stage reasoning variant, using Gemini 2.5 as the backbone model. We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.

关键词: large language models, evidence-based medicine, question generation, clinical encounters, medical guideline agent, physician reasoning, Gemini 2.5, ambient assistants

142. ❌ BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

作者: Praveen Kumar Myakala, Manan Agrawal, Rahul Manche 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23848v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM Agents在长期对话中的信念动态（如观点漂移、一致性、修正），直接涉及LLM Agents、RAG（作为评估设置之一）和LLMs（评估了多个模型）。与Alignment相关（提到over-alignment），与Self-Correction/Reflection相关（涉及信念修正），与Factuality相关（涉及事实基础与信念更新的权衡）。其他关键词如MoE、SLMs、训练技术、推理加速、科学应用等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了BeliefShift基准，用于评估LLM Agents在多轮对话中的信念一致性、矛盾检测和证据驱动修正能力，发现模型在个性化与事实性之间存在权衡。

摘要翻译

大型语言模型正日益被用作长期运行的对话代理，然而现有评估其记忆能力的主要基准测试均将用户信息视为静态事实进行存储和检索。这是一种错误的模型。人们会改变想法，在长期互动中，观点漂移、过度对齐和确认偏误等现象开始产生重要影响。
BeliefShift引入了一个专门为评估多轮会话中信念动态而设计的纵向基准测试。它涵盖三个维度：时序信念一致性、矛盾检测和证据驱动的信念修正。该数据集包含2,400条人工标注的多轮会话轨迹，涉及健康、政治、个人价值观和产品偏好等领域。
我们在零样本和检索增强生成两种设置下评估了包括GPT-4o、Claude 3.5 Sonnet、Gemini 1.5 Pro、LLaMA-3和Mistral-Large在内的七个模型。结果揭示了明显的权衡关系：积极个性化的模型难以抵抗观点漂移，而基于事实的模型则容易忽略合理的信念更新。
我们进一步提出了四个新颖的评估指标：信念修正准确度、漂移连贯性分数、矛盾解决率和证据敏感度指数。

摘要 (Abstract)

LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That’s the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over-alignment, and confirmation bias start to matter a lot. BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi-session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. The dataset includes 2,400 human-annotated multi-session interaction trajectories spanning health, politics, personal values, and product preferences. We evaluate seven models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA-3, and Mistral-Large under zero-shot and retrieval-augmented generation (RAG) settings. Results reveal a clear trade-off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates. We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).

关键词: LLM Agents, belief dynamics, temporal consistency, opinion drift, retrieval-augmented generation, benchmark, multi-session interactions, evaluation metrics

143. ❌ How Vulnerable Are Edge LLMs?

作者: Ao Ding, Hongzong Li, Zi Liang, Zhanpeng Shi, Shuxin Zhuang, Shiqin Tang, Rong Feng, Ping Lu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23822v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究边缘部署的量化LLMs的安全漏洞，与’Large Language Models’和’Small Language Models/On-device AI’高度相关（10分），因为直接研究边缘设备上的LLMs部署。与’Quantization/Model Compression’高度相关（10分），因为论文重点研究INT8/INT4量化模型的安全风险。其他关键词如MoE、Scaling Laws、Fine-tuning方法、推理技术、对齐方法、科学AI应用等均未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，尽管量化引入了噪声，但边缘部署的量化大语言模型仍然容易受到查询式知识提取攻击，提出的CLIQ框架能更有效地恢复模型行为，表明量化本身不能提供有效的安全保护。

摘要翻译

大型语言模型（LLM）在严格的计算和量化约束下正日益部署于边缘设备，但其安全影响尚不明确。本研究在现实查询预算下，对量化边缘部署LLM进行基于查询的知识提取分析，结果表明：尽管量化引入了噪声，但并未消除底层的语义知识，通过精心设计的查询仍可实现显著的行为恢复。为系统分析此风险，我们提出CLIQ（Clustered Instruction Querying，聚类指令查询）——一种结构化查询构建框架，该框架在减少冗余的同时提升了语义覆盖度。在量化Qwen模型（INT8/INT4）上的实验表明，CLIQ在BERTScore、BLEU和ROUGE指标上均持续优于原始查询，能够在有限预算下实现更高效的提取。这些结果表明，仅靠量化无法有效防御基于查询的知识提取，这揭示了边缘部署LLM中一个先前未被充分探索的安全风险。

摘要 (Abstract)

Large language models (LLMs) are increasingly deployed on edge devices under strict computation and quantization constraints, yet their security implications remain unclear. We study query-based knowledge extraction from quantized edge-deployed LLMs under realistic query budgets and show that, although quantization introduces noise, it does not remove the underlying semantic knowledge, allowing substantial behavioral recovery through carefully designed queries. To systematically analyze this risk, we propose \textbf{CLIQ} (\textbf{Cl}ustered \textbf{I}nstruction \textbf{Q}uerying), a structured query construction framework that improves semantic coverage while reducing redundancy. Experiments on quantized Qwen models (INT8/INT4) demonstrate that CLIQ consistently outperforms original queries across BERTScore, BLEU, and ROUGE, enabling more efficient extraction under limited budgets. These results indicate that quantization alone does not provide effective protection against query-based extraction, highlighting a previously underexplored security risk in edge-deployed LLMs.

关键词: edge LLMs, quantization, knowledge extraction, security risk, CLIQ, query-based attack, model compression, edge deployment

144. ❌ Language Model Planners do not Scale, but do Formalizers?

作者: Owen Jiang, Cassie Huang, Ashish Sabharwal, Li Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23844v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在规划问题中的表现，比较了LLM规划器与LLM形式化器的可扩展性，因此与’Large Language Models’高度相关（10分）。研究涉及推理和规划能力，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），因为形式化生成程序可视为一种推理过程。LLM作为形式化器可视为代理行为，与’LLM Agents’相关（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究发现，在复杂规划问题中，LLM形式化器（生成求解器导向程序）比LLM规划器具有更好的可扩展性，并提出了一种分治法形式化技术和LLM-as-higher-order-formalizer新范式来应对组合爆炸挑战。

摘要翻译

近期研究表明，即使那些经过训练以扩展推理轨迹的大语言模型，在解决过于复杂的规划问题时仍表现不佳，这一结论已获得压倒性证据支持。然而，对于面向求解器生成程序的大语言模型形式化工具是否同样存在此局限，目前尚不明确。我们系统性地证明，大语言模型形式化工具的性能远超大语言模型规划器，其中部分在经典的BlocksWorld领域中保持完美准确率，而该领域的状态空间规模高达$10^{165}$。尽管较小规模的大语言模型形式化工具的性能会随问题复杂度增加而下降，但我们证明一种分治形式化技术能显著提升其鲁棒性。最后，我们提出了一类“解构性问题”，其中一行问题描述实际对应指数级数量的形式语言（如规划领域定义语言PDDL）代码行，这对大语言模型形式化工具构成了巨大挑战。为应对此挑战，我们引入了一种新范式，即“大语言模型作为高阶形式化工具”，通过大语言模型生成程序生成器。这种方法将令牌输出与底层形式化及搜索空间的组合爆炸解耦，从而有效应对复杂性挑战。

摘要 (Abstract)

Recent work shows overwhelming evidence that LLMs, even those trained to scale their reasoning trace, perform unsatisfactorily when solving planning problems too complex. Whether the same conclusion holds for LLM formalizers that generate solver-oriented programs remains unknown. We systematically show that LLM formalizers greatly out-scale LLM planners, some retaining perfect accuracy in the classic BlocksWorld domain with a huge state space of size up to $10^{165}$. While performance of smaller LLM formalizers degrades with problem complexity, we show that a divide-and-conquer formalizing technique can greatly improve its robustness. Finally, we introduce unraveling problems where one line of problem description realistically corresponds to exponentially many lines of formal language such as the Planning Domain Definition Language (PDDL), greatly challenging LLM formalizers. We tackle this challenge by introducing a new paradigm, namely LLM-as-higher-order-formalizer, where an LLM generates a program generator. This decouples token output from the combinatorial explosion of the underlying formalization and search space.

关键词: LLM planners, LLM formalizers, planning problems, BlocksWorld, PDDL, divide-and-conquer, scalability, combinatorial explosion

145. ❌ Infrequent Child-Directed Speech Is Bursty and May Draw Infant Vocalizations

作者: Margaret Cychosz, Adriana Weisleder 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23797v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究儿童语言发展，分析婴儿在不同文化环境（玻利维亚农村和美国城市）中接收的成人/儿童指向性言语的时空模式及其与婴儿发声行为的关系。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而本文属于发展心理学、语言学和人类学交叉领域，未涉及任何人工智能、机器学习或大模型技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究探讨了在成人对婴儿说话较少的社区中，婴儿语言发展的输入特征，发现即使成人指向性言语较少，其时间上的集中性（突发性）以及年长儿童作为言语来源，仍能促进婴儿发声行为。

摘要翻译

在世界许多地区，儿童接收到的指向性言语输入相对较少，却仍能达成语言发展的关键里程碑。当指向性输入稀缺时，婴儿所学习的言语输入有何不同？通过使用玻利维亚农村和美国城市采集的长时段、以婴儿为中心的录音数据，我们分析了婴儿言语输入的时间模式及其前语言发声行为。研究发现，玻利维亚的儿童指向性言语虽出现频率较低，但其时间聚集程度与美国相当，均以密集爆发形式出现而非均匀分布于全天。在两个社群中，婴儿最可能在接收指向性言语的时段发出类言语发声，且其在目标性儿童指向性言语期间产生类言语发声的概率几乎是静默期间的两倍。在玻利维亚，婴儿的类言语发声也更可能出现在年长儿童（而非成人）进行指向性言语的时段。这些发现共同表明，儿童指向性言语对发展的影响可能不仅取决于数量，还与其时间集中度和来源有关——在某些成人对婴儿言语输入较少的社群中，年长儿童构成了重要的输入来源。

摘要 (Abstract)

Children in many parts of the world hear relatively little speech directed to them, yet still reach major language development milestones. What differs about the speech input that infants learn from when directed input is rare? Using longform, infant-centered audio recordings taken in rural Bolivia and the urban U.S., we examined temporal patterns of infants’ speech input and their pre-linguistic vocal behavior. We find that child-directed speech in Bolivia, though less frequent, was just as temporally clustered as speech input in the U.S, arriving in concentrated bursts rather than spread across the day. In both communities, infants were most likely to produce speech-like vocalizations during periods of speech directed to them, with the probability of infants’ speech-like vocalizations during target child-directed speech nearly double that during silence. In Bolivia, infants’ speech-like vocalizations were also more likely to occur during bouts of directed speech from older children than from adults. Together, these findings suggest that the developmental impact of child-directed speech may depend not only on quantity, but on temporal concentration and source, with older children serving as an important source of input in some communities, including where adult speech to infants is less frequent.

关键词: child-directed speech, infant vocalizations, language development, temporal patterns, cross-cultural, speech input, infant-centered recordings, Bolivia

146. ❌ IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

作者: Ali Abdelaal, Mohammed Nader Al Haffar, Mahmoud Fawzi, Walid Magdy 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23750v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是创建并应用一个名为IslamicMMLU的基准来评估大型语言模型（LLMs）在伊斯兰知识领域的表现，因此与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词所描述的具体模型架构、训练技术、推理方法、应用领域或性能优化技术，因此这些关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在伊斯兰知识领域缺乏全面评估基准的问题，创建了IslamicMMLU基准并评估了26个模型，发现模型性能差异显著（准确率39.8%至93.8%），并揭示了模型在法学派别偏好上的偏差。

摘要翻译

大型语言模型日益被用于获取伊斯兰知识，但目前缺乏全面评估其在核心伊斯兰学科表现的综合基准。我们推出IslamicMMLU基准测试，包含10,013道选择题，涵盖三个专项：《古兰经》（2,013题）、圣训（4,000题）与教法学（Fiqh，4,000题）。每个专项均设置多种题型，以检验大型语言模型处理伊斯兰知识不同维度的能力。该基准用于建立公开的IslamicMMLU评估排行榜，我们初步评估了26个大型语言模型，其三大专项平均准确率介于39.8%至93.8%之间（最高为Gemini 3 Flash模型）。《古兰经》专项表现出最大跨度（99.3%至32.4%），而教法学专项创新性地设置了学派（madhab）偏见检测任务，揭示了不同模型对伊斯兰法学流派的可变偏好。阿拉伯语专用模型表现参差不齐，但均未超越前沿模型性能。评估代码与排行榜已向公众开放。

摘要 (Abstract)

Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8% to 93.8% (by Gemini 3 Flash). The Quran track shows the widest span (99.3% to 32.4%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.

关键词: Large Language Models, Benchmark, Islamic Knowledge, Evaluation, Quran, Hadith, Fiqh, Leaderboard

147. ❌ LLMs Do Not Grade Essays Like Humans

作者: Jerin George Mathew, Sumayya Taher, Anindita Kundu, Denilson Barbosa 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23714v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在自动论文评分中的应用，与人类评分的一致性分析，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、训练方法、推理技术、代理系统、压缩技术等，故其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究评估了大型语言模型（LLMs）在自动论文评分中与人类评分的一致性，发现LLM评分与人类评分一致性较弱，且评分模式与人类不同，但LLM生成的反馈与其评分一致，可用于辅助论文评分。

摘要翻译

大型语言模型近期被提出作为自动化作文评分的工具，但其与人工评分的一致性仍不明确。本研究评估了LLM生成的分数与人工评分的对比情况，并在未经任务特定训练的开箱即用环境下，分析了GPT和Llama系列多个模型的评分行为。结果表明，LLM与人工评分的一致性相对较弱，且随文章特征变化。具体而言，与人工评分者相比，LLM倾向于给篇幅较短或未充分展开的文章打出更高分数，而对包含轻微语法或拼写错误的长篇文章则倾向于给出较低分数。我们还发现，LLM生成的分数与其生成的反馈总体一致：获得更多赞扬的文章倾向于得到更高分数，而受到更多批评的文章则倾向于得到更低分数。这些结果表明，LLM生成的分数和反馈遵循内在一致的逻辑，但其依赖的评判标准与人工评分者不同，导致与人类评分实践的契合度有限。尽管如此，我们的研究表明LLM生成的反馈与其评分具有一致性，且能可靠地用于辅助作文评分。

摘要 (Abstract)

Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics. In particular, compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays, while assigning lower scores to longer essays that contain minor grammatical or spelling errors. We also find that the scores generated by LLMs are generally consistent with the feedback they generate: essays receiving more praise tend to receive higher scores, while essays receiving more criticism tend to receive lower scores. These results suggest that LLM-generated scores and feedback follow coherent patterns but rely on signals that differ from those used by human raters, resulting in limited alignment with human grading practices. Nevertheless, our work shows that LLMs produce feedback that is consistent with their grading and that they can be reliably used in supporting essay scoring.

关键词: Large Language Models, Automated Essay Scoring, Human Grading, GPT, Llama, Feedback Consistency, Grading Behavior, Alignment Analysis

148. ❌ PLACID: Privacy-preserving Large language models for Acronym Clinical Inference and Disambiguation

作者: Manjushree B. Aithal, Ph. D., Alexander Kotz, James Mitchell, Ph. D 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23678v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在医疗领域的应用，特别是隐私保护下的临床缩略语消歧。高度相关的关键词包括：1）‘Large Language Models OR LLMs OR Foundation Models’（论文明确使用LLMs解决临床问题，10分）；2）‘Small Language Models OR SLMs OR On-device AI’（论文重点研究2B-10B参数的小模型在设备端部署以实现隐私保护，10分）；3）‘AI for Science OR Bioinformatics OR Cheminformatics’（论文属于生物信息学/医疗AI应用，10分）。其他关键词如MoE、Scaling Laws、RLHF等未在摘要中提及，与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究如何在保护隐私的前提下，使用设备端部署的小型大语言模型（2B-10B参数）解决临床叙事中缩略语消歧的问题，并通过级联管道结合通用模型和领域特定生物医学模型，将扩展准确率从约0.655提升至约0.81。

摘要翻译

大型语言模型（LLM）为众多领域提供了变革性解决方案，但其在医疗健康领域的整合受到严格数据隐私约束的阻碍。临床叙述文本中充斥着大量含义模糊的缩写词，对这些缩写的误译可能导致严重后果，例如危及生命的用药错误。虽然依赖云服务的LLM在缩写消歧方面表现出色，但将受保护的健康信息传输至外部服务器违反了隐私保护框架。为弥合这一差距，本研究率先评估了完全部署于设备端的小参数模型，以确保隐私保护。我们提出了一种隐私保护的级联流程，利用通用的本地模型检测临床缩写，并将其路由至特定领域的生物医学模型以进行上下文相关的扩展。结果显示，尽管通用的指令遵循模型在检测准确率上表现优异（约0.988），但其扩展能力大幅下降（约0.655）。我们的级联方法利用特定领域的医疗模型，将扩展准确率提升至约0.81。这项创新性工作表明，保护隐私的设备端（2B-10B参数）模型能够提供高保真度的临床缩写消歧支持。

摘要 (Abstract)

Large Language Models (LLMs) offer transformative solutions across many domains, but healthcare integration is hindered by strict data privacy constraints. Clinical narratives are dense with ambiguous acronyms, misinterpretation these abbreviations can precipitate severe outcomes like life-threatening medication errors. While cloud-dependent LLMs excel at Acronym Disambiguation, transmitting Protected Health Information to external servers violates privacy frameworks. To bridge this gap, this study pioneers the evaluation of small-parameter models deployed entirely on-device to ensure privacy preservation. We introduce a privacy-preserving cascaded pipeline leveraging general-purpose local models to detect clinical acronyms, routing them to domain-specific biomedical models for context-relevant expansions. Results reveal that while general instruction-following models achieve high detection accuracy (~0.988), their expansion capabilities plummet (~0.655). Our cascaded approach utilizes domain-specific medical models to increase expansion accuracy to (~0.81). This novel work demonstrates that privacy-preserving, on-device (2B-10B) models deliver high-fidelity clinical acronym disambiguation support.

关键词: Large Language Models, clinical acronym disambiguation, privacy-preserving, on-device models, small-parameter models, healthcare integration, biomedical models, cascaded pipeline

149. ❌ Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges

作者: Weilun Xu, Alexander Rusnak, Frederic Kaplan 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23659v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的内部表示如何编码不同的伦理框架（如义务论、功利主义等），这直接涉及LLMs的基础研究（权重1.0关键词得10分）。研究通过探测隐藏表示来分析伦理判断，与"Value Alignment"（价值对齐）高度相关，因为伦理框架是价值对齐的核心组成部分（权重1.0关键词得10分）。同时，论文使用探测方法分析模型内部表示，属于"Mechanistic Interpretability"（机制可解释性）的范畴（权重1.0关键词得10分）。其他关键词如MoE、SFT、RAG、推理加速等均未在论文标题或摘要中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该研究通过探测大语言模型的隐藏表示，探究其内部是否区分不同伦理框架（如义务论、功利主义），发现存在分化的伦理子空间但探测结果部分依赖于基准模板的表面特征，揭示了方法论的局限性。

摘要翻译

当大型语言模型进行伦理判断时，其内部表征是否能够区分不同的规范框架，抑或将伦理压缩为单一的接受度维度？我们针对六种参数量在4B至72B之间的语言模型，探测了其隐藏表征在五种伦理框架（道义论、功利主义、德性伦理、正义伦理、常识伦理）中的表现。分析显示，模型内部存在分化的伦理子空间，并呈现非对称的迁移模式——例如，道义论探针可部分泛化至德性伦理场景，而常识伦理探针在正义伦理场景中则完全失效。尽管这种关联可能部分源于对场景难度的共同敏感性，但道义论与功利主义探针之间的分歧程度与不同模型架构中行为熵的升高具有相关性。事后验证表明，探针部分依赖于基准模板的表层特征，这提示我们需要谨慎解读结果。我们既讨论了这些方法所提供的结构性洞见，也反思了其认识论层面的局限性。

摘要 (Abstract)

When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B–72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns – e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious interpretation. We discuss both the structural insights these methods provide and their epistemological limitations.

关键词: Large Language Models, Ethical Frameworks, Representation Probing, Deontology, Utilitarianism, Interpretability, Value Alignment, Methodological Challenges

150. ❌ Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages

作者: Badr M. Abdullah, Israel Abebe Azime, Atnafu Lambebo Tonja, Jesujoba O. Alabi, Abel Mulat Alemu, Eyob G. Hagos, Bontu Fufa Balcha, Mulubrhan A. Nerea, Debela Desalegn Yadeta, Dagnachew Mekonnen Marilign, Amanuel Temesgen Fentahun, Tadesse Kebede, Israel D. Gebru, Michael Melese Woldeyohannis, Walelign Tewabe Sewunetie, Bernd Möbius, Dietrich Klakow 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23654v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多语言语音识别（ASR）技术，特别是针对埃塞俄比亚语言的CTC模型训练和评估。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是语音识别，属于不同的AI子领域（语音处理而非文本/语言模型），因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Ethio-ASR，一套针对五种埃塞俄比亚语言联合训练的多语言CTC语音识别模型，在WAXAL测试集上实现了30.48%的平均词错误率，优于参数更多的OmniASR基线模型。

摘要翻译

我们推出Ethio-ASR——一套基于连接时序分类（CTC）的多语言自动语音识别（ASR）模型，该模型联合训练了五种埃塞俄比亚语言：阿姆哈拉语、提格雷尼亚语、奥罗莫语、西达玛语和沃莱塔语。这些语言分属亚非语系的闪米特语族、库希特语族和奥摩特语族，尽管埃塞俄比亚绝大多数人口使用这些语言，它们在语音技术领域仍处于严重代表性不足的状态。我们使用多种预训练语音编码器，在最新发布的WAXAL语料库上训练模型，并与包括OmniASR在内的强大多语言基线进行性能对比。我们的最佳模型在WAXAL测试集上实现了30.48%的平均词错误率（WER），以明显更少的参数量超越了最优的OmniASR模型。我们进一步提供了关于性别偏见、元音长度与辅音重叠现象对ASR错误的影响，以及多语言CTC模型训练动态的综合分析。本研究的模型与代码库已向研究社区公开。

摘要 (Abstract)

We present Ethio-ASR, a suite of multilingual CTC-based automatic speech recognition (ASR) models jointly trained on five Ethiopian languages: Amharic, Tigrinya, Oromo, Sidaama, and Wolaytta. These languages belong to the Semitic, Cushitic, and Omotic branches of the Afroasiatic family, and remain severely underrepresented in speech technology despite being spoken by the vast majority of Ethiopia’s population. We train our models on the recently released WAXAL corpus using several pre-trained speech encoders and evaluate against strong multilingual baselines, including OmniASR. Our best model achieves an average WER of 30.48% on the WAXAL test set, outperforming the best OmniASR model with substantially fewer parameters. We further provide a comprehensive analysis of gender bias, the contribution of vowel length and consonant gemination to ASR errors, and the training dynamics of multilingual CTC models. Our models and codebase are publicly available to the research community.

关键词: multilingual speech recognition, CTC models, Ethiopian languages, WAXAL corpus, automatic speech recognition, language identification, gender bias analysis, consonant gemination

151. ❌ A Theory of LLM Information Susceptibility

作者: Zhuo-Yang Song, Hua Xing Zhu 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23626v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在智能体系统中作为优化模块时的性能极限理论（LLM信息敏感性理论），与’Large Language Models’和’LLM Agents’高度相关（10分），因为论文明确研究LLM在agentic systems中的部署和性能限制；与’Self-Correction’有一定关联（5分），因为论文探讨了agentic self-improvement的可能性；其他关键词如MoE、SFT、RAG等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个LLM信息敏感性理论，研究当计算资源充足时，固定LLM干预是否增加策略集的性能敏感性，并通过实证验证发现嵌套协同扩展架构能开启固定配置无法实现的响应通道，为AI系统设计提供预测性约束。

摘要翻译

大型语言模型（LLM）越来越多地被部署为智能体系统中的优化模块，然而人们对这种基于LLM的改进的根本限制仍知之甚少。本文提出了一种关于LLM信息敏感性的理论，其核心假设是：当计算资源足够大时，固定LLM的介入不会增加策略集在预算方面的性能敏感性。我们开发了一个多变量效用函数框架，将该假设推广到具有多个共变预算通道的架构中，并讨论了共缩放能够超越敏感性界限的条件。我们在结构多样的领域和跨越一个数量级的模型规模上对理论进行了实证验证，结果表明嵌套式共缩放架构能够开启固定配置所不具备的响应通道。这些结果明确了LLM介入何时有效、何时无效，证明了统计物理学的工具可以为人工智能系统的设计提供预测性约束。如果敏感性假设普遍成立，该理论表明嵌套架构可能是实现开放式智能体自我改进的必要结构条件。

摘要 (Abstract)

Large language models (LLMs) are increasingly deployed as optimization modules in agentic systems, yet the fundamental limits of such LLM-mediated improvement remain poorly understood. Here we propose a theory of LLM information susceptibility, centred on the hypothesis that when computational resources are sufficiently large, the intervention of a fixed LLM does not increase the performance susceptibility of a strategy set with respect to budget. We develop a multi-variable utility-function framework that generalizes this hypothesis to architectures with multiple co-varying budget channels, and discuss the conditions under which co-scaling can exceed the susceptibility bound. We validate the theory empirically across structurally diverse domains and model scales spanning an order of magnitude, and show that nested, co-scaling architectures open response channels unavailable to fixed configurations. These results clarify when LLM intervention helps and when it does not, demonstrating that tools from statistical physics can provide predictive constraints for the design of AI systems. If the susceptibility hypothesis holds generally, the theory suggests that nested architectures may be a necessary structural condition for open-ended agentic self-improvement.

关键词: LLM information susceptibility, agentic systems, multi-variable utility-function, co-scaling architectures, performance susceptibility, nested architectures, agentic self-improvement, statistical physics

152. ❌ Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

作者: Fatih Uenal 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23646v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估前沿大语言模型在瑞士法律监管任务上的性能，因此与’Large Language Models’高度相关（10分）。论文提到’hallucination detection’任务，与’Hallucination Mitigation’有一定关联（5分）。其他关键词涉及模型架构、训练方法、推理技术、应用领域等，论文未涉及这些具体技术细节或应用场景，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个瑞士法律监管任务的三语基准测试Swiss-Bench SBP-002，评估了十种前沿大语言模型在零检索条件下的性能，发现即使表现最佳的模型正确率也仅为38.2%，且开源模型在排名中领先或与闭源模型相当。

摘要翻译

尽管近期研究已对大语言模型在瑞士法律翻译（Niklaus等人，2025）和基于大学考试的法律推理（Fan等人，2025）方面进行了基准测试，但目前尚无评估前沿模型在瑞士实际监管合规任务表现的基准。本文提出Swiss-Bench SBP-002——一个包含395项专家构建题目的三语基准测试，涵盖瑞士三大监管领域（FINMA、Legal-CH、EFK）、七种任务类型和三种语言（德语、法语、意大利语），并采用结构化三维评分框架评估了2026年3月的十款前沿模型。评分由三位匿名大语言模型评委（GPT-4o、Claude Sonnet 4、Qwen3-235B）通过多数投票机制完成（加权卡帕系数=0.605），其中100题子集的参考答案经独立人类法律专家验证（73%被评为正确，0%错误，法律准确性达完美标准）。结果显示三个描述性性能集群：A级（正确率35-38%）、B级（26-29%）和C级（13-21%）。该基准证明具有挑战性：即使排名最高的模型（Qwen 3.5 Plus）正确率仅达38.2%，错误率为47.3%，部分正确率为14.4%。任务类型难度差异显著：法律翻译和案例分析正确率达69-72%，而监管问答、幻觉检测和差距分析正确率均低于9%。在当前模型阵容（七个开源权重模型，三个闭源模型）中，开源权重模型位居榜首，且多个开源模型表现达到或超越闭源对标模型。这些发现为零检索条件下评估前沿模型处理瑞士监管任务的能力提供了首个实证参考基准。

摘要 (Abstract)

While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (35-38% correct), Tier B (26-29%), and Tier C (13-21%). The benchmark proves difficult: even the top-ranked model (Qwen 3.5 Plus) achieves only 38.2% correct, with 47.3% incorrect and 14.4% partially correct. Task type difficulty varies widely: legal translation and case analysis yield 69-72% correct rates, while regulatory Q&A, hallucination detection, and gap analysis remain below 9%. Within this roster (seven open-weight, three closed-source), an open-weight model leads the ranking, and several open-weight models match or outperform their closed-source counterparts. These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions.

关键词: large language models, benchmark, Swiss legal tasks, regulatory compliance, frontier models, zero-retrieval, open-weight models, hallucination detection

153. ❌ Revisiting Real-Time Digging-In Effects: No Evidence from NP/Z Garden-Paths

作者: Amani Maina-Kilaas, Roger Levy 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23624v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究人类句子处理中的digging-in效应，通过实验比较人类行为与大语言模型预测。论文明确使用了large language models（LLMs）作为比较基准，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词涉及大模型技术原理、应用领域或具体方法（如MoE、量化、推理加速、科学AI等），论文未涉及这些具体技术或应用，因此均为0分。

!!! tip deepseek-chat TL;DR

该研究通过实验检验人类句子处理中是否存在实时digging-in效应，并比较人类行为与大语言模型预测，结果发现没有证据支持实时digging-in效应，且非句末消歧项显示的趋势与神经模型预测一致。

摘要翻译

挖掘效应——即歧义区域越长，消解难度随之增加的现象——常被引证为支持句子处理自组织理论的依据，该理论认为结构承诺会随时间推移而增强。相比之下，惊奇理论预测除非延长操作确实改变了统计预期，否则不会出现此类效应，而神经语言模型似乎呈现出相反的模式。挖掘效应究竟是人类句子实时处理中稳健存在的现象，抑或是收尾处理过程或方法混淆的产物，目前尚不明确。我们通过迷宫任务与自定步速阅读两项实验，对英语NP/Z花园路径句进行研究，并将人类行为与一组大型语言模型的预测进行比较。结果未发现实时挖掘效应的证据。关键在于，句末消解与非句末消解的实验材料呈现出性质不同的模式：积极的挖掘效应趋势仅出现在句末位置，而该处的收尾效应会干扰解读。非句末项目——作为更纯粹的实时处理测试——则显示出与神经模型预测一致的反向趋势。

摘要 (Abstract)

Digging-in effects, where disambiguation difficulty increases with longer ambiguous regions, have been cited as evidence for self-organized sentence processing, in which structural commitments strengthen over time. In contrast, surprisal theory predicts no such effect unless lengthening genuinely shifts statistical expectations, and neural language models appear to show the opposite pattern. Whether digging-in is a robust real-time phenomenon in human sentence processing – or an artifact of wrap-up processes or methodological confounds – remains unclear. We report two experiments on English NP/Z garden-path sentences using Maze and self-paced reading, comparing human behavior with predictions from an ensemble of large language models. We find no evidence for real-time digging-in effects. Critically, items with sentence-final versus nonfinal disambiguation show qualitatively different patterns: positive digging-in trends appear only sentence-finally, where wrap-up effects confound interpretation. Nonfinal items – the cleaner test of real-time processing – show reverse trends consistent with neural model predictions.

关键词: digging-in effects, sentence processing, garden-path sentences, large language models, human behavior, real-time processing, surprisal theory, neural language models

154. ❌ LLMORPH: Automated Metamorphic Testing of Large Language Models

作者: Steven Cho, Stefano Ruberto, Valerio Terragni 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23611v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发LLMORPH工具，专门用于大语言模型（LLMs）的自动化测试，因此与’Large Language Models’高度相关（10分）。论文关注测试LLMs的可靠性和一致性，间接涉及’Hallucination Mitigation’（5分）和’Mechanistic Interpretability’（5分），因为测试工具可帮助识别模型错误和增强可解释性。其他关键词如MoE、SFT、RAG等涉及具体模型架构、训练方法或应用技术，论文未直接研究，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了LLMORPH，一种基于蜕变测试的自动化工具，用于检测大语言模型在NLP任务中的不一致性，有效暴露了GPT-4、LLAMA3等模型的潜在故障行为。

摘要翻译

自动化测试对于评估和提升大语言模型（LLM）的可靠性至关重要，然而缺乏用于验证输出正确性的自动化预言机仍是关键挑战。本文提出LLMORPH，一款专为执行自然语言处理（NLP）任务的大语言模型设计的自动化测试工具，该工具利用蜕变测试（Metamorphic Testing, MT）技术，在不依赖人工标注数据的情况下揭示模型缺陷行为。蜕变测试通过蜕变关系（Metamorphic Relations, MRs）从源测试输入生成衍生输入，从而无需昂贵标注数据即可检测模型输出的不一致性。LLMORPH面向需要评估基于LLM的NLP系统鲁棒性的研究人员和开发者。本文详细阐述了LLMORPH的设计、实现与实际应用，展示其如何轻松扩展到任意LLM、NLP任务及蜕变关系集合。在评估中，我们在四个NLP基准测试上应用了36个蜕变关系，测试了三种前沿大语言模型：GPT-4、LLAMA3和HERMES 2，累计执行超过56.1万次测试。结果表明LLMORPH能有效自动暴露模型的不一致行为。

摘要 (Abstract)

Automated testing is essential for evaluating and improving the reliability of Large Language Models (LLMs), yet the lack of automated oracles for verifying output correctness remains a key challenge. We present LLMORPH, an automated testing tool specifically designed for LLMs performing NLP tasks, which leverages Metamorphic Testing (MT) to uncover faulty behaviors without relying on human-labeled data. MT uses Metamorphic Relations (MRs) to generate follow-up inputs from source test input, enabling detection of inconsistencies in model outputs without the need of expensive labelled data. LLMORPH is aimed at researchers and developers who want to evaluate the robustness of LLM-based NLP systems. In this paper, we detail the design, implementation, and practical usage of LLMORPH, demonstrating how it can be easily extended to any LLM, NLP task, and set of MRs. In our evaluation, we applied 36 MRs across four NLP benchmarks, testing three state-of-the-art LLMs: GPT-4, LLAMA3, and HERMES 2. This produced over 561,000 test executions. Results demonstrate LLMORPH’s effectiveness in automatically exposing inconsistencies.

关键词: Large Language Models, Automated Testing, Metamorphic Testing, NLP Tasks, Reliability Evaluation, Inconsistency Detection, GPT-4, LLAMA3

155. ❌ The Geometric Price of Discrete Logic: Context-driven Manifold Dynamics of Number Representations

作者: Long Zhang, Dai-jun Lin, Wei-neng Chen 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23577v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）中离散逻辑推理的几何机制，属于LLM技术原理创新。高度相关关键词：1) ‘Large Language Models’ (10分)：论文明确研究LLMs，是核心研究对象；2) ‘Mechanistic Interpretability’ (10分)：论文通过Gram-Schmidt分解、向量消融等方法探究LLMs的内部工作机制，属于机制可解释性研究。中等相关关键词：1) ‘Chain of Thought’ (5分)：论文涉及逻辑推理，与多步推理概念相关；2) ‘System 2 Thinking’ (5分)：研究深度推理所需的离散决策边界形成；3) ‘Hallucination Mitigation’ (5分)：论文将’流形纠缠’几何解释为幻觉和奉承的原因。其他关键词与论文的几何机制、拓扑变形研究主题无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）中离散逻辑推理能力形成的几何机制，发现任务上下文作为非等距动力学算子驱动拓扑变形，通过类无关的拓扑保持和特定的代数发散形成逻辑边界，并揭示了这种几何动态与模型功能（如幻觉）的因果关系。

摘要翻译

大语言模型（LLM）能够在连续的语义空间中实现平滑泛化，但严格的逻辑推理要求形成离散的决策边界。依赖线性等距投影的主流理论无法解决这一根本矛盾。本文认为，任务语境作为一种非等距的动态算子，施加了必要的“拓扑扭曲”。通过对残差流激活进行格拉姆-施密特分解，我们揭示了驱动此过程的双重调制机制：一种是类别无关的拓扑保持，用于锚定全局结构以防止语义坍缩；另一种是特定的代数发散，它定向地撕裂跨类别概念以锻造逻辑边界。我们在从简单映射到复杂素数测试的一系列任务梯度上验证了这一几何演化过程。关键的是，针对特定向量的消融实验在此拓扑结构与模型功能之间建立了严格的因果绑定：代数擦除发散组件会使奇偶分类准确率从100%坍缩至随机水平（38.57%）。此外，我们发现了一个三阶段的逐层几何动态，并证明在社会压力提示下，模型无法产生足够的发散。这导致了“流形纠缠”，从而从几何角度解释了谄媚与幻觉现象。最终，我们的研究修正了线性等距假设，证明LLM中离散逻辑的出现是以不可避免的拓扑形变为代价换取的。

摘要 (Abstract)

Large language models (LLMs) generalize smoothly across continuous semantic spaces, yet strict logical reasoning demands the formation of discrete decision boundaries. Prevailing theories relying on linear isometric projections fail to resolve this fundamental tension. In this work, we argue that task context operates as a non-isometric dynamical operator that enforces a necessary “topological distortion.” By applying Gram-Schmidt decomposition to residual-stream activations , we reveal a dual-modulation mechanism driving this process: a class-agnostic topological preservation that anchors global structure to prevent semantic collapse, and a specific algebraic divergence that directionally tears apart cross-class concepts to forge logical boundaries. We validate this geometric evolution across a gradient of tasks, from simple mapping to complex primality testing. Crucially, targeted specific vector ablation establishes a strict causal binding between this topology and model function: algebraically erasing the divergence component collapses parity classification accuracy from 100% to chance levels (38.57%). Furthermore, we uncover a three-phase layer-wise geometric dynamic and demonstrate that under social pressure prompts, models fail to generate sufficient divergence. This results in a “manifold entanglement” that geometrically explains sycophancy and hallucination. Ultimately, our findings revise the linear-isometric presumption, demonstrating that the emergence of discrete logic in LLMs is purchased at an irreducible cost of topological deformation.

关键词: Large Language Models, Logical Reasoning, Geometric Mechanism, Topological Distortion, Manifold Dynamics, Mechanistic Interpretability, Gram-Schmidt Decomposition, Hallucination Explanation

156. ❌ TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models

作者: Jiaying Zhou, Zhihao Zhan, Ruifeng Zhai, Qinhan Lyu, Hao Liu, Keze Wang, Liang Lin, Guangrun Wang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24584v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是视觉-语言-动作（VLA）策略在机器人操作中的实例级接地失败问题，并提出了一种推理时引导机制TAG。论文的核心是机器人视觉-语言-动作策略的鲁棒性改进，属于机器人学、计算机视觉和具身AI的交叉领域。所有给定的关键词都明确针对大语言模型（LLM）及其相关技术（如训练方法、推理技术、应用范式等），而本论文的研究对象是VLA策略（通常基于视觉-语言模型和机器人控制策略），并未涉及或讨论任何大语言模型技术、原理、训练方法或应用。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文针对视觉-语言-动作策略在杂乱场景中因实例级接地失败导致的可靠性下降问题，提出了一种无需修改策略架构的推理时引导机制TAG，通过在原始观察和物体擦除观察之间对比策略预测来增强物体证据的影响，从而在多个标准操作基准测试中一致提高了鲁棒性并减少了错误执行。

摘要翻译

视觉-语言-动作（Vision–Language–Action，VLA）策略在将语言指令和视觉观测映射为机器人动作方面取得了显著进展，但在存在干扰物的杂乱场景中其可靠性会下降。通过分析失败案例，我们发现许多错误并非源于不可行的运动轨迹，而是由实例级定位失败导致：策略常生成看似合理的抓取轨迹，却略微偏离目标甚至作用于错误的对象实例。为解决这一问题，我们提出目标无关引导（Target-Agnostic Guidance，TAG），一种简单的推理时引导机制，旨在显式降低VLA策略中由干扰物和外观特征引起的偏差。受无分类器引导（Classifier-Free Guidance，CFG）启发，TAG通过对比原始观测与物体擦除观测下的策略预测，将二者的差异作为残差引导信号，以增强决策过程中物体证据的影响力。TAG无需修改策略架构，仅需极少的训练和推理调整即可与现有VLA策略集成。我们在标准操作基准测试（包括LIBERO、LIBERO-Plus和VLABench）中评估TAG，结果表明该方法能持续提升杂乱环境下的策略鲁棒性，并减少近失误和错误对象执行的情况。

摘要 (Abstract)

Vision–Language–Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions, but from instance-level grounding failures: the policy often produces a plausible grasp trajectory that lands slightly off-target or even on the wrong object instance. To address this issue, we propose TAG (Target-Agnostic Guidance), a simple inference-time guidance mechanism that explicitly reduces distractor- and appearance-induced bias in VLA policies. Inspired by classifier-free guidance (CFG), TAG contrasts policy predictions under the original observation and an object-erased observation, and uses their difference as a residual steering signal that strengthens the influence of object evidence in the decision process. TAG does not require modifying the policy architecture and can be integrated with existing VLA policies with minimal training and inference changes. We evaluate TAG on standard manipulation benchmarks, including LIBERO, LIBERO-Plus, and VLABench, where it consistently improves robustness under clutter and reduces near-miss and wrong-object executions.

关键词: Vision-Language-Action (VLA) policies, instance-level grounding failures, Target-Agnostic Guidance (TAG), inference-time guidance, robustness under clutter, object-erased observation, classifier-free guidance (CFG) inspired, robotic manipulation benchmarks

157. ❌ Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving

作者: Linbo Wang, Yupeng Zheng, Qiang Chen, Shiwei Li, Yichen Zhang, Zebin Xing, Qichao Zhang, Xiang Li, Deheng Qian, Pengxuan Yang, Yihang Dong, Ce Hao, Xiaoqing Ye, Junyu han, Yifeng Pan, Dongbin Zhao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24581v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Latent-WAM框架，核心是构建动态潜在世界模型（DLWM）用于自动驾驶规划，与’World Models AND General World Models’高度相关（10分）。论文使用基础模型提取几何知识，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。框架涉及端到端训练，与’Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（各5分）。自动驾驶系统可视为自主代理，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定关联（5分）。其他关键词如MoE、SLMs、RAG、CoT等未在论文中涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文提出Latent-WAM框架，通过空间感知压缩世界编码器和动态潜在世界模型解决自动驾驶中世界表示不足的问题，在NAVSIM v2和HUGSIM数据集上取得了最先进的轨迹规划性能。

摘要翻译

本文提出Latent-WAM，一种高效的端到端自动驾驶框架，通过空间感知与动态信息融合的潜在世界表征实现强大的轨迹规划能力。现有基于世界模型的规划器因表征压缩不足、空间理解有限且未能充分利用时序动态信息，在有限数据和计算资源下常产生次优规划结果。Latent-WAM通过两个核心模块解决这些局限：空间感知压缩世界编码器（Spatial-Aware Compressive World Encoder, SCWE）从基础模型中提取几何知识，并通过可学习查询将多视角图像压缩为紧凑场景标记；动态潜在世界模型（Dynamic Latent World Model, DLWM）采用因果Transformer架构，以前序视觉与运动表征为条件自回归预测未来世界状态。在NAVSIM v2和HUGSIM数据集上的大量实验取得了新的最优性能：NAVSIM v2上达到89.3 EPDMS，HUGSIM上获得28.9 HD-Score，以显著更少的训练数据和仅1.04亿参数的紧凑模型，超越此前最佳无感知方法3.2 EPDMS。

摘要 (Abstract)

We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.

关键词: autonomous driving, world model, latent representation, trajectory planning, end-to-end framework, spatial-aware compression, dynamic prediction, Transformer

158. ❌ Vision-Language Models vs Human: Perceptual Image Quality Assessment

作者: Imran Mehmood, Imad Ali Shah, Ming Ronnier Luo, Brian Deegan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24578v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究视觉语言模型（VLMs）在感知图像质量评估中的应用，与人类心理物理数据进行比较。论文主题是计算机视觉与人类感知的交叉研究，主要涉及视觉语言模型的应用评估，而非大语言模型（LLMs）或深度学习技术原理的创新。所有关键词均针对大语言模型（LLMs）的技术、训练方法、推理优化、对齐、应用框架等，而本文研究的是视觉语言模型（VLMs），属于不同的模型类别和应用领域。唯一略有相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及AI在科学评估（图像质量感知）中的应用，但并非核心匹配，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型（VLMs）在感知图像质量评估（IQA）中能否近似人类判断，通过系统基准测试发现VLMs在颜色丰富度上表现出高人类对齐性（ρ高达0.93），但在对比度上表现不佳，且模型自一致性高并不一定意味着人类对齐性高。

摘要翻译

心理物理学实验仍是感知图像质量评估（IQA）最可靠的方法，但其成本高昂且可扩展性有限，推动了自动化方法的发展。本研究探讨了视觉语言模型（Vision Language Models, VLMs）能否在三种图像质量维度（对比度、色彩丰富度和整体偏好）上近似人类感知判断。我们以心理物理学数据为基准，对六种VLM（四种专有模型和两种开放权重模型）进行了评估。本研究通过对比人类心理物理学数据，首次系统性地建立了VLM在感知IQA任务上的基准。结果显示，模型表现存在强烈的属性依赖性差异：在色彩丰富度上与人类高度一致（ρ最高达0.93）的模型在对比度上表现不佳，反之亦然。属性权重分析进一步表明，在评估整体偏好时，大多数VLM与心理物理学数据类似，会赋予色彩丰富度比对比度更高的权重。模型内部一致性分析揭示了一个反直觉的权衡关系：自我一致性最高的模型未必与人类判断最一致，这表明响应变异性反映了模型对场景依赖性感知线索的敏感性。此外，人类与VLM的一致性会随感知可分离度的提高而增强，说明当刺激差异被清晰表达时，VLM的评估更为可靠。

摘要 (Abstract)

Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (ρup to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.

关键词: Vision-Language Models, Perceptual Image Quality Assessment, Psychophysical Experiments, Human Alignment, Contrast, Colorfulness, Benchmark, Self-Consistency

159. ❌ Towards Training-Free Scene Text Editing

作者: Yubo Li, Xugong Qin, Peng Zhang, Hailun Lin, Gangyan Zeng, Kexin Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24571v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Towards Training-Free Scene Text Editing》专注于计算机视觉领域的场景文本编辑任务，提出了一种无需训练的框架TextFlow，结合了注意力增强和流形引导技术。虽然该研究属于AI应用范畴，但所有评分关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本文完全不涉及LLM、深度学习模型训练或大模型技术原理，也未应用于科学领域（如生物信息学）。因此，所有关键词均得0分，表示完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练的框架TextFlow，通过结合注意力增强和流形引导技术，实现了在自然图像中高效、高保真地编辑场景文本，其性能媲美或优于基于训练的方法。

摘要翻译

场景文本编辑旨在修改自然图像中的文本内容，同时保持视觉真实性与语义一致性。现有方法通常需要针对特定任务进行训练或依赖配对数据，这限制了其可扩展性与适应性。本文提出TextFlow，一种无需训练的端到端场景文本编辑框架，它融合了注意力增强（AttnBoost）与流形引导调控（Flow Manifold Steering, FMS）的优势，无需额外训练即可实现灵活、高保真的文本操控。具体而言，FMS通过建模字符与背景区域的视觉流来保持结构与风格一致性，而AttnBoost则通过基于注意力的引导机制增强文本内容的渲染质量。通过协同利用这两个互补模块，我们的方法以即插即用方式，通过语义对齐与空间优化实现端到端的文本编辑。大量实验表明，本框架在视觉质量与文本准确性上达到甚至超越了基于训练的同类方法，并能良好泛化至多样化的场景与语言环境。本研究将场景文本编辑推向更高效、可泛化且无需训练的新范式。代码发布于https://github.com/lyb18758/TextFlow

摘要 (Abstract)

Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm. Code is available at https://github.com/lyb18758/TextFlow

关键词: Scene Text Editing, Training-Free, Attention Boost, Flow Manifold Steering, Visual Realism, Semantic Consistency, Plug-and-Play, TextFlow

160. ❌ POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

作者: Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kumar Das, Monorama Swain, Yufang Hou, Elisabeth Andre, Khalid Mahmood Malik, Markus Schedl, Shah Nawaz 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24569v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多模态说话人识别，特别是处理缺失模态和跨语言条件下的挑战。论文内容涉及音频-视觉多模态系统、说话人识别、缺失模态处理、跨语言鲁棒性等，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新或任何评分关键词中的技术（如MoE、SFT、RAG、量化等）。论文属于计算机视觉/语音处理领域，与评分关键词列表中的大模型和深度学习技术主题无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了POLY-SIM Grand Challenge 2026，旨在推动多模态说话人识别在缺失模态和跨语言条件下的研究，通过设计数据集、任务和评估框架来促进更鲁棒实用的系统发展。

摘要翻译

多模态说话人识别系统通常假设在训练和测试阶段均能获得完整且同质的音频-视觉模态数据。然而，在实际应用中，此类假设往往难以成立。视觉信息可能因遮挡、摄像头故障或隐私限制而缺失，同时多语言说话人由于跨语言的语言差异性引入了额外的复杂性。这些挑战显著影响了多模态说话人识别系统的鲁棒性与泛化能力。POLY-SIM 2026 国际挑战赛旨在推动缺失模态与跨语言条件下的多模态说话人识别研究。具体而言，本次挑战赛鼓励开发能够有效利用不完整多模态输入，并在不同语言间保持强劲性能的鲁棒方法。本报告介绍了 POLY-SIM 2026 国际挑战赛的设计与组织安排，包括数据集、任务定义、评估协议及基线模型。通过提供标准化的基准与评估框架，本挑战赛旨在推动构建更鲁棒、更实用的多模态说话人识别系统。

摘要 (Abstract)

Multimodal speaker identification systems typically assume the availability of complete and homogeneous audio-visual modalities during both training and testing. However, in real-world applications, such assumptions often do not hold. Visual information may be missing due to occlusions, camera failures, or privacy constraints, while multilingual speakers introduce additional complexity due to linguistic variability across languages. These challenges significantly affect the robustness and generalization of multimodal speaker identification systems. The POLY-SIM Grand Challenge 2026 aims to advance research in multimodal speaker identification under missing-modality and cross-lingual conditions. Specifically, the Grand Challenge encourages the development of robust methods that can effectively leverage incomplete multimodal inputs while maintaining strong performance across different languages. This report presents the design and organization of the POLY-SIM Grand Challenge 2026, including the dataset, task formulation, evaluation protocol, and baseline model. By providing a standardized benchmark and evaluation framework, the challenge aims to foster progress toward more robust and practical multimodal speaker identification systems.

关键词: multimodal speaker identification, missing modality, cross-lingual, audio-visual, robustness, generalization, evaluation framework, baseline model

161. ❌ The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

作者: Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mirela Tulbure, Patrick Hostert, Stefan Erasmi 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24552v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究使用Vision Transformer模型和Sentinel-2时间序列数据区分有机和常规农业系统，属于遥感农业应用领域。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在农业科学（可持续农业）中的应用，但并非核心的生物信息学或化学信息学，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了使用Vision Transformer模型和Sentinel-2时间序列数据区分有机与常规农业系统的可行性，发现分类性能因作物类型而异，且空间上下文能提升分类准确性，而多任务学习仅提供有限益处。

摘要翻译

有机农业是实现更可持续农业的关键要素。为更好地理解有机农业的发展与影响，需要全面且具有空间明确性的信息。本研究提出了一种利用年内哨兵-2号时间序列数据区分有机与常规耕作体系的方法，并探讨了影响该区分效果的两个因素：在并行任务中联合学习作物类型信息的作用，以及空间上下文的影响。基于时空视觉变换器（TSViT）架构的视觉变换器模型被用于构建两种耕作体系的分类模型。该模型通过扩展实现了对作物类型的同步学习，形成了多任务学习框架。通过改变输入模型的图像块尺寸，我们测试了空间上下文对两项任务分类精度的影响。研究表明，利用多光谱遥感数据区分有机与常规耕作体系是可行的，但分类性能在不同作物类型间存在显著差异。对于冬黑麦、冬小麦和冬燕麦等作物，可获得0.8或更高的F1分数；而其他农业用地类型（如永久草地、果园、葡萄园和啤酒花）则无法被可靠区分，其有机管理类别的F1分数仅为0.4或更低。耕作体系与作物类型的联合学习相较于单任务学习仅能提供有限的额外优势。相比之下，融入更广泛的空间上下文信息能同时提升耕作体系与作物类型分类的性能。总体而言，我们证明了在多类型农业区域利用多光谱遥感数据对农业耕作体系进行分类是可行的。

摘要 (Abstract)

Organic farming is a key element in achieving more sustainable agriculture. For a better understanding of the development and impact of organic farming, comprehensive, spatially explicit information is needed. This study presents an approach for the discrimination of organic and conventional farming systems using intra-annual Sentinel-2 time series. In addition, it examines two factors influencing this discrimination: the joint learning of crop type information in a concurrent task and the role of spatial context. A Vision Transformer model based on the Temporo-Spatial Vision Transformer (TSViT) architecture was used to construct a classification model for the two farming systems. The model was extended for simultaneous learning of the crop type, creating a multitask learning setting. By varying the patch size presented to the model, we tested the influence of spatial context on the classification accuracy of both tasks. We show that discrimination between organic and conventional farming systems using multispectral remote sensing data is feasible. However, classification performance varies substantially across crop types. For several crops, such as winter rye, winter wheat, and winter oat, F1 scores of 0.8 or higher can be achieved. In contrast, other agricultural land use classes, such as permanent grassland, orchards, grapevines, and hops, cannot be reliably distinguished, with F1 scores for the organic management class of 0.4 or lower. Joint learning of farming system and crop type provides only limited additional benefits over single-task learning. In contrast, incorporating wider spatial context improves the performance of both farming system and crop type classification. Overall, we demonstrate that a classification of agricultural farming systems is possible in a diverse agricultural region using multispectral remote sensing data.

关键词: organic farming, Sentinel-2 time series, Vision Transformer, multitask learning, spatial context, agricultural classification, remote sensing, crop type discrimination

作者: Dipam Goswami, Simone Magistri, Gido M. van de Ven, Bartłomiej Twardowski, Andrew D. Bagdanov, Tinne Tuytelaars, Joost van de Weijer 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24528v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉语言模型（VLM）如CLIP在少样本图像分类中的应用，通过混合图像和文本原型并分析其偏差-方差特性来改进分类性能。所有给定的关键词均与大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等）或特定科学领域AI应用相关，而本文专注于视觉语言模型（VLM）的少样本分类，未涉及LLM技术、大模型创新或AI在科学领域的应用。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过混合和语义对齐图像与文本原型来改进基于CLIP的少样本图像分类，提出了一种结合文本对齐混合原型分类器和图像特定LDA分类器的方法，在少样本分类基准上优于现有方法。

摘要翻译

以CLIP为代表的视觉语言模型（VLMs）的训练目标是对齐文本与图像对。为提升基于CLIP的小样本图像分类性能，近期研究发现，除文本嵌入外，来自训练集的图像嵌入同样是重要的信息来源。本研究探讨了直接融合图像与文本原型对小样本分类的影响，并从偏差-方差角度进行了分析。我们证明，混合原型的作用类似于收缩估计量。尽管混合原型提升了分类性能，但图像原型仍会以实例特定的背景或上下文信息形式引入噪声。为仅捕获图像空间中与给定分类任务相关的信息，我们提出将图像原型投影到语义文本嵌入空间的主方向上，从而获得文本对齐的语义图像子空间。这些文本对齐的图像原型与文本嵌入混合后，能进一步提升分类效果。然而，对于CLIP中跨模态对齐较弱的下游数据集，语义对齐可能并非最优解。我们证明，通过使用类别协方差建模各向异性，仍可有效利用图像子空间。实验表明，结合文本对齐的混合原型分类器与图像特定的LDA分类器，在多个小样本分类基准测试中均优于现有方法。

摘要 (Abstract)

Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.

关键词: Vision-language models, CLIP, few-shot classification, prototype mixing, cross-modal alignment, text-aligned image subspace, bias-variance perspective, LDA classifier

163. ❌ Toward Physically Consistent Driving Video World Models under Challenging Trajectories

作者: Jiawei Zhou, Zhenxin Zhu, Lingyi Du, Linye Lyu, Lijun Zhou, Zhanqian Wu, Hongcheng Luo, Zhuotao Tian, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun, Yu Li 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24506v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频生成世界模型在自动驾驶仿真中的应用，核心是物理一致性视频生成。仅与关键词’World Models AND General World Models’高度相关（10分），因为论文明确研究’world models for autonomous driving simulation’并提出了PhyGenesis世界模型。其他关键词均涉及大语言模型、训练技术、推理方法、代理系统等，与论文的视频生成和物理模拟主题完全无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有驾驶视频世界模型在挑战性轨迹下产生物理不一致的问题，提出了PhyGenesis框架，通过物理条件生成器和物理增强视频生成器，结合大规模物理丰富数据集训练，实现了高视觉保真度和强物理一致性的驾驶视频生成。

摘要翻译

视频生成模型已展现出作为自动驾驶仿真世界模型的强大潜力。然而，现有方法主要在真实世界驾驶数据集上进行训练，这些数据大多包含自然且安全的驾驶场景。因此，当前模型在处理具有挑战性或反事实的轨迹条件时——例如由模拟器或规划系统生成的不完美轨迹——常常失败，产生的视频存在严重的物理不一致性和伪影。为应对这一局限，我们提出了PhyGenesis，这是一个旨在生成具有高视觉保真度和强物理一致性的驾驶视频的世界模型。我们的框架包含两个关键组件：(1) 一个物理条件生成器，将可能无效的轨迹输入转化为物理上合理的条件；(2) 一个物理增强的视频生成器，在这些条件下生成高保真度的多视角驾驶视频。为有效训练这些组件，我们构建了一个大规模、富含物理信息的异构数据集。具体而言，除了真实世界驾驶视频，我们使用CARLA模拟器生成了多样化的挑战性驾驶场景，并从中提取监督信号，以指导模型学习极端条件下的物理基础动力学。这种挑战性轨迹学习策略实现了轨迹校正，并促进了物理一致的视频生成。大量实验表明，PhyGenesis在各项指标上持续优于现有先进方法，尤其在挑战性轨迹上表现突出。我们的项目页面位于：https://wm-research.github.io/PhyGenesis/。

摘要 (Abstract)

Video generation models have shown strong potential as world models for autonomous driving simulation. However, existing approaches are primarily trained on real-world driving datasets, which mostly contain natural and safe driving scenarios. As a result, current models often fail when conditioned on challenging or counterfactual trajectories-such as imperfect trajectories generated by simulators or planning systems-producing videos with severe physical inconsistencies and artifacts. To address this limitation, we propose PhyGenesis, a world model designed to generate driving videos with high visual fidelity and strong physical consistency. Our framework consists of two key components: (1) a physical condition generator that transforms potentially invalid trajectory inputs into physically plausible conditions, and (2) a physics-enhanced video generator that produces high-fidelity multi-view driving videos under these conditions. To effectively train these components, we construct a large-scale, physics-rich heterogeneous dataset. Specifically, in addition to real-world driving videos, we generate diverse challenging driving scenarios using the CARLA simulator, from which we derive supervision signals that guide the model to learn physically grounded dynamics under extreme conditions. This challenging-trajectory learning strategy enables trajectory correction and promotes physically consistent video generation. Extensive experiments demonstrate that PhyGenesis consistently outperforms state-of-the-art methods, especially on challenging trajectories. Our project page is available at: https://wm-research.github.io/PhyGenesis/.

关键词: world models, autonomous driving simulation, video generation, physical consistency, challenging trajectories, PhyGenesis, physics-enhanced video generator, CARLA simulator

164. ❌ OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

作者: Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, Yue Wu, Liefeng Bo, Siliang Tang, Zhao Zhong 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24458v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频生成领域，提出OmniWeaving模型实现统一视频生成。与大多数关键词无关，但涉及三个关键词：1) ‘Pre-training’ (8分)：论文明确提到使用大规模预训练数据集；2) ‘Chain of Thought’ (8分)：模型具备推理能力，用于推断复杂用户意图；3) ‘LLM Agents’ (8分)：模型被描述为智能代理，用于视频创作。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对开源统一视频生成模型落后于专有系统的问题，提出了OmniWeaving模型，通过大规模预训练和推理能力实现了多模态组合的统一视频生成，并在开源模型中达到了最先进的性能。

摘要翻译

尽管Seedance-2.0等专有系统在全能视频生成领域取得了显著成功，开源替代方案仍明显落后。大多数学术模型仍高度碎片化，现有少数构建统一视频生成模型的尝试，仍难以在单一框架内无缝整合多样化任务。为弥合这一差距，我们提出了OmniWeaving——一个具备强大多模态组合与推理感知能力的全层级视频生成模型。通过利用涵盖多样化组合与推理增强场景的大规模预训练数据集，OmniWeaving能够学习对交错出现的文本、多图像及视频输入进行时序绑定，同时作为智能代理推断复杂用户意图以进行精细视频创作。此外，我们推出了IntelligentVBench——首个为严格评估下一代智能统一视频生成而设计的综合性基准测试。大量实验表明，OmniWeaving在开源统一模型中实现了最先进的性能表现。代码与模型即将公开。项目页面：https://omniweaving.github.io。

摘要 (Abstract)

While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model will be made publicly available soon. Project Page: https://omniweaving.github.io.

关键词: video generation, unified model, multimodal composition, reasoning, pretraining, intelligent agent, benchmark, open-source

165. ❌ Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

作者: Siqi Liu, Xinyang Li, Bochao Zou, Junbao Zhuo, Huimin Ma, Jiansheng Chen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24484v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究多模态大语言模型（MLLMs）在视频输入下的心理理论（ToM）能力，提出VisionToM干预框架来增强任务感知推理。核心相关关键词包括：1）‘Large Language Models’（10分）- 论文明确研究LLMs/MLLMs；2）‘Hallucination Mitigation’（8分）- 论文探讨LLM幻觉对任务的影响并试图减少；3）‘Mechanistic Interpretability’（8分）- 论文从可解释性角度分析模型内部注意力行为；4）‘Alignment’（5分）- 论文提到推动机器-人类协作对齐；5）‘Chain of Thought’和’System 2 Thinking’（各5分）- 论文涉及任务感知推理和深入推理。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在纯视频输入下心理理论能力不足的问题，提出了VisionToM干预框架，通过对齐视觉表征与语义目标来引导模型注意力，从而显著提升了模型在视频心理理论任务上的表现和解释准确性。

摘要翻译

随着大语言模型（LLM）的持续发展，学界对其推断人类心理状态并展现类人心理理论（Theory of Mind, ToM）的能力日益关注。然而，现有的大部分ToM评估主要围绕文本输入展开，而仅依赖视觉信息的场景则鲜少受到重视。这造成了研究空白，因为现实中的人机交互通常需要多模态理解能力。此外，当前许多方法将模型视为黑箱，很少探究其在多项选择题问答（QA）任务中内部注意力的运作机制。从可解释性视角来看，LLM幻觉对此类任务的影响也尚未得到充分探索。为解决这些问题，我们提出了VisionToM——一个面向视觉的干预框架，旨在增强任务感知推理能力。其核心思想是计算干预向量，使视觉表征与正确的语义目标对齐，从而通过视觉特征的不同层级引导模型的注意力。这种引导减少了模型对虚假语言先验的依赖，从而产生更可靠的多模态语言模型（MLLM）输出和更优的QA性能。在EgoToM基准（一个用于ToM研究的以自我为中心的真实世界视频数据集，包含三种多项选择QA设置）上的实验表明，我们的方法显著提升了MLLM的ToM能力。此外，在另一项开放式生成任务上的结果显示，VisionToM能使MLLM生成更准确捕捉智能体心理状态的自由形式解释，从而推动机器与人类的协作走向更高程度的对齐。

摘要 (Abstract)

As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model’s attention through different layers of visual features. This guidance reduces the model’s reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark-an egocentric, real-world video dataset for ToM with three multiple-choice QA settings-demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents’ mental states, pushing machine-human collaboration toward greater alignment.

关键词: Multimodal Large Language Models, Theory of Mind, Video Understanding, Attention Intervention, Hallucination Mitigation, Interpretability, Task-aware Reasoning, VisionToM

166. ❌ Positive-First Most Ambiguous: A Simple Active Learning Criterion for Interactive Retrieval of Rare Categories

作者: Kawtar Zaher, Olivier Buisson, Alexis Joly 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24480v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的主动学习（Active Learning）方法，专注于细粒度视觉检索中的类别不平衡问题，提出了一种名为PF-MA的新标准。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词都特指自然语言处理或通用大模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文明确提到了在生物多样性监测、生态研究和植物学数据上的应用，属于AI在科学领域的应用，因此给予5分（有一定关联）。论文的核心是视觉检索的算法改进，并非大模型或深度学习技术原理的创新。

!!! tip deepseek-chat TL;DR

该论文针对细粒度视觉检索中类别高度不平衡和标注预算有限的问题，提出了一种名为PF-MA的主动学习标准，该标准通过优先选择边界附近且可能为正的样本，在真实人机交互场景中有效提升了稀有和视觉细微类别的检索性能。

摘要翻译

现实世界中的细粒度视觉检索通常需要在最小监督下从大规模无标注数据集中发现稀有概念。这在生物多样性监测、生态学研究以及长尾视觉领域中尤为关键，因为目标类别可能仅占数据的极小部分，从而形成高度不平衡的二分类问题。基于相关性反馈的交互式检索提供了一种实用解决方案：系统从少量查询样本出发，选择候选样本供用户进行二元标注，并迭代优化一个轻量级分类器。尽管主动学习（AL）常被用于指导样本选择，但传统AL方法假设类别先验对称且标注预算充足，这限制了其在不平衡、低预算、低延迟场景下的有效性。我们提出了一种简单而有效的主动学习准则——正样本优先最模糊（PF-MA），该准则明确处理类别不平衡的不对称性：它在优先选择决策边界附近样本的同时，倾向于可能为正例的样本，从而在保持信息量的同时快速发现细微的视觉类别。与标准方法过度采样负样本不同，PF-MA始终返回高比例相关样本的小批量数据，提升了早期检索效果和用户满意度。为捕捉检索多样性，我们还提出了一种类别覆盖度度量指标，用于评估所选正样本在多大程度上覆盖了目标类别的视觉多样性。在包括细粒度植物数据在内的长尾数据集上的实验表明，无论类别规模或特征描述符如何变化，PF-MA在覆盖度和分类器性能方面均持续优于强基线方法。我们的研究结果强调，将主动学习与交互式细粒度检索的不对称性及以用户为中心的目标相结合，能够为现实人机协同场景中检索稀有且视觉细微的类别提供简洁而强大的解决方案。

摘要 (Abstract)

Real-world fine-grained visual retrieval often requires discovering a rare concept from large unlabeled collections with minimal supervision. This is especially critical in biodiversity monitoring, ecological studies, and long-tailed visual domains, where the target may represent only a tiny fraction of the data, creating highly imbalanced binary problems. Interactive retrieval with relevance feedback offers a practical solution: starting from a small query, the system selects candidates for binary user annotation and iteratively refines a lightweight classifier. While Active Learning (AL) is commonly used to guide selection, conventional AL assumes symmetric class priors and large annotation budgets, limiting effectiveness in imbalanced, low-budget, low-latency settings. We introduce Positive-First Most Ambiguous (PF-MA), a simple yet effective AL criterion that explicitly addresses the class imbalance asymmetry: it prioritizes near-boundary samples while favoring likely positives, enabling rapid discovery of subtle visual categories while maintaining informativeness. Unlike standard methods that oversample negatives, PF-MA consistently returns small batches with a high proportion of relevant samples, improving early retrieval and user satisfaction. To capture retrieval diversity, we also propose a class coverage metric that measures how well selected positives span the visual variability of the target class. Experiments on long-tailed datasets, including fine-grained botanical data, demonstrate that PF-MA consistently outperforms strong baselines in both coverage and classifier performance, across varying class sizes and descriptors. Our results highlight that aligning AL with the asymmetric and user-centric objectives of interactive fine-grained retrieval enables simple yet powerful solutions for retrieving rare and visually subtle categories in realistic human-in-the-loop settings.

关键词: Active Learning, Fine-grained Visual Retrieval, Class Imbalance, Relevance Feedback, Rare Category Discovery, Interactive Retrieval, PF-MA, Biodiversity Monitoring

167. ❌ Unleashing Vision-Language Semantics for Deepfake Video Detection

作者: Jiawen Zhu, Yunqi Miao, Xueyi Zhang, Jiankang Deng, Guansong Pang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24454v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于预训练视觉语言模型（VLM）的深度伪造视频检测，属于计算机视觉与多模态学习领域，而非大语言模型（LLM）或深度学习技术原理的创新。关键词中仅’Pre-training’和’Post-training’与论文使用的预训练VLM及微调相关，但非核心创新点，故给5分；其余关键词均与LLM、推理、对齐、压缩等无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为VLAForge的新框架，通过利用视觉语言模型的跨模态语义来增强深度伪造视频检测的判别能力，在多个基准测试中显著优于现有方法。

摘要翻译

近期深度伪造视频检测研究表明，预训练的视觉语言模型（如CLIP）在检测不同身份间的伪造痕迹方面展现出强大的泛化能力。然而，现有方法仅侧重于利用视觉特征，忽视了其最显著的优势——即潜空间中丰富的视觉语言语义信息。我们提出VLAForge，一种新颖的深度伪造检测框架，通过释放这种跨模态语义的潜力来增强模型在深度伪造检测中的判别能力。本工作：i）通过ForgePerceiver增强视觉语言模型的视觉感知能力，该模块作为独立学习器，在保持预训练视觉语言对齐知识的同时，从细粒度和整体层面捕获多样且细微的伪造痕迹；ii）提供一种互补的判别性线索——身份感知的视觉语言对齐分数，该分数通过将跨模态语义与ForgePerceiver学习的伪造痕迹相结合而生成。值得注意的是，该视觉语言对齐分数通过身份先验引导的文本提示进行增强，以捕获针对每个身份定制的真实性线索，从而实现更具判别力的跨模态语义。在视频深度伪造检测基准（包括经典的面部替换伪造和近期全脸生成伪造）上的综合实验表明，我们的VLAForge在帧级别和视频级别均显著优于现有最先进方法。代码发布于https://github.com/mala-lab/VLAForge。

摘要 (Abstract)

Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength – the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model’s discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue – Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at https://github.com/mala-lab/VLAForge.

关键词: Deepfake Video Detection, Vision-Language Models, Cross-modal Semantics, VLAForge, ForgePerceiver, Identity-Aware VLA Score, Vision-Language Alignment, Video DFD Benchmarks

168. ❌ Teacher-Student Diffusion Model for Text-Driven 3D Hand Motion Generation

作者: Ching-Lam Cheng, Bin Zhu, Shengfeng He 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24407v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究文本驱动的3D手部运动生成，使用教师-学生扩散模型框架。所有评分关键词均专注于大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等）或特定科学AI应用（如生物信息学）。论文内容涉及计算机视觉、3D运动生成和扩散模型，但未涉及LLM技术、LLM应用或指定的科学AI领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TSHaMo的模型无关教师-学生扩散框架，用于从自然语言文本生成高质量的3D手部运动，无需在测试时使用3D物体网格，并在GRAB和H2O数据集上验证了其有效性和鲁棒性。

摘要翻译

从自然语言生成逼真的三维手部运动对于虚拟现实、机器人学和人机交互至关重要。现有方法要么专注于全身运动而忽略细节手势，要么需要显式的三维物体网格，限制了泛化能力。我们提出TSHaMo，一种模型无关的师生扩散框架，用于文本驱动的手部运动生成。学生模型学习仅从文本合成运动，而教师模型则利用辅助信号（如MANO参数）在训练期间提供结构化指导。协同训练策略使学生能够从教师的中间预测中受益，同时在推理时保持仅使用文本。在GRAB和H2O数据集上使用两种扩散主干进行评估，TSHaMo持续提升了运动质量和多样性。消融实验证实了其鲁棒性以及使用多样化辅助输入的灵活性，且在测试时无需三维物体信息。

摘要 (Abstract)

Generating realistic 3D hand motion from natural language is vital for VR, robotics, and human-computer interaction. Existing methods either focus on full-body motion, overlooking detailed hand gestures, or require explicit 3D object meshes, limiting generality. We propose TSHaMo, a model-agnostic teacher-student diffusion framework for text-driven hand motion generation. The student model learns to synthesize motions from text alone, while the teacher leverages auxiliary signals (e.g., MANO parameters) to provide structured guidance during training. A co-training strategy enables the student to benefit from the teacher’s intermediate predictions while remaining text-only at inference. Evaluated using two diffusion backbones on GRAB and H2O, TSHaMo consistently improves motion quality and diversity. Ablations confirm its robustness and flexibility in using diverse auxiliary inputs without requiring 3D objects at test time.

关键词: 3D hand motion generation, text-driven, diffusion model, teacher-student framework, model-agnostic, MANO parameters, co-training strategy, motion quality and diversity

169. ❌ The Gait Signature of Frailty: Transfer Learning based Deep Gait Models for Scalable Frailty Assessment

作者: Laura McDaniel, Basudha Pal, Crystal Szczesny, Yuxiang Guo, Ryan Roemmich, Peter Abadir, Rama Chellappa 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24434v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究基于步态分析的衰弱评估，使用深度学习模型（卷积和注意力架构）进行计算机视觉任务。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于AI在生物医学/老年医学领域的应用，但论文本身并未提及生物信息学或化学信息学的具体方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了如何利用预训练的步态识别模型，通过迁移学习在有限数据条件下进行衰弱分类，发现选择性冻结低层特征并结合互补学习目标能提高模型的稳定性和泛化能力，为临床衰弱评估提供了一个可扩展的非侵入性框架。

摘要翻译

衰弱是老年医学中一种以生理储备下降和对压力源脆弱性增加为特征的状态。然而，衰弱评估在临床实践中仍存在主观性强、异质性高且难以规模化的问题。步态是生物衰老的敏感标志物，能在明显功能障碍出现前捕捉多系统衰退。然而，现代计算机视觉在基于步态的衰弱评估中的应用一直受限于数据规模小、不平衡且缺乏具有临床代表性的基准数据集。本研究引入一个在临床真实场景下收集的、基于轮廓的公开衰弱步态数据集，该数据集覆盖完整的衰弱谱系，并包含使用助行器的老年人。利用该数据集，我们评估了在有限数据条件下，如何将预训练的步态识别模型适配用于衰弱分类。我们研究了卷积架构与混合注意力架构，结果表明预测性能主要取决于预训练表征的迁移方式，而非仅由架构复杂性决定。在所有模型中，选择性冻结低层级步态表征，同时允许高层级特征进行适配，相比完全微调或严格冻结能产生更稳定且可泛化的性能。对类别不平衡问题的保守处理进一步提升了训练稳定性，而结合互补的学习目标则增强了对临床相邻衰弱状态的区分能力。可解释性分析显示，模型持续关注下肢和骨盆区域，这与已确立的衰弱生物力学关联特征相一致。综上，这些发现确立了基于步态的表征学习作为一种可扩展、非侵入且可解释的衰弱评估框架，并支持将现代生物特征建模方法整合到衰老研究与临床实践中。

摘要 (Abstract)

Frailty is a condition in aging medicine characterized by diminished physiological reserve and increased vulnerability to stressors. However, frailty assessment remains subjective, heterogeneous, and difficult to scale in clinical practice. Gait is a sensitive marker of biological aging, capturing multisystem decline before overt disability. Yet the application of modern computer vision to gait-based frailty assessment has been limited by small, imbalanced datasets and a lack of clinically representative benchmarks. In this work, we introduce a publicly available silhouette-based frailty gait dataset collected in a clinically realistic setting, spanning the full frailty spectrum and including older adults who use walking aids. Using this dataset, we evaluate how pretrained gait recognition models can be adapted for frailty classification under limited data conditions. We study both convolutional and hybrid attention-based architectures and show that predictive performance depends primarily on how pretrained representations are transferred rather than architectural complexity alone. Across models, selectively freezing low-level gait representations while allowing higher-level features to adapt yields more stable and generalizable performance than either full fine-tuning or rigid freezing. Conservative handling of class imbalance further improves training stability, and combining complementary learning objectives enhances discrimination between clinically adjacent frailty states. Interpretability analyses reveal consistent model attention to lower-limb and pelvic regions, aligning with established biomechanical correlates of frailty. Together, these findings establish gait-based representation learning as a scalable, non-invasive, and interpretable framework for frailty assessment and support the integration of modern biometric modeling approaches into aging research and clinical practice.

关键词: frailty assessment, gait analysis, transfer learning, deep learning, computer vision, clinical application, aging medicine, interpretability

170. ❌ ViHOI: Human-Object Interaction Synthesis with Visual Priors

作者: Songjin Cai, Linjie Zhong, Ling Guo, Changxing Ding 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24383v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文ViHOI专注于3D人-物交互合成，使用视觉语言模型（VLM）提取先验知识，并采用扩散模型进行生成。虽然涉及大模型（VLM），但研究重点在于计算机视觉和运动生成，而非大模型技术原理的创新或其在科学领域的应用。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（计算机视觉/图形学）领域的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均与论文内容无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出ViHOI框架，通过从2D图像提取视觉先验来增强扩散模型，解决了3D人-物交互合成的物理约束难题，并在多个基准测试中实现了最先进的性能。

摘要翻译

生成真实且物理合理的三维人物-物体交互（HOI）仍然是运动生成领域的一个关键挑战。一个主要原因是仅用语言描述这些物理约束十分困难。为突破这一局限，我们提出了一种新范式：从易于获取的二维图像中提取丰富的交互先验。具体而言，我们引入了ViHOI，这是一个新颖的框架，它使基于扩散的生成模型能够利用来自二维图像的丰富、任务特定的先验知识来提升生成质量。我们利用一个大型视觉-语言模型（VLM）作为强大的先验提取引擎，并采用层解耦策略来获取视觉和文本先验。同时，我们设计了一个基于Q-Former的适配器，将VLM的高维特征压缩为紧凑的先验标记，这极大地促进了我们扩散模型的条件训练。我们的框架在数据集中的运动渲染图像上进行训练，以确保视觉输入与运动序列之间严格的语义对齐。在推理阶段，它利用由文本到图像生成模型合成的参考图像，以提高对未见过的物体和交互类别的泛化能力。实验结果表明，ViHOI取得了最先进的性能，在多个基准测试中超越了现有方法，并展现出卓越的泛化能力。

摘要 (Abstract)

Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM’s high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization.

关键词: Human-Object Interaction, 3D motion generation, visual priors, diffusion models, Vision-Language Model, Q-Former adapter, generalization, state-of-the-art

171. ❌ GeoRouter: Dynamic Paradigm Routing for Worldwide Image Geolocalization

作者: Pengyue Jia, Derong Xu, Yingyi Zhang, Xiaopeng Li, Wenlin Zhang, Yi Wen, Yuanshao Zhu, Xiangyu Zhao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24376v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究图像地理定位任务，提出GeoRouter动态路由框架，结合检索和生成两种范式。与关键词的相关性分析：1）论文使用Large Vision-Language Models（LVLMs）作为骨干网络，与’Large Language Models’有一定关联（5分）；2）论文涉及检索范式，与’Retrieval-Augmented Generation’概念相关（8分）；其他关键词如MoE、SLMs、Scaling Laws、训练技术、推理优化、AI for Science等均未在论文中涉及或提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对全球图像地理定位任务，提出了GeoRouter动态路由框架，通过自适应选择检索或生成范式，显著提升了定位精度。

摘要翻译

全球图像地理定位旨在为地球任意位置拍摄的图像预测精确的GPS坐标，由于视觉与地理环境的巨大多样性，这是一项极具挑战性的任务。现有方法主要遵循两种范式：基于检索的方法将查询图像与参考数据库进行匹配，以及基于生成的方法利用大型视觉语言模型直接预测坐标。然而，我们观察到两者存在明显的误差特征差异：检索范式擅长细粒度实例匹配，而生成范式具备更强的语义推理鲁棒性。这种互补的异质性表明，单一范式并非普遍最优。为挖掘此潜力，我们提出GeoRouter——一种动态路由框架，能够自适应地为每个查询分配最优范式。GeoRouter利用大型视觉语言模型主干分析视觉内容并提供路由决策。为优化GeoRouter，我们引入距离感知偏好目标，将范式间的距离差距转化为连续监督信号，显式反映相对性能差异。此外，我们构建了首个专为路由策略训练设计的大规模数据集GeoRouting，其中包含独立的范式预测结果。在IM2GPS3k和YFCC4k数据集上的大量实验表明，GeoRouter显著优于现有先进基线方法。

摘要 (Abstract)

Worldwide image geolocalization aims to predict precise GPS coordinates for images captured anywhere on Earth, which is challenging due to the large visual and geographic diversity. Recent methods mainly follow two paradigms: retrieval-based approaches that match queries against a reference database, and generation-based approaches that directly predict coordinates using Large Vision-Language Models (LVLMs). However, we observe distinct error profiles between them: retrieval excels at fine-grained instance matching, while generation offers robust semantic reasoning. This complementary heterogeneity suggests that no single paradigm is universally superior. To harness this potential, we propose GeoRouter, a dynamic routing framework that adaptively assigns each query to the optimal paradigm. GeoRouter leverages an LVLM backbone to analyze visual content and provide routing decisions. To optimize GeoRouter, we introduce a distance-aware preference objective that converts the distance gap between paradigms into a continuous supervision signal, explicitly reflecting relative performance differences. Furthermore, we construct GeoRouting, the first large-scale dataset tailored for training routing policies with independent paradigm predictions. Extensive experiments on IM2GPS3k and YFCC4k demonstrate that GeoRouter significantly outperforms state-of-the-art baselines.

关键词: Worldwide image geolocalization, Dynamic routing framework, Retrieval-based approaches, Generation-based approaches, Large Vision-Language Models, Distance-aware preference objective, GeoRouting dataset, State-of-the-art performance

172. ❌ Causal Transfer in Medical Image Analysis

作者: Mohammed M. Abdelsamea, Daniel Tweneboah Anyimadu, Tasneem Selim, Saif Alzubi, Lei Zhang, Ahmed Karam Eldaly, Xujiong Ye 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24388v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文《Causal Transfer in Medical Image Analysis》是一篇关于医学图像分析的综述，主要研究因果推理与跨域表示学习的结合（因果迁移学习CTL），以解决医学影像模型在部署时因域偏移导致的失败问题。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）的技术细节、训练方法、推理优化、代理系统等，而本文专注于医学图像分析的传统深度学习领域，未涉及LLM。仅有两个关键词相关：1. “Pre-training OR Continual Pre-training OR Domain Adaptation”：论文明确讨论域适应（Domain Adaptation）作为解决域偏移的方法之一，并对比因果迁移学习与基于相关性的域适应，因此高度相关（10分）。2. “AI for Science OR Bioinformatics OR Cheminformatics”：论文属于AI在科学领域的应用，具体是医学图像分析（生物信息学相关），因此高度相关（10分）。其他关键词如LLMs、MoE、SFT、RAG、CoT等均未在论文中提及或相关。

!!! tip deepseek-chat TL;DR

这篇论文系统综述了因果迁移学习（CTL）在医学图像分析中的应用，通过整合因果推理与跨域表示学习来解决模型因域偏移而失效的问题，并展示了CTL如何超越基于相关性的域适应方法以提高临床AI的鲁棒性和泛化能力。

摘要翻译

医学影像模型在跨医院、扫描设备、人群或成像协议部署时，常因域偏移而失效，限制了其临床可靠性。尽管迁移学习和域适应方法从统计学角度应对此类偏移，但它们往往依赖于在条件变化时失效的虚假相关性。另一方面，因果推断为识别跨环境保持稳定的不变机制提供了原则性方法。本综述系统性地介绍了面向医学影像分析的因果迁移学习范式。该范式将因果推理与跨域表征学习相结合，旨在实现鲁棒且可泛化的临床人工智能。我们将域偏移构建为因果问题，分析了如何将结构因果模型、不变风险最小化及反事实推理嵌入迁移学习流程中。研究覆盖分类、分割、重建、异常检测及多模态成像等任务，并按任务类型、偏移类别与因果假设进行梳理。本文提出一个统一分类体系，以连接因果框架与迁移机制。我们进一步总结了相关数据集、基准测试及实证效果，阐明因果迁移方法在何时及为何优于基于相关性的域适应方法。最后，我们探讨了因果迁移学习如何支持多机构与联邦学习场景下的公平性、鲁棒性及可信部署，并展望了实现临床可靠医学影像人工智能所面临的开放挑战与研究方向。

摘要 (Abstract)

Medical imaging models frequently fail when deployed across hospitals, scanners, populations, or imaging protocols due to domain shift, limiting their clinical reliability. While transfer learning and domain adaptation address such shifts statistically, they often rely on spurious correlations that break under changing conditions. On the other hand, causal inference provides a principled way to identify invariant mechanisms that remain stable across environments. This survey introduces and systematises Causal Transfer Learning (CTL) for medical image analysis. This paradigm integrates causal reasoning with cross-domain representation learning to enable robust and generalisable clinical AI. We frame domain shift as a causal problem and analyse how structural causal models, invariant risk minimisation, and counterfactual reasoning can be embedded within transfer learning pipelines. We studied spanning classification, segmentation, reconstruction, anomaly detection, and multimodal imaging, and organised them by task, shift type, and causal assumption. A unified taxonomy is proposed that connects causal frameworks and transfer mechanisms. We further summarise datasets, benchmarks, and empirical gains, highlighting when and why causal transfer outperforms correlation-based domain adaptation. Finally, we discuss how CTL supports fairness, robustness, and trustworthy deployment in multi-institutional and federated settings, and outline open challenges and research directions for clinically reliable medical imaging AI.

关键词: Causal Transfer Learning, Medical Image Analysis, Domain Shift, Causal Inference, Domain Adaptation, Robust AI, Generalizable Models, Clinical AI

173. ❌ Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions

作者: Shiqin Wang, Haoyang Chen, Huaizhou Huang, Yinkan He, Dongfang Sun, Xiaoqing Chen, Xingyu Liu, Zheng Wang, Kaiyan Zhao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24322v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于语义分割的无监督域自适应，特别是恶劣天气条件下的应用。核心创新是提出了一种启发式自定步学习框架，将课程学习建模为顺序决策问题，使用强化学习思想设计自主类别调度器。该研究与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），因为其核心就是域自适应技术。然而，论文完全不涉及大语言模型（LLMs）、深度学习技术原理创新或AI for Science等关键词，因此其他所有关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于强化学习的启发式自定步学习方法，用于解决恶劣天气条件下语义分割的无监督域自适应问题，通过自主类别调度器实现动态学习，在多个基准测试中达到了最先进的性能。

摘要翻译

语义类别的学习顺序对语义分割的无监督域适应具有显著影响，尤其在恶劣天气条件下。现有课程大多依赖人工设计的启发式规则（如固定的不确定性度量）并遵循静态调度策略，这无法适应模型不断演变的高维训练动态，从而导致类别偏差。受强化学习启发，我们将课程学习建模为序列决策问题，并提出一种自主类别调度器。该调度器包含两个组件：（i）一个高维状态编码器，将模型的训练状态映射到潜在空间，并提取反映学习进度的关键特征；（ii）一个确保各类别均衡提升的类别公平策略梯度目标。结合源域与目标域的混合监督，学习得到的类别排序能引导网络在每一阶段聚焦于信息量最丰富的类别，从而实现更具适应性和动态性的学习。值得注意的是，我们的方法在三个广泛使用的基准数据集（如ACDC、Dark Zurich和Nighttime Driving）上取得了最先进的性能，并在合成到真实的语义分割任务中展现出泛化能力。

摘要 (Abstract)

The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model’s evolving, high-dimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an autonomous class scheduler. This scheduler consists of two components: (i) a high-dimensional state encoder that maps the model’s training status into a latent space and distills key features indicative of progress, and (ii) a category-fair policy-gradient objective that ensures balanced improvement across classes. Coupled with mixed source-target supervision, the learned class rankings direct the network’s focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state-of-the-art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving) and shows generalization ability in synthetic-to-real semantic segmentation.

关键词: Semantic Segmentation, Unsupervised Domain Adaptation, Adverse Weather Conditions, Curriculum Learning, Reinforcement Learning, Class Scheduler, State Encoder, Policy-gradient Objective

作者: Ciem Cornelissen, Sam Leroux, Pieter Simoens 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24327v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态自监督表示学习，特别是视觉模态（RGB图像、LiDAR深度、热成像）的融合，不涉及大语言模型（LLMs）、深度学习技术原理创新或科学领域应用。所有关键词均围绕LLMs、对齐、推理、代理、压缩等大模型特定技术，与论文的计算机视觉和多模态学习主题完全无关。

!!! tip deepseek-chat TL;DR

论文提出Le MuMo JEPA，一个通过可学习融合令牌在共享Transformer中学习统一多模态表示的自监督框架，在Waymo、nuScenes和FLIR基准测试中实现了最佳性能效率权衡。

摘要翻译

自监督学习已成为无需人工标注即可学习视觉表征的强大范式，然而大多数方法仍局限于单一模态，因而无法利用异构传感器提供的互补结构。我们提出Le MuMo JEPA，一种能够从RGB图像及对齐的伴随模态中学习统一表征的自监督框架。在驾驶实验中，第二模态为相机对齐的LiDAR深度数据；我们还在Teledyne FLIR ADAS基准上评估了RGB-热成像训练及迁移性能。该方法通过在学习共享Transformer内模态特定图像块主干之间充当潜在瓶颈的融合令牌，将LeJEPA扩展至多模态场景。我们的默认模型采用剪枝融合策略：在初始跨模态注意力层后，模态特定令牌被丢弃，迫使跨模态信息在应用草图化各向同性高斯正则化（Sketched Isotropic Gaussian Regularization, SIGReg）于联合多模态CLS嵌入之前，以共享融合令牌网格作为高效的潜在瓶颈。在Waymo数据集上，Le MuMo JEPA在从头训练的多模态基线中实现了下游图像块探测任务的最佳性能-效率平衡，在提升CenterNet检测与稠密深度估计性能的同时，在分割任务上保持竞争力。在nuScenes数据集上从头训练时，Le MuMo JEPA仍是最强模型，并在FLIR基准上取得最优结果，尤其在经过Waymo初始化微调后表现突出。该模型还在显著降低计算量、内存占用及预估训练时间的前提下，在我们的研究中保持了最佳的整体精度-效率平衡。

摘要 (Abstract)

Self-supervised learning has emerged as a powerful paradigm for learning visual representations without manual annotations, yet most methods still operate on a single modality and therefore miss the complementary structure available from heterogeneous sensors. We present Le MuMo JEPA, a self-supervised framework that learns unified representations from RGB images and aligned companion modalities. In our driving experiments, the second modality is camera-aligned LiDAR depth; we also evaluate RGB-thermal training and transfer on the Teledyne FLIR ADAS benchmark. Our approach extends LeJEPA to the multi-modal setting by learning fusion tokens that act as a latent bottleneck between modality-specific patch stems inside a shared transformer. Our default model employs a pruned fusion strategy: after an initial cross-modal attention layer, modality-specific tokens are dropped, forcing cross-modal information into the shared fusion-token grid as an efficient latent bottleneck before Sketched Isotropic Gaussian Regularization (SIGReg) is applied to the joint multimodal CLS embedding. On Waymo, Le MuMo JEPA gives the strongest performance-efficiency trade-off on downstream patch probes among the from-scratch multimodal baselines, improving CenterNet detection and dense depth while remaining competitive on segmentation. Under from-scratch training on nuScenes, Le MuMo JEPA remains the strongest model, and it also gives the best FLIR results, especially after Waymo-initialized fine-tuning. It also retains the best overall accuracy-efficiency balance in our study at substantially lower compute, memory, and estimated training time.

关键词: self-supervised learning, multi-modal representation learning, fusion tokens, transformer, RGB-LiDAR fusion, Waymo, nuScenes, FLIR ADAS

175. ❌ PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

作者: Cheng Cui, Yubo Zhang, Ting Sun, Xueqing Wang, Hongen Liu, Manhui Lin, Yue Zhang, Tingquan Gao, Changda Zhou, Jiaxuan Liu, Zelun Zhang, Jing Zhang, Jun Zhang, Yi Liu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24373v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	2.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	7.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	3.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	3.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	7.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于OCR任务，提出轻量级模型PP-OCRv5（5M参数）挑战大参数视觉语言模型（VLMs）。核心创新在于数据质量研究而非大模型技术本身。相关关键词：1）‘Small Language Models’（8分）：论文强调轻量级、高效模型，与SLMs理念一致；2）‘Scaling Laws AND Data Quality’（7分）：论文挑战模型规模决定论，系统研究数据难度、准确性和多样性对性能的影响；3）‘Quantization OR Model Compression’（8分）：5M参数模型体现了模型压缩和高效设计；4）‘Hallucination Mitigation’（7分）：论文明确减少文本幻觉；5）‘Large Language Models’（2分）：论文对比VLMs但非核心；6）‘Pre-training/Post-training’（3分）：涉及训练但非重点。其他关键词（如MoE、RLHF、RAG等）与OCR任务无关。

!!! tip deepseek-chat TL;DR

该论文挑战了模型规模决定OCR性能的普遍观点，通过数据质量研究开发了仅5M参数的轻量级OCR模型PP-OCRv5，在标准基准测试中达到与数十亿参数视觉语言模型相当的性能，同时提供更精确的文本定位和更少的幻觉。

摘要翻译

“OCR 2.0”与大规模视觉-语言模型（Vision-Language Models, VLMs）的出现为文本识别领域设立了新的性能基准。然而，这些统一架构通常伴随着显著的计算需求、在复杂版面中精确定位文本的挑战，以及易于产生文本幻觉的倾向。本文重新审视了“模型规模是达成高精度的唯一路径”这一主流观点，提出了PP-OCRv5——一个经过精心优化、仅含500万参数的轻量级OCR系统。我们证明，在标准OCR基准测试中，PP-OCRv5的性能可与许多数十亿参数的VLMs相竞争，同时提供更优的定位精度和更少的文本幻觉。我们成功的关键不在于架构扩张，而在于一项以数据为中心的研究。我们通过量化三个关键维度——数据难度、数据准确性与数据多样性——系统地剖析了训练数据的作用。大量实验表明，当拥有足量高质量、标注准确且多样化的数据时，传统高效的两阶段OCR流程的性能上限远高于通常的预期。这项工作为在大模型时代轻量级专用模型的可行性提供了有力证据，并为OCR数据构建提供了实用见解。源代码与模型已公开于https://github.com/PaddlePaddle/PaddleOCR。

摘要 (Abstract)

The advent of “OCR 2.0” and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.

关键词: OCR, lightweight model, vision-language models, data quality, model efficiency, text recognition, hallucination mitigation, parameter-efficient

176. ❌ Refining time-space traffic diagrams: A neighborhood-adaptive linear regression method

作者: Zhihong Yao, Yi Yu, Yunxia Wu, Hao Li, Yangsheng Jiang, Zhengbing He 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24312v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于交通工程领域，提出了一种基于邻域自适应线性回归的时空交通图细化方法，用于提高低采样率交通数据的分辨率。论文内容完全围绕交通流分析、数据插值和图像处理技术，未涉及任何大模型、深度学习、AI for Science或其他评分关键词相关的技术、方法或应用。所有关键词均与论文主题无关，因此相关度评分均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于邻域自适应线性回归的时空交通图细化方法，通过利用局部模式相似性来拟合低分辨率到高分辨率的映射，有效提高了交通数据的精度，并在多个指标上优于基准方法。

摘要翻译

时空交通流图是表征交通流动态演化的关键工具，其分辨率直接影响交通理论研究与工程应用的效果。然而受监测精度与采样频率限制，现有时空交通流图普遍存在分辨率不足的问题。为此，本文提出一种基于邻域自适应线性回归的时空交通流图精细化方法。该方法将邻域嵌入思想引入时空图精细化中，利用时空图的局部模式相似性，自适应识别与目标单元相似的邻域，并在邻域内拟合低分辨率至高分辨率的映射关系以实现精细化。该方法避免了传统全局线性模型的过度平滑倾向，能够捕捉独特的交通波传播与拥堵演化特征，且在局部信息利用上优于传统邻域嵌入方法，从而实现目标单元的精细化。在两种真实数据集上、多尺度与多上采样因子下的验证结果表明，相较于基准方法，所提方法在平均绝对误差（MAE）、平均绝对百分比误差（MAPE）、余弦相似性（CMJS）、结构相似性指数（SSIM）和梯度幅度相似性偏差（GMSD）等指标上分别提升了9.16%、8.16%、1.86%、3.89%和5.83%。此外，所提方法在跨日与跨场景验证中表现出良好的泛化性与鲁棒性。综上所述，所提方法仅需少量成对的高低分辨率训练数据，模型形式简洁，为低采样率交通数据的低成本、细粒度精细化提供了基础。

摘要 (Abstract)

The time-space (TS) traffic diagram serves as a crucial tool for characterizing the dynamic evolution of traffic flow, with its resolution directly influencing the effectiveness of traffic theory research and engineering applications. However, constrained by monitoring precision and sampling frequency, existing TS traffic diagrams commonly suffer from low resolution. To address this issue, this paper proposes a refinement method for TS traffic diagrams based on neighborhood-adaptive linear regression. Introducing the concept of neighborhood embedding into TS diagram refinement, the method leverages local pattern similarity in TS diagrams, adaptively identifies neighborhoods similar to target cells, and fits the low-to-high resolution mapping within these neighborhoods for refinement. It avoids the over-smoothing tendency of the traditional global linear model, allows the capture of unique traffic wave propagation and congestion evolution characteristics, and outperforms the traditional neighborhood embedding method in terms of local information utilization to achieve target cell refinement. Validation on two real datasets across multiple scales and upscaling factors shows that, compared to benchmark methods, the proposed method achieves improvements of 9.16%, 8.16%, 1.86%, 3.89%, and 5.83% in metrics including MAE, MAPE, CMJS, SSIM, and GMSD, respectively. Furthermore, the proposed method exhibits strong generalization and robustness in cross-day and cross-scenario validations. In summary, requiring only a minimal amount of paired high- and low-resolution training data, the proposed method features a concise formulation, providing a foundation for the low-cost, fine-grained refinement of low-sampling-rate traffic data.

关键词: time-space traffic diagram, neighborhood-adaptive linear regression, traffic flow refinement, low-resolution data, local pattern similarity, traffic wave propagation, congestion evolution, generalization and robustness

177. ❌ RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation

作者: Kai Zhu, Zhenyu Cui, Zehua Zang, Jiahuan Zhou 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24295v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频语义分割（VSS）任务，提出了一种改进状态空间模型（State Space Model, SSM）的方法RS-SSM，以解决状态空间压缩过程中特定信息遗忘的问题。论文的核心是计算机视觉中的视频理解任务，涉及状态空间模型、通道感知、遗忘门等具体技术。虽然论文属于深度学习在视频分析领域的应用，但所有给定的关键词均与大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理等）直接相关，而论文完全没有涉及LLM或自然语言处理。论文研究的是状态空间模型在视频分割中的应用，与关键词列表中的LLM技术、科学AI应用等主题无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对视频语义分割中状态空间模型会遗忘特定信息导致像素级分割能力受限的问题，提出了RS-SSM方法，通过通道感知和遗忘门细化来补充被遗忘的时空细节，从而在多个基准测试上取得了最先进的性能。

摘要翻译

近年来，状态空间模型通过线性复杂度的状态空间压缩实现了高效的视频分割。然而，视频语义分割任务需要像素级的时空建模能力，以保持语义对象分割的时间一致性。虽然状态空间模型在状态空间压缩过程中能够保留常见的语义信息，但固定大小的状态空间不可避免地会遗忘特定信息，这限制了模型进行像素级分割的能力。为解决上述问题，我们提出了一种用于视频语义分割的细化特定信息状态空间模型方法，该方法对遗忘的时空特定信息进行互补性细化。具体而言，我们设计了一个通道幅度感知器，用于提取并对齐状态空间中特定信息的分布特征。此外，我们提出了遗忘门信息细化器，基于特定信息分布自适应地反转并细化状态空间模型中的遗忘门矩阵。因此，我们的RS-SSM利用反转后的遗忘门对状态空间压缩过程中遗忘的特定信息进行互补性细化，从而增强模型在时空像素级分割方面的能力。在四个视频语义分割基准数据集上的大量实验表明，我们的RS-SSM在保持高计算效率的同时，取得了最先进的性能。代码公开于https://github.com/zhoujiahuan1991/CVPR2026-RS-SSM。

摘要 (Abstract)

Recently, state space models have demonstrated efficient video segmentation through linear-complexity state space compression. However, Video Semantic Segmentation (VSS) requires pixel-level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models’ capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel-wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS-SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model’s capability for spatiotemporal pixel-level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS-SSM achieves state-of-the-art performance while maintaining high computational efficiency. The code is available at https://github.com/zhoujiahuan1991/CVPR2026-RS-SSM.

关键词: Video Semantic Segmentation, State Space Model, Spatiotemporal Modeling, Forgetting Gate, Channel-wise Amplitude Perceptron, Pixel-level Segmentation, Computational Efficiency, Benchmark Performance

178. ❌ VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection

作者: Jumin Lee, Siyeong Lee, Namil Kim, Sung-Eui Yoon 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24294v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是3D目标检测中的长尾分布问题，提出了一种基于现成基础模型的多模态实例增强框架VERIA。虽然论文提到了使用’off-the-shelf foundation models’，但这指的是用于图像合成的通用基础模型（如扩散模型），而非大语言模型（LLMs）。论文的核心是计算机视觉和自动驾驶感知技术，与评分关键词列表中的大模型技术原理、训练方法、推理优化、对齐技术、代理系统等主题完全无关。所有关键词均未在论文标题或摘要中出现，也没有相关概念描述。

!!! tip deepseek-chat TL;DR

论文针对自动驾驶数据集中长尾分布的3D目标检测问题，提出了VERIA框架，通过现成基础模型合成多模态实例并进行验证，有效提升了稀有类别的检测性能。

摘要翻译

驾驶数据集中的长尾分布对三维感知构成了根本性挑战，因为稀有类别虽表现出显著的类内多样性，但可用样本仅稀疏地覆盖其变化空间。现有基于复制粘贴或资源库的实例增强方法虽能提升稀有类别的曝光度，但其细粒度多样性和场景上下文布局常受局限。我们提出VERIA——一种图像优先的多模态增强框架，该框架利用现成的基础模型合成同步的RGB-LiDAR实例，并通过序列化语义与几何验证对其进行筛选。这种以验证为核心的设计倾向于选择更符合真实激光雷达统计特性、同时覆盖更广类内变化范围的实例。分阶段产出分解机制提供了基于日志的管道可靠性诊断。在nuScenes和Lyft数据集上，VERIA在纯激光雷达与多模态设置下均提升了稀有类别的三维目标检测性能。代码发布于https://sgvr.kaist.ac.kr/VERIA/。

摘要 (Abstract)

Long-tail distributions in driving datasets pose a fundamental challenge for 3D perception, as rare classes exhibit substantial intra-class diversity yet available samples cover this variation space only sparsely. Existing instance augmentation methods based on copy-paste or asset libraries improve rare-class exposure but are often limited in fine-grained diversity and scene-context placement. We propose VERIA, an image-first multimodal augmentation framework that synthesizes synchronized RGB–LiDAR instances using off-the-shelf foundation models and curates them with sequential semantic and geometric verification. This verification-centric design tends to select instances that better match real LiDAR statistics while spanning a wider range of intra-class variation. Stage-wise yield decomposition provides a log-based diagnostic of pipeline reliability. On nuScenes and Lyft, VERIA improves rare-class 3D object detection in both LiDAR-only and multimodal settings. Our code is available at https://sgvr.kaist.ac.kr/VERIA/.

关键词: 3D object detection, long-tail distribution, multimodal augmentation, instance augmentation, LiDAR, autonomous driving, rare-class detection, verification-centric design

179. ❌ ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

作者: Haodong Yu, Yabo Zhang, Donglin Di, Ruyi Zhang, Wangmeng Zuo 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24270v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型在图像生成领域的创新，特别是解决极端宽高比图像生成的结构性问题，通过引入视频扩散先验和连续视频生成框架来实现。所有评分关键词均与大语言模型（LLMs）相关，包括技术原理、训练方法、推理优化、对齐、应用等，而本文研究的是扩散模型（Diffusion Models）在计算机视觉领域的应用，属于完全不同的技术路线和研究领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出ScrollScape框架，通过将极端宽高比图像生成重新定义为连续视频生成过程，利用视频扩散模型的时序一致性作为全局约束，解决了传统扩散模型在生成超高清图像时出现的结构失效问题，实现了32K分辨率的图像生成并显著提升了全局一致性和视觉保真度。

摘要翻译

尽管扩散模型在生成常规尺寸图像方面表现出色，但当推动其合成极端长宽比下的超高分辨率图像时，常会引发灾难性的结构故障，例如物体重复和空间破碎。这一局限从根本上源于缺乏稳健的空间先验，因为静态文生图模型主要基于常规尺寸的图像分布进行训练。为突破这一瓶颈，我们提出了ScrollScape——一个通过两项核心创新将极端长宽比图像合成重新定义为连续视频生成过程的新框架。通过将巨幅画布的空间扩展映射为视频帧的时间演进，ScrollScape利用视频模型固有的时序一致性作为强大的全局约束，从而确保长距离结构完整性。具体而言，扫描位置编码将全局坐标分布至各帧，充当灵活的移动视角；而滚动超分辨率则借助视频超分辨率先验规避内存瓶颈，高效地将输出缩放至前所未有的32K分辨率。通过在精心构建的3K多比例图像数据集上微调，ScrollScape有效将预训练视频先验与极端长宽比生成任务对齐。大量评估表明，该方法通过消除严重的局部伪影，显著优于现有图像扩散基线。因此，我们的方法克服了固有的结构瓶颈，在极端尺度下确保了跨多领域的卓越全局连贯性与视觉保真度。

摘要 (Abstract)

While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation.This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions.To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations.By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity.Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.

关键词: diffusion models, image generation, extreme aspect ratios, video diffusion priors, 32K resolution, structural failures, temporal consistency, Scanning Positional Encoding

180. ❌ TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification

作者: Guan Luo, Xiu Li, Rui Chen, Xuanyu Yi, Jing Lin, Chia-Hao Chen, Jiahang Liu, Song-Hai Zhang, Jianfeng Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24278v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文TopoMesh专注于3D几何处理和生成，提出了一种基于稀疏体素的变分自编码器（VAE）用于高保真网格重建。其核心创新在于通过拓扑统一（Dual Marching Cubes框架）解决网格表示不匹配问题，并采用教师强制和渐进分辨率训练。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理（如MoE、Scaling Laws、各种训练对齐方法、推理优化、智能体等）或特定科学AI应用（如生物信息学）直接相关。本文研究内容（3D计算机视觉、几何深度学习、网格处理）与这些关键词的主题领域（自然语言处理、大模型技术、AI for Science中的特定子领域）完全不同，没有涉及任何大模型、语言模型或相关技术，也未应用于生物/化学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了3D生成中因真实网格与网络预测的拓扑结构不匹配而导致重建细节丢失的问题，通过提出TopoMesh——一种在统一Dual Marching Cubes拓扑框架下的稀疏体素VAE，实现了显式的网格级监督，从而显著提高了重建保真度并更好地保留了尖锐特征。

摘要翻译

当前高保真三维生成的主流范式依赖于VAE-扩散模型流程，其中VAE的重建能力为生成质量设定了严格的上限。限制现有VAE的一个根本性挑战在于真实网格与网络预测之间的表示失配：真实网格具有任意且可变的拓扑结构，而VAE通常预测固定结构的隐式场（例如规则网格上的符号距离场SDF）。这种固有的不对齐性阻碍了显式网格级对应关系的建立，迫使先前的研究依赖间接监督信号（如SDF损失或渲染损失）。因此，重建过程中难以保留精细的几何细节，尤其是尖锐特征。为解决这一问题，我们提出了TopoMesh——一种基于稀疏体素的VAE，它在共享的对偶行进立方体（Dual Marching Cubes, DMC）拓扑框架下统一了真实网格与预测网格。具体而言，我们通过一种重网格算法将任意输入网格转换为符合DMC规范的表示，该算法利用L$\infty$距离度量来保留尖锐边缘。我们的解码器以相同的DMC格式输出网格，确保预测网格和目标网格具有完全一致的拓扑结构。这建立了顶点和面层级的显式对应关系，使我们能够为拓扑结构、顶点位置和面朝向推导出具有清晰梯度的显式网格级监督信号。我们的稀疏VAE架构采用这一统一框架，并通过教师强制训练和渐进式分辨率训练实现稳定高效的收敛。大量实验表明，TopoMesh在重建保真度上显著优于现有VAE，能够更优地保留尖锐特征和几何细节。

摘要 (Abstract)

The dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE’s reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the representation mismatch between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topological framework. Specifically, we convert arbitrary input meshes into DMC-compliant representations via a remeshing algorithm that preserves sharp edges using an L$\infty$ distance metric. Our decoder outputs meshes in the same DMC format, ensuring that both predicted and target meshes share identical topological structures. This establishes explicit correspondences at the vertex and face level, allowing us to derive explicit mesh-level supervision signals for topology, vertex positions, and face orientations with clear gradients. Our sparse VAE architecture employs this unified framework and is trained with Teacher Forcing and progressive resolution training for stable and efficient convergence. Extensive experiments demonstrate that TopoMesh significantly outperforms existing VAEs in reconstruction fidelity, achieving superior preservation of sharp features and geometric details.

关键词: 3D reconstruction, mesh autoencoding, variational autoencoder, topological unification, Dual Marching Cubes, sharp feature preservation, sparse voxel representation, high-fidelity generation

181. ❌ Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning

作者: Tommaso Galliena, Stefano Rosa, Tommaso Apicella, Pietro Morerio, Alessio Del Bue, Lorenzo Natale 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24257v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种记忆增强的视觉语言智能体，核心是解决视觉语言模型在跨视角对象描述中的不一致性问题。与关键词的相关性分析如下：1）高度相关（10分）：‘LLM Agents OR Autonomous Agents OR Agentic Workflow’ - 论文核心是构建一个统一的、记忆增强的视觉语言智能体，处理数据关联、对象描述和探索策略，属于自主智能体研究。2）中等相关（8分）：‘Large Language Models OR LLMs OR Foundation Models’ - 论文基于视觉语言模型（VLMs），属于基础模型在具体任务中的应用。3）弱相关（5分）：‘Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’ - 论文提到使用自监督方式训练模型，并收集数据集，涉及模型训练和适应，但非核心创新。4）无关（0分）：其余关键词与论文内容无直接关联，论文未涉及MoE、小模型、缩放定律、对齐、RAG、推理加速、科学AI等具体技术。

!!! tip deepseek-chat TL;DR

该论文解决了视觉语言模型在跨视角对象描述中的不一致性问题，通过提出一个统一的记忆增强视觉语言智能体，在自监督训练下实现了对象身份持久性和语义一致性，在标准描述评分和描述自相似性上分别提升了11.86%和7.39%。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）在描述同一物体时，常因视角不同而产生不一致的描述，这阻碍了具身智能体随时间构建一致语义表征的能力。先前的方法通过离线多视图聚合或多阶段流程来解决不一致性问题，这些流程将探索、数据关联和描述学习解耦，但其对先前观察到的物体进行推理的能力有限。本文提出了一种统一的、记忆增强的视觉语言智能体，它在一个单一的自回归框架内同时处理数据关联、物体描述和探索策略。该模型处理当前的RGB观测、自上而下的探索地图以及序列化为物体级令牌的物体级情景记忆，从而确保在长序列中保持持久的物体身份和语义一致性。为了以自监督方式训练模型，我们使用一种基于分歧的策略和一个在多视图描述历史中强制一致性的伪描述模型，在逼真的3D环境中收集了一个数据集。在人工标注的物体级测试集上进行广泛评估，结果表明，与基线模型相比，该模型在标准描述评分上提升了高达+11.86%，在描述自相似性上提升了+7.39%，同时通过紧凑的场景表征实现了可扩展的性能。代码、模型权重和数据可在 https://github.com/hsp-iit/epos-vlm 获取。

摘要 (Abstract)

Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we introduce a unified, memory-augmented Vision-Language agent that simultaneously handles data association, object captioning, and exploration policy within a single autoregressive framework. The model processes the current RGB observation, a top-down explored map, and an object-level episodic memory serialized into object-level tokens, ensuring persistent object identity and semantic consistency across extended sequences. To train the model in a self-supervised manner, we collect a dataset in photorealistic 3D environments using a disagreement-based policy and a pseudo-captioning model that enforces consistency across multi-view caption histories. Extensive evaluation on a manually annotated object-level test set, demonstrate improvements of up to +11.86% in standard captioning scores and +7.39% in caption self-similarity over baseline models, while enabling scalable performance through a compact scene representation. Code, model weights, and data are available at https://github.com/hsp-iit/epos-vlm

关键词: Vision-Language Models, Memory-Augmented Agents, Object Captioning, Semantic Consistency, Autoregressive Framework, Self-Supervised Training, Embodied Agents, Episodic Memory

182. ❌ InstanceRSR: Real-World Super-Resolution via Instance-Aware Representation Alignment

作者: Zixin Guo, Kai Zhao, Luyan Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24240v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文InstanceRSR专注于计算机视觉领域的真实世界超分辨率任务，采用基于扩散模型的生成先验方法，通过实例感知表示对齐来提升细节恢复能力。虽然属于AI应用，但所有关键词均针对大语言模型（LLM）及相关技术（如MoE、RLHF、RAG、Agent等），而本文完全不涉及语言模型、文本处理或LLM技术栈，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出InstanceRSR框架，通过联合建模语义信息和实例级特征对齐，解决了现有真实世界超分辨率方法在复杂场景中难以恢复多样对象实例细节的问题，在多个基准测试中实现了新的最先进性能。

摘要翻译

现有基于生成先验的真实世界超分辨率方法在生成高质量且全局一致的复原结果方面取得了显著进展。然而，这些方法往往难以恢复复杂真实场景中多样化物体实例的细粒度细节。这一局限主要源于常用的去噪损失函数（如均方误差）本质上倾向于保持全局一致性，却忽视了实例级别的感知与复原。为解决此问题，我们提出了InstanceRSR，一种新颖的真实世界超分辨率框架，该框架联合建模语义信息并引入实例级别的特征对齐。具体而言，我们采用低分辨率图像作为全局一致性引导，同时联合建模图像数据与语义分割图，以在采样过程中强化语义相关性。此外，我们设计了一个实例表示学习模块，将扩散潜空间与实例潜空间对齐，从而实现实例感知的特征对齐，并进一步结合尺度对齐机制以增强细粒度感知与细节恢复能力。得益于这些设计，我们的方法不仅能生成逼真的细节，还能在实例级别保持语义一致性。在多个真实世界基准数据集上的大量实验表明，InstanceRSR在定量指标和视觉质量上均显著优于现有方法，达到了新的最优性能水平。

摘要 (Abstract)

Existing real-world super-resolution (RSR) methods based on generative priors have achieved remarkable progress in producing high-quality and globally consistent reconstructions. However, they often struggle to recover fine-grained details of diverse object instances in complex real-world scenes. This limitation primarily arises because commonly adopted denoising losses (e.g., MSE) inherently favor global consistency while neglecting instance-level perception and restoration. To address this issue, we propose InstanceRSR, a novel RSR framework that jointly models semantic information and introduces instance-level feature alignment. Specifically, we employ low-resolution (LR) images as global consistency guidance while jointly modeling image data and semantic segmentation maps to enforce semantic relevance during sampling. Moreover, we design an instance representation learning module to align the diffusion latent space with the instance latent space, enabling instance-aware feature alignment, and further incorporate a scale alignment mechanism to enhance fine-grained perception and detail recovery. Benefiting from these designs, our approach not only generates photorealistic details but also preserves semantic consistency at the instance level. Extensive experiments on multiple real-world benchmarks demonstrate that InstanceRSR significantly outperforms existing methods in both quantitative metrics and visual quality, achieving new state-of-the-art (SOTA) performance.

关键词: real-world super-resolution, instance-aware representation alignment, diffusion models, semantic segmentation, feature alignment, fine-grained detail recovery, generative priors, photorealistic reconstruction

183. ❌ B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter” Approach to Micro-Action Recognition

作者: Nishit Poddar, Aglind Reka, Diana-Laura Borza, Snehashis Majhi, Michal Balazia, Abhijit Das, Francois Bremond 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24245v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文B-MoE专注于计算机视觉中的微动作识别，提出了一种基于身体部位感知的混合专家（Mixture of Experts, MoE）框架。该框架的核心创新在于使用MoE结构，让不同专家专注于不同身体区域（如头、身体、上肢、下肢），并通过交叉注意力路由机制动态选择信息最丰富的区域。因此，与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为MoE是论文的核心方法。论文属于AI在科学（具体是计算机视觉和动作分析）领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），尽管它更偏向于计算机视觉而非生物信息学或化学信息学。其他关键词主要涉及大语言模型（LLMs）的技术原理、训练方法、推理优化、代理系统等，而本文研究的是基于视觉的动作识别，未涉及语言模型或相关技术，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对微动作识别中因动作细微、持续时间短和类间模糊性导致的识别困难问题，提出了一种身体部位感知的混合专家框架B-MoE，通过在三个基准测试上实现最先进的性能提升，有效改善了模糊、代表性不足和低幅度类别的识别效果。

摘要翻译

微动作（Micro-actions）——如瞥视、点头或细微姿态调整等短暂且低幅度的运动——承载着丰富的社会意义，但由于其动作细微、持续时间短以及类间高度模糊性，当前的动作识别模型仍难以准确识别。本文提出B-MoE，一种基于身体部位感知的混合专家框架，旨在显式建模人体运动的结构化特性。在B-MoE中，每个专家专注于特定身体区域（头部、躯干、上肢、下肢），并基于轻量级的宏观-微观运动编码器（Macro-Micro Motion Encoder, M3E）构建，该编码器能够捕捉长程上下文结构与细粒度局部运动。通过交叉注意力路由机制学习区域间关联，并动态选择对每个微动作最具信息量的区域。B-MoE采用双流编码器，将区域特定的语义线索与全局运动特征融合，共同捕捉微动作特有的空间局部线索与时间上的细微变化。在三个具有挑战性的基准数据集（MA-52、SocialGesture和MPII-GroupInteraction）上的实验表明，该方法取得了持续性的最优性能提升，尤其在模糊类别、样本不足类别及低幅度动作类别上表现显著改善。

摘要 (Abstract)

Micro-actions, fleeting and low-amplitude motions, such as glances, nods, or minor posture shifts, carry rich social meaning but remain difficult for current action recognition models to recognize due to their subtlety, short duration, and high inter-class ambiguity. In this paper, we introduce B-MoE, a Body-part-aware Mixture-of-Experts framework designed to explicitly model the structured nature of human motion. In B-MoE, each expert specializes in a distinct body region (head, body, upper limbs, lower limbs), and is based on the lightweight Macro-Micro Motion Encoder (M3E) that captures long-range contextual structure and fine-grained local motion. A cross-attention routing mechanism learns inter-region relationships and dynamically selects the most informative regions for each micro-action. B-MoE uses a dual-stream encoder that fuses these region-specific semantic cues with global motion features to jointly capture spatially localized cues and temporally subtle variations that characterize micro-actions. Experiments on three challenging benchmarks (MA-52, SocialGesture, and MPII-GroupInteraction) show consistent state-of-theart gains, with improvements in ambiguous, underrepresented, and low amplitude classes.

关键词: Micro-action recognition, Mixture-of-Experts, Body-part-aware, Cross-attention routing, Dual-stream encoder, Macro-Micro Motion Encoder, Human motion analysis, Action recognition benchmarks

184. ❌ RVLM: Recursive Vision-Language Models with Adaptive Depth

作者: Nicanor Mayumu, Zeenath Khan, Melodena Stephens, Patrick Mukala, Farhad Oroumchian 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24224v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出RVLM框架，核心是迭代生成-执行循环（与Chain of Thought、System 2 Thinking高度相关），使用Python代码调用视觉子代理（与Tool Use、LLM Agents高度相关），满足临床AI可审计性要求（与Explainable AI高度相关），应用于医学影像分析（与AI for Science高度相关）。论文未涉及MoE、量化、RLHF等技术细节，与大部分纯技术关键词无关。

!!! tip deepseek-chat TL;DR

该论文针对医学AI系统缺乏可解释性和固定迭代预算效率低的问题，提出了RVLM框架，通过自适应深度的递归视觉-语言模型实现可审计的迭代推理，在脑MRI和胸X光数据集上验证了其一致性和跨模态分析能力。

摘要翻译

医学人工智能系统面临两个根本性局限。首先，传统的视觉-语言模型（Vision-Language Models, VLMs）执行单次推理，产生黑盒预测，无法以临床术语进行审计或解释。其次，暴露中间步骤的迭代推理系统依赖于固定的迭代预算，在简单病例上浪费算力，同时对复杂病例又缺乏足够的推理深度。我们通过一个统一框架解决了这两个局限。RVLM 以迭代的生成-执行循环取代单次推理：在每一步，模型编写 Python 代码、调用视觉子代理、操作图像并积累证据。每一个诊断主张都基于可执行代码，满足了临床人工智能治理框架的可审计性要求。RRouter 使迭代深度自适应：一个轻量级控制器根据任务复杂性特征预测最优预算，随后监控进展并在推理停滞时提前终止。我们在 BraTS 2023 脑膜瘤（脑部 MRI）和 MIMIC-CXR（胸部 X 光）数据集上使用未经微调的 Gemini 2.5 Flash 进行评估。在多次重复运行中，RVLM 在关键发现（如肿块存在和强化）上表现出高度一致性，并能检测液体衰减反转恢复（Fluid-Attenuated Inversion Recovery, FLAIR）信号特征与分割边界之间的跨模态差异。在 MIMIC-CXR 上，它能生成结构化报告并正确识别视图特异性伪影。代码：https://github.com/nican2018/rvlm。

摘要 (Abstract)

Medical AI systems face two fundamental limitations. First, conventional vision-language models (VLMs) perform single-pass inference, yielding black-box predictions that cannot be audited or explained in clinical terms. Second, iterative reasoning systems that expose intermediate steps rely on fixed iteration budgets wasting compute on simple cases while providing insufficient depth for complex ones. We address both limitations with a unified framework. RVLM replaces single-pass inference with an iterative generate-execute loop: at each step, the model writes Python code, invokes vision sub-agents, manipulates images, and accumulates evidence. Every diagnostic claim is grounded in executable code, satisfying auditability requirements of clinical AI governance frameworks. RRouter makes iteration depth adaptive: a lightweight controller predicts the optimal budget from task-complexity features, then monitors progress and terminates early when reasoning stalls. We evaluate on BraTS 2023 Meningioma (brain MRI) and MIMIC-CXR (chest X-ray) using Gemini 2.5 Flash without fine-tuning. Across repeated runs, RVLM shows high consistency on salient findings (e.g., mass presence and enhancement) and can detect cross-modal discrepancies between Fluid-Attenuated Inversion Recovery (FLAIR) signal characteristics and segmentation boundaries. On MIMIC-CXR, it generates structured reports and correctly recognises view-specific artefacts. Code: https://github.com/nican2018/rvlm.

关键词: Vision-Language Models, Iterative Reasoning, Clinical AI, Explainable AI, Medical Imaging, Adaptive Depth, Python Code Execution, Multi-agent Systems

185. ❌ Attack Assessment and Augmented Identity Recognition for Human Skeleton Data

作者: Joseph G. Zalameda, Megan A. Witherow, Alexander M. Glandon, Jose Aguilera, Khan M. Iftekharuddin 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24232v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究基于LiDAR骨架数据的人员识别模型（HCN-ID）的对抗攻击评估与防御增强（Attack-AAIRS框架），使用GAN生成对抗样本进行模型接种。论文主题是计算机视觉/安全领域的对抗性机器学习，专注于特定传感器数据（LiDAR骨架）和特定模型架构（HCN-ID）。所有评分关键词均与大语言模型（LLM）及其相关技术（如MoE、缩放定律、对齐、RAG、推理、智能体、量化等）或AI for Science（生物信息学、化学信息学）直接相关。本论文未涉及任何大语言模型技术、原理或应用，也未涉及科学领域的AI应用（如生物/化学信息学），因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对基于LiDAR骨架数据的小样本人员识别模型（HCN-ID）易受对抗攻击的问题，提出了Attack-AAIRS框架，利用GAN生成对抗样本来评估并增强模型对未见攻击的鲁棒性，且不降低在真实数据上的测试性能。

摘要翻译

在安全应用场景中，基于小数据集训练的机器学习模型尤其容易受到对抗性攻击的威胁。基于激光雷达（LiDAR）骨架数据进行人员身份识别时，需要为每个身份主体耗费大量时间与高昂成本进行数据采集。近期，骨架评估与增强身份识别框架（AAIRS, Assessment and Augmented Identity Recognition for Skeletons）已被用于训练基于小规模LiDAR骨架数据集的层次共现网络人员识别模型（HCN-ID, Hierarchical Co-occurrence Networks for Person Identification）。然而，AAIRS既未评估HCN-ID模型对抗对抗性攻击的鲁棒性，也未对模型进行免疫训练以防御此类攻击。当前主流的基于扰动的对抗攻击生成方法受限于仅能在真实训练样本上添加定向扰动，这对于利用小训练集进行模型免疫并非理想方案。为此，我们提出Attack-AAIRS，作为AAIRS框架的创新扩展。该方法利用小规模真实数据集与生成对抗网络（GAN）生成的合成数据集，评估并提升模型对未知对抗攻击的鲁棒性。该方法不局限于对有限真实训练样本的扰动，而是通过GAN学习针对HCN-ID模型弱点的对抗攻击样本分布。从该分布中采样的攻击样本将用于增强训练，从而对HCN-ID进行免疫以提升其鲁棒性。十折交叉验证表明，Attack-AAIRS对包括快速梯度符号法（FGSM）、投影梯度下降法（PGD）、加性高斯噪声、动量迭代快速梯度符号法（MI-FGSM）和基本迭代法（BIM）在内的未知攻击均表现出更强的鲁棒性。Attack-AAIRS的HCN-ID合成数据质量评分显示，生成的攻击样本与AAIRS原始生成的良性合成样本质量相近。此外，经过免疫训练的模型在真实数据上的最终测试精度与原始模型保持一致，这表明我们的方法在提升对抗攻击鲁棒性的同时，未降低模型在真实数据上的测试性能。

摘要 (Abstract)

Machine learning models trained on small data sets for security applications are especially vulnerable to adversarial attacks. Person identification from LiDAR based skeleton data requires time consuming and expensive data acquisition for each subject identity. Recently, Assessment and Augmented Identity Recognition for Skeletons (AAIRS) has been used to train Hierarchical Co-occurrence Networks for Person Identification (HCN-ID) with small LiDAR based skeleton data sets. However, AAIRS does not evaluate robustness of HCN-ID to adversarial attacks or inoculate the model to defend against such attacks. Popular perturbation-based approaches to generating adversarial attacks are constrained to targeted perturbations added to real training samples, which is not ideal for inoculating models with small training sets. Thus, we propose Attack-AAIRS, a novel addition to the AAIRS framework. Attack-AAIRS leverages a small real data set and a GAN generated synthetic data set to assess and improve model robustness against unseen adversarial attacks. Rather than being constrained to perturbations of limited real training samples, the GAN learns the distribution of adversarial attack samples that exploit weaknesses in HCN-ID. Attack samples drawn from this distribution augment training for inoculation of the HCN-ID to improve robustness. Ten-fold cross validation of Attack-AAIRS yields increased robustness to unseen attacks- including FGSM, PGD, Additive Gaussian Noise, MI-FGSM, and BIM. The HCN-ID Synthetic Data Quality Score for Attack-AAIRS indicates that generated attack samples are of similar quality to the original benign synthetic samples generated by AAIRS. Furthermore, inoculated models show consistent final test accuracy with the original model trained on real data, demonstrating that our method improves robustness to adversarial attacks without reducing test performance on real data.

关键词: adversarial attacks, skeleton data, LiDAR, person identification, GAN, model robustness, small data sets, HCN-ID

186. ❌ HEART-PFL: Stable Personalized Federated Learning under Heterogeneity with Hierarchical Directional Alignment and Adversarial Knowledge Transfer

作者: Minjun Kim, Minje Kim 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24209v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于个性化联邦学习（PFL）方法，提出了一种名为HEART-PFL的双边框架，通过分层方向对齐（HDA）和对抗知识转移（AKT）来解决异构数据下的模型个性化问题。论文内容涉及联邦学习、模型个性化、异构数据、对抗训练、知识蒸馏等传统机器学习领域，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术或AI科学应用相关，而本文研究的是联邦学习框架，属于分布式机器学习范畴，与评分关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为HEART-PFL的个性化联邦学习框架，通过分层方向对齐和对抗知识转移解决了异构数据下的客户端模型个性化问题，并在多个数据集上实现了最先进的个性化准确率。

摘要翻译

个性化联邦学习（Personalized Federated Learning, PFL）旨在处理异构数据分布时提供高效的客户端定制模型，然而现有方法存在原型对齐浅层化与服务器端蒸馏脆弱性的问题。我们提出HEART-PFL——一种双端协同框架，其创新在于：（1）通过深度感知的分层定向对齐（Hierarchical Directional Alignment, HDA）机制，在训练早期阶段采用余弦相似度对齐，在深层阶段使用均方误差匹配，以保持客户端特异性；（2）引入对抗性知识迁移（Adversarial Knowledge Transfer, AKT），基于干净数据与对抗性代理数据的对称KL散度蒸馏，以稳定全局模型更新。该框架仅需1.46M可训练参数的轻量适配器，在狄利克雷非独立同分布划分的CIFAR-100、Flowers-102和Caltech-101数据集上分别达到63.42%、84.23%和95.67%的当前最优个性化准确率，且对域外代理数据保持鲁棒性。消融实验进一步验证了HDA与AKT在模型对齐、鲁棒性和优化稳定性方面具有互补增益，揭示了两组件如何协同强化个性化效果。总体而言，这些结果表明HEART-PFL能同步提升个性化性能与全局稳定性，彰显其作为强健、可扩展PFL解决方案的潜力（代码开源地址：https://github.com/danny0628/HEART-PFL）。

摘要 (Abstract)

Personalized Federated Learning (PFL) aims to deliver effective client-specific models under heterogeneous distributions, yet existing methods suffer from shallow prototype alignment and brittle server-side distillation. We propose HEART-PFL, a dual-sided framework that (i) performs depth-aware Hierarchical Directional Alignment (HDA) using cosine similarity in the early stage and MSE matching in the deep stage to preserve client specificity, and (ii) stabilizes global updates through Adversarial Knowledge Transfer (AKT) with symmetric KL distillation on clean and adversarial proxy data. Using lightweight adapters with only 1.46M trainable parameters, HEART-PFL achieves state-of-the-art personalized accuracy on CIFAR-100, Flowers-102, and Caltech-101 (63.42%, 84.23%, and 95.67%, respectively) under Dirichlet non-IID partitions, and remains robust to out-of-domain proxy data. Ablation studies further confirm that HDA and AKT provide complementary gains in alignment, robustness, and optimization stability, offering insights into how the two components mutually reinforce effective personalization. Overall, these results demonstrate that HEART-PFL simultaneously enhances personalization and global stability, highlighting its potential as a strong and scalable solution for PFL(code available at https://github.com/danny0628/HEART-PFL).

关键词: Personalized Federated Learning, Heterogeneous Distributions, Hierarchical Directional Alignment, Adversarial Knowledge Transfer, Non-IID Data, Model Personalization, Lightweight Adapters, Optimization Stability

187. ❌ RefReward-SR: LR-Conditioned Reward Modeling for Preference-Aligned Super-Resolution

作者: Yushuai Song, Weize Quan, Weining Wang, Jiahui Sun, Jing Liu, Meng Li, Pengbin Yu, Zhentao Chen, Wei Shen, Lunxi Yuan, Dong-ming Yan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24198v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出RefReward-SR，一种基于多模态大语言模型（MLLM）的奖励模型，用于超分辨率任务中的人机偏好对齐。核心涉及大语言模型（LLMs）在视觉任务中的应用，以及通过Group Relative Policy Optimization（GRPO）进行偏好对齐和奖励建模，这与RLHF/DPO等对齐技术高度相关。论文未涉及其他关键词如MoE、量化、推理加速、科学AI应用等。

!!! tip deepseek-chat TL;DR

该论文针对超分辨率任务中现有评估方法与人类感知偏好不一致的问题，提出了一种基于多模态大语言模型的LR条件奖励模型RefReward-SR，通过GRPO优化实现了更好的人机偏好对齐，生成了语义一致且感知自然的超分辨率图像。

摘要翻译

生成式超分辨率技术近期取得了显著进展，大幅提升了视觉真实感，但现有的评估与优化框架仍与人类感知存在偏差。全参考与无参考评价指标往往无法反映感知偏好，它们或因像素未对齐而惩罚语义合理的细节，或倾向于视觉锐利但内容不一致的伪影。此外，多数超分辨率方法依赖于真实标签相关的分布匹配，但这并不必然对应人类的判断标准。本研究提出RefReward-SR，一种基于低分辨率参考的奖励模型，用于实现偏好对齐的超分辨率。该方法不依赖真实标签监督或无参考评估，而是以低分辨率图像作为语义锚点，基于其对应的低分辨率输入条件来评估高分辨率重建结果。通过利用多模态大语言模型的视觉-语言先验知识，该模型以推理感知的方式评估语义一致性与合理性。为支持此范式，我们构建了RefSR-18K——首个大规模低分辨率条件超分辨率偏好数据集，提供基于低分辨率-高分辨率一致性与高分辨率自然度的成对排序。我们采用组相对策略优化方法，利用低分辨率条件排序奖励对多模态大语言模型进行微调，并将该优化框架整合至超分辨率模型训练中，以RefReward-SR作为偏好对齐生成的核心奖励信号。大量实验表明，我们的框架能显著提升与人类判断的对齐度，生成的重建结果在保持语义一致性的同时，增强了感知合理性与视觉自然度。代码、模型及数据集将在论文录用后公开。

摘要 (Abstract)

Recent advances in generative super-resolution (SR) have greatly improved visual realism, yet existing evaluation and optimization frameworks remain misaligned with human perception. Full-Reference and No-Reference metrics often fail to reflect perceptual preference, either penalizing semantically plausible details due to pixel misalignment or favoring visually sharp but inconsistent artifacts. Moreover, most SR methods rely on ground-truth (GT)-dependent distribution matching, which does not necessarily correspond to human judgments. In this work, we propose RefReward-SR, a low-resolution (LR) reference-aware reward model for preference-aligned SR. Instead of relying on GT supervision or NR evaluation, RefReward-SR assesses high-resolution (HR) reconstructions conditioned on their LR inputs, treating the LR image as a semantic anchor. Leveraging the visual-linguistic priors of a Multimodal Large Language Models (MLLM), it evaluates semantic consistency and plausibility in a reasoning-aware manner. To support this paradigm, we construct RefSR-18K, the first large-scale LR-conditioned preference dataset for SR, providing pairwise rankings based on LR-HR consistency and HR naturalness. We fine-tune the MLLM with Group Relative Policy Optimization (GRPO) using LR-conditioned ranking rewards, and further integrate GRPO into SR model training with RefReward-SR as the core reward signal for preference-aligned generation. Extensive experiments show that our framework achieves substantially better alignment with human judgments, producing reconstructions that preserve semantic consistency while enhancing perceptual plausibility and visual naturalness. Code, models, and datasets will be released upon paper acceptance.

关键词: super-resolution, reward modeling, preference alignment, multimodal large language models, LR-conditioned evaluation, Group Relative Policy Optimization, semantic consistency, perceptual plausibility

188. ❌ Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

作者: Adhemar de Senneville, Xavier Bou, Jérémy Anger, Rafael Grompone, Gabriele Facciolo 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24181v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型视觉语言模型（LVLMs）在少样本图像分类任务中的性能提升方法，通过提示条件化和注意力头选择来改进特征可分离性。与关键词的相关性分析：1）高度相关（8分）：论文明确研究LVLMs（属于大模型范畴），并涉及少样本学习（与In-context Learning相关）。2）中等相关（5分）：论文分析模型内部表示（注意力头），涉及可解释性方面；提到CLIP预训练，与预训练技术相关。3）无关（0分）：其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文解决了大型视觉语言模型在少样本图像分类任务中性能不足的问题，通过提出基于提示条件化和注意力头选择的Head Ensemble Classifiers方法，在12个数据集上实现了最先进的少样本和零样本分类性能。

摘要翻译

当前的大型视觉语言模型（LVLMs）在图像描述、视觉问答和光学字符识别等众多零样本任务中表现出色。然而，这些模型在图像分类任务上却表现欠佳，其性能落后于基于CLIP的方法。值得注意的是，这一差距令人惊讶，因为许多LVLMs使用了CLIP预训练的视觉编码器。但LVLMs本质上并不受限于CLIP那种视觉与文本编码器分离的架构。在CLIP中，这种分离导致分类偏向于类名匹配，而非联合的视觉-文本推理。本文研究表明，尽管LVLMs的原始分类性能较差，但它们能在推理过程中通过提示条件化提升视觉特征的类间可分性；同时，LVLMs的内部表征（尤其是注意力头）在零样本和小样本分类任务上可以超越模型自身的表现。我们提出了头部集成分类器（Head Ensemble Classifiers, HEC），以弥合基于CLIP的分类方法与基于LVLM的分类方法之间的性能差距。受高斯判别分析的启发，HEC筛选出最具判别力的视觉与文本注意力头，并将它们组合成一个无需训练的分类器。实验证明，HEC在12个数据集上的小样本和零样本分类任务中达到了最先进的性能水平。

摘要 (Abstract)

Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP’s architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs’ internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.

关键词: Large Vision Language Models, Few-shot Classification, Prompt Conditioning, Attention Heads, Head Ensemble Classifiers, Zero-shot Classification, Visual Feature Separability, Training-free Classifier

189. ❌ Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic

作者: Wanying Qu, Jianxiong Gao, Wei Wang, Yanwei Fu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24176v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于神经影像学领域，提出了一种从EEG重建高分辨率fMRI动态序列的框架，属于AI在生物医学/神经科学领域的应用。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词特指自然语言处理或通用人工智能中的大模型技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（特别是生物信息学/神经科学）领域的应用，但论文本身并未使用或创新大模型技术，只是应用了深度学习框架，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于EEG条件重建高分辨率动态fMRI序列的框架，解决了fMRI采集成本高、采样不规则的问题，并在实验中证明了其在体素级重建质量和时间一致性上的优越性，同时保留了支持下游视觉解码任务的功能信息。

摘要翻译

捕捉动态时空神经活动对于理解大规模脑机制至关重要。功能磁共振成像（fMRI）提供了高分辨率的皮层表征，为刻画细粒度脑活动模式奠定了坚实基础。然而，fMRI的高采集成本限制了其大规模应用，这使得高质量fMRI重建成为一项关键任务。脑电图（EEG）能提供毫秒级的时间线索，与fMRI形成互补。利用这种互补性，我们提出了一种EEG条件化框架，用于在皮层顶点水平上将动态fMRI重建为具有高空间保真度和强时间连贯性的连续神经序列。针对真实fMRI采集中常见的采样不规则性问题，我们引入了零空间中间帧重建机制，能够以测量一致的方式补全任意中间帧，从而提升序列连续性及实际应用价值。在CineBrain数据集上的实验表明，该方法在全脑及功能特异性区域均表现出优越的体素级重建质量和稳健的时间一致性。重建的fMRI还能保留关键功能信息，可有效支持下游视觉解码任务。本研究为从EEG估计高分辨率fMRI动态提供了新途径，并推动多模态神经成像向更具动态性的脑活动建模迈进。

摘要 (Abstract)

Capturing dynamic spatiotemporal neural activity is essential for understanding large-scale brain mechanisms. Functional magnetic resonance imaging (fMRI) provides high-resolution cortical representations that form a strong basis for characterizing fine-grained brain activity patterns. The high acquisition cost of fMRI limits large-scale applications, therefore making high-quality fMRI reconstruction a crucial task. Electroencephalography (EEG) offers millisecond-level temporal cues that complement fMRI. Leveraging this complementarity, we present an EEG-conditioned framework for reconstructing dynamic fMRI as continuous neural sequences with high spatial fidelity and strong temporal coherence at the cortical-vertex level. To address sampling irregularities common in real fMRI acquisitions, we incorporate a null-space intermediate-frame reconstruction, enabling measurement-consistent completion of arbitrary intermediate frames and improving sequence continuity and practical applicability. Experiments on the CineBrain dataset demonstrate superior voxel-wise reconstruction quality and robust temporal consistency across whole-brain and functionally specific regions. The reconstructed fMRI also preserves essential functional information, supporting downstream visual decoding tasks. This work provides a new pathway for estimating high-resolution fMRI dynamics from EEG and advances multimodal neuroimaging toward more dynamic brain activity modeling.

关键词: fMRI reconstruction, EEG-conditioned framework, spatiotemporal neural activity, multimodal neuroimaging, brain dynamics, neural sequences, cortical-vertex level, visual decoding

190. ❌ CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

作者: Akash Ghosh, Tajamul Ashraf, Rishu Kumar Singh, Numan Saeed, Sriparna Saha, Xiuying Chen, Salman Khan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24157v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究医疗领域的多智能体框架，与AI for Science高度相关（10分），涉及LLM Agents、Multi-agent Systems和Tool Use（各10分）。论文提到智能体需要长视野推理和自我纠正，与Chain of Thought、System 2 Thinking和Self-Correction有一定关联（各5分）。其他关键词如大模型技术、训练方法、推理优化等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对医疗领域复杂长视野软件工作流自动化不足的问题，提出了一个基于演员-评论家范式的多智能体框架CarePilot，在医疗基准测试中显著超越了现有多模态基线模型。

摘要翻译

多模态智能体管道正通过高效、便捷地自动化复杂现实任务，深刻改变人机交互模式。然而，现有研究多集中于短周期或通用应用场景（如移动端或桌面端界面），针对特定领域系统——尤其是医疗健康领域——的长周期自动化任务仍亟待探索。为此，我们提出CareFlow：一个高质量人工标注的基准测试集，涵盖医学标注工具、DICOM（医学数字成像与通信）影像浏览器、电子健康记录（EHR）系统及实验室信息系统中的复杂长周期软件工作流。实验表明，现有视觉语言模型（VLMs）在该基准上表现欠佳，难以应对医疗场景下的长周期推理与多步骤交互挑战。
为突破此局限，我们设计CarePilot——一个基于“执行者-评判者”范式构建的多智能体框架。执行者（Actor）融合工具定位与双记忆机制（长期经验与短期经验），通过视觉界面和系统状态预测下一语义动作；评判者（Critic）评估每个动作，根据观测结果更新记忆，并选择执行动作或提供修正反馈以优化工作流。通过迭代式智能体模拟，执行者在推理过程中逐步学会进行更稳健且具备推理意识的预测。实验证明，CarePilot在基准测试及分布外数据集上均取得最先进性能，分别以约15.26%和3.38%的优势超越现有强闭源与开源多模态基线模型。

摘要 (Abstract)

Multimodal agentic pipelines are transforming human-computer interaction by enabling efficient and accessible automation of complex, real-world tasks. However, recent efforts have focused on short-horizon or general-purpose applications (e.g., mobile or desktop interfaces), leaving long-horizon automation for domain-specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high-quality human-annotated benchmark comprising complex, long-horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision-language models (VLMs) perform poorly, struggling with long-horizon reasoning and multi-step interactions in medical contexts. To overcome this, we propose CarePilot, a multi-agent framework based on the actor-critic paradigm. The Actor integrates tool grounding with dual-memory mechanisms (long-term and short-term experience) to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Through iterative agentic simulation, the Actor learns to perform more robust and reasoning-aware predictions during inference. Our experiments show that CarePilot achieves state-of-the-art performance, outperforming strong closed-source and open-source multimodal baselines by approximately 15.26% and 3.38%, respectively, on our benchmark and out-of-distribution dataset.

关键词: multi-agent framework, long-horizon automation, healthcare, actor-critic paradigm, tool grounding, dual-memory mechanisms, medical software workflows, vision-language models

191. ❌ A convergent Plug-and-Play Majorization-Minimization algorithm for Poisson inverse problems

作者: Thibaut Modrzyk, Ane Etxebeste, Élie Bretin, Voichita Maxim 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24156v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是用于泊松逆问题的变分即插即用算法，属于计算成像和医学图像处理领域。论文使用了预训练的神经网络作为正则化项，但核心内容并非大模型或深度学习技术原理的创新，而是将经典的最大似然估计方法与基于梯度的去噪器相结合。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文提到了在核医学中的应用，属于AI在科学领域的应用。其他关键词均与大模型、深度学习技术原理、训练方法、推理优化、智能体等无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于泊松逆问题的收敛性即插即用变分算法，通过结合Kullback-Leibler数据保真度和基于预训练神经网络的正则化，在去卷积和断层扫描中实现了最先进的性能，尤其在高噪声条件下表现出色。

摘要翻译

本文提出了一种用于泊松逆问题的新型变分即插即用算法。该方法通过最小化一个显式泛函实现，该泛函由Kullback-Leibler数据保真项与基于预训练神经网络的正则化项共同构成。通过将经典似然最大化方法与基于梯度的去噪器最新进展相结合，本算法能够在保持收敛性保证的前提下直接使用预训练的高斯去噪器。该算法基于主优化-最小化框架构建，确保收敛至稳定点。数值实验表明，在中度噪声条件下，算法在去卷积和断层扫描任务中达到先进性能；在高噪声条件下则展现出显著优势，这使得该方法在核医学应用领域具有特殊价值。

摘要 (Abstract)

In this paper, we present a novel variational plug-and-play algorithm for Poisson inverse problems. Our approach minimizes an explicit functional which is the sum of a Kullback-Leibler data fidelity term and a regularization term based on a pre-trained neural network. By combining classical likelihood maximization methods with recent advances in gradient-based denoisers, we allow the use of pre-trained Gaussian denoisers without sacrificing convergence guarantees. The algorithm is formulated in the majorization-minimization framework, which guarantees convergence to a stationary point. Numerical experiments confirm state-of-the-art performance in deconvolution and tomography under moderate noise, and demonstrate clear superiority in high-noise conditions, making this method particularly valuable for nuclear medicine applications.

关键词: Poisson inverse problems, plug-and-play algorithm, majorization-minimization, Kullback-Leibler data fidelity, pre-trained neural network, deconvolution, tomography, nuclear medicine

192. ❌ Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

作者: Xu Zhang, Zhe Chen, Jing Zhang, Dacheng Tao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24166v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是数据稀缺条件下的视觉语言理解任务（Referring Object Detection），提出了一个启发式推理先验框架HeROD。论文与大多数大模型技术关键词（如LLMs、MoE、RLHF等）完全无关，因为这些是自然语言处理领域的技术，而本文是计算机视觉/视觉语言任务。论文的核心是推理先验和解释性AI，因此与’Chain of Thought/CoT Reasoning/Multi-step Reasoning’（5分）、‘System 2 Thinking/Slow Thinking/In-depth Reasoning’（5分）和’Mechanistic Interpretability/Explainable AI’（5分）有一定关联，因为论文提到了’interpretable reasoning priors’和’interpretable signals’，但并非这些关键词在大模型语境下的典型应用。论文与’AI for Science/Bioinformatics/Cheminformatics’无关，因为研究的是通用视觉语言任务，而非特定科学领域应用。

!!! tip deepseek-chat TL;DR

论文研究了在数据稀缺条件下如何通过注入启发式推理先验来提高Referring Object Detection的性能，提出的HeROD框架在多个基准数据集上显著优于现有基线方法。

摘要翻译

多数指称目标检测模型，尤其是现代的定位检测器，是为数据充足场景设计的，然而许多实际应用场景（如机器人、增强现实及其他专业领域）常面临严重的标注稀缺问题。在此类情况下，端到端的定位检测器需从零学习空间与语义结构，浪费了宝贵的样本。我们提出一个简单问题：当数据稀缺时，显式的推理先验能否帮助模型更高效地学习？为探究此问题，我们首先提出了数据高效指称目标检测任务，这是一个用于衡量低数据和少样本设定下指称目标检测性能的基准协议。随后，我们提出HeROD（启发式驱动的指称目标检测框架），这是一个轻量级、模型无关的框架，其将基于指称短语推导出的可解释显式启发式空间与语义推理先验，注入现代DETR风格流程的三个阶段：候选框排序、预测融合和匈牙利匹配。通过使训练和推理过程偏向合理的候选目标，这些先验有望提升标注效率与收敛性能。在RefCOCO、RefCOCO+和RefCOCOg数据集上，HeROD在标注稀缺场景中持续优于主流的定位检测基线模型。更广泛而言，我们的研究结果表明，整合简单可解释的推理先验为提升视觉-语言理解的数据效率提供了一条实用且可扩展的路径。

摘要 (Abstract)

Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.

关键词: Referring Object Detection, Data-efficient Learning, Heuristic-inspired Reasoning, Vision-Language Understanding, DETR-style Pipeline, Interpretable Priors, Few-shot Learning, Label Scarcity

193. ❌ LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds

作者: Jaehun Bang, Jinhyeok Kim, Minji Kim, Seungheon Jeong, Kyungdon Joo 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24146v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是计算机视觉领域的开放词汇3D场景理解，专注于3D表示、语义分割和高效推理，未涉及大语言模型、深度学习技术原理或科学应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了LightSplat框架，通过注入紧凑的2字节语义索引到3D表示中，解决了开放词汇3D场景理解中速度慢、内存占用高的问题，实现了高达50-400倍的加速和64倍的内存降低。

摘要翻译

开放词汇三维场景理解允许用户通过自然语言在复杂三维环境中分割新物体。然而，现有方法因依赖迭代优化和密集的逐高斯特征分配，仍存在速度慢、内存占用高且过于复杂的问题。为此，我们提出LightSplat，一种快速且内存高效的免训练框架，它将紧凑的2字节语义索引从多视角图像注入三维表示中。通过仅对显著区域分配语义索引，并利用轻量级索引-特征映射进行管理，LightSplat消除了昂贵的特征优化和存储开销。我们进一步通过单步聚类确保语义一致性和高效推理，该聚类在三维空间中关联几何与语义相关的掩码。我们在LERF-OVS、ScanNet和DL3DV-OVS数据集上，针对复杂的室内外场景评估了所提方法。结果表明，LightSplat实现了最先进的性能，速度提升高达50-400倍，内存占用降低64倍，从而实现了可扩展的语言驱动三维理解。更多细节请访问项目页面https://vision3d-lab.github.io/lightsplat/。

摘要 (Abstract)

Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments. To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50-400x speedup and 64x lower memory, enabling scalable language-driven 3D understanding. For more details, visit our project page https://vision3d-lab.github.io/lightsplat/.

关键词: open-vocabulary 3D scene understanding, semantic segmentation, 3D representations, memory-efficient, fast inference, training-free framework, multi-view images, semantic indices

194. ❌ Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

作者: Zhanhe Lei, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Yuhong Yang, Zhen Han, Chao Liang, Dengpan Ye 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24139v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于深度伪造检测领域，提出了一种基于强化学习的动态课程学习框架（Tutor-Student Reinforcement Learning），使用PPO算法优化训练样本权重分配。虽然论文涉及深度学习（特别是计算机视觉中的深度伪造检测）和强化学习（PPO），但所有给定的关键词均与大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、Agent等）或特定科学领域AI应用（如生物信息学）直接相关。论文未提及任何大语言模型、其训练方法、推理优化、对齐技术或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对深度伪造检测中标准监督训练对所有样本同等对待的不足，提出了一种基于强化学习的Tutor-Student框架，通过动态调整训练样本权重来优化课程学习，从而提高了检测模型对未见操纵技术的泛化能力。

摘要翻译

深度伪造检测的标准监督训练对所有样本赋予同等重要性，这可能不利于学习鲁棒且可泛化的特征。本研究提出一种新颖的导师-学生强化学习框架，以动态优化训练课程。该方法将训练过程建模为马尔可夫决策过程，其中“导师”智能体学习引导“学生”（即深度伪造检测器）。导师以近端策略优化智能体实现，通过观察每个训练样本的丰富状态表征（不仅包含视觉特征，还涵盖其历史学习动态，如指数移动平均损失和遗忘计数），基于该状态采取行动，为样本损失分配连续权重（0-1），从而动态调整训练批次的加权。导师的奖励基于学生模型的即时性能变化，特别奖励从错误预测到正确预测的转变。该策略促使导师学习一种优先处理高价值样本（如困难但可学习的示例）的课程方案，从而实现更高效、更有效的训练过程。实验证明，与传统训练方法相比，这种自适应课程能提升学生模型对未见操纵技术的泛化能力。代码发布于https://github.com/wannac1/TSRL。

摘要 (Abstract)

Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a Tutor'' agent learns to guide a Student’’ (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample’s loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student’s immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student’s generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.

关键词: deepfake detection, reinforcement learning, curriculum learning, Tutor-Student framework, PPO, dynamic sample weighting, generalization, training optimization

195. ❌ Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation

作者: Haoyu Ji, Bowen Chen, Zhihao Yang, Wenze Huang, Yu Gao, Xueting Liu, Weihong Ren, Zhiyong Wang, Honghai Liu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24134v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于骨架动作分割的计算机视觉任务，提出了一种基于频域分析的频率选择性过滤框架。所有关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，但论文内容与这些关键词无直接关联。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为动作分割可视为AI在科学或生物运动分析中的应用，但论文未明确提及科学领域应用，因此给予5分（有一定关联）。其他关键词均与大模型技术、训练方法、推理优化、代理系统等无关，评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对骨架时序动作分割中相邻动作区分度不足和边界模糊的问题，提出了Spectral Scalpel框架，通过频率选择性过滤增强动作间差异，在五个公开数据集上实现了最先进的性能。

摘要翻译

基于骨架的时序动作分割（Skeleton-based Temporal Action Segmentation, STAS）旨在对长时、未裁剪的骨骼运动序列进行密集的动作分割与分类。然而，现有STAS方法面临类间区分度有限和分割边界模糊的挑战，主要源于相邻动作间的时空模式缺乏充分区分。为应对这些局限，我们提出“光谱手术刀”（Spectral Scalpel），一种频率选择性滤波框架，旨在抑制相邻不同动作间共享的频率成分，同时放大其动作特有的频率，从而增强动作间差异并锐化过渡边界。具体而言，光谱手术刀采用自适应多尺度光谱滤波器作为“手术刀”编辑频谱，并结合以相邻动作间的差异损失作为“手术目标”。该设计放大了相邻动作间的表征差异，有效缓解了边界定位模糊性和类间混淆。此外，为补充长时序建模，我们引入频率感知通道混合器，通过跨通道聚合频谱来增强通道演化。本研究为STAS提出了一种新颖范式，通过引入频域分析扩展了传统的时空建模。在五个公开数据集上的大量实验表明，光谱手术刀实现了最先进的性能。代码发布于https://github.com/HaoyuJi/SpecScalpel。

摘要 (Abstract)

Skeleton-based Temporal Action Segmentation (STAS) seeks to densely segment and classify diverse actions within long, untrimmed skeletal motion sequences. However, existing STAS methodologies face challenges of limited inter-class discriminability and blurred segmentation boundaries, primarily due to insufficient distinction of spatio-temporal patterns between adjacent actions. To address these limitations, we propose Spectral Scalpel, a frequency-selective filtering framework aimed at suppressing shared frequency components between adjacent distinct actions while amplifying their action-specific frequencies, thereby enhancing inter-action discrepancies and sharpening transition boundaries. Specifically, Spectral Scalpel employs adaptive multi-scale spectral filters as scalpels to edit frequency spectra, coupled with a discrepancy loss between adjacent actions serving as the surgical objective. This design amplifies representational disparities between neighboring actions, effectively mitigating boundary localization ambiguities and inter-class confusion. Furthermore, complementing long-term temporal modeling, we introduce a frequency-aware channel mixer to strengthen channel evolution by aggregating spectra across channels. This work presents a novel paradigm for STAS that extends conventional spatio-temporal modeling by incorporating frequency-domain analysis. Extensive experiments on five public datasets demonstrate that Spectral Scalpel achieves state-of-the-art performance. Code is available at https://github.com/HaoyuJi/SpecScalpel.

关键词: Skeleton-based Action Segmentation, Frequency-selective Filtering, Spectral Scalpel, Adjacent Action Discrepancy, Temporal Action Segmentation, Frequency-domain Analysis, Multi-scale Spectral Filters, State-of-the-art Performance

196. ❌ Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization

作者: David Faget, José Luis Lisani, Miguel Colom 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24117v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的图像地理定位任务，使用卷积神经网络（CNN）和类激活图（CAM）技术来提高模型可解释性。所有关键词均与大语言模型（LLM）、深度学习技术原理创新或科学领域应用相关，但本文研究的是传统CNN在特定视觉任务中的应用，未涉及大模型、MoE、缩放定律、训练技术、推理优化、智能体、量化压缩等任何关键词的核心内容。唯一有微弱关联的是’Mechanistic Interpretability OR Explainable AI’，因为论文涉及模型可解释性方法，但这是针对CNN而非LLM的可解释性，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Combi-CAM的新方法，通过组合CNN多层梯度加权类激活图来增强图像地理定位模型的可解释性，从而更详细地理解不同图像特征如何影响模型决策。

摘要翻译

行星尺度图像地理定位涉及一项复杂任务：仅依据图像视觉特征估算其所描绘的地理位置。尽管深度学习模型，尤其是卷积神经网络（CNNs），已显著推动该领域发展，但理解其预测背后的推理机制仍具挑战性。本文提出Combi-CAM这一新方法，通过结合从网络架构多个层级获取的梯度加权类激活图（gradient-weighted class activation maps），而非如传统方法般仅使用最深层级的信息，从而提升基于CNN的地理定位模型的可解释性。该方法能更细致地理解不同图像特征如何影响模型决策，相比传统方法提供了更深入的洞察。

摘要 (Abstract)

Planet-scale photo geolocalization involves the intricate task of estimating the geographic location depicted in an image purely based on its visual features. While deep learning models, particularly convolutional neural networks (CNNs), have significantly advanced this field, understanding the reasoning behind their predictions remains challenging. In this paper, we present Combi-CAM, a novel method that enhances the explainability of CNN-based geolocalization models by combining gradient-weighted class activation maps obtained from several layers of the network architecture, rather than using only information from the deepest layer as is typically done. This approach provides a more detailed understanding of how different image features contribute to the model’s decisions, offering deeper insights than the traditional approaches.

关键词: image geolocalization, explainable AI, convolutional neural networks, class activation maps, gradient-weighted CAM, model interpretability, visual features, deep learning

197. ❌ Retinal Layer Segmentation in OCT Images With 2.5D Cross-slice Feature Fusion Module for Glaucoma Assessment

作者: Hyunwoo Kim, Heesuk Kim, Wungrak Choi, Jae-Sang Hyun 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24115v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像分割（OCT视网膜层分割），属于计算机视觉和医学影像分析领域，与所有大模型/深度学习技术原理关键词（如LLM、MoE、SFT、RLHF、RAG等）完全无关，因此除’AI for Science’外均评0分；‘AI for Science’评5分，因为论文将AI应用于医学（青光眼评估），属于科学应用，但并非大模型在科学领域的应用，关联度中等。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于OCT图像视网膜层分割的2.5D框架，通过跨切片特征融合模块提高了分割准确性和鲁棒性，以支持青光眼评估。

摘要翻译

为实现精准的青光眼诊断与监测，对视网膜光学相干断层扫描（OCT）图像进行可靠的视网膜层分割至关重要。然而，现有的二维分割方法因缺乏相邻B扫描间的上下文信息，常出现切片间不一致的问题。三维分割方法虽能更好地捕捉切片间上下文，但需要高昂的计算资源。为应对这些局限，我们提出了一种2.5维分割框架，将新颖的跨切片特征融合（CFF）模块融入类U-Net架构中。CFF模块通过融合切片间特征，有效捕获上下文信息，从而实现跨切片的一致性边界检测，并提升在噪声区域的鲁棒性。该框架在临床数据集和公开的DUKE DME数据集上均得到验证。与未使用CFF模块的其他分割方法相比，所提方法的平均绝对距离降低了8.56%，均方根误差降低了13.92%，显示出更高的分割精度与鲁棒性。总体而言，所提出的2.5维框架在上下文感知与计算效率之间取得了平衡，能够实现解剖学上可靠的视网膜层描绘，为自动化青光眼评估及潜在临床应用提供了支持。

摘要 (Abstract)

For accurate glaucoma diagnosis and monitoring, reliable retinal layer segmentation in OCT images is essential. However, existing 2D segmentation methods often suffer from slice-to-slice inconsistencies due to the lack of contextual information across adjacent B-scans. 3D segmentation methods are better for capturing slice-to-slice context, but they require expensive computational resources. To address these limitations, we propose a 2.5D segmentation framework that incorporates a novel cross-slice feature fusion (CFF) module into a U-Net-like architecture. The CFF module fuses inter-slice features to effectively capture contextual information, enabling consistent boundary detection across slices and improved robustness in noisy regions. The framework was validated on both a clinical dataset and the publicly available DUKE DME dataset. Compared to other segmentation methods without the CFF module, the proposed method achieved an 8.56% reduction in mean absolute distance and a 13.92% reduction in root mean square error, demonstrating improved segmentation accuracy and robustness. Overall, the proposed 2.5D framework balances contextual awareness and computational efficiency, enabling anatomically reliable retinal layer delineation for automated glaucoma evaluation and potential clinical applications.

关键词: retinal layer segmentation, OCT images, 2.5D framework, cross-slice feature fusion, glaucoma assessment, U-Net, computational efficiency, clinical applications

198. ❌ Reservoir-Based Graph Convolutional Networks

作者: Mayssa Soussia, Gita Ayu Salsabila, Mohamed Ali Mahjoub, Islem Rekik 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24131v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于图神经网络（GNNs）和图卷积网络（GCNs）的改进，特别是通过集成储层计算（reservoir computing）来解决GCNs中的长程依赖和过平滑问题。论文的核心是图结构数据的处理，不涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、缩放定律、训练方法、对齐、推理优化等）或大模型在不同领域的应用。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文应用了RGC-Net于动态脑连接图生成，这属于生物信息学或科学AI的范畴，但并非核心焦点（主要贡献是模型本身，应用是示例）。因此，该关键词得5分（有一定关联），其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为RGC-Net的储层图卷积网络，通过集成储层动力学和结构化图卷积来解决传统图卷积网络中的长程依赖和过平滑问题，并在图分类和生成任务（如脑图演化）中实现了最先进的性能。

摘要翻译

消息传递是图神经网络（GNNs）的核心机制，通过聚合来自相邻节点的信息实现节点嵌入的迭代更新。图卷积网络（GCNs）是这一方法的典型代表，它将卷积操作适配于图结构，从而有效整合相邻节点的特征。然而，GCNs在处理复杂或动态数据时面临挑战。捕获长程依赖通常需要更深的网络层，这不仅会增加计算成本，还会导致过度平滑问题，即节点嵌入变得难以区分。为克服这些挑战，储层计算被整合到GNNs中，利用迭代消息传递的动态特性实现稳定的信息传播，而无需大量参数调优。尽管前景广阔，现有基于储层的模型缺乏结构化的卷积机制，限制了其准确聚合多跳邻域信息的能力。针对这些局限性，我们提出了RGC-Net（基于储层的图卷积网络），它将储层动态与结构化图卷积相结合。主要贡献包括：（i）重新设计的卷积框架，采用固定随机储层权重和泄漏积分器以增强特征保留；（ii）用于图分类的鲁棒且适应性强的模型；（iii）基于RGC-Net的变压器，用于图生成任务，并应用于动态脑连接分析。大量实验表明，RGC-Net在分类和生成任务（包括脑图演化）中实现了最先进的性能，同时具有更快的收敛速度和更低的过度平滑效应。源代码发布于https://github.com/basiralab/RGC-Net。

摘要 (Abstract)

Message passing is a core mechanism in Graph Neural Networks (GNNs), enabling the iterative update of node embeddings by aggregating information from neighboring nodes. Graph Convolutional Networks (GCNs) exemplify this approach by adapting convolutional operations for graph structures, allowing features from adjacent nodes to be combined effectively. However, GCNs encounter challenges with complex or dynamic data. Capturing long-range dependencies often requires deeper layers, which not only increase computational costs but also lead to over-smoothing, where node embeddings become indistinguishable. To overcome these challenges, reservoir computing has been integrated into GNNs, leveraging iterative message-passing dynamics for stable information propagation without extensive parameter tuning. Despite its promise, existing reservoir-based models lack structured convolutional mechanisms, limiting their ability to accurately aggregate multi-hop neighborhood information. To address these limitations, we propose RGC-Net (Reservoir-based Graph Convolutional Network), which integrates reservoir dynamics with structured graph convolution. Key contributions include: (i) a reimagined convolutional framework with fixed random reservoir weights and a leaky integrator to enhance feature retention; (ii) a robust, adaptable model for graph classification; and (iii) an RGC-Net-powered transformer for graph generation with application to dynamic brain connectivity. Extensive experiments show that RGC-Net achieves state-of-the-art performance in classification and generative tasks, including brain graph evolution, with faster convergence and reduced over-smoothing. Source code is available at https://github.com/basiralab/RGC-Net .

关键词: Graph Neural Networks, Graph Convolutional Networks, Reservoir Computing, Long-range Dependencies, Over-smoothing, Graph Classification, Graph Generation, Brain Connectivity

199. ❌ Granular Ball Guided Stable Latent Domain Discovery for Domain-General Crowd Counting

作者: Fan Chen, Shuyin Xia, Yi Wang, Xinbo Gao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24106v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是人群计数的领域泛化问题，提出了一种基于粒度球引导的稳定潜在域发现框架。论文的核心技术是计算机视觉中的领域泛化方法，涉及粒度球聚类、语义-风格解纠缠等技术。与评分关键词列表中的大模型、深度学习技术原理创新等主题基本无关。唯一有微弱关联的是’Pre-training OR Continual Pre-training OR Domain Adaptation’中的’Domain Adaptation’，因为论文涉及领域泛化（domain generalization），这是领域适应（domain adaptation）的相关概念，但论文重点不是大模型或深度学习的预训练技术，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，给予0分。

!!! tip deepseek-chat TL;DR

该论文针对单源领域泛化的人群计数问题，提出了一种基于粒度球引导的稳定潜在域发现框架，通过层次化聚类和语义-风格解纠缠学习，在多个数据集上显著提升了模型在领域偏移下的泛化性能。

摘要翻译

单源域泛化的人群计数任务仍极具挑战性，因为单个带标注的源域常包含异质的隐式域，而测试数据可能呈现严重的分布偏移。一个核心困难在于稳定的隐式域发现：直接在动态变化的样本级隐特征上进行扁平聚类易受特征噪声、异常值和表征漂移的影响，导致不可靠的伪域分配并削弱域结构化学习。为解决此问题，我们提出一种基于粒度球引导的稳定隐式域发现框架，用于域泛化人群计数。具体而言，所提方法首先将样本组织为紧凑的局部粒度球，随后以粒度球中心为代表进行聚类以获取伪域，从而将直接的样本级聚类转化为基于代表的层次化聚类过程。该设计能够产生更稳定且语义一致的伪域分配。基于发现的隐式域，我们进一步构建了一个双分支学习框架：通过语义码本重编码增强可迁移的语义表征，同时借助风格分支建模域特定的外观变化，从而减少语义与风格的纠缠，提升域偏移下的泛化能力。在严格的无适应协议下，于ShanghaiTech A/B、UCF_QNRF和NWPU-Crowd数据集上的大量实验表明，所提方法始终优于强基线模型，尤其在较大域差异场景下表现突出。

摘要 (Abstract)

Single-source domain generalization for crowd counting remains highly challenging because a single labeled source domain often contains heterogeneous latent domains, while test data may exhibit severe distribution shifts. A fundamental difficulty lies in stable latent domain discovery: directly performing flat clustering on evolving sample-level latent features is easily affected by feature noise, outliers, and representation drift, leading to unreliable pseudo-domain assignments and weakened domain-structured learning. To address this issue, we propose a granular ball guided stable latent domain discovery framework for domain-general crowd counting. Specifically, the proposed method first organizes samples into compact local granular balls and then clusters granular ball centers as representatives to obtain pseudo-domains, transforming direct sample-level clustering into a hierarchical representative-based clustering process. This design yields more stable and semantically consistent pseudo-domain assignments. Built upon the discovered latent domains, we further develop a two-branch learning framework that enhances transferable semantic representations via semantic codebook re-encoding while modeling domain-specific appearance variations through a style branch, thereby reducing semantic–style entanglement and improving generalization under domain shifts. Extensive experiments on ShanghaiTech A/B, UCF_QNRF, and NWPU-Crowd under a strict no-adaptation protocol demonstrate that the proposed method consistently outperforms strong baselines, especially under large domain gaps.

关键词: crowd counting, domain generalization, latent domain discovery, granular ball clustering, semantic-style disentanglement, single-source domain adaptation, hierarchical clustering, transferable representations

200. ❌ LGTM: Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation

作者: Ryugo Morita, Stanislav Frolov, Brian Bernhard Moser, Ko Watanabe, Riku Takahashi, Andreas Dengel 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24086v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是扩散模型（Diffusion Models）在文本到图像生成中的光照控制问题，提出了一种无需训练的方法LGTM。所有评分关键词都明确针对大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等），而本文的核心是扩散模型，属于生成式AI但并非大语言模型领域。论文未涉及任何LLM技术、架构、训练方法或应用场景，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练的扩散模型方法LGTM，通过操纵初始潜在噪声来实现文本到图像生成中的细粒度光照控制，解决了现有方法依赖两阶段流程和微调的问题，并在实验中展示了优于基线方法的照明一致性和图像质量。

摘要翻译

扩散模型在条件性文本到图像生成中已展现出高质量性能，尤其在处理边缘、布局和深度等结构化线索方面。然而，光照条件在生成过程中受到的关注有限，且仍难以精确控制。现有方法通常采用两阶段流程处理光照，即在图像生成后进行重照明，这种方式效率较低。此外，这些方法依赖于大规模数据集和大量计算进行微调，限制了其对新模型和任务的适应性。为解决这一问题，我们提出了一种无需训练的光照引导文本到图像扩散模型（Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation, LGTM），该方法通过操纵扩散过程的初始潜在噪声，结合文本提示和用户指定的光照方向来引导图像生成。通过对潜在空间进行通道级分析，我们发现选择性操纵潜在通道能够实现细粒度的光照控制，而无需对预训练模型进行微调或修改。大量实验表明，我们的方法在光照一致性方面超越了基于提示的基线方法，同时保持了图像质量和文本对齐性。这一方法为动态、用户引导的光照控制引入了新的可能性。此外，该方法可与ControlNet等模型无缝集成，展现了在不同场景下的广泛适应性。

摘要 (Abstract)

Diffusion models have demonstrated high-quality performance in conditional text-to-image generation, particularly with structural cues such as edges, layouts, and depth. However, lighting conditions have received limited attention and remain difficult to control within the generative process. Existing methods handle lighting through a two-stage pipeline that relights images after generation, which is inefficient. Moreover, they rely on fine-tuning with large datasets and heavy computation, limiting their adaptability to new models and tasks. To address this, we propose a novel Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation (LGTM), which manipulates the initial latent noise of the diffusion process to guide image generation with text prompts and user-specified light directions. Through a channel-wise analysis of the latent space, we find that selectively manipulating latent channels enables fine-grained lighting control without fine-tuning or modifying the pre-trained model. Extensive experiments show that our method surpasses prompt-based baselines in lighting consistency, while preserving image quality and text alignment. This approach introduces new possibilities for dynamic, user-guided light control. Furthermore, it integrates seamlessly with models like ControlNet, demonstrating adaptability across diverse scenarios.

关键词: Diffusion Models, Text-to-Image Generation, Lighting Control, Training-Free Method, Initial Noise Manipulation, Latent Space Analysis, ControlNet Integration, Image Quality Preservation

201. ❌ LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation

作者: Haoyu Ji, Xueting Liu, Yu Gao, Wenze Huang, Zhihao Yang, Weihong Ren, Zhiyong Wang, Honghai Liu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24097v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于骨架动作分割的计算机视觉任务，提出了一种融合拉格朗日动力学原理的深度学习网络（LaDy）。虽然论文涉及深度学习技术，但其研究内容与所有评分关键词（均围绕大语言模型及其相关技术、应用）完全无关。论文未提及任何大模型、语言模型、提示工程、对齐、推理、代理、压缩等技术，也未涉及科学AI应用（如生物信息学）。

!!! tip deepseek-chat TL;DR

该论文针对骨架时序动作分割任务中现有方法忽略物理动力学的问题，提出了拉格朗日动力学引导网络（LaDy），通过融合物理动力学原理显著提升了动作分类和边界定位的精度，在多个数据集上达到了最先进的性能。

摘要翻译

基于骨架的时序动作分割（Skeleton-based Temporal Action Segmentation, STAS）旨在将未裁剪的骨架序列密集解析为帧级别的动作类别。然而，现有方法虽擅长捕捉时空运动学特征，却忽略了支配人体运动的内在物理动力学原理。这一疏忽限制了具有相似运动学特征但动态意图不同的动作之间的类间区分度，并阻碍了在动态力分布发生变化的区域实现精确的边界定位。为解决这些问题，我们提出了拉格朗日动力学信息网络（Lagrangian-Dynamic Informed Network, LaDy），该框架将拉格朗日动力学原理整合到分割过程中。具体而言，LaDy首先从关节位置计算广义坐标，随后在物理约束下估计拉格朗日项以显式合成广义力。为进一步确保物理一致性，我们提出的能量一致性损失函数强化了功能定理，使动能变化与净力所做的功保持一致。学习到的动力学特征随后驱动一个时空调制模块：在空间维度，广义力与空间表征融合以提供更具区分度的语义信息；在时间维度，构建显著的动态信号用于时序门控，从而显著增强边界感知能力。在多个挑战性数据集上的实验表明，LaDy实现了最先进的性能，验证了物理动力学整合对于动作分割的有效性。代码发布于https://github.com/HaoyuJi/LaDy。

摘要 (Abstract)

Skeleton-based Temporal Action Segmentation (STAS) aims to densely parse untrimmed skeletal sequences into frame-level action categories. However, existing methods, while proficient at capturing spatio-temporal kinematics, neglect the underlying physical dynamics that govern human motion. This oversight limits inter-class discriminability between actions with similar kinematics but distinct dynamic intents, and hinders precise boundary localization where dynamic force profiles shift. To address these, we propose the Lagrangian-Dynamic Informed Network (LaDy), a framework integrating principles of Lagrangian dynamics into the segmentation process. Specifically, LaDy first computes generalized coordinates from joint positions and then estimates Lagrangian terms under physical constraints to explicitly synthesize the generalized forces. To further ensure physical coherence, our Energy Consistency Loss enforces the work-energy theorem, aligning kinetic energy change with the work done by the net force. The learned dynamics then drive a Spatio-Temporal Modulation module: Spatially, generalized forces are fused with spatial representations to provide more discriminative semantics. Temporally, salient dynamic signals are constructed for temporal gating, thereby significantly enhancing boundary awareness. Experiments on challenging datasets show that LaDy achieves state-of-the-art performance, validating the integration of physical dynamics for action segmentation. Code is available at https://github.com/HaoyuJi/LaDy.

关键词: Skeleton-based Action Segmentation, Lagrangian Dynamics, Spatio-Temporal Modulation, Generalized Forces, Energy Consistency Loss, Temporal Action Segmentation, Physical Dynamics, Boundary Localization

202. ❌ AD-Reasoning: Multimodal Guideline-Guided Reasoning for Alzheimer’s Disease Diagnosis

作者: Qiuhui Chen, Yushan Deng, Xuancheng Yao, Yi Hong 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24059v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于阿尔茨海默病诊断的多模态AI系统，与大多数大模型技术关键词无关。主要相关性在于：1）“AI for Science"高度相关（10分），因为论文属于生物医学AI应用；2）“Chain of Thought"和"System 2 Thinking"有一定关联（8分），因为论文强调结构化推理和决策一致性；3）“Explainable AI"部分相关（5分），因为论文提到透明度改进。其他关键词如LLMs、MoE、RLHF等均未涉及。

!!! tip deepseek-chat TL;DR

该研究提出了AD-Reasoning多模态框架，通过结合神经影像和临床数据以及基于规则的验证器，实现了阿尔茨海默病的结构化、指南一致的诊断，并在新数据集上达到了最先进的诊断准确性和透明度。

摘要翻译

阿尔茨海默病（Alzheimer’s disease, AD）的诊断需依据既定标准，整合神经影像学与异质性临床证据并进行推理，然而当前多数多模态模型仍缺乏透明度且与诊疗指南契合度较低。本文提出AD-Reasoning，一个多模态框架，该框架将结构磁共振成像（structural MRI）与六种临床模态数据相结合，并引入基于规则的验证器，以生成符合美国国家衰老研究所-阿尔茨海默病协会（NIA-AA）诊断标准的结构化诊断结果。AD-Reasoning融合了针对特定模态的编码器、双向交叉注意力融合机制，以及通过可验证奖励进行强化微调的方法，这些奖励机制用于确保输出格式规范、覆盖指南要求的证据范围并保持推理与决策的一致性。我们还发布了AD-MultiSense数据集，这是一个包含10,378次就诊记录的多模态问答数据集，其依据ADNI/AIBL数据构建，且所有诊断依据均经过指南验证。在AD-MultiSense上，AD-Reasoning实现了最先进的诊断准确率，并生成结构化的诊断依据，相较于近期基线模型显著提升了透明度，同时提供了可解释的推理过程。

摘要 (Abstract)

Alzheimer’s disease (AD) diagnosis requires integrating neuroimaging with heterogeneous clinical evidence and reasoning under established criteria, yet most multimodal models remain opaque and weakly guideline-aligned. We present AD-Reasoning, a multimodal framework that couples structural MRI with six clinical modalities and a rule-based verifier to generate structured, NIA-AA-consistent diagnoses. AD-Reasoning combines modality-specific encoders, bidirectional cross-attention fusion, and reinforcement fine-tuning with verifiable rewards that enforce output format, guideline evidence coverage, and reasoning–decision consistency. We also release AD-MultiSense, a 10,378-visit multimodal QA dataset with guideline-validated rationales built from ADNI/AIBL. On AD-MultiSense, AD-Reasoning achieves state-of-the-art diagnostic accuracy and produces structured rationales that improve transparency over recent baselines, while providing transparent rationales.

关键词: Alzheimer’s disease diagnosis, multimodal framework, guideline-guided reasoning, structured rationales, reinforcement fine-tuning, clinical evidence integration, transparency improvement, AD-MultiSense dataset

203. ❌ Beyond Semantic Priors: Mitigating Optimization Collapse for Generalizable Visual Forensics

作者: Jipeng Liu, Haichao Shi, Siyu Xing, Rong Yin, Xiao-Yu Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24057v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉-语言模型（VLMs）在深度伪造检测中的应用，特别是针对非语义伪造的优化崩溃问题。虽然涉及深度学习和大模型（VLMs），但所有关键词均针对语言模型（LLMs）或特定LLM技术（如MoE、RLHF、RAG等），而论文专注于视觉-语言模型（如CLIP）和计算机视觉任务（深度伪造检测），未涉及任何语言模型技术、LLM应用或AI for Science的具体领域（如生物信息学）。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

论文研究了视觉-语言模型在深度伪造检测中因优化崩溃导致的泛化失败问题，提出了Critical Optimization Radius理论分析和Contrastive Regional Injection Transformer方法，以提升模型对非语义伪造的检测性能。

摘要翻译

尽管如CLIP等视觉-语言模型已成为泛化性深度伪造检测的主流范式，其表征层面仍存在脱节：这些模型以语义为中心的训练方式难以捕捉超现实合成内容中固有的非语义伪影。本研究识别出一种称为“优化坍缩”的失效模式：当扰动半径超过一个狭窄阈值时，采用锐度感知最小化训练的检测器在非语义伪造样本上会退化为随机猜测。为从理论上形式化这一坍缩现象，我们提出“临界优化半径”以量化优化景观的几何稳定性，并利用“梯度信噪比”衡量泛化潜力。我们建立了一个定理，证明临界优化半径随梯度信噪比单调递增，从而揭示锐度感知最小化优化的几何不稳定性源于内在泛化潜力的退化。这一结果表明，梯度信噪比的逐层衰减是检测非语义伪造时发生优化坍缩的根本原因。虽然简单减小扰动半径可在锐度感知最小化下实现稳定收敛，但这仅缓解了表面症状而未改善内在的泛化退化问题，因此需要增强梯度保真度。基于此洞见，我们提出对比区域注入Transformer，该模型将计算高效的对比梯度代理与三种免训练策略相结合：通过区域细化掩码抑制对比梯度代理方差，通过区域信号注入保持其幅值，并通过分层表征集成获得更具泛化能力的表征。大量实验表明，该模型能有效缓解优化坍缩，并在跨域与通用伪造基准测试中实现了最先进的泛化性能。

摘要 (Abstract)

While Vision-Language Models (VLMs) like CLIP have emerged as a dominant paradigm for generalizable deepfake detection, a representational disconnect remains: their semantic-centric pre-training is ill-suited for capturing non-semantic artifacts inherent to hyper-realistic synthesis. In this work, we identify a failure mode termed Optimization Collapse, where detectors trained with Sharpness-Aware Minimization (SAM) degenerate to random guessing on non-semantic forgeries once the perturbation radius exceeds a narrow threshold. To theoretically formalize this collapse, we propose the Critical Optimization Radius (COR) to quantify the geometric stability of the optimization landscape, and leverage the Gradient Signal-to-Noise Ratio (GSNR) to measure generalization potential. We establish a theorem proving that COR increases monotonically with GSNR, thereby revealing that the geometric instability of SAM optimization originates from degraded intrinsic generalization potential. This result identifies the layer-wise attenuation of GSNR as the root cause of Optimization Collapse in detecting non-semantic forgeries. Although naively reducing perturbation radius yields stable convergence under SAM, it merely treats the symptom without mitigating the intrinsic generalization degradation, necessitating enhanced gradient fidelity. Building on this insight, we propose the Contrastive Regional Injection Transformer (CoRIT), which integrates a computationally efficient Contrastive Gradient Proxy (CGP) with three training-free strategies: Region Refinement Mask to suppress CGP variance, Regional Signal Injection to preserve CGP magnitude, and Hierarchical Representation Integration to attain more generalizable representations. Extensive experiments demonstrate that CoRIT mitigates optimization collapse and achieves state-of-the-art generalization across cross-domain and universal forgery benchmarks.

关键词: Vision-Language Models, Deepfake Detection, Optimization Collapse, Sharpness-Aware Minimization, Generalization, Contrastive Regional Injection Transformer, Non-semantic Artifacts

204. ❌ LGEST: Dynamic Spatial-Spectral Expert Routing for Hyperspectral Image Classification

作者: Jiawen Wen, Suixuan Qiu, Zihang Luo, Xiaofei Yang, Haotian Shi 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24045v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文LGEST专注于高光谱图像分类，属于AI for Science（遥感/地球科学应用）范畴，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分）。论文核心创新之一是采用了混合专家（Mixture of Experts, MoE）架构，在CIEM-FPN和LGES中明确使用了’residual mixture-of-experts layers’和’sparsely activated expert pairs’，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分）。论文未涉及大语言模型（LLMs）、训练技术（如预训练、微调、对齐）、推理优化、智能体或其他大模型相关主题，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对高光谱图像分类中局部-全局表示融合不灵活、谱-空尺度差异处理不足以及高维样本异质性下的休斯现象等问题，提出了LGEST框架，通过深度空间-谱自编码器、交叉交互混合专家特征金字塔和局部-全局专家系统，在多个基准数据集上实现了最先进的性能。

摘要翻译

包括卷积神经网络、Transformer与Mamba在内的深度学习方法在高光谱图像分类领域取得了显著成功。然而，现有方法存在局部-全局表征融合方式僵化、对异构波段间光谱-空间尺度差异处理不足，以及在高维样本异质性下易受休斯现象影响等问题。为应对这些挑战，我们提出局部-全局专家空谱Transformer（LGEST）这一新型框架，该框架通过协同整合三项关键创新实现突破。LGEST首先采用深度空谱自编码器，通过分层非线性压缩生成紧凑且判别性强的嵌入表征，在保持三维邻域连贯性的同时缓解高维空间中的信息损失。其次，交叉交互混合专家特征金字塔通过交叉注意力机制与残差混合专家层动态融合多尺度特征，借助可学习的门控函数自适应权衡光谱判别性与空间显著性。最后，局部-全局专家系统通过稀疏激活的专家对处理分解特征：卷积子专家捕获细粒度纹理特征，而Transformer子专家建模长程上下文依赖关系，路由控制器根据实时特征显著性动态选择专家。在四个基准数据集上的大量实验表明，LGEST始终优于现有最先进方法。

摘要 (Abstract)

Deep learning methods, including Convolutional Neural Networks, Transformers and Mamba, have achieved remarkable success in hyperspectral image (HSI) classification. Nevertheless, existing methods exhibit inflexible integration of local-global representations, inadequate handling of spectral-spatial scale disparities across heterogeneous bands, and susceptibility to the Hughes phenomenon under high-dimensional sample heterogeneity. To address these challenges, we propose Local-Global Expert Spatial-Spectral Transformer (LGEST), a novel framework that synergistically combines three key innovations. The LGEST first employs a Deep Spatial-Spectral Autoencoder (DSAE) to generate compact yet discriminative embeddings through hierarchical nonlinear compression, preserving 3D neighborhood coherence while mitigating information loss in high-dimensional spaces. Secondly, a Cross-Interactive Mixed Expert Feature Pyramid (CIEM-FPN) leverages cross-attention mechanisms and residual mixture-of-experts layers to dynamically fuse multi-scale features, adaptively weighting spectral discriminability and spatial saliency through learnable gating functions. Finally, a Local-Global Expert System (LGES) processes decomposed features via sparsely activated expert pairs: convolutional sub-experts capture fine-grained textures, while transformer sub-experts model long-range contextual dependencies, with a routing controller dynamically selecting experts based on real-time feature saliency. Extensive experiments on four benchmark datasets demonstrate that LGEST consistently outperforms state-of-the-art methods.

关键词: Hyperspectral Image Classification, Mixture of Experts, Spatial-Spectral Transformer, Dynamic Expert Routing, Local-Global Representation, Cross-Interactive Feature Pyramid, Deep Spatial-Spectral Autoencoder, Sparse Activation

205. ❌ HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models

作者: Yeqi He, Liang Li, Zhiwen Yang, Xichun Sheng, Zhidong Zhao, Chenggang Yan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24043v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散模型在图像风格迁移中的技术改进，提出了HAM方法。所有评分关键词均与大语言模型（LLMs）相关，而论文研究的是扩散模型（Diffusion Models），属于不同的生成模型领域。论文未涉及任何LLMs、MoE、SLMs、对齐、推理、代理、压缩等关键词相关技术，也未涉及科学AI应用。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练的扩散模型风格迁移方法HAM，通过异构注意力调制解决风格-内容平衡问题，在多项指标上达到最先进性能。

摘要翻译

扩散模型在图像生成领域展现出卓越性能，尤其在风格迁移任务中表现突出。当前主流的风格迁移方法通常利用预训练扩散模型强大的特征提取能力，结合外部模块化控制路径来显式施加风格引导信号。然而，这些方法往往难以捕捉复杂的风格参考或保持用户提供的内容图像的身份特征，从而陷入风格与内容平衡的困境。为此，我们提出一种无需训练的、基于异质注意力调制（Heterogeneous Attention Modulation，HAM）的风格迁移方法，以在图像/文本引导的风格参考迁移过程中保护身份信息，从而应对风格与内容权衡的挑战。具体而言，我们首先引入风格噪声初始化来为扩散过程初始化潜在噪声。随后，在扩散过程中，该方法创新性地采用HAM机制作用于不同的注意力模块，包括全局注意力调控（Global Attention Regulation，GAR）与局部注意力移植（Local Attention Transplantation，LAT），从而在捕捉复杂风格参考的同时更好地保留内容图像的细节特征。通过一系列定性与定量实验验证，我们的方法在多项量化指标上达到了最先进的性能水平。

摘要 (Abstract)

Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models’ robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, these methods often fail to capture complex style reference or retain the identity of user-provided content images, thus falling into the trap of style-content balance. Thus, we propose a training-free style transfer approach via $\textbf{h}$eterogeneous $\textbf{a}$ttention $\textbf{m}$odulation ($\textbf{HAM}$) to protect identity information during image/text-guided style reference transfer, thereby addressing the style-content trade-off challenge. Specifically, we first introduces style noise initialization to initialize latent noise for diffusion. Then, during the diffusion process, it innovatively employs HAM for different attention mechanisms, including Global Attention Regulation (GAR) and Local Attention Transplantation (LAT), which better preserving the details of the content image while capturing complex style references. Our approach is validated through a series of qualitative and quantitative experiments, achieving state-of-the-art performance on multiple quantitative metrics.

关键词: Diffusion Models, Style Transfer, Training-Free, Attention Modulation, Heterogeneous Attention, Global Attention Regulation, Local Attention Transplantation, Style-Content Balance

206. ❌ SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons

作者: Haiyang Xu, Ronghuan Wu, Li-Yi Wei, Nanxuan Zhao, Chenxi Liu, Cuong Nguyen, Zhuowen Tu, Zhaowen Wang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24039v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SemLayer专注于计算机视觉和图形学领域，研究抽象图标的语义感知生成分割和图层重建，核心是视觉生成和几何重建技术。所有评分关键词均涉及大语言模型、深度学习技术原理、AI for Science等特定领域，而本文完全不涉及这些内容，没有使用或讨论任何大模型、深度学习技术原理或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对扁平化矢量艺术图标丢失原始语义分层的问题，提出了SemLayer视觉生成流程，通过颜色区分、语义补全和图层组装来恢复可编辑的语义分层结构。

摘要翻译

图形图标是现代设计工作流程的基石，但它们通常以扁平化的单路径或复合路径图形形式分发，导致原始的语义分层信息丢失。这种语义分解的缺失阻碍了下游任务，如图标编辑、风格重设和动画制作。我们将此问题形式化为针对扁平化矢量图稿的语义图层构建任务，并提出了SemLayer——一种由视觉生成技术驱动的流程，旨在恢复可编辑的分层结构。给定一个抽象图标，SemLayer首先生成一个色彩差异化的表示，使不同的语义组件在视觉上可分离。为了恢复每个部分的完整几何形状（包括被遮挡区域），我们随后执行语义补全步骤，以重建连贯的对象级形状。最后，将恢复的部件按照推断出的遮挡关系组装成分层的矢量表示。大量的定性比较和定量评估证明了SemLayer的有效性，它实现了以往扁平化矢量图形无法支持的编辑工作流程，并将语义图层重建确立为一项具有实用价值的重要任务。项目页面：https://xxuhaiyang.github.io/SemLayer/

摘要 (Abstract)

Graphic icons are a cornerstone of modern design workflows, yet they are often distributed as flattened single-path or compound-path graphics, where the original semantic layering is lost. This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation empowered pipeline that restores editable layered structures. Given an abstract icon, SemLayer first generates a chromatically differentiated representation in which distinct semantic components become visually separable. To recover the complete geometry of each part, including occluded regions, we then perform a semantic completion step that reconstructs coherent object-level shapes. Finally, the recovered parts are assembled into a layered vector representation with inferred occlusion relationships. Extensive qualitative comparisons and quantitative evaluations demonstrate the effectiveness of SemLayer, enabling editing workflows previously inapplicable to flattened vector graphics and establishing semantic layer reconstruction as a practical and valuable task. Project page: https://xxuhaiyang.github.io/SemLayer/

关键词: Semantic layer construction, Vector art, Generative segmentation, Semantic completion, Layered representation, Visual generation, Abstract icons, Occlusion relationships

207. ❌ SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision

作者: Avigail Cohen Rimon, Amir Mann, Mirela Ben Chen, Or Litany 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision》专注于计算机视觉和图形学领域，特别是3D高斯泼溅（3DGS）在视频跟踪中的应用。它提出了一种通过频域监督解决梯度消失问题的新方法，涉及优化算法、渲染技术和跟踪框架。然而，所有评分关键词均围绕大模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本文完全不涉及语言模型、深度学习在科学领域的应用，或大模型技术原理的创新。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了3D高斯泼溅在视频跟踪中因空间重叠不足导致的梯度消失问题，提出了一种基于频域监督的SpectralSplats框架，通过谱矩监督和频率退火策略，实现了从严重错位初始化中恢复复杂变形的鲁棒跟踪。

摘要翻译

三维高斯泼溅（3D Gaussian Splatting，简称3DGS）能够实现实时、逼真的新视角合成，使其成为基于模型的视频跟踪中极具吸引力的表示方法。然而，在非受控环境中利用3DGS渲染器的可微分性仍然存在众所周知的脆弱性问题。一个根本瓶颈在于高斯图元的紧凑局部支撑特性。标准的光度目标函数隐式依赖于空间重叠；若严重的相机位姿偏差导致渲染物体完全脱离目标物体的局部覆盖区域，梯度将严格消失，使优化器陷入停滞。我们提出SpectralSplats，一种鲁棒的跟踪框架，通过将优化目标从空间域转换到频域，解决了这一“梯度消失”问题。通过使用一组全局复正弦特征（谱矩）对渲染图像进行监督，我们构建了一个全局吸引域，确保即使像素重叠完全不存在时，整个图像域内仍存在指向目标的有效方向性梯度。为了利用这一全局吸引域，同时避免引入高频分量相关的周期性局部极小值，我们从基本原理出发推导出一种理论完备的频率退火策略，使优化器能够平滑地从全局凸性过渡到精确的空间对齐。我们证明，SpectralSplats可作为空间损失函数的无缝即插即用替代方案，适用于多种变形参数化方法（从多层感知器到稀疏控制点），即使从严重错位的初始状态出发，也能成功恢复复杂变形，而基于标准外观的跟踪方法在此类情况下会完全失效。

摘要 (Abstract)

3D Gaussian Splatting (3DGS) enables real-time, photorealistic novel view synthesis, making it a highly attractive representation for model-based video tracking. However, leveraging the differentiability of the 3DGS renderer “in the wild” remains notoriously fragile. A fundamental bottleneck lies in the compact, local support of the Gaussian primitives. Standard photometric objectives implicitly rely on spatial overlap; if severe camera misalignment places the rendered object outside the target’s local footprint, gradients strictly vanish, leaving the optimizer stranded. We introduce SpectralSplats, a robust tracking framework that resolves this “vanishing gradient” problem by shifting the optimization objective from the spatial to the frequency domain. By supervising the rendered image via a set of global complex sinusoidal features (Spectral Moments), we construct a global basin of attraction, ensuring that a valid, directional gradient toward the target exists across the entire image domain, even when pixel overlap is completely nonexistent. To harness this global basin without introducing periodic local minima associated with high frequencies, we derive a principled Frequency Annealing schedule from first principles, gracefully transitioning the optimizer from global convexity to precise spatial alignment. We demonstrate that SpectralSplats acts as a seamless, drop-in replacement for spatial losses across diverse deformation parameterizations (from MLPs to sparse control points), successfully recovering complex deformations even from severely misaligned initializations where standard appearance-based tracking catastrophically fails.

关键词: 3D Gaussian Splatting, differentiable tracking, spectral moment supervision, vanishing gradient problem, frequency domain optimization, robust deformation recovery, video tracking, rendering

208. ❌ Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection

作者: Sa Zhu, Wanqian Zhang, Lin Wang, Xiaohua Chen, Chenxu Cui, Jinchao Zhang, Bo Li 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24030v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是利用大语言模型（LLMs）的思维链（CoT）推理能力来分解动作标签，用于开放词汇时序动作检测（OV-TAD）。因此，与’Large Language Models’和’Chain of Thought’高度相关（10分）。‘Alignment’关键词得5分，因为论文涉及视觉-文本对齐，但非价值对齐。‘System 2 Thinking’得5分，因CoT模仿人类深度推理。其他关键词如MoE、SFT、RAG等未涉及，得0分。论文属于计算机视觉应用，非生物信息学等科学AI领域，故’AI for Science’得0分。

!!! tip deepseek-chat TL;DR

该论文针对开放词汇时序动作检测中现有方法全局对齐不足的问题，提出了一种利用大语言模型思维链推理进行相位分解和对齐的框架，显著提升了未见动作类别的泛化性能。

摘要翻译

开放词汇时序动作检测（Open-Vocabulary Temporal Action Detection, OV-TAD）旨在对未修剪视频中的未见类别动作片段进行分类与定位。现有方法仅依赖于标签级语义与视觉特征之间的全局对齐，这不足以将时序一致的视觉知识从已见类别迁移到未见类别。为解决此问题，我们提出了一种分阶段分解与对齐（Phase-wise Decomposition and Alignment, PDA）框架，通过细粒度的动作模式学习实现有效的先验知识迁移。具体而言，我们首先引入思维链提示语义分解（CoT-Prompting Semantic Decomposition, CSD）模块，利用大语言模型的思维链（chain-of-thought, CoT）推理能力，自动将动作标签分解为连贯的阶段级描述，以模拟人类认知过程。随后，我们提出文本注入前景过滤（Text-infused Foreground Filtering, TIF）模块，利用阶段语义线索自适应地过滤每个阶段中与动作相关的片段，生成语义对齐的视觉表征。此外，我们设计了自适应阶段对齐（Adaptive Phase-wise Alignment, APA）模块，执行阶段级的视觉-文本匹配，并自适应地聚合各阶段的对齐结果以进行最终预测。这种自适应的分阶段对齐有助于捕捉可迁移的动作模式，并显著增强对未见动作的泛化能力。在两个OV-TAD基准数据集上的大量实验验证了所提方法的优越性。

摘要 (Abstract)

Open-Vocabulary Temporal Action Detection (OV-TAD) aims to classify and localize action segments in untrimmed videos for unseen categories. Previous methods rely solely on global alignment between label-level semantics and visual features, which is insufficient to transfer temporal consistent visual knowledge from seen to unseen classes. To address this, we propose a Phase-wise Decomposition and Alignment (PDA) framework, which enables fine-grained action pattern learning for effective prior knowledge transfer. Specifically, we first introduce the CoT-Prompting Semantic Decomposition (CSD) module, which leverages the chain-of-thought (CoT) reasoning ability of large language models to automatically decompose action labels into coherent phase-level descriptions, emulating human cognitive processes. Then, Text-infused Foreground Filtering (TIF) module is introduced to adaptively filter action-relevant segments for each phase leveraging phase-wise semantic cues, producing semantically aligned visual representations. Furthermore, we propose the Adaptive Phase-wise Alignment (APA) module to perform phase-level visual-textual matching, and adaptively aggregates alignment results across phases for final prediction. This adaptive phase-wise alignment facilitates the capture of transferable action patterns and significantly enhances generalization to unseen actions. Extensive experiments on two OV-TAD benchmarks demonstrated the superiority of the proposed method.

关键词: Open-Vocabulary Temporal Action Detection, Chain-of-Thought, Large Language Models, Phase-wise Decomposition, Visual-Textual Alignment, Action Pattern Transfer, Video Understanding

209. ❌ UW-VOS: A Large-Scale Dataset for Underwater Video Object Segmentation

作者: Hongshen Zhao, Jingkang Tai, Yuhang Wu, Wenkang Zhang, Xi Lan, Shangyan Wang, Tianyu Zhang, Wankou Yang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24006v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于水下视频对象分割（VOS），属于计算机视觉领域，而非大语言模型（LLM）或深度学习技术原理的核心研究。论文提出的SAM-U框架使用了参数高效微调（PEFT）技术（通过轻量级适配器），因此与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’关键词有一定关联（5分）。此外，水下VOS应用于海洋探索，属于’AI for Science’的范畴（5分）。其他关键词均与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对水下视频对象分割（VOS）中高质量训练数据缺乏和现有方法性能下降的问题，提出了首个大规模水下VOS数据集UW-VOS和一个参数高效的适配框架SAM-U，有效缩小了领域差距并实现了最先进的性能。

摘要翻译

水下视频目标分割（Underwater Video Object Segmentation，简称VOS）对于海洋探索至关重要，然而现有通用方法因水下颜色失真、低对比度及普遍存在的伪装现象而性能显著下降。一个主要障碍是缺乏高质量的训练数据。为弥补这一空白，我们提出了首个大规模水下VOS基准数据集$\textbf{UW-VOS}$，该数据集包含409个类别的1,431个视频序列，并提供了309,295个掩码标注，其构建过程采用了经过严格人工验证的半自动数据引擎。我们进一步提出$\textbf{SAM-U}$，一个参数高效的框架，它将SAM2适配至水下领域。通过在图像编码器中插入轻量级适配器，SAM-U仅以约$2%$的可训练参数实现了最先进的性能。大量实验表明，现有方法在UW-VOS上的平均$\mathcal{J}&\mathcal{F}$指标下降了13个百分点，而SAM-U则有效弥合了这一领域差距。基于属性的详细分析进一步指出，小目标、伪装以及目标出镜再入镜是当前的关键瓶颈，这为未来鲁棒水下感知研究提供了路线图。

摘要 (Abstract)

Underwater Video Object Segmentation (VOS) is essential for marine exploration, yet open-air methods suffer significant degradation due to color distortion, low contrast, and prevalent camouflage. A primary hurdle is the lack of high-quality training data. To bridge this gap, we introduce $\textbf{UW-VOS}$, the first large-scale underwater VOS benchmark comprising 1,431 video sequences across 409 categories with 309,295 mask annotations, constructed via a semi-automatic data engine with rigorous human verification. We further propose $\textbf{SAM-U}$, a parameter-efficient framework that adapts SAM2 to the underwater domain. By inserting lightweight adapters into the image encoder, SAM-U achieves state-of-the-art performance with only $\sim$2$%$ trainable parameters. Extensive experiments reveal that existing methods experience an average 13-point $\mathcal{J}&\mathcal{F}$ drop on UW-VOS, while SAM-U effectively bridges this domain gap. Detailed attribute-based analysis further identifies small targets, camouflage, and exit-re-entry as critical bottlenecks, providing a roadmap for future research in robust underwater perception.

关键词: Underwater Video Object Segmentation, Large-scale Dataset, Domain Adaptation, Parameter-efficient Fine-tuning, SAM2, Marine Exploration, Benchmark, Adapters

210. ❌ COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm

作者: Zekun Qian, Wei Feng, Ruize Han, Junhui Hou 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的多目标跟踪（MOT）任务，特别是开放词汇多目标跟踪（OVMOT）。论文的核心贡献包括构建连续标注的训练数据集C-TAO和提出COVTrack++框架，涉及检测与关联的协同机制、多线索融合、层次聚合和时序置信传播等技术。然而，所有评分关键词均围绕大语言模型（LLMs）及其相关技术（如MoE、缩放定律、微调方法、推理优化、智能体等）或特定科学领域AI应用（如生物信息学）。论文内容完全不涉及语言模型、大模型技术原理或AI for Science的具体应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文解决了开放词汇多目标跟踪中缺乏连续标注训练数据和定制化框架的问题，通过构建C-TAO数据集和提出COVTrack++协同框架，在TAO基准上实现了最先进的性能，并展示了在BDD100K上的强零样本泛化能力。

摘要翻译

传统多目标跟踪（MOT）通常局限于少数特定类别，这限制了其在涉及多样化物体的真实场景中的适用性。开放词汇多目标跟踪（OVMOT）通过实现对任意类别（包括训练期间未见的新颖物体）的跟踪来解决此问题。然而，当前进展受到两大挑战的制约：一是缺乏用于训练的连续标注视频数据，二是缺乏定制的OVMOT框架以协同处理检测与关联。我们通过构建C-TAO来解决数据瓶颈，这是首个为OVMOT设计的连续标注训练集，其标注密度相比原始TAO数据集提高了26倍，并捕捉了平滑的运动动态与中间物体状态。针对框架瓶颈，我们提出了COVTrack++，这是一个协同框架，通过三个模块实现了检测与关联之间的双向互惠机制：（1）多线索自适应融合（MCF）动态平衡外观、运动和语义线索，以进行关联特征学习；（2）多粒度层次聚合（MGA）利用密集检测中的层次空间关系，其中可见的子节点（如物体部件）协助被遮挡的父对象（如整体）以增强关联特征；（3）时序置信度传播（TCP）通过高置信度的已跟踪物体跨帧提升低置信度候选检测，从而恢复闪烁的检测结果并稳定轨迹。在TAO数据集上的大量实验证明了最先进的性能，新颖的TETA指标在验证集和测试集上分别达到35.4%和30.5%，相比先前方法将新颖AssocA提升了4.8%，新颖LocA提升了5.8%，并在BDD100K上展现出强大的零样本泛化能力。代码与数据集将公开提供。

摘要 (Abstract)

Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K. The code and dataset will be publicly available.

关键词: Open-Vocabulary Multi-Object Tracking, Continuously Annotated Dataset, Synergistic Framework, Multi-Cue Fusion, Hierarchical Aggregation, Temporal Confidence Propagation, Zero-shot Generalization, State-of-the-art Performance

211. ❌ DB SwinT: A Dual-Branch Swin Transformer Network for Road Extraction in Optical Remote Sensing Imagery

作者: Zongyang He, Xiangli Yang, Xian Gao, Zhiguo Wang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24005v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉中的遥感图像道路提取任务，提出了一种基于Swin Transformer和U-Net的双分支网络架构。所有关键词均与大语言模型（LLM）及其相关技术（如训练、对齐、推理优化、智能体等）或特定科学AI应用（如生物信息学）直接相关。论文内容完全不涉及LLM、深度学习基础模型技术原理或生物/化学信息学，仅与“AI for Science”有微弱关联（遥感属于广义的地球科学应用），因此该关键词得5分，其余均为0分。

!!! tip deepseek-chat TL;DR

该论文针对光学遥感图像中因遮挡导致道路提取不准确的问题，提出了一种双分支Swin Transformer网络（DB SwinT），通过结合局部细节恢复和全局语义上下文捕捉，在Massachusetts和DeepGlobe数据集上分别取得了79.35%和74.84%的IoU分数，有效提升了道路提取的准确性。

摘要翻译

随着光学遥感影像空间分辨率的持续提升，精确的道路提取对于城市规划、交通监测和灾害管理等应用日益重要。然而，在复杂的城乡环境中，道路提取仍面临挑战，因为道路常被树木、建筑物等物体遮挡，导致结构断裂和提取精度下降。为解决这一问题，本文提出了一种用于道路提取的双分支Swin Transformer网络（DB SwinT）。该框架结合了Swin Transformer的长程依赖建模能力与U-Net的多尺度特征融合策略，并采用双分支编码器学习互补的局部与全局表征。具体而言，局部分支专注于恢复遮挡区域的精细结构细节，而全局分支则捕获更广泛的语义上下文以保持道路网络的整体连续性。此外，本文引入了注意力特征融合（Attentional Feature Fusion, AFF）模块，以自适应地融合两个分支的特征，进一步增强遮挡路段的表征能力。在Massachusetts和DeepGlobe数据集上的实验结果表明，DB SwinT的交并比（Intersection over Union, IoU）分别达到79.35%和74.84%，验证了其在光学遥感影像道路提取中的有效性。

摘要 (Abstract)

With the continuous improvement in the spatial resolution of optical remote sensing imagery, accurate road extraction has become increasingly important for applications such as urban planning, traffic monitoring, and disaster management. However, road extraction in complex urban and rural environments remains challenging, as roads are often occluded by trees, buildings, and other objects, leading to fragmented structures and reduced extraction accuracy. To address this problem, this paper proposes a Dual-Branch Swin Transformer network (DB SwinT) for road extraction. The proposed framework combines the long-range dependency modeling capability of the Swin Transformer with the multi-scale feature fusion strategy of U-Net, and employs a dual-branch encoder to learn complementary local and global representations. Specifically, the local branch focuses on recovering fine structural details in occluded areas, while the global branch captures broader semantic context to preserve the overall continuity of road networks. In addition, an Attentional Feature Fusion (AFF) module is introduced to adaptively fuse features from the two branches, further enhancing the representation of occluded road segments. Experimental results on the Massachusetts and DeepGlobe datasets show that DB SwinT achieves Intersection over Union (IoU) scores of 79.35% and 74.84%, respectively, demonstrating its effectiveness for road extraction from optical remote sensing imagery.

关键词: Road Extraction, Optical Remote Sensing Imagery, Swin Transformer, Dual-Branch Network, Feature Fusion, Occlusion Handling, Semantic Segmentation, Deep Learning

212. ❌ HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images

作者: Yumeng Liu, Xiao-Xiao Long, Marc Habermann, Xuanze Yang, Cheng Lin, Yuan Liu, Yuexin Ma, Wenping Wang, Ligang Liu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23997v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的3D手部网格重建，使用基于视觉的几何基础模型方法，但所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等具体技术直接相关，而本文不涉及任何语言模型、模型训练、推理优化、对齐、代理系统等主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为HGGT的新方法，首次实现了从未校准图像中联合推断3D手部网格和相机姿态，在精度和部署灵活性方面优于现有方法。

摘要翻译

从图像中恢复高保真度的三维手部几何是计算机视觉领域的一项关键任务，对机器人学、动画制作以及虚拟现实/增强现实（VR/AR）等领域具有重要价值。至关重要的是，可扩展的应用既要求精度，也要求部署的灵活性，即需要能够利用来自互联网的大量非结构化图像数据，或实现在无需复杂校准的消费级RGB相机上的部署。然而，现有方法面临一个两难困境：单视图方法易于部署，但受限于深度模糊性和遮挡问题；反之，多视图系统虽能解决这些不确定性，但通常需要固定且经过校准的配置，这限制了其在实际场景中的应用。为弥合这一差距，我们从直接从视觉数据中学习显式几何的三维基础模型中汲取灵感。通过将任意视角下的手部重构重新定义为一个视觉-几何基础任务，我们提出了一种前馈式架构，该架构在文献中首次实现了从未校准的视角中联合推断三维手部网格和相机姿态。大量评估表明，我们的方法超越了当前最先进的基准，并在未校准的、真实世界场景中展现出强大的泛化能力。项目页面链接如下：https://lym29.github.io/HGGT/。

摘要 (Abstract)

Recovering high-fidelity 3D hand geometry from images is a critical task in computer vision, holding significant value for domains such as robotics, animation and VR/AR. Crucially, scalable applications demand both accuracy and deployment flexibility, requiring the ability to leverage massive amounts of unstructured image data from the internet or enable deployment on consumer-grade RGB cameras without complex calibration. However, current methods face a dilemma. While single-view approaches are easy to deploy, they suffer from depth ambiguity and occlusion. Conversely, multi-view systems resolve these uncertainties but typically demand fixed, calibrated setups, limiting their real-world utility. To bridge this gap, we draw inspiration from 3D foundation models that learn explicit geometry directly from visual data. By reformulating hand reconstruction from arbitrary views as a visual-geometry grounded task, we propose a feed-forward architecture that, for the first time in literature, jointly infers 3D hand meshes and camera poses from uncalibrated views. Extensive evaluations show that our approach outperforms state-of-the-art benchmarks and demonstrates strong generalization to uncalibrated, in-the-wild scenarios. Here is the link of our project page: https://lym29.github.io/HGGT/.

关键词: 3D hand mesh reconstruction, uncalibrated images, camera pose estimation, visual-geometry grounded task, feed-forward architecture, robust generalization, computer vision, VR/AR applications

213. ❌ CAKE: Real-time Action Detection via Motion Distillation and Background-aware Contrastive Learning

作者: Hieu Hoang, Dung Trung Tran, Hong Nguyen, Nam-Phong Nguyen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23988v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的在线动作检测（OAD），提出了一种基于光流蒸馏和背景感知对比学习的实时动作检测框架CAKE。研究内容涉及动作识别、运动建模、模型蒸馏和计算效率优化，属于计算机视觉和视频理解领域。所有评分关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本论文研究的是纯计算机视觉问题，未涉及任何大模型技术、语言模型、AI for Science应用或深度学习技术原理的创新。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了在线动作检测中计算成本高和背景运动干扰的问题，提出了CAKE框架，通过运动蒸馏和背景感知对比学习实现了高效的实时动作检测，在多个数据集上达到SOTA性能，并在单CPU上运行超过72 FPS。

摘要翻译

在线动作检测系统面临两大核心挑战：高计算成本以及对背景运动具有区分性的时序动态建模不足。引入光流虽能提供强运动线索，但会带来显著计算开销。我们提出CAKE——一种基于光流的在线动作检测蒸馏框架，将运动知识迁移至RGB模型中。我们设计了动态运动适配器，用于抑制静态背景噪声并增强像素变化，无需显式计算即可有效近似光流。该框架还集成了浮动对比学习策略，以区分信息性运动动态与时序背景。在TVSeries、THUMOS'14和Kinetics-400数据集上的多项实验验证了模型的有效性。在使用相同骨干网络的情况下，CAKE相比当前最优方法实现了突出的平均精度均值。我们的模型在单CPU上运行速度超过72帧/秒，非常适合资源受限的系统。

摘要 (Abstract)

Online Action Detection (OAD) systems face two primary challenges: high computational cost and insufficient modeling of discriminative temporal dynamics against background motion. Adding optical flow could provides strong motion cues but it incurs significant computational overhead. We propose CAKE, a OAD Flow-based distillation framework to transfer motion knowledge into RGB models. We propose Dynamic Motion Adapter (DMA) to suppress static background noise and emphasize pixel changes, effectively approximating optical flow without explicit computation. The framework also integrates a Floating Contrastive Learning strategy to distinguish informative motion dynamics from temporal background. Various experiments conducted on the TVSeries, THUMOS'14, Kinetics-400 datasets show effectiveness of our model. CAKE achieves a standout mAP compared with SOTA while using the same backbone. Our model operates at over 72 FPS on a single CPU, making it highly suitable for resource-constrained systems.

关键词: Online Action Detection, Motion Distillation, Background-aware Contrastive Learning, Real-time Action Detection, Optical Flow, Computational Efficiency, Dynamic Motion Adapter, Floating Contrastive Learning

214. ❌ SilLang: Improving Gait Recognition with Silhouette Language Encoding

作者: Ruiyi Zhan, Guozhen Peng, Canyu Chen, Jian Lei, Annan Li 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23976v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心贡献是提出SilLang框架，将步态识别中的二进制步态轮廓与自然语言在离散编码空间对齐，并利用LLMs提取判别特征来增强视觉步态识别。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLMs是方法的核心组件。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为步态识别属于生物识别/计算机视觉应用，可视为AI在科学/生物信息学相关领域的应用。其他关键词（如MoE、SFT、RAG、量化等）未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出SilLang框架，通过将二进制步态轮廓与自然语言在离散编码空间对齐，并利用大语言模型（LLMs）提取特征，显著提升了步态识别在多个数据集上的性能。

摘要翻译

步态轮廓可编码为二进制步态码，广泛应用于表征行人运动模式。现有方法通常利用视觉主干网络对步态轮廓进行编码，取得了显著成效。然而，这些方法主要关注连续视觉特征，忽视了二进制轮廓的离散本质——其本质上与自然语言共享离散编码空间。大语言模型（LLMs）已证明在从离散序列中提取区分性特征和建模长程依赖方面具有卓越能力，凸显了其通过识别细微变化捕捉时序运动模式的潜力。基于此，我们探索在二进制编码空间内建立二进制步态轮廓与自然语言的桥梁。然而，文本标记（text tokens）与二进制步态轮廓的编码空间仍存在错位，主要源于标记频率与分布密度的差异。为解决此问题，我们提出轮廓-速度标记器（Contour-Velocity Tokenizer），在编码二进制步态轮廓的同时重塑其分布，以更好地对齐文本标记空间。进而，我们构建了一个双分支框架——步态轮廓语言模型（Silhouette Language Model, SilLang），通过融合源自大语言模型的离散语言嵌入来增强视觉轮廓特征。在主流步态骨干网络上实施后，SilLang 在 SUSTech1K、GREW 和 Gait3D 数据集上均持续提升了现有最优方法的性能。

摘要 (Abstract)

Gait silhouettes, which can be encoded into binary gait codes, are widely adopted to representing motion patterns of pedestrian. Recent approaches commonly leverage visual backbones to encode gait silhouettes, achieving successful performance. However, they primarily focus on continuous visual features, overlooking the discrete nature of binary silhouettes that inherently share a discrete encoding space with natural language. Large Language Models (LLMs) have demonstrated exceptional capability in extracting discriminative features from discrete sequences and modeling long-range dependencies, highlighting their potential to capture temporal motion patterns by identifying subtle variations. Motivated by these observations, we explore bridging binary gait silhouettes and natural language within a binary encoding space. However, the encoding spaces of text tokens and binary gait silhouettes remain misaligned, primarily due to differences in token frequency and density. To address this issue, we propose the Contour-Velocity Tokenizer, which encodes binary gait silhouettes while reshaping their distribution to better align with the text token space. We then establish a dual-branch framework termed Silhouette Language Model, which enhances visual silhouettes by integrating discrete linguistic embeddings derived from LLMs. Implemented on mainstream gait backbones, SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D.

关键词: Gait Recognition, Silhouette Language Encoding, Large Language Models, Binary Gait Silhouettes, Contour-Velocity Tokenizer, Temporal Motion Patterns, Discrete Encoding Space, Visual-Linguistic Integration

215. ❌ HyDRA: Hybrid Domain-Aware Robust Architecture for Heterogeneous Collaborative Perception

作者: Minwoo Song, Minhee Kang, Heejin Ahn 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23975v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究协作感知中的异构性问题，提出HyDRA架构整合中间和后期融合，并引入轻量级域分类器和锚引导位姿图优化。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，但论文专注于计算机视觉/机器人领域的协作感知，未涉及大模型、语言模型、训练技术、推理方法、对齐、压缩、幻觉缓解、可解释性、世界模型或科学AI应用。仅与’Multi-agent Systems OR Agent Coordination’有一定关联（5分），因为涉及多个智能体（车辆/机器人）的协作感知，但并非大模型代理系统。其他关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文解决了协作感知中因模型架构或数据分布差异导致的异构性问题，提出了HyDRA架构，通过动态域分类和锚引导优化，在不需额外训练的情况下实现了与最先进方法相当的性能，并能零成本扩展。

摘要翻译

在协同感知中，智能体的性能可能因模型架构或训练数据分布的差异而产生的异构性而下降。为解决这一挑战，我们提出HyDRA（混合域感知鲁棒架构），这是一个在域感知框架内融合中间层融合与后期融合的统一流程。我们引入了一个轻量级域分类器，可动态识别异构智能体并将其分配至后期融合分支。此外，我们提出锚点引导的位姿图优化方法，利用中间层融合产生的可靠检测结果作为固定空间锚点，以减轻后期融合固有的定位误差。大量实验表明，尽管无需额外训练，HyDRA仍能达到与当前最先进的异构感知协同感知方法相当的性能。重要的是，该性能在协同智能体数量增加时仍能保持，实现了无需重新训练的零成本扩展。

摘要 (Abstract)

In collaborative perception, an agent’s performance can be degraded by heterogeneity arising from differences in model architecture or training data distributions. To address this challenge, we propose HyDRA (Hybrid Domain-Aware Robust Architecture), a unified pipeline that integrates intermediate and late fusion within a domain-aware framework. We introduce a lightweight domain classifier that dynamically identifies heterogeneous agents and assigns them to the late-fusion branch. Furthermore, we propose anchor-guided pose graph optimization to mitigate localization errors inherent in late fusion, leveraging reliable detections from intermediate fusion as fixed spatial anchors. Extensive experiments demonstrate that, despite requiring no additional training, HyDRA achieves performance comparable to state-of-the-art heterogeneity-aware CP methods. Importantly, this performance is maintained as the number of collaborating agents increases, enabling zero-cost scaling without retraining.

关键词: collaborative perception, heterogeneity, domain-aware, intermediate fusion, late fusion, pose graph optimization, zero-cost scaling, robust architecture

216. ❌ Machine vision with small numbers of detected photons per inference

作者: Shi-Yuan Ma, Jérémie Laydevant, Mandar M. Sohoni, Logan G. Wright, Tianyu Wang, Peter L. McMahon 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23974v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于机器视觉在极低光子条件下的优化方法（PANS），属于AI在科学仪器中的应用，与’AI for Science’有一定关联（5分）。但论文完全不涉及大语言模型、深度学习技术原理、模型训练/优化方法、推理技术、智能体系统等关键词领域，其他26个关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种光子感知神经形态传感（PANS）方法，用于在极低光子条件下进行端到端优化的机器视觉，实验证明在FashionMNIST和MNIST数据集上仅用个位数光子总数就能实现高精度分类，比传统方法光子效率高数个数量级。

摘要翻译

机器视觉（包括目标识别与图像重建）是众多消费电子设备和科学仪器的核心技术。端到端优化方法的采用彻底改变了机器视觉系统的设计范式，该方法将光学前端与后处理后端进行联合优化。然而，尽管当前机器视觉在中等或明亮光照条件下表现卓越——此时相机每像素可探测数千光子、每帧可探测数十亿光子——但在极弱光环境下仍面临巨大挑战。本文提出光子感知神经形态传感（PANS, photon-aware neuromorphic sensing），一种专为高度光子匮乏场景设计的端到端优化方法。该方法的训练过程融合了低光子预算的约束条件以及当每像素平均光子数接近或小于1时的光探测随机特性。我们报告了一项原理验证实验：使用PANS进行弱光图像分类，在FashionMNIST数据集上以单次推理平均仅探测4.9（17）个光子的条件下达到73%（82%）的准确率，在MNIST数据集上以8.6（29）个光子达到86%（97%）的准确率——其光子效率比传统方法高出数个数量级。我们还通过仿真研究展示了PANS如何应用于其他分类、事件检测及图像重建任务。通过考虑非经典态或替代传感硬件的测量结果统计特性，PANS原则上可适配于量子系统及其他光子匮乏场景，从而实现高精度探测。

摘要 (Abstract)

Machine vision, including object recognition and image reconstruction, is a central technology in many consumer devices and scientific instruments. The design of machine-vision systems has been revolutionized by the adoption of end-to-end optimization, in which the optical front end and the post-processing back end are jointly optimized. However, while machine vision currently works extremely well in moderate-light or bright-light situations – where a camera may detect thousands of photons per pixel and billions of photons per frame – it is far more challenging in very low-light situations. We introduce photon-aware neuromorphic sensing (PANS), an approach for end-to-end optimization in highly photon-starved scenarios. The training incorporates knowledge of the low photon budget and the stochastic nature of light detection when the average number of photons per pixel is near or less than 1. We report a proof-of-principle experimental demonstration in which we performed low-light image classification using PANS, achieving 73% (82%) accuracy on FashionMNIST with an average of only 4.9 (17) detected photons in total per inference, and 86% (97%) on MNIST with 8.6 (29) detected photons – orders of magnitude more photon-efficient than conventional approaches. We also report simulation studies showing how PANS could be applied to other classification, event-detection, and image-reconstruction tasks. By taking into account the statistics of measurement results for non-classical states or alternative sensing hardware, PANS could in principle be adapted to enable high-accuracy results in quantum and other photon-starved setups.

关键词: machine vision, photon-aware neuromorphic sensing, low-light imaging, end-to-end optimization, photon-starved scenarios, image classification, FashionMNIST, MNIST

217. ❌ SLAT-Phys: Fast Material Property Field Prediction from Structured 3D Latents

作者: Rocktim Jyoti Das, Dinesh Manocha 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23973v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文SLAT-Phys专注于从单张RGB图像快速预测3D资产的材料属性场，属于计算机视觉与物理模拟交叉的AI应用。其核心创新在于利用预训练的3D资产生成模型的潜在特征，并训练轻量级解码器进行材料参数估计。与评分关键词列表对比，绝大多数关键词（如LLMs、MoE、对齐、推理、智能体等）涉及大语言模型的核心技术、训练方法或应用范式，而本论文未使用或研究任何形式的大语言模型，也未涉及这些特定的模型架构、训练技术或应用场景。因此，除以下两项外，其余关键词均评为0分（完全无关）：1. “Pre-training OR Continual Pre-training OR Domain Adaptation”：论文提及利用了"pretrained 3D asset generation model"的潜在特征，这属于预训练模型的特征迁移应用，有一定关联，但非论文核心创新（核心是解码器设计和应用），故给5分。2. “AI for Science OR Bioinformatics OR Cheminformatics”：论文解决材料属性预测问题，用于物理模拟、机器人学和数字孪生，属于AI在科学计算和工程领域的应用，与"AI for Science"高度相关，但非生物信息学或化学信息学，故给8分。加权总分计算为(5.01.0) + (8.01.0) = 13.0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SLAT-Phys的端到端方法，直接从单张RGB图像预测3D资产的连续材料属性场（如杨氏模量、密度和泊松比），无需显式3D重建，在保持竞争性精度的同时实现了120倍的加速。

摘要翻译

估算三维资产的材质属性场对于基于物理的仿真、机器人学和数字孪生生成至关重要。现有的基于视觉的方法要么成本高昂且速度缓慢，要么依赖于三维信息。本文提出SLAT-Phys，这是一种端到端方法，可直接从单张RGB图像预测三维资产的空间变化材质属性场，无需显式的三维重建。我们的方法利用预训练三维资产生成模型中的空间组织化潜在特征，该模型编码了丰富的几何与语义先验，并通过训练一个轻量级神经解码器来估算杨氏模量、密度和泊松比。潜在表征中关于物体几何与外观的粗略体素布局和语义线索，能够实现精确的材质估算。实验表明，与现有方法相比，我们的方法在预测连续材质参数方面具有相当的准确性，同时显著减少了计算时间。具体而言，SLAT-Phys在NVIDIA RTXA5000 GPU上每个物体仅需9.9秒，且避免了重建和体素化预处理。这相比现有方法实现了120倍的加速，从而能够从单张图像更快地估算材质属性。

摘要 (Abstract)

Estimating the material property field of 3D assets is critical for physics-based simulation, robotics, and digital twin generation. Existing vision-based approaches are either too expensive and slow or rely on 3D information. We present SLAT-Phys, an end-to-end method that predicts spatially varying material property fields of 3D assets directly from a single RGB image without explicit 3D reconstruction. Our approach leverages spatially organised latent features from a pretrained 3D asset generation model that encodes rich geometry and semantic prior, and trains a lightweight neural decoder to estimate Young’s modulus, density, and Poisson’s ratio. The coarse volumetric layout and semantic cues of the latent representation about object geometry and appearance enable accurate material estimation. Our experiments demonstrate that our method provides competitive accuracy in predicting continuous material parameters when compared against prior approaches, while significantly reducing computation time. In particular, SLAT-Phys requires only 9.9 seconds per object on an NVIDIA RTXA5000 GPU and avoids reconstruction and voxelization preprocessing. This results in 120x speedup compared to prior methods and enables faster material property estimation from a single image.

关键词: material property field prediction, 3D assets, single RGB image, pretrained 3D asset generation model, lightweight neural decoder, Young’s modulus, density, Poisson’s ratio

218. ❌ GRMLR: Knowledge-Enhanced Small-Data Learning for Deep-Sea Cold Seep Stage Inference

作者: Chenxu Zhou, Zelin Liu, Rui Cai, Houlin Gong, Yikang Yu, Jia Zeng, Yanru Pei, Liang Zhang, Weishu Zhao, Xiaofeng Gao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23961v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种用于深海冷泉阶段推断的知识增强分类框架（GRMLR），属于AI在科学领域的应用（具体为海洋生态学），因此仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。论文未涉及大模型、深度学习技术原理、LLM相关方法（如微调、对齐、推理优化等）或任何其他关键词，故其余关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对深海冷泉阶段评估中微生物数据量小、特征维度高导致的过拟合问题，提出了一种融合生态知识图谱作为结构先验的图正则化多项逻辑回归（GRMLR）框架，实验表明该方法显著优于基线模型，为深海生态评估提供了鲁棒且可扩展的解决方案。

摘要翻译

深海冷泉活动阶段的传统评估依赖于成本高昂、高风险载人潜水器作业及宏体动物的目视调查。尽管微生物群落为评估提供了更具成本效益的替代方案，但由于现有深海数据集规模极小（$n = 13$）而微生物特征维度较高（$p = 26$），使得纯数据驱动模型极易过拟合，可靠的推断仍面临挑战。为此，我们提出一种知识增强分类框架，将生态知识图谱作为结构先验融入模型。通过融合宏体-微生物耦合关系与微生物共现模式，该框架将既有的生态逻辑内化至一个图正则化多项逻辑回归（Graph-Regularized Multinomial Logistic Regression, GRMLR）模型中，通过流形惩罚有效约束特征空间，确保分类结果符合生物学一致性。值得注意的是，该框架在推断阶段无需宏体动物观测数据：宏体-微生物关联仅用于指导训练，而预测完全依赖微生物丰度谱。实验结果表明，该方法显著优于标准基线模型，凸显了其作为稳健且可扩展的深海生态评估框架的潜力。

摘要 (Abstract)

Deep-sea cold seep stage assessment has traditionally relied on costly, high-risk manned submersible operations and visual surveys of macrofauna. Although microbial communities provide a promising and more cost-effective alternative, reliable inference remains challenging because the available deep-sea dataset is extremely small ($n = 13$) relative to the microbial feature dimension ($p = 26$), making purely data-driven models highly prone to overfitting. To address this, we propose a knowledge-enhanced classification framework that incorporates an ecological knowledge graph as a structural prior. By fusing macro-microbe coupling and microbial co-occurrence patterns, the framework internalizes established ecological logic into a \underline{\textbf{G}}raph-\underline{\textbf{R}}egularized \underline{\textbf{M}}ultinomial \underline{\textbf{L}}ogistic \underline{\textbf{R}}egression (GRMLR) model, effectively constraining the feature space through a manifold penalty to ensure biologically consistent classification. Importantly, the framework removes the need for macrofauna observations at inference time: macro-microbe associations are used only to guide training, whereas prediction relies solely on microbial abundance profiles. Experimental results demonstrate that our approach significantly outperforms standard baselines, highlighting its potential as a robust and scalable framework for deep-sea ecological assessment.

关键词: deep-sea cold seep, microbial communities, knowledge graph, graph-regularized multinomial logistic regression, small-data learning, ecological assessment, overfitting mitigation, macro-microbe coupling

219. ❌ Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection

作者: Jielun Peng, Yabin Wang, Yaqi Li, Long Kong, Xiaopeng Hong 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23960v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究音频-视觉深度伪造检测，属于AI安全应用领域。与大多数大模型技术关键词无关，仅与’Pre-training’（论文提到在真实视频上进行预训练）和’AI for Science’（可视为AI在安全领域的应用）有中等关联，各给5分。其他关键词均不涉及。

!!! tip deepseek-chat TL;DR

该论文针对音频-视觉深度伪造检测问题，提出了基于整体音频-视觉内在一致性的HAVIC检测器，并在多个基准测试中显著优于现有方法。

摘要翻译

生成式人工智能的快速发展催生了超逼真的视听深度伪造内容，加剧了对个人安全与社会信任的威胁。现有的大多数深度伪造检测器仅依赖单模态伪影或视听不一致性，未能协同利用这两类信息源。此外，依赖生成器特定伪影的检测器在面对未知伪造技术时往往表现出泛化性能下降。我们认为，稳健且可泛化的检测应基于模态内与跨模态的固有视听一致性。为此，我们提出了HAVIC——一种基于整体视听内在一致性的深度伪造检测器。HAVIC首先通过在真实视频上进行预训练，学习模态特定结构一致性、跨模态微观与宏观一致性的先验知识。基于习得的先验，HAVIC进一步执行整体自适应聚合，动态融合视听特征以进行深度伪造检测。此外，我们构建了HiFi-AVDF数据集，这是一个高保真视听深度伪造数据集，包含来自前沿商业生成器的文本到视频与图像到视频伪造样本。在多个基准测试上的广泛实验表明，HAVIC显著优于现有最优方法，在最具挑战性的跨数据集场景中实现了9.39%平均精度（AP）与9.37%曲线下面积（AUC）的性能提升。我们的代码与数据集已公开于https://github.com/tuffy-studio/HAVIC。

摘要 (Abstract)

The rapid progress of generative AI has enabled hyper-realistic audio-visual deepfakes, intensifying threats to personal security and social trust. Most existing deepfake detectors rely either on uni-modal artifacts or audio-visual discrepancies, failing to jointly leverage both sources of information. Moreover, detectors that rely on generator-specific artifacts tend to exhibit degraded generalization when confronted with unseen forgeries. We argue that robust and generalizable detection should be grounded in intrinsic audio-visual coherence within and across modalities. Accordingly, we propose HAVIC, a Holistic Audio-Visual Intrinsic Coherence-based deepfake detector. HAVIC first learns priors of modality-specific structural coherence, inter-modal micro- and macro-coherence by pre-training on authentic videos. Based on the learned priors, HAVIC further performs holistic adaptive aggregation to dynamically fuse audio-visual features for deepfake detection. Additionally, we introduce HiFi-AVDF, a high-fidelity audio-visual deepfake dataset featuring both text-to-video and image-to-video forgeries from state-of-the-art commercial generators. Extensive experiments across several benchmarks demonstrate that HAVIC significantly outperforms existing state-of-the-art methods, achieving improvements of 9.39% AP and 9.37% AUC on the most challenging cross-dataset scenario. Our code and dataset are available at https://github.com/tuffy-studio/HAVIC.

关键词: deepfake detection, audio-visual coherence, HAVIC, HiFi-AVDF dataset, cross-dataset generalization, holistic adaptive aggregation, generative AI threats, multimodal deepfake

220. ❌ PointRFT: Explicit Reinforcement Fine-tuning for Point Cloud Few-shot Learning

作者: Yankai Wang, Yiding Sun, Qirui Wang, Pengbo Li, Chaoyi Lu, Dongxu Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23957v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究3D点云的强化微调方法，与深度学习在科学领域的应用相关。核心相关关键词：1) ‘Post-training OR Supervised Fine-tuning OR SFT’（10分）- 论文直接比较并改进了监督微调方法；2) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）- 涉及预训练基础模型；3) ‘Large Language Models OR LLMs OR Foundation Models’（5分）- 借鉴了LLM中的强化学习方法；4) ‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）- 属于3D感知的科学应用。其他关键词与论文的3D点云强化微调主题无关。

!!! tip deepseek-chat TL;DR

该论文提出了首个针对点云表示学习的强化微调范式PointRFT，通过设计专门的奖励函数，在少样本分类任务中显著优于传统监督微调，并在数据稀缺场景下实现了最先进的性能。

摘要翻译

理解点云中的空间动态与语义关系是全面三维认知的基础。尽管强化学习算法如群组相对策略优化（GRPO）近期通过策略性奖励设计激发推理能力，在大型语言模型中取得了显著突破，但其在三维感知领域的潜力仍很大程度上未被探索。这自然引出了一个关键问题：基于强化学习的方法能否有效赋能三维点云微调？本文提出PointRFT——首个专为点云表征学习设计的强化微调范式。我们选取三种主流三维基础模型，并设计了专用的精度奖励函数与离散度奖励函数以稳定训练、缓解分布偏移。通过对比不同训练范式的系统性小样本分类实验，我们证明PointRFT在多种基准测试中均稳定优于传统监督微调（SFT）。进一步地，当将其有机整合至“预训练-SFT-RFT”混合范式时，点云基础模型的表征能力得到充分释放，尤其在数据稀缺场景下实现了最先进的性能表现。

摘要 (Abstract)

Understanding spatial dynamics and semantics in point cloud is fundamental for comprehensive 3D comprehension. While reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO) have recently achieved remarkable breakthroughs in large language models by incentivizing reasoning capabilities through strategic reward design, their potential remains largely unexplored in the 3D perception domain. This naturally raises a pivotal question: Can RL-based methods effectively empower 3D point cloud fine-tuning? In this paper, we propose PointRFT, the first reinforcement fine-tuning paradigm tailored specifically for point cloud representation learning. We select three prevalent 3D foundation models and devise specialized accuracy reward and dispersion reward functions to stabilize training and mitigate distribution shifts. Through comprehensive few-shot classification experiments comparing distinct training paradigms, we demonstrate that PointRFT consistently outperforms vanilla supervised fine-tuning (SFT) across diverse benchmarks. Furthermore, when organically integrated into a hybrid Pretraining-SFT-RFT paradigm, the representational capacity of point cloud foundation models is substantially unleashed, achieving state-of-the-art performance particularly under data-scarce scenarios.

关键词: Point Cloud, Reinforcement Fine-tuning, Few-shot Learning, 3D Foundation Models, Supervised Fine-tuning, Representation Learning, Accuracy Reward, Dispersion Reward

221. ❌ SynMVCrowd: A Large Synthetic Benchmark for Multi-view Crowd Counting and Localization

作者: Qi Zhang, Daijie Chen, Yunfei Gong, Hui Huang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23956v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的多视角人群计数和定位任务，提出了一个合成基准数据集SynMVCrowd，并建立了基线方法。论文内容与所有评分关键词（均涉及大模型、深度学习技术原理、AI for Science等）完全无关，未涉及任何大模型技术、训练方法、推理优化、AI代理或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一个大规模合成基准SynMVCrowd，用于更实际地评估多视角人群计数和定位方法，并建立了在该基准上优于现有方法的基线模型。

摘要翻译

现有多视角人群计数与定位方法通常在场景规模较小、人群数量有限、相机视角及帧数较少的条件下进行评估。这使得现有方法的评估与对比缺乏实际意义，因为小规模数据集极易被这些方法过拟合。为避免此类问题，3DROM提出了一种数据增强方法。而本文则提出了一个大规模合成基准数据集SynMVCrowd，旨在为多视角人群计数与定位任务提供更贴近实际的评估与比较平台。SynMVCrowd基准包含50个合成场景，具有海量的多视角视频帧与相机视角，以及更大规模的人群数量（最高达1000人），更适用于大场景多视角人群视觉任务。此外，我们提出了强大的多视角人群定位与计数基线模型，其在SynMVCrowd新基准上的表现优于所有对比方法。进一步地，我们证明借助该基准数据集，可在全新真实场景中实现更优的跨领域迁移多视角及单图像计数性能。因此，所提出的基准数据集能够推动多视角与单图像人群计数与定位研究向更实际的应用场景迈进。代码与数据集地址：https://github.com/zqyq/SynMVCrowd。

摘要 (Abstract)

Existing multi-view crowd counting and localization methods are evaluated under relatively small scenes with limited crowd numbers, camera views, and frames. This makes the evaluation and comparison of existing methods impractical, as small datasets are easily overfit by these methods. To avoid these issues, 3DROM proposes a data augmentation method. Instead, in this paper, we propose a large synthetic benchmark, SynMVCrowd, for more practical evaluation and comparison of multi-view crowd counting and localization tasks. The SynMVCrowd benchmark consists of 50 synthetic scenes with a large number of multi-view frames and camera views and a much larger crowd number (up to 1000), which is more suitable for large-scene multi-view crowd vision tasks. Besides, we propose strong multi-view crowd localization and counting baselines that outperform all comparison methods on the new SynMVCrowd benchmark. Moreover, we prove that better domain transferring multi-view and single-image counting performance could be achieved with the aid of the benchmark on novel new real scenes. As a result, the proposed benchmark could advance the research for multi-view and single-image crowd counting and localization to more practical applications. The codes and datasets are here: https://github.com/zqyq/SynMVCrowd.

关键词: multi-view crowd counting, crowd localization, synthetic benchmark, large-scale scenes, domain transfer, computer vision, baseline methods, SynMVCrowd

222. ❌ VOLMO: Versatile and Open Large Models for Ophthalmology

作者: Zhenyue Qin, Younjoon Chung, Elijah Lee, Wanyue Feng, Xuguang Ai, Serina Applebaum, Minjie Zou, Yang Liu, Pan Xiao, Mac Singer, Amisha Dave, Aidan Gilson, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu, Yih-Chung Tham, Ron Adelman, Luciano V. Del Priore, Qingyu Chen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23953v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出VOLMO框架，专门针对眼科开发多模态大语言模型（MLLM），核心涉及大模型技术（LLMs）在医学领域的应用。高度相关的关键词包括：‘Large Language Models’（论文开发MLLM）、‘Pre-training’（眼科知识预训练阶段）、‘Post-training’（领域任务微调阶段）、‘Chain of Thought’（多步临床推理阶段）、‘AI for Science’（眼科医学AI应用）。‘Small Language Models’得5分，因为论文训练了紧凑的2B参数模型，属于较小规模模型，但非核心焦点。其余关键词与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

论文针对眼科临床工作流中整合多模态数据的挑战，提出了VOLMO框架，通过预训练、微调和多步推理三阶段开发眼科专用多模态大语言模型，其2B参数模型在多项任务上超越现有基线模型。

摘要翻译

视力障碍影响着全球数百万人，早期检测对于预防不可逆的视力丧失至关重要。眼科临床工作流程要求临床医生整合医学影像、结构化临床数据和自由文本记录，以确定疾病严重程度和管理方案，这一过程耗时且繁重。近期出现的多模态大语言模型展现出潜力，但现有的通用及医学多模态大语言模型在眼科领域表现不佳，且极少有眼科专用的多模态大语言模型公开可用。我们提出了VOLMO（眼科通用开放大模型框架），这是一个模型无关、数据开放的框架，用于开发眼科专用的多模态大语言模型。VOLMO包含三个阶段：首先，基于来自82种期刊的26,569篇文章中的86,965个图文对进行眼科知识预训练；其次，在涵盖12种眼病的26,929个标注实例上进行领域任务微调，用于疾病筛查和严重程度分级；最后，在913份患者病例报告上进行多步骤临床推理训练，以完成评估、规划和随访护理。利用此框架，我们训练了一个紧凑的20亿参数多模态大语言模型，并将其与多个强基线模型进行了比较，包括InternVL-2B、LLaVA-Med-7B、MedGemma-4B、MedGemma-27B和RETFound。我们在图像描述生成、疾病筛查与分期分类、以及评估与管理方案生成等任务上评估了这些模型，并额外由两名医疗专业人员进行人工评审，同时在三个独立队列中对年龄相关性黄斑变性和糖尿病视网膜病变进行了外部验证。在所有评估场景中，VOLMO-2B均持续优于基线模型，实现了更强的图像描述性能，在12种眼病上平均F1分数达到87.4%，并在外部验证中获得了更高的评分。

摘要 (Abstract)

Vision impairment affects millions globally, and early detection is critical to preventing irreversible vision loss. Ophthalmology workflows require clinicians to integrate medical images, structured clinical data, and free-text notes to determine disease severity and management, which is time-consuming and burdensome. Recent multimodal large language models (MLLMs) show promise, but existing general and medical MLLMs perform poorly in ophthalmology, and few ophthalmology-specific MLLMs are openly available. We present VOLMO (Versatile and Open Large Models for Ophthalmology), a model-agnostic, data-open framework for developing ophthalmology-specific MLLMs. VOLMO includes three stages: ophthalmology knowledge pretraining on 86,965 image-text pairs from 26,569 articles across 82 journals; domain task fine-tuning on 26,929 annotated instances spanning 12 eye conditions for disease screening and severity classification; and multi-step clinical reasoning on 913 patient case reports for assessment, planning, and follow-up care. Using this framework, we trained a compact 2B-parameter MLLM and compared it with strong baselines, including InternVL-2B, LLaVA-Med-7B, MedGemma-4B, MedGemma-27B, and RETFound. We evaluated these models on image description generation, disease screening and staging classification, and assessment-and-management generation, with additional manual review by two healthcare professionals and external validation on three independent cohorts for age-related macular degeneration and diabetic retinopathy. Across settings, VOLMO-2B consistently outperformed baselines, achieving stronger image description performance, an average F1 of 87.4% across 12 eye conditions, and higher scores in external validation.

关键词: multimodal large language models, ophthalmology, medical AI, domain adaptation, clinical reasoning, disease screening, model-agnostic framework, external validation

223. ❌ DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models

作者: Hongyi Miao, Jun Jia, Xincheng Wang, Qianli Ma, Wei Sun, Wangqiu Zhou, Dandan Zhu, Yewen Cao, Zhi Liu, Guangtao Zhai 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23925v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉语言模型（VLMs）的隐私保护问题，提出了一种针对私人照片数据集的数据投毒保护框架。与评分关键词的相关性分析如下：1）论文涉及VLMs的fine-tuning（特别是SFT），因此与"Post-training OR Supervised Fine-tuning OR SFT"有一定关联（5分），但并非核心创新点；2）其他关键词主要针对大语言模型（LLMs）的技术原理、应用或特定领域（如AI for Science），而本文专注于视觉语言模型的隐私风险与防护，未涉及这些关键词的具体内容，因此评分为0分。

!!! tip deepseek-chat TL;DR

本文针对视觉语言模型在私人照片上微调导致的身份-关联隐私泄露风险，提出了首个基于数据投毒的隐私保护框架DP2-VL，通过优化不可感知的扰动使模型在受保护数据集上过拟合，从而有效防止隐私信息泄露。

摘要翻译

视觉-语言对齐技术的最新进展赋予了视觉-语言模型（VLMs）细粒度的图像理解能力。然而，这一进步也带来了新的隐私风险。本文首次提出了一种名为身份-关联学习的新型隐私威胁模型：攻击者仅使用目标个体的少量私人照片对VLM进行微调，从而将目标面部身份与其私有财产和社会关系之间的关联嵌入到模型的内部表征中。一旦通过公共API部署，该模型便能在输入目标照片时，导致目标用户的私人信息被非授权暴露。为评估VLMs对此类身份-关联泄露的敏感性，我们引入了首个身份-关联数据集，涵盖私人照片中出现的七种典型场景。每种场景均通过多个以身份为中心的照片-描述对进行实例化。实验结果表明，主流VLM模型（如LLaVA、Qwen-VL和MiniGPT-v2）能够通过在小规模私人照片数据集甚至合成生成的数据集上进行微调，识别面部身份并推断身份-关联关系。为缓解这一隐私风险，我们提出了DP2-VL，这是首个利用数据投毒技术的私人照片数据集保护框架。通过将原始表征推向对立区域以优化难以察觉的扰动，DP2-VL在VLM编码器的嵌入空间中诱导出数据集级别的偏移。这种偏移将受保护图像与干净的推理图像分离，导致在受保护集上的微调产生过拟合。大量实验表明，DP2-VL在不同模型间具有强泛化能力，对多种后处理操作具有鲁棒性，并在不同保护比例下均保持稳定的有效性。

摘要 (Abstract)

Recent advances in visual-language alignment have endowed vision-language models (VLMs) with fine-grained image understanding capabilities. However, this progress also introduces new privacy risks. This paper first proposes a novel privacy threat model named identity-affiliation learning: an attacker fine-tunes a VLM using only a few private photos of a target individual, thereby embedding associations between the target facial identity and their private property and social relationships into the model’s internal representations. Once deployed via public APIs, this model enables unauthorized exposure of the target user’s private information upon input of their photos. To benchmark VLMs’ susceptibility to such identity-affiliation leakage, we introduce the first identity-affiliation dataset comprising seven typical scenarios appearing in private photos. Each scenario is instantiated with multiple identity-centered photo-description pairs. Experimental results demonstrate that mainstream VLMs like LLaVA, Qwen-VL, and MiniGPT-v2, can recognize facial identities and infer identity-affiliation relationships by fine-tuning on small-scale private photographic dataset, and even on synthetically generated datasets. To mitigate this privacy risk, we propose DP2-VL, the first Dataset Protection framework for private photos that leverages Data Poisoning. Though optimizing imperceptible perturbations by pushing the original representations toward an antithetical region, DP2-VL induces a dataset-level shift in the embedding space of VLMs’encoders. This shift separates protected images from clean inference images, causing fine-tuning on the protected set to overfit. Extensive experiments demonstrate that DP2-VL achieves strong generalization across models, robustness to diverse post-processing operations, and consistent effectiveness across varying protection ratios.

关键词: vision-language models, privacy protection, data poisoning, identity-affiliation learning, fine-tuning, dataset protection, private photos, VLMs

224. ❌ DepthArb: Training-Free Depth-Arbitrated Generation for Occlusion-Robust Image Synthesis

作者: Hongjin Niu, Jiahao Wang, Xirui Hu, Weizhan Zhang, Lan Ma, Yuan Gao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23924v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DepthArb专注于文本到图像扩散模型中的遮挡关系生成问题，提出了一种无需训练的注意力仲裁框架来解决多对象遮挡模糊性。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是计算机视觉领域的扩散模型图像合成技术，与评分关键词列表中的大模型技术、训练方法、推理优化、AI for Science等主题均无直接关联。

!!! tip deepseek-chat TL;DR

该论文解决了文本到图像扩散模型中多对象遮挡关系不准确的问题，通过提出无需训练的DepthArb框架（包含注意力仲裁调制和空间紧凑性控制机制）来仲裁对象间的注意力竞争，从而在OcclBench基准测试中实现了比现有方法更好的遮挡准确性和视觉保真度。

摘要翻译

文本到图像扩散模型在合成多个物体的准确遮挡关系时，常表现出不足，尤其在密集重叠区域。现有的免训练布局引导方法主要依赖于刚性的空间先验，这些先验对深度顺序缺乏感知，常导致概念混淆或不合逻辑的遮挡。为解决这些局限，我们提出DepthArb，一种免训练框架，通过仲裁交互物体间的注意力竞争来解决遮挡模糊问题。具体而言，DepthArb采用两种核心机制：注意力仲裁调制（Attention Arbitration Modulation, AAM），通过抑制重叠区域中的背景激活来强制实现深度顺序的可见性；以及空间紧凑性控制（Spatial Compactness Control, SCC），通过约束注意力发散来保持结构完整性。这些机制使得无需模型重新训练即可实现鲁棒的遮挡生成。为系统评估此能力，我们提出OcclBench，一个旨在评估多样化遮挡场景的综合基准。大量评估表明，DepthArb在遮挡准确性和视觉保真度上均持续优于现有先进基线方法。作为一种即插即用方法，DepthArb无缝增强了扩散主干模型的组合能力，为生成模型内的空间分层提供了新视角。

摘要 (Abstract)

Text-to-image diffusion models frequently exhibit deficiencies in synthesizing accurate occlusion relationships of multiple objects, particularly within dense overlapping regions. Existing training-free layout-guided methods predominantly rely on rigid spatial priors that remain agnostic to depth order, often resulting in concept mixing or illogical occlusion. To address these limitations, we propose DepthArb, a training-free framework that resolves occlusion ambiguities by arbitrating attention competition between interacting objects. Specifically, DepthArb employs two core mechanisms: Attention Arbitration Modulation (AAM), which enforces depth-ordered visibility by suppressing background activations in overlapping regions, and Spatial Compactness Control (SCC), which preserves structural integrity by curbing attention divergence. These mechanisms enable robust occlusion generation without model retraining. To systematically evaluate this capability, we propose OcclBench, a comprehensive benchmark designed to evaluate diverse occlusion scenarios. Extensive evaluations demonstrate that DepthArb consistently outperforms state-of-the-art baselines in both occlusion accuracy and visual fidelity. As a plug-and-play method, DepthArb seamlessly enhances the compositional capabilities of diffusion backbones, offering a novel perspective on spatial layering within generative models.

关键词: text-to-image diffusion models, occlusion relationships, training-free framework, attention arbitration, depth-ordered visibility, spatial layering, OcclBench benchmark, compositional capabilities

225. ❌ Uncertainty-Aware Vision-based Risk Object Identification via Conformal Risk Tube Prediction

作者: Kai-Yu Fu, Yi-Ting Chen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23919v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，研究智能驾驶系统中的视觉风险对象识别，提出了一种基于保形预测的不确定性建模方法。论文内容完全不涉及大语言模型、深度学习技术原理创新或大模型在不同领域的应用，所有关键词均与大模型、深度学习技术、AI for Science等主题无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对智能驾驶系统中视觉风险对象识别方法缺乏不确定性建模的问题，提出了Conformal Risk Tube Prediction方法，通过时空不确定性建模提供覆盖保证和校准的风险评分，显著提升了风险识别的鲁棒性和下游任务性能。

摘要翻译

本研究聚焦于基于目标重要性的视觉风险目标识别（Vision-ROI），这是智能驾驶系统中危险检测的关键能力。现有方法通常做出确定性决策而忽略不确定性，这可能导致严重的安全故障。具体而言，在模糊场景中，固定的决策阈值可能引发过早或延迟的风险检测，以及时间上不稳定的预测，尤其是在存在多重交互风险的复杂场景中。尽管面临这些挑战，当前方法仍缺乏一个原则性框架来联合建模跨时空的风险不确定性。我们提出了“共形风险管预测”，这是一种统一框架，能够捕捉时空风险不确定性，为真实风险提供覆盖保证，并生成带有不确定性估计的校准风险评分。为进行系统性评估，我们提出了一个新的数据集和评估指标，用于探究具有多风险耦合效应的多样化场景配置，这是现有数据集所不具备的。我们系统分析了影响不确定性估计的因素，包括场景变化、单风险类别行为以及感知误差传播。相较于先前方法，我们的方法实现了显著改进，增强了视觉-ROI的鲁棒性及下游任务性能，例如减少了误触发的制动警报。更多定性结果请访问我们的项目网页：https://hcis-lab.github.io/CRTP/

摘要 (Abstract)

We study object importance-based vision risk object identification (Vision-ROI), a key capability for hazard detection in intelligent driving systems. Existing approaches make deterministic decisions and ignore uncertainty, which could lead to safety-critical failures. Specifically, in ambiguous scenarios, fixed decision thresholds may cause premature or delayed risk detection and temporally unstable predictions, especially in complex scenes with multiple interacting risks. Despite these challenges, current methods lack a principled framework to model risk uncertainty jointly across space and time. We propose Conformal Risk Tube Prediction, a unified formulation that captures spatiotemporal risk uncertainty, provides coverage guarantees for true risks, and produces calibrated risk scores with uncertainty estimates. To conduct a systematic evaluation, we present a new dataset and metrics probing diverse scenario configurations with multi-risk coupling effects, which are not supported by existing datasets. We systematically analyze factors affecting uncertainty estimation, including scenario variations, per-risk category behavior, and perception error propagation. Our method delivers substantial improvements over prior approaches, enhancing vision-ROI robustness and downstream performance, such as reducing nuisance braking alerts. For more qualitative results, please visit our project webpage: https://hcis-lab.github.io/CRTP/

关键词: vision-based risk object identification, uncertainty modeling, conformal prediction, spatiotemporal risk, intelligent driving systems, risk tube prediction, safety-critical systems, calibrated risk scores

226. ❌ Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

作者: Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Selim Furkan Tekin, Tiansheng Huang, Zachary Yahn, Yichang Xu, Ling Liu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23914v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	15.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大型视觉语言模型（VLMs）的解码阶段内存效率优化，属于大模型推理加速技术。核心创新是AttentionPack框架，通过注意力压缩和延迟优化来减少KV缓存内存占用，这与’KV Cache Compression’高度相关（15分）。论文明确针对长上下文任务，与’Context Window Extension’相关（10分）。优化目标包括推理加速，与’Speculative Decoding OR Inference Acceleration’相关（10分）。论文提到与量化结合，与’Quantization OR Model Compression’有一定关联（5分）。论文涉及大模型，与’Large Language Models OR LLMs’相关（8分）。其他关键词如MoE、训练方法、对齐、代理等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了AttentionPack框架，通过注意力压缩和延迟优化技术，解决了大型视觉语言模型在长上下文解码时的内存效率问题，实现了高达8倍的内存效率提升和更快的批量推理。

摘要翻译

大型视觉语言模型（VLMs）在多模态推理领域取得了显著成功，但其推理时间效率因解码过程中的内存开销而面临重大挑战，尤其当VLM的查询与答案由长序列的视觉与文本标记组成时。本文提出AttentionPack，一种专为大型视觉语言模型设计的自适应且注意力感知的优化框架，旨在提升解码过程中的内存效率，重点应对因视觉输入数量及交互增加所带来的挑战，特别是在涉及多张高分辨率图像或视频的长上下文任务中。AttentionPack的创新性体现在两个方面：（i）我们引入了一种多头注意力压缩方法，通过利用隐式低秩结构经济地存储键值矩阵；（ii）我们开发了一种针对特定标记的注意力感知解压缩机制，以降低延迟开销。在多个基准测试上的实验结果表明，AttentionPack将内存效率提升高达8倍，支持更大的批处理规模和更快的批量推理，同时保持模型输出质量或实现更长的上下文长度以提升检索性能。我们还报告了AttentionPack与驱逐策略、量化和内核融合技术结合的有效性，显示出在资源受限环境下可进一步获得效率提升。

摘要 (Abstract)

Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.

关键词: Large Vision-Language Models, Memory-efficient Decoding, Attention-aware Optimization, KV Cache Compression, Long-context Tasks, Inference Acceleration, Batch Inference, Multi-modal Reasoning

227. ❌ GenMask: Adapting DiT for Segmentation via Direct Mask

作者: Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai, Jiangchao Yao, Ya Zhang, Junyang Lin, Yanfeng Wang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23906v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GenMask专注于将Diffusion Transformer (DiT)直接应用于分割任务，提出了一种统一的生成式训练方法。核心贡献在于通过时间步采样策略解决二值掩码与自然图像在VAE潜在空间中的分布差异，实现分割掩码和RGB图像的联合生成。论文主要涉及生成模型（DiT）在计算机视觉分割任务中的应用，与大多数关键词（如LLMs、MoE、RLHF、RAG等）无关。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文利用了预训练的DiT模型并涉及领域适应（从图像生成到分割），但这不是核心创新点。其他关键词均未涉及（0分）。

!!! tip deepseek-chat TL;DR

论文解决了将预训练生成模型直接用于分割任务时存在的表示不对齐问题，通过引入时间步采样策略使DiT能够统一生成分割掩码和彩色图像，在多个分割基准上取得了最先进的性能。

摘要翻译

近期分割方法普遍采用预训练生成模型作为特征提取器，将分割视为通过间接特征检索实现的下游适应任务。这种隐式应用存在表征层面的根本错位问题，且高度依赖间接特征提取流程，导致工作流复杂化并限制了适应能力。本文主张分割任务应以生成式方法直接训练，而非采用间接适应策略。我们指出实现这一统一框架的关键障碍：二值掩码的VAE潜在空间具有分布尖锐、噪声鲁棒且线性可分的特性，与自然图像的潜在表征存在显著差异。为弥合这一鸿沟，我们提出针对二值掩码的时序采样策略——在分割任务中强调极端噪声水平，在图像生成中采用适度噪声，从而实现和谐的联合训练。本文提出GenMask模型，该模型基于原始生成目标，通过DiT架构在RGB空间内同步生成黑白分割掩码与彩色图像。GenMask完整保留了原始DiT架构，同时无需为分割任务定制特征提取流程。实验表明，GenMask在指代分割与推理分割基准测试中达到最先进性能，消融实验量化了各组件贡献度。

摘要 (Abstract)

Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.

关键词: segmentation, generative models, Diffusion Transformer, mask generation, joint training, VAE latents, timesteps sampling, state-of-the-art

228. ❌ Polynomial Speedup in Diffusion Models with the Multilevel Euler-Maruyama Method

作者: Arthur Jacot 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24594v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型的采样加速方法（Multilevel Euler-Maruyama方法），属于深度学习在生成模型中的应用，但所有关键词均针对大语言模型（LLMs）及相关技术（如MoE、对齐、推理、代理等），而本文研究扩散模型（基于UNet架构的图像生成模型），与LLMs技术栈完全不同，无直接关联。

!!! tip deepseek-chat TL;DR

本文提出了一种用于扩散模型的多级Euler-Maruyama采样方法，通过结合不同精度的UNet近似器，在CelebA数据集上实现了最高4倍的图像生成加速，且计算成本仅相当于单次评估最大网络。

摘要翻译

本文提出多级欧拉-丸山（ML-EM）方法，用于求解随机微分方程（SDE）和常微分方程（ODE）。该方法采用一系列精度与计算成本递增的漂移项近似器 $f^1,\dots,f^k$ 来逼近原漂移项 $f$，其特点在于仅需少量调用最高精度的 $f^k$，而大量调用成本较低的 $f^1,\dots,f^{k-1}$。若漂移项处于所谓“比蒙特卡洛更难”（HTMC）的范畴，即需要 $ε^{-γ}$ 的计算量才能实现 $ε$ 精度的近似（其中 $γ>2$），则 ML-EM 方法能以 $ε^{-γ}$ 的计算量实现 SDE 解的 $ε$ 精度近似，优于传统欧拉-丸山（EM）方法 $ε^{-γ-1}$ 的收敛速率。换言之，该方法使得求解 SDE 的计算成本仅相当于单次漂移项求值。在扩散模型的背景下，不同层级的 $f^{1},\dots,f^{k}$ 通过训练规模递增的 UNet 网络获得，而 ML-EM 使我们能够以相当于单次最大 UNet 前向传播的计算成本完成采样过程。数值实验验证了理论结果：在 64×64 分辨率的 CelebA 数据集图像生成任务中，我们观测到 $γ\approx2.5$，并实现了最高达四倍的加速效果。鉴于这是多项式量级的加速，我们预期在涉及更大规模网络的实际应用中将会获得更显著的加速效益。

摘要 (Abstract)

We introduce the Multilevel Euler-Maruyama (ML-EM) method compute solutions of SDEs and ODEs using a range of approximators $f^1,\dots,f^k$ to the drift $f$ with increasing accuracy and computational cost, only requiring a few evaluations of the most accurate $f^k$ and many evaluations of the less costly $f^1,\dots,f^{k-1}$. If the drift lies in the so-called Harder than Monte Carlo (HTMC) regime, i.e. it requires $ε^{-γ}$ compute to be $ε$-approximated for some $γ>2$, then ML-EM $ε$-approximates the solution of the SDE with $ε^{-γ}$ compute, improving over the traditional EM rate of $ε^{-γ-1}$. In other terms it allows us to solve the SDE at the same cost as a single evaluation of the drift. In the context of diffusion models, the different levels $f^{1},\dots,f^{k}$ are obtained by training UNets of increasing sizes, and ML-EM allows us to perform sampling with the equivalent of a single evaluation of the largest UNet. Our numerical experiments confirm our theory: we obtain up to fourfold speedups for image generation on the CelebA dataset downscaled to 64x64, where we measure a $γ\approx2.5$. Given that this is a polynomial speedup, we expect even stronger speedups in practical applications which involve orders of magnitude larger networks.

关键词: Diffusion Models, Multilevel Euler-Maruyama, SDEs, Sampling Acceleration, UNet, Computational Cost, Polynomial Speedup, Image Generation

229. ❌ DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving

作者: Pengxuan Yang, Yupeng Zheng, Deheng Qian, Zebin Xing, Qichao Zhang, Linbo Wang, Yichen Zhang, Shaoyu Guo, Zhongpu Xia, Qiang Chen, Junyu Han, Lingyun Xu, Yifeng Pan, Dongbin Zhao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24587v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶领域的强化学习，提出了一种名为DreamerAD的潜在世界模型框架，通过压缩扩散采样步骤实现高效训练。论文的核心是强化学习和世界模型在自动驾驶中的应用，与大多数关键词（如LLM、MoE、SFT、RAG等）无关。唯一相关的关键词是’World Models AND General World Models’，因为论文明确提出了一个潜在世界模型框架，并用于强化学习训练，因此给予10分（高度相关）。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出了DreamerAD，一种用于自动驾驶的潜在世界模型框架，通过将扩散采样从100步压缩到1步实现80倍加速，在NavSim v2上达到87.7 EPDMS的先进性能。

摘要翻译

我们提出DreamerAD，这是首个通过将扩散采样从100步压缩至1步实现高效自动驾驶强化学习的潜在世界模型框架——在保持视觉可解释性的同时实现了80倍加速。在真实世界驾驶数据上训练强化学习策略会产生极高的成本与安全风险。现有基于像素级扩散的世界模型虽能实现安全的想象训练，但其多步扩散推理延迟（2秒/帧）阻碍了高频强化学习交互。我们的方法通过三种关键机制利用视频生成模型的去噪潜在特征：（1）通过递归多分辨率步长压缩降低采样复杂度的捷径强制机制，（2）直接在潜在表征上运行的自回归密集奖励模型，实现细粒度信用分配，以及（3）为GRPO设计的高斯词汇采样，将探索约束在物理可行的轨迹空间。DreamerAD在NavSim v2基准上取得87.7 EPDMS分数，确立了最先进的性能，并证明潜在空间强化学习对自动驾驶具有显著效力。

摘要 (Abstract)

We introduce DreamerAD, the first latent world model framework that enables efficient reinforcement learning for autonomous driving by compressing diffusion sampling from 100 steps to 1 - achieving 80x speedup while maintaining visual interpretability. Training RL policies on real-world driving data incurs prohibitive costs and safety risks. While existing pixel-level diffusion world models enable safe imagination-based training, they suffer from multi-step diffusion inference latency (2s/frame) that prevents high-frequency RL interaction. Our approach leverages denoised latent features from video generation models through three key mechanisms: (1) shortcut forcing that reduces sampling complexity via recursive multi-resolution step compression, (2) an autoregressive dense reward model operating directly on latent representations for fine-grained credit assignment, and (3) Gaussian vocabulary sampling for GRPO that constrains exploration to physically plausible trajectories. DreamerAD achieves 87.7 EPDMS on NavSim v2, establishing state-of-the-art performance and demonstrating that latent-space RL is effective for autonomous driving.

关键词: DreamerAD, latent world model, reinforcement learning, autonomous driving, diffusion sampling compression, efficient training, NavSim v2, GRPO

230. ❌ Trust Region Constrained Bayesian Optimization with Penalized Constraint Handling

作者: Raju Chowdhury, Tanmay Sen, Prajamitra Bhuyan, Biswabrata Pradhan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24567v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于高维黑盒约束优化的贝叶斯优化方法，提出了一种结合惩罚函数、代理模型和信任区域策略的算法。论文内容完全围绕优化算法设计，与所有评分关键词（均涉及大模型、深度学习技术原理或AI科学应用）无直接关联。论文未提及任何语言模型、训练技术、推理方法、代理系统或AI科学应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合惩罚函数和信任区域策略的贝叶斯优化方法，用于解决高维黑盒约束优化问题，实验表明该方法能以更少评估次数找到高质量可行解并保持稳定性能。

摘要翻译

在高维黑盒约束优化问题中，由于目标函数评估代价高昂、缺乏梯度信息以及可行域结构复杂，优化过程极具挑战性。本研究提出一种结合罚函数法、代理模型与信赖域策略的贝叶斯优化方法。该方法通过惩罚约束违反将原约束问题转化为无约束形式，从而提供了一个统一的建模框架。信赖域将搜索范围限制在当前最优解附近的局部区域，这提升了高维空间中的稳定性与效率。在此区域内，我们采用期望改进采集函数，通过平衡改进潜力与不确定性来选择评估点。所提出的信赖域方法将基于罚函数的约束处理与局部代理建模相结合。这种结合能够在保持采样效率的同时，有效探索可行域。我们在合成与真实世界的高维约束优化问题上，将所提方法与前沿方法进行了比较。结果表明，该方法能够以更少的评估次数识别出高质量的可行解，并在不同设置下保持稳定的性能。

摘要 (Abstract)

Constrained optimization in high-dimensional black-box settings is difficult due to expensive evaluations, the lack of gradient information, and complex feasibility regions. In this work, we propose a Bayesian optimization method that combines a penalty formulation, a surrogate model, and a trust region strategy. The constrained problem is converted to an unconstrained form by penalizing constraint violations, which provides a unified modeling framework. A trust region restricts the search to a local region around the current best solution, which improves stability and efficiency in high dimensions. Within this region, we use the Expected Improvement acquisition function to select evaluation points by balancing improvement and uncertainty. The proposed Trust Region method integrates penalty-based constraint handling with local surrogate modeling. This combination enables efficient exploration of feasible regions while maintaining sample efficiency. We compare the proposed method with state-of-the-art methods on synthetic and real-world high-dimensional constrained optimization problems. The results show that the method identifies high-quality feasible solutions with fewer evaluations and maintains stable performance across different settings.

关键词: Bayesian optimization, constrained optimization, trust region, penalty method, high-dimensional, black-box, surrogate model, Expected Improvement

231. ❌ Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction

作者: Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu, Shih-Lun Huang, Long Chen, Gabe Schulman, Huizhen Jin, Shengduo Li, Yixuan Wang, Huidi Yang, Kyunghyun Cho, Cem M. Deniz, Narges Razavian 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24562v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出RAVEN，一种针对结构化电子健康记录（EHR）的生成式预训练基础模型，核心研究内容包括：1）基于Foundation Models（高度相关，10分）在医疗领域的应用；2）提出新的预训练策略（Pre-training，10分）；3）实证研究数据受限、计算饱和情况下的扩展行为（Scaling Laws AND Data Quality，10分）；4）属于AI for Science在生物信息学/医疗领域的应用（10分）。其他关键词如MoE、SFT、RLHF、RAG、量化等均未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为RAVEN的新型生成式预训练基础模型，用于电子健康记录的序列预测，通过正则化重复事件和实证研究扩展行为，在零样本疾病预测任务中达到了与全微调模型相当的性能，并能泛化到外部患者队列。

摘要翻译

尽管大规模预训练已彻底改变了语言建模领域，但其在基于结构化电子健康记录（EHR）的医疗健康领域的潜力仍未得到充分探索。本文提出RAVEN，一种针对序列化EHR数据的新型生成式预训练策略，其核心是基于复发感知的下一就诊事件预测。利用包含超过一百万独立个体的数据集，我们的模型能够以患者历史为条件，自回归地生成下一就诊时经分词处理的临床事件。我们引入了针对重复事件预测的正则化方法，并揭示了基于EHR的基础模型评估中的一个关键缺陷：当新发事件与后续重复事件未被区分时，重复事件标记会虚增性能指标。此外，我们在数据受限、计算饱和的机制下实证研究了缩放规律，结果表明若数据量未相应增加，仅单纯扩大模型规模并非最优选择。我们通过零样本预测来评估模型在多种疾病发病率预测任务上的表现，结果显示其性能可与完全微调的基于表示的Transformer模型相媲美，并优于广泛使用的基于模拟的下一标记预测方法。最后，在不进行额外参数更新的情况下，我们证明了RAVEN能够在存在有损临床代码映射和特征覆盖缺失的情况下，泛化至外部患者队列。

摘要 (Abstract)

While large-scale pretraining has revolutionized language modeling, its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present RAVEN, a novel generative pretraining strategy for sequential EHR data based on Recurrence-Aware next-Visit EveNt prediction. Leveraging a dataset of over one million unique individuals, our model learns to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. We introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Furthermore, we empirically investigate the scaling behaviors in a data-constrained, compute-saturated regime, showing that simply increasing model size is suboptimal without commensurate increases in data volume. We evaluate our model via zero-shot prediction for forecasting the incidence of a diverse set of diseases, where it rivals fully fine-tuned representation-based Transformer models and outperforms widely used simulation-based next-token approaches. Finally, without additional parameter updates, we show that RAVEN can generalize to an external patient cohort under lossy clinical code mappings and feature coverage gaps.

关键词: Foundation Models, Electronic Health Records, Generative Pretraining, Next-Visit Prediction, Scaling Laws, Zero-shot Prediction, Clinical Data, RAVEN

232. ❌ TuneShift-KD: Knowledge Distillation and Transfer for Fine-tuned Models

作者: Yushi Guan, Jeanine Ohene-Agyei, Daniel Kwan, Jean Sebastien Dandurand, Yifei Zhang, Nandita Vijaykumar 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24518v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大模型（LLMs/Foundation Models）的微调（Fine-tuning）和知识蒸馏（Knowledge Distillation）技术，特别是针对参数高效微调（PEFT/LoRA）后的模型，将学到的领域知识迁移到新的预训练模型。因此，与’Large Language Models’、‘Post-training/SFT’、‘PEFT/LoRA’高度相关（10分）。与’Pre-training/Domain Adaptation’有一定关联（5分），因为涉及从预训练模型开始的知识迁移。论文未涉及其他关键词的具体技术或应用领域，故其余关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TuneShift-KD的新方法，用于在无需原始训练数据的情况下，自动从已微调的大模型中蒸馏出领域专业知识，并将其高效迁移到新的预训练模型中，实验表明该方法比现有方法能实现更高的知识迁移准确率。

摘要翻译

为将领域特定或专业知识嵌入预训练基础模型，采用参数高效微调（如LoRA）等技术进行微调是常见做法。然而，随着新的大语言模型架构和预训练模型不断涌现，将此类专业知识迁移至新模型成为一项重要任务。在许多场景中，原始专业数据可能因隐私或商业限制而无法获取，因此需要从微调后的基础模型向不同的预训练模型蒸馏并迁移这些专业知识。本文提出TuneShift-KD，这是一种仅使用少量代表性专业信息示例，即可自动从微调模型向目标模型蒸馏专业知识的新方法。我们的核心见解是：通过基础模型与微调模型之间的困惑度差异可以识别专业知识——当微调模型能自信响应（低困惑度）而基础模型应对困难（高困惑度）的提示，即对应微调模型所习得的专业知识查询。TuneShift-KD利用这一洞察构建合成训练数据集以实现专业知识迁移。通过迭代过程，TuneShift-KD能生成更多与产生专业知识响应的提示相似的查询。该方法无需训练判别器或访问原始训练数据集，是一种仅需初始微调模型、基础模型及少量代表性提示的自动化方案。实验表明，采用TuneShift-KD微调的模型相比现有方法实现了更高准确率，既能简化部署流程，又能更有效地迁移专业知识。

摘要 (Abstract)

To embed domain-specific or specialized knowledge into pre-trained foundation models, fine-tuning using techniques such as parameter efficient fine-tuning (e.g. LoRA) is a common practice. However, as new LLM architectures and pre-trained models emerge, transferring this specialized knowledge to newer models becomes an important task. In many scenarios, the original specialized data may be unavailable due to privacy or commercial restrictions, necessitating distillation and transfer of this specialized knowledge from the fine-tuned base model to a different pre-trained model. We present TuneShift-KD, a novel approach that automatically distills specialized knowledge from a fine-tuned model to a target model using only a few examples representative of the specialized information. Our key insight is that specialized knowledge can be identified through perplexity differences between base and fine-tuned models: prompts where the fine-tuned model responds confidently (low perplexity), but the base model struggles (high perplexity), indicate queries corresponding to the specialized knowledge learned by the fine-tuned model. TuneShift-KD leverages this insight to create a synthetic training dataset to transfer the specialized knowledge. Using an iterative process, TuneShift-KD generates more prompts similar to those that generated responses with specialized knowledge. TuneShift-KD does not require training discriminators or access to training datasets. It is an automated approach that only requires the initial fine-tuned and base models and a few representative prompts. Our experiments demonstrate that models fine-tuned using TuneShift-KD achieve higher accuracy than prior approaches, enabling ease of deployment and more effective transfer of the specialized knowledge.

关键词: Knowledge Distillation, Fine-tuning, Parameter Efficient Fine-tuning, LoRA, Foundation Models, Domain Adaptation, Model Transfer, Perplexity Analysis

233. ❌ AVO: Agentic Variation Operators for Autonomous Evolutionary Search

作者: Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye, Timmy Liu, Ali Hassani, Tianqi Chen, Andrew Kerr, Haicheng Wu, Yang Xu, Yu-Jung Chen, Hanfeng Chen, Aditya Kane, Ronny Krashinsky, Ming-Yu Liu, Vinod Grover, Luis Ceze, Roger Bringmann, John Tran, Wei Liu, Fung Xie, Michael Lightstone, Humphrey Shi 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24517v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AVO（Agentic Variation Operators），使用自主编码代理替代传统进化搜索中的固定变异和交叉操作，属于大模型在优化领域的创新应用。核心相关关键词：1）‘LLM Agents/Autonomous Agents/Agentic Workflow’（10分）- 论文核心是自主代理系统；2）‘KV Cache Compression/Linear Attention/FlashAttention’（10分）- 论文在注意力内核优化上与FlashAttention-4直接比较；3）‘Large Language Models/LLMs/Foundation Models’（8分）- 使用语言模型作为自主代理；4）‘Self-Correction/Self-Improvement/Self-Reflection’（8分）- 代理具有修复、批评、验证的自我改进能力；5）‘Speculative Decoding/Inference Acceleration’（8分）- 优化注意力内核实现推理加速；6）‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’（各5分）- 代理执行多步推理；7）‘Tool Use/Function Calling/API Tool Use’（5分）- 代理使用执行反馈作为工具。其他关键词与论文内容无关或未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AVO的自主代理进化搜索方法，通过使用语言模型代理替代传统进化操作，在NVIDIA Blackwell GPU上优化注意力内核，实现了比cuDNN和FlashAttention-4更优的性能。

摘要翻译

代理变异算子（Agentic Variation Operators，简称AVO）是一类新型的进化变异算子，它用自主编码代理取代了经典进化搜索中固定的突变、交叉和人工设计的启发式方法。AVO并非将语言模型局限于既定流程中的候选方案生成，而是将变异实例化为一个自主的代理循环。该循环能够参考当前谱系、特定领域知识库以及执行反馈，从而提出、修复、评判和验证实现方案的修改。我们在AI领域优化最为激进的核函数目标之一——注意力机制上，于NVIDIA Blackwell（B200）GPU平台上对AVO进行了评估。在多头注意力机制上经过连续7天的自主进化，AVO发现的核函数在评估的所有配置中，性能超越cuDNN最高达3.5%，超越FlashAttention-4最高达10.5%。所发现的优化方案能够轻松迁移到分组查询注意力机制，仅需30分钟的额外自主适应，即可实现相比cuDNN最高7.0%、相比FlashAttention-4最高9.3%的性能提升。这些结果表明，代理变异算子通过将代理的角色从候选生成器提升为变异算子，超越了以往将大语言模型置于循环中的进化流程，并且能够发现对性能至关重要的微架构优化，从而产生超越当今最先进GPU硬件上专家精心设计的注意力实现方案的核函数。

摘要 (Abstract)

Agentic Variation Operators (AVO) are a new family of evolutionary variation operators that replace the fixed mutation, crossover, and hand-designed heuristics of classical evolutionary search with autonomous coding agents. Rather than confining a language model to candidate generation within a prescribed pipeline, AVO instantiates variation as a self-directed agent loop that can consult the current lineage, a domain-specific knowledge base, and execution feedback to propose, repair, critique, and verify implementation edits. We evaluate AVO on attention, among the most aggressively optimized kernel targets in AI, on NVIDIA Blackwell (B200) GPUs. Over 7 days of continuous autonomous evolution on multi-head attention, AVO discovers kernels that outperform cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% across the evaluated configurations. The discovered optimizations transfer readily to grouped-query attention, requiring only 30 minutes of additional autonomous adaptation and yielding gains of up to 7.0% over cuDNN and 9.3% over FlashAttention-4. Together, these results show that agentic variation operators move beyond prior LLM-in-the-loop evolutionary pipelines by elevating the agent from candidate generator to variation operator, and can discover performance-critical micro-architectural optimizations that produce kernels surpassing state-of-the-art expert-engineered attention implementations on today’s most advanced GPU hardware.

关键词: Agentic Variation Operators, autonomous evolutionary search, attention kernel optimization, LLM agents, FlashAttention, GPU performance, autonomous coding agents, evolutionary algorithms

234. ❌ Towards Safe Learning-Based Non-Linear Model Predictive Control through Recurrent Neural Network Modeling

作者: Mihaela-Larisa Clement, Mónika Farsang, Agnes Poks, Johannes Edelmann, Manfred Plöchl, Radu Grosu, Ezio Bartocci 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24503v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究非线性模型预测控制（NMPC）的神经网络近似方法，专注于控制理论和强化学习领域，使用循环神经网络（RNN）构建序列策略。论文内容与所有评分关键词（均围绕大语言模型、深度学习技术原理及其应用）完全无关，未涉及任何大模型、语言模型、对齐、推理、代理、压缩等技术或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于循环神经网络的序列策略（Sequential-AMPC）来近似非线性模型预测控制，以减少专家数据需求和在线计算成本，并通过安全增强机制提高了闭环安全性和可行性。

摘要翻译

非线性模型预测控制（NMPC）的实际应用常受限于在线计算：在嵌入式硬件上以高控制频率求解非线性规划问题可能代价高昂，尤其在模型复杂或预测时域较长时。基于学习的NMPC近似方法将计算负担转移至离线阶段，但通常需要大量专家数据集和昂贵的训练成本。我们提出序列化AMPC方法，这是一种序列神经策略，通过在预测时域内共享参数来生成MPC候选控制序列。为便于部署，我们将该策略封装于安全性增强的在线评估与回退机制中，形成安全序列化AMPC。相较于多个基准测试中的简单前馈策略基线，序列化AMPC所需专家MPC推演数据显著减少，产生的候选序列具有更高的可行率及更强的闭环安全性。在高维系统上，该方法在更少的训练周期内展现出更优的学习动态和性能，同时保持稳定的验证提升，而前馈基线则可能陷入停滞。

摘要 (Abstract)

The practical deployment of nonlinear model predictive control (NMPC) is often limited by online computation: solving a nonlinear program at high control rates can be expensive on embedded hardware, especially when models are complex or horizons are long. Learning-based NMPC approximations shift this computation offline but typically demand large expert datasets and costly training. We propose Sequential-AMPC, a sequential neural policy that generates MPC candidate control sequences by sharing parameters across the prediction horizon. For deployment, we wrap the policy in a safety-augmented online evaluation and fallback mechanism, yielding Safe Sequential-AMPC. Compared to a naive feedforward policy baseline across several benchmarks, Sequential-AMPC requires substantially fewer expert MPC rollouts and yields candidate sequences with higher feasibility rates and improved closed-loop safety. On high-dimensional systems, it also exhibits better learning dynamics and performance in fewer epochs while maintaining stable validation improvement where the feedforward baseline can stagnate.

关键词: Nonlinear Model Predictive Control, Neural Policy, Recurrent Neural Network, Sequential-AMPC, Safety-Augmented Control, Feasibility Rates, Closed-loop Safety, Learning-based NMPC

235. ❌ Uniform Laws of Large Numbers in Product Spaces

作者: Ron Holzman, Shay Moran, Alexander Shlimovich 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24493v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是概率论和统计学习理论中的一致大数定律，属于纯数学理论范畴，与所有给定的关键词（均涉及大模型、深度学习技术及其应用）完全无关。论文内容聚焦于VC维理论、乘积空间中的分布假设和线性VC维，不涉及任何人工智能、机器学习模型或相关技术。

!!! tip deepseek-chat TL;DR

该论文研究了乘积空间中一致大数定律的成立条件，证明了在分布相对于其边缘乘积绝对连续的假设下，事件族的一致大数定律成立当且仅当其线性VC维有限。

摘要翻译

一致大数定律是Vapnik-Chervonenkis理论的基石，其成立条件由VC维度的有限性所刻画。本研究在符合乘积结构的分布假设下，探讨笛卡尔乘积空间中的一致收敛现象。具体而言，我们假设分布相对于其边缘分布的乘积是绝对连续的，这一条件涵盖了许多自然场景，包括乘积分布、乘积分布的稀疏混合、低互信息分布等。
我们证明，在此假设下，一个事件族满足一致大数定律当且仅当其线性VC维度有限。线性VC维度定义为位于轴平行线上的被粉碎集的最大规模，即这些向量集在所有坐标上至多只有一个坐标存在差异。该维度始终不超过经典VC维度，但可以任意更小。例如，$\mathbb{R}^d$中凸集族的线性VC维度为$2$，而其经典VC维度在$d\ge 2$时已为无穷。我们的证明依赖于一个显著偏离标准经验均值估计量的估计器，该估计器展现出更复杂的结构。我们证明在此设定下，这种对标准经验均值估计器的偏离是不可避免的。全文提出了若干开放性问题，尤其侧重于定量样本复杂度界限的探讨。

摘要 (Abstract)

Uniform laws of large numbers form a cornerstone of Vapnik–Chervonenkis theory, where they are characterized by the finiteness of the VC dimension. In this work, we study uniform convergence phenomena in cartesian product spaces, under assumptions on the underlying distribution that are compatible with the product structure. Specifically, we assume that the distribution is absolutely continuous with respect to the product of its marginals, a condition that captures many natural settings, including product distributions, sparse mixtures of product distributions, distributions with low mutual information, and more. We show that, under this assumption, a uniform law of large numbers holds for a family of events if and only if the linear VC dimension of the family is finite. The linear VC dimension is defined as the maximum size of a shattered set that lies on an axis-parallel line, namely, a set of vectors that agree on all but at most one coordinate. This dimension is always at most the classical VC dimension, yet it can be arbitrarily smaller. For instance, the family of convex sets in $\mathbb{R}^d$ has linear VC dimension $2$, while its VC dimension is infinite already for $d\ge 2$. Our proofs rely on estimator that departs substantially from the standard empirical mean estimator and exhibits more intricate structure. We show that such deviations from the standard empirical mean estimator are unavoidable in this setting. Throughout the paper, we propose several open questions, with a particular focus on quantitative sample complexity bounds.

关键词: Uniform laws of large numbers, VC dimension, Product spaces, Linear VC dimension, Empirical mean estimator, Sample complexity, Distribution assumptions, Cartesian product

236. ❌ Composer 2 Technical Report

作者: Cursor Reseach, :, Aaron Chan, Ahmed Shalaby, Alexander Wettig, Aman Sanger, Andrew Zhai, Anurag Ajay, Ashvin Nair, Charlie Snell, Chen Lu, Chen Shen, Emily Jia, Federico Cassano, Hanpeng Liu, Haoyu Chen, Henry Wildermuth, Jacob Jackson, Janet Li, Jediah Katz, Jiajun Yao, Joey Hejna, Josh Warner, Julius Vering, Kevin Frans, Lee Danilek, Less Wright, Lujing Cen, Luke Melas-Kyriazi, Michael Truell, Michiel de Jong, Naman Jain, Nate Schmidt, Nathan Wang, Niklas Muennighoff, Oleg Rybkin, Paul Loh, Phillip Kravtsov, Rishabh Yadav, Sahil Shah, Sam Kottler, Alexander M Rush, Shengtong Zhang, Shomil Jain, Sriram Sankar, Stefan Heule, Stuart H. Sul, Sualeh Asif, Victor Rong, Wanqi Zhu, William Lin, Yuchen Wu, Yuri Volkov, Yury Zemlyanskiy, Zack Holbrook, Zhiyuan Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24477v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Composer 2专注于agentic software engineering，是一个专门用于编码的LLM。核心相关关键词包括：LLMs（基础模型）、Pre-training（持续预训练）、RLHF（大规模强化学习）、Chain of Thought（多步推理）、System 2 Thinking（深度推理）和LLM Agents（自主代理）。这些是论文的核心技术和方法。其他关键词如MoE、SLMs、RAG、量化等未在摘要中提及，因此评分为0。

!!! tip deepseek-chat TL;DR

Composer 2是一个专为agentic软件工程设计的大语言模型，通过持续预训练和大规模强化学习训练，在长期规划和编码智能方面表现出色，在CursorBench等基准测试中实现了显著的准确性提升。

摘要翻译

Composer 2 是一款专为智能体软件工程设计的专业模型。该模型展现出强大的长期规划与编码智能，同时保持高效解决交互式使用问题的能力。模型训练分为两个阶段：首先通过持续预训练提升模型的知识储备与潜在编码能力，随后进行大规模强化学习，以增强推理能力、精确的多步骤执行能力以及在长周期现实编码问题上的连贯性，从而提升端到端的编码性能。我们开发了与部署模型所用 Cursor 框架相匹配的基础设施，采用等效的工具和结构，并使用高度贴近真实问题的环境进行训练。为衡量模型在日益复杂任务上的能力，我们引入了一个基于大型代码库（包括我们自身的代码库）中真实软件工程问题衍生的基准测试。Composer 2 是一款前沿水平的编码模型，展示了训练强大领域专用模型的方法流程。在我们的 CursorBench 评估中，该模型相较于前代 Composer 模型（61.3分）实现了准确率的显著提升。在公开基准测试中，模型在我们的测试框架下于 Terminal-Bench 获得 61.7 分，在 SWE-bench Multilingual 获得 73.7 分，性能与当前最先进的系统相当。

摘要 (Abstract)

Composer 2 is a specialized model designed for agentic software engineering. The model demonstrates strong long-term planning and coding intelligence while maintaining the ability to efficiently solve problems for interactive use. The model is trained in two phases: first, continued pretraining to improve the model’s knowledge and latent coding ability, followed by large-scale reinforcement learning to improve end-to-end coding performance through stronger reasoning, accurate multi-step execution, and coherence on long-horizon realistic coding problems. We develop infrastructure to support training in the same Cursor harness that is used by the deployed model, with equivalent tools and structure, and use environments that match real problems closely. To measure the ability of the model on increasingly difficult tasks, we introduce a benchmark derived from real software engineering problems in large codebases including our own. Composer 2 is a frontier-level coding model and demonstrates a process for training strong domain-specialized models. On our CursorBench evaluations the model achieves a major improvement in accuracy compared to previous Composer models (61.3). On public benchmarks the model scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual in our harness, comparable to state-of-the-art systems.

关键词: agentic software engineering, large language model, continued pretraining, reinforcement learning, long-term planning, coding intelligence, multi-step execution, domain-specialized model

237. ❌ Conformalized Transfer Learning for Li-ion Battery State of Health Forecasting under Manufacturing and Usage Variability

作者: Samuel Filgueira da Silva, Mehmet Fatih Ozkan, Faissal El Idrissi, Marcello Canova 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24475v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究锂离子电池健康状态预测，使用LSTM模型结合领域适应和不确定性量化技术。论文与大多数大模型技术关键词无关，但与’AI for Science’高度相关（8分），因为这是AI在科学领域的应用。与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文使用了领域适应技术（MMD）来缓解领域偏移。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合LSTM、领域适应和不确定性量化的迁移学习框架，用于提高锂离子电池健康状态预测在不同制造和使用条件下的泛化能力和可信度。

摘要翻译

锂离子电池健康状态（SOH）的精确预测对于确保其安全可靠运行至关重要。然而，基于特定条件下实验室测试校准的现有模型，往往难以推广到因微小制造差异或不同运行条件而存在差异的新电池。为解决这一挑战，本文提出一种不确定性感知的迁移学习框架，该框架将长短期记忆网络（LSTM）模型与基于最大均值差异（MMD）的域适应方法以及通过保形预测（CP）进行的不确定性量化相结合。LSTM模型在一个旨在捕捉电极制造和运行条件真实世界变异性的虚拟电池数据集上进行训练。MMD通过对齐模拟域与目标域的潜在特征分布来缓解域偏移，而CP则提供经过校准的、无分布依赖的预测区间。该框架提升了SOH预测在不同电池间的泛化能力与可信度。

摘要 (Abstract)

Accurate forecasting of state-of-health (SOH) is essential for ensuring safe and reliable operation of lithium-ion cells. However, existing models calibrated on laboratory tests at specific conditions often fail to generalize to new cells that differ due to small manufacturing variations or operate under different conditions. To address this challenge, an uncertainty-aware transfer learning framework is proposed, combining a Long Short-Term Memory (LSTM) model with domain adaptation via Maximum Mean Discrepancy (MMD) and uncertainty quantification through Conformal Prediction (CP). The LSTM model is trained on a virtual battery dataset designed to capture real-world variability in electrode manufacturing and operating conditions. MMD aligns latent feature distributions between simulated and target domains to mitigate domain shift, while CP provides calibrated, distribution-free prediction intervals. This framework improves both the generalization and trustworthiness of SOH forecasts across heterogeneous cells.

关键词: Lithium-ion battery, State of health forecasting, Transfer learning, Domain adaptation, Conformal prediction, LSTM, Uncertainty quantification, Manufacturing variability

238. ❌ Learning Response-Statistic Shifts and Parametric Roll Episodes from Wave–Vessel Time Series via LSTM Functional Models

作者: Jose del Aguila Ferrandis 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24431v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用LSTM模型进行船舶运动预测和参数横摇分析，属于AI在工程科学领域的应用。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或大模型相关技术（如MoE、Scaling Laws、微调、对齐、推理优化等）。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文展示了AI（LSTM）在船舶工程科学问题中的应用，但并非核心生物信息学或化学信息学领域，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文开发了一种基于LSTM的数据驱动代理模型，用于从波浪-船舶时间序列中学习非线性功能映射，以预测参数横摇事件和相关的响应统计变化，并在数值波浪池生成的训练数据上验证了模型在时域精度和分布保真度方面的性能。

摘要翻译

参数横摇是一种罕见但后果严重的失稳现象，可能引发船舶响应的急剧状态变化，包括横摇统计特征与尾部风险的显著偏移。本文构建了一种数据驱动的代理模型，该模型能够学习从入射波浪-运动时序到船舶运动的非线性因果函数映射，并证明该代理模型可同时复现（i）参数横摇事件及（ii）响应中相关的统计特征偏移。关键在于，该学习框架对数据来源具有普适性：配对的波浪-运动时序可通过受控实验（例如，在船体存在时通过带有浪高仪和运动追踪系统的拖曳水池或试验池测试）获取，亦可在设计阶段通过高保真数值模拟（当实验条件尚不具备时）获得。为提供一种受控的恶劣海况验证，我们采用非定常雷诺平均纳维-斯托克斯（URANS）数值波浪水池生成训练数据，所用长峰不规则波由修正的皮尔逊-莫斯科维茨（Pierson–Moskowitz）谱合成。验证数据集包含三种海况各49组随机相位实现，均在固定航速下模拟，该航速被选定为可能引发参数横摇事件的遭遇条件。我们采用堆叠长短期记忆（LSTM）网络构建代理模型，以波面高程时间序列进行训练，并通过时域精度与分布保真度指标在预留数据集上进行评估。在最恶劣海况下，模型准确追踪了与参数激励一致的大幅值横摇的起始与发展过程，并捕捉到横摇概率密度函数（PDFs）的相应变化。我们进一步比较了损失函数的选择（均方误差、基于相对熵的目标函数及幅值加权变体），并展示了这些函数如何在平均误差与尾部保真度之间进行权衡，后者对船舶操作性与风险评估至关重要。

摘要 (Abstract)

Parametric roll is a rare but high-consequence instability that can trigger abrupt regime changes in ship response, including pronounced shifts in roll statistics and tail risk. This paper develops a data-driven surrogate that learns the nonlinear, causal functional mapping from incident wave–motion time series to vessel motions, and demonstrates that the surrogate reproduces both (i) parametric roll episodes and (ii) the associated statistical shifts in the response. Crucially, the learning framework is data-source agnostic: the paired wave–motion time series can be obtained from controlled experiments (e.g., towing-tank or basin tests with wave probes and motion tracking) when a hull exists, or from high-fidelity simulations during design when experiments are not yet available. To provide a controlled severe-sea demonstration, we generate training data with a URANS numerical wave tank, using long-crested irregular seas synthesized from a modified Pierson–Moskowitz spectrum. The demonstration dataset comprises 49 random-phase realizations for each of three sea states, simulated at a fixed forward speed selected to yield encounter conditions under which parametric-roll episodes can occur. A stacked LSTM surrogate is trained on wave-elevation time series and evaluated on held-out realizations using time-domain accuracy and distributional fidelity metrics. In the most severe case, the model tracks the onset and growth of large-amplitude roll consistent with parametric excitation, and captures the corresponding changes in roll probability density functions (PDFs). We further compare loss-function choices (MSE, relative-entropy-based objectives, and amplitude-weighted variants) and show how they trade average error for improved tail fidelity relevant to operability and risk assessment.

关键词: parametric roll, LSTM, wave-vessel time series, data-driven surrogate, statistical shifts, URANS numerical wave tank, loss-function comparison, risk assessment

239. ❌ Marchuk: Efficient Global Weather Forecasting from Mid-Range to Sub-Seasonal Scales via Flow Matching

作者: Arsen Kuzhamuratov, Mikhail Zhirnov, Andrey Kuznetsov, Ivan Oseledets, Konstantin Sobolev 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24428v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文Marchuk专注于使用生成式流匹配模型进行全球天气预报，属于AI for Science（科学AI）的应用范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（5分）。然而，论文的核心技术是流匹配（flow matching）和自回归预测，并未涉及大语言模型（LLMs）、专家混合（MoE）、小语言模型（SLMs）、缩放定律、预训练、后训练、指令调优、RLHF、参数高效微调、检索增强生成、上下文窗口扩展、KV缓存压缩、思维链、系统2思维、蒙特卡洛树搜索、自我纠正、智能体、工具使用、多智能体系统、量化、推测解码、幻觉缓解、机制可解释性、世界模型、模型合并或上下文学习等大模型或深度学习技术原理的创新，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Marchuk的生成式流匹配模型，用于解决中长期（15-30天）全球天气预报的挑战，该模型在仅2.76亿参数下实现了与更大模型相当的预测性能，并显著提高了推理速度。

摘要翻译

准确的次季节天气预测仍然是一个重大挑战，这源于大气固有的混沌特性，该特性限制了传统模型在中长期（约15天以上）的预测能力。本文提出 \textit{Marchuk}，一种用于全球天气预测的生成式潜空间流匹配模型，其预测范围覆盖中期至次季节尺度，预测时效可达30天。Marchuk 以当前天气图为条件，在已学习的潜空间内自回归地预测后续日期的天气图。我们用可训练的位置嵌入替代了旋转位置编码（RoPE），并扩展了时间上下文窗口，这些改进共同增强了模型在潜空间预测中表征和传递长程时间依赖关系的能力。Marchuk 具有两大关键优势：高计算效率和强大的预测性能。尽管其架构紧凑，仅包含2.76亿参数，该模型的性能却可与参数规模大得多（16亿参数）的 LaDCast 模型相媲美，同时推理速度显著更快。我们在以下地址开源了推理代码和模型：https://v-gen-ai.github.io/Marchuk/

摘要 (Abstract)

Accurate subseasonal weather forecasting remains a major challenge due to the inherently chaotic nature of the atmosphere, which limits the predictive skill of conventional models beyond the mid-range horizon (approximately 15 days). In this work, we present \textit{Marchuk}, a generative latent flow-matching model for global weather forecasting spanning mid-range to subseasonal timescales, with prediction horizons of up to 30 days. Marchuk conditions on current-day weather maps and autoregressively predicts subsequent days’ weather maps within the learned latent space. We replace rotary positional encodings (RoPE) with trainable positional embeddings and extend the temporal context window, which together enhance the model’s ability to represent and propagate long-range temporal dependencies during latent forecasting. Marchuk offers two key advantages: high computational efficiency and strong predictive performance. Despite its compact architecture of only 276 million parameters, the model achieves performance comparable to LaDCast, a substantially larger model with 1.6 billion parameters, while operating at significantly higher inference speeds. We open-source our inference code and model at: https://v-gen-ai.github.io/Marchuk/

关键词: weather forecasting, subseasonal forecasting, flow matching, generative model, latent space, autoregressive prediction, computational efficiency, global weather maps

240. ❌ Continuous-Time Learning of Probability Distributions: A Case Study in a Digital Trial of Young Children with Type 1 Diabetes

作者: Antonio Álvarez-López, Marcos Matabuena 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24427v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用神经ODE和概率框架对连续血糖监测数据进行建模，以分析1型糖尿病儿童的血糖分布随时间演变。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词都特指大语言模型（LLM）及相关技术。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在生物医学（具体是糖尿病监测）领域的应用，属于“AI for Science”的范畴，但论文并未使用大模型或深度学习进行创新，而是使用传统的概率模型和神经ODE，因此相关性较弱，给5分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于神经ODE的概率框架来建模1型糖尿病儿童连续血糖监测数据的分布随时间演变，应用于临床试验数据时，该方法能检测到传统分析方法难以捕捉的治疗相关血糖动态改善。

摘要翻译

理解生物标志物分布如何随时间演变是数字健康与慢性病监测领域的核心挑战。在糖尿病研究中，葡萄糖测量值分布的变化能够揭示传统统计指标所忽略的疾病进展模式与治疗反应规律。基于一项为期26周、比较闭环胰岛素输送系统t:slim X2与标准疗法在1型糖尿病儿童中疗效的临床试验，我们提出了一种概率框架，利用每五分钟采集一次的连续血糖监测数据（Continuous Glucose Monitoring, CGM）对时间索引分布的连续时间演化进行建模。我们将葡萄糖分布表示为高斯混合模型，其时变混合权重由神经常微分方程控制。我们采用基于最大平均差异的分布匹配准则来估计模型参数。该框架兼具可解释性、计算高效性以及对细微时序分布变化的敏感性。应用于CGM试验数据后，本方法检测到了传统分析方法难以捕捉的、与治疗相关的葡萄糖动态改善。

摘要 (Abstract)

Understanding how biomarker distributions evolve over time is a central challenge in digital health and chronic disease monitoring. In diabetes, changes in the distribution of glucose measurements can reveal patterns of disease progression and treatment response that conventional summary measures miss. Motivated by a 26-week clinical trial comparing the closed-loop insulin delivery system t:slim X2 with standard therapy in children with type 1 diabetes, we propose a probabilistic framework to model the continuous-time evolution of time-indexed distributions using continuous glucose monitoring data (CGM) collected every five minutes. We represent the glucose distribution as a Gaussian mixture, with time-varying mixture weights governed by a neural ODE. We estimate the model parameter using a distribution-matching criterion based on the maximum mean discrepancy. The resulting framework is interpretable, computationally efficient, and sensitive to subtle temporal distributional changes. Applied to CGM trial data, the method detects treatment-related improvements in glucose dynamics that are difficult to capture with traditional analytical approaches.

关键词: continuous-time learning, probability distributions, type 1 diabetes, continuous glucose monitoring, neural ODE, Gaussian mixture, maximum mean discrepancy, digital health

241. ❌ Neural Network Models for Contextual Regression

作者: Seksan Kiatsupaibul, Pakawan Chansiripas 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24400v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于上下文回归的神经网络模型（SCtxtNN），专注于将上下文识别与上下文特定回归分离，以提高效率和可解释性。然而，所有评分关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文研究的是传统神经网络在回归任务中的架构改进，未涉及大模型、LLMs、MoE、微调、对齐、推理、代理、压缩、幻觉缓解、可解释性、科学AI等任何指定领域。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种简单的上下文神经网络（SCtxtNN）用于上下文回归，通过分离上下文识别和回归来减少参数并提高可解释性，实验表明其在保持可解释性的同时比全连接网络具有更低的误差和更稳定的性能。

摘要翻译

我们提出一种用于上下文回归的神经网络模型，其中回归模型依赖于决定活跃子模型的上下文特征，并给出相应的模型拟合算法。所提出的简单上下文神经网络（SCtxtNN）将上下文识别与上下文特定回归分离，形成了一种结构清晰、可解释的架构，其参数量少于全连接前馈网络。我们从数学上证明了该架构仅使用标准神经网络组件即可充分表示上下文线性回归模型。数值实验支持了理论结果，表明在参数量相当的情况下，所提模型相比前馈神经网络实现了更低的超额均方误差和更稳定的性能，而更大规模的网络仅能以增加复杂度为代价提升精度。这些结果表明，融入上下文结构能够在保持可解释性的同时提升模型效率。

摘要 (Abstract)

We propose a neural network model for contextual regression in which the regression model depends on contextual features that determine the active submodel and an algorithm to fit the model. The proposed simple contextual neural network (SCtxtNN) separates context identification from context-specific regression, resulting in a structured and interpretable architecture with fewer parameters than a fully connected feed-forward network. We show mathematically that the proposed architecture is sufficient to represent contextual linear regression models using only standard neural network components. Numerical experiments are provided to support the theoretical result, showing that the proposed model achieves lower excess mean squared error and more stable performance than feed-forward neural networks with comparable numbers of parameters, while larger networks improve accuracy only at the cost of increased complexity. The results suggest that incorporating contextual structure can improve model efficiency while preserving interpretability.

关键词: contextual regression, neural network model, context identification, interpretable architecture, excess mean squared error, parameter efficiency, structured architecture, feed-forward network

242. ❌ Federated fairness-aware classification under differential privacy

作者: Gengyu Xue, Yi Yu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24392v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究联邦学习中的差分隐私和算法公平性联合问题，属于传统机器学习领域，不涉及大语言模型、深度学习技术原理或科学AI应用。所有关键词均与大模型、深度学习技术或科学AI应用相关，而本文专注于联邦学习、隐私保护和公平性约束的分类问题，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了联邦学习环境下差分隐私和算法公平性对分类任务的联合影响，提出了FDP-Fair和CDP-Fair算法，并提供了隐私、公平性和超额风险控制的理论保证。

摘要翻译

隐私与算法公平性已成为现代机器学习中的两大核心议题。尽管二者各自作为快速发展的研究领域已受到广泛关注，但它们的联合影响仍相对缺乏深入探索。本文系统研究了联邦学习场景下差分隐私与公平性对分类问题的共同影响，其中数据分布于多个服务器之间。针对联邦差分隐私约束下的人口统计差异受限分类问题，我们提出了一种两步算法FDP-Fair。在仅存在单服务器的特殊场景中，我们进一步提出了一种简洁而高效的算法CDP-Fair，作为计算轻量化的替代方案。在温和的结构假设下，我们建立了隐私性、公平性及超额风险控制的理论保证。特别地，我们将隐私公平感知超额风险的来源解构为：a) 分类的固有成本，b) 隐私分类成本，c) 非隐私的公平性成本，以及d) 隐私的公平性成本。我们通过合成数据与真实数据集上的大量数值实验验证了理论发现，凸显了所设计算法的实用性。

摘要 (Abstract)

Privacy and algorithmic fairness have become two central issues in modern machine learning. Although each has separately emerged as a rapidly growing research area, their joint effect remains comparatively under-explored. In this paper, we systematically study the joint impact of differential privacy and fairness on classification in a federated setting, where data are distributed across multiple servers. Targeting demographic disparity constrained classification under federated differential privacy, we propose a two-step algorithm, namely FDP-Fair. In the special case where there is only one server, we further propose a simple yet powerful algorithm, namely CDP-Fair, serving as a computationally-lightweight alternative. Under mild structural assumptions, theoretical guarantees on privacy, fairness and excess risk control are established. In particular, we disentangle the source of the private fairness-aware excess risk into a) intrinsic cost of classification, b) cost of private classification, c) non-private cost of fairness and d) private cost of fairness. Our theoretical findings are complemented by extensive numerical experiments on both synthetic and real datasets, highlighting the practicality of our designed algorithms.

关键词: federated learning, differential privacy, algorithmic fairness, classification, demographic disparity, excess risk, privacy-fairness trade-off, distributed data

243. ❌ On the Use of Bagging for Local Intrinsic Dimensionality Estimation

作者: Kristóf Péter, Ricardo J. G. B. Campello, James Bailey, Michael E. Houle 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24384v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是局部内在维度（LID）估计的统计方法改进，具体提出了一种基于bagging的集成方法来减少估计方差。论文内容完全聚焦于数据挖掘和机器学习中的统计估计技术，与所有评分关键词（均涉及大模型、深度学习技术原理、AI应用等）无直接关联。论文未提及任何大模型、深度学习、AI for Science等相关概念，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于bagging的集成方法来改进局部内在维度（LID）估计，通过理论分析和实验验证表明该方法能显著降低估计方差和均方误差，并可与邻域平滑技术结合进一步提升性能。

摘要翻译

局部本征维度（Local Intrinsic Dimensionality，LID）理论已成为刻画数据流形内部及流形间局部复杂性的重要工具，为一系列数据挖掘与机器学习任务提供支持。准确的LID估计需要从每个查询点周围的小邻域内抽取样本，以避免非局部效应和潜在流形混合带来的偏差，但此类邻域内有限的数据往往导致较高的估计方差。作为一种方差缩减策略，我们提出一种集成方法，利用子袋装（subbagging）技术来保持最近邻（Nearest Neighbor，NN）距离的局部分布。主要挑战在于，每个子样本中总样本量的均匀缩减会提高在查询点周围寻找固定数量k个最近邻的邻近阈值。因此，在LID估计的具体背景下，采样率与邻域大小之间存在一种额外的复杂相互作用：二者共同决定了样本量以及估计时所考虑的局部性与分辨率。我们从理论与实验两方面分析了采样率、用于LID估计的k-NN大小以及集成规模的选择如何影响性能，从而能够根据应用偏好对这些超参数进行有依据的先验选择。研究结果表明，在超参数空间内广泛且特征明确的区域中，相较于对应的非袋装基线方法，使用袋装估计器通常能显著降低方差以及均方误差，同时对偏差的影响可控。此外，我们提出并评估了将袋装法与邻域平滑技术相结合的不同方式，从而在LID估计性能上实现了进一步的显著提升。

摘要 (Abstract)

The theory of Local Intrinsic Dimensionality (LID) has become a valuable tool for characterizing local complexity within and across data manifolds, supporting a range of data mining and machine learning tasks. Accurate LID estimation requires samples drawn from small neighborhoods around each query to avoid biases from nonlocal effects and potential manifold mixing, yet limited data within such neighborhoods tends to cause high estimation variance. As a variance reduction strategy, we propose an ensemble approach that uses subbagging to preserve the local distribution of nearest neighbor (NN) distances. The main challenge is that the uniform reduction in total sample size within each subsample increases the proximity threshold for finding a fixed number k of NNs around the query. As a result, in the specific context of LID estimation, the sampling rate has an additional, complex interplay with the neighborhood size, where both combined determine the sample size as well as the locality and resolution considered for estimation. We analyze both theoretically and experimentally how the choice of the sampling rate and the k-NN size used for LID estimation, alongside the ensemble size, affects performance, enabling informed prior selection of these hyper-parameters depending on application-based preferences. Our results indicate that within broad and well-characterized regions of the hyper-parameters space, using a bagged estimator will most often significantly reduce variance as well as the mean squared error when compared to the corresponding non-bagged baseline, with controllable impact on bias. We additionally propose and evaluate different ways of combining bagging with neighborhood smoothing for substantial further improvements on LID estimation performance.

关键词: Local Intrinsic Dimensionality, LID estimation, bagging, ensemble methods, variance reduction, nearest neighbor distances, hyper-parameter selection, neighborhood smoothing

244. ❌ Adaptive decision-making for stochastic service network design

作者: Javier Duran Micco, Bilge Atasoy 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24369v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究物流服务网络设计（SND）的优化问题，采用模拟退火、离散事件模拟和机器学习（自适应代理模型）等方法解决不确定旅行时间和有限卡车资源下的两阶段决策问题。所有评分关键词均专注于大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、推理、代理、量化等）或特定科学AI应用（如生物信息学）。论文内容属于运筹学、物流优化和传统机器学习（代理模型）领域，未涉及任何大语言模型、深度学习技术原理或AI for Science的具体应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对不确定旅行时间和有限卡车资源的物流服务网络设计问题，提出了一种结合模拟退火、离散事件模拟和机器学习代理模型的两阶段优化方法，在保证解质量的同时将计算时间减少了高达20倍。

摘要翻译

本文针对在多式联运网络中运营的物流服务提供商（LSP），研究了考虑不确定运输时间和有限卡车车队可用性的服务网络设计（Service Network Design, SND）问题。提出了一种结合元启发式、仿真与机器学习组件的两阶段优化方法。该解决方案框架将战术决策（如运输请求接受、预定服务的舱位预订）与运营决策（包括动态卡车分配、路径规划以及应对干扰的重新规划）相集成。采用模拟退火（Simulated Annealing, SA）元启发式算法求解战术问题，并辅以一个自适应代理模型予以支持；该代理模型通过离散事件仿真模型训练而成，能够捕捉运营复杂性以及不确定运输时间引发的连锁效应。使用基准算例对所提方法的性能进行了评估。首先，在问题的确定性版本上测试模拟退火算法，并与现有最优结果进行比较，结果表明该算法能提升解的质量并显著减少计算时间。随后，将所提出的模拟退火算法应用于更复杂的随机性问题。与每次解评估均需执行完整仿真的基准算法相比，基于学习的模拟退火算法能够生成高质量解，同时大幅降低计算负担：在目标函数值仅相差5%的情况下，计算时间最多可缩短至二十分之一。这些结果证明了所提算法在求解复杂服务网络设计问题上的优异性能。此外，研究也凸显了整合多种建模与优化技术的有效性，以及此类方法在高效应对货运运输规划挑战方面的潜力。

摘要 (Abstract)

This paper addresses the Service Network Design (SND) problem for a logistics service provider (LSP) operating in a multimodal freight transport network, considering uncertain travel times and limited truck fleet availability. A two-stage optimization approach is proposed, which combines metaheuristics, simulation and machine learning components. This solution framework integrates tactical decisions, such as transport request acceptance and capacity booking for scheduled services, with operational decisions, including dynamic truck allocation, routing, and re-planning in response to disruptions. A simulated annealing (SA) metaheuristic is employed to solve the tactical problem, supported by an adaptive surrogate model trained using a discrete-event simulation model that captures operational complexities and cascading effects of uncertain travel times. The performance of the proposed method is evaluated using benchmark instances. First, the SA is tested on a deterministic version of the problem and compared to state-of-the-art results, demonstrating it can improve the solution quality and significantly reduce the computational time. Then, the proposed SA is applied to the more complex stochastic problem. Compared to a benchmark algorithm that executes a full simulation for each solution evaluation, the learning-based SA generates high quality solutions while significantly reducing computational effort, achieving only a 5% difference in objective function value while cutting computation time by up to 20 times. These results demonstrate the strong performance of the proposed algorithm in solving complex versions of the SND. Moreover, they highlight the effectiveness of integrating diverse modeling and optimization techniques, and the potential of such approaches to efficiently address freight transport planning challenges.

关键词: Service Network Design, stochastic optimization, simulated annealing, discrete-event simulation, adaptive surrogate model, freight transport, two-stage decision-making, computational efficiency

245. ❌ CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal Control

作者: Yifeng Zhang, Harsh Goel, Peizhuo Li, Mehul Damani, Sandeep Chinchali, Guillaume Sartoretti 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24366v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal Control》专注于多智能体强化学习（MARL）在交通信号控制中的应用，提出了一种名为CoordLight的框架，包括Queue Dynamic State Encoding（QDSE）状态表示和Neighbor-aware Policy Optimization（NAPO）算法，以解决分散式环境中的部分可观测性和协调挑战。该研究与大多数关键词（如大语言模型、训练技术、推理方法等）完全无关，因为这些关键词主要涉及自然语言处理和大模型技术，而本文属于交通控制领域的MARL应用。唯一相关的关键词是“Multi-agent Systems OR Agent Coordination”，评分为10分，因为论文的核心正是多智能体系统中的协调问题，通过注意力机制促进相邻智能体之间的协调决策。其他关键词均未涉及，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多智能体强化学习的框架CoordLight，通过新颖的状态编码和邻居感知策略优化算法，解决了分散式交通信号控制中的协调挑战，并在真实交通数据集上实现了优于现有方法的性能。

摘要翻译

自适应交通信号控制（Adaptive Traffic Signal Control, ATSC）对于缓解不断扩张的城市中的交通拥堵、最大化通行能力以及促进可持续出行至关重要。多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）近期在应对复杂交通动态方面展现出巨大潜力，但在去中心化环境中，部分可观测性与协调机制的复杂性仍然是制定可扩展且高效控制策略的关键挑战。为解决这些挑战，我们提出了CoordLight，这是一个基于MARL的框架，旨在通过增强单个交叉口（智能体）的决策能力以及与相邻智能体的协调能力来改善区域内部交通，从而实现网络级交通优化。具体而言，我们引入了队列动态状态编码（Queue Dynamic State Encoding, QDSE），这是一种基于车辆排队模型的新型状态表示方法，它增强了智能体分析、预测和响应局部交通动态的能力。我们进一步提出了一种先进的MARL算法，称为邻居感知策略优化（Neighbor-aware Policy Optimization, NAPO）。该算法整合了一种注意力机制，能够识别相邻智能体之间的状态与动作依赖关系，旨在促进更协调的决策，并通过鲁棒的优势值计算改进策略学习更新。这使得智能体能够识别并优先处理与有影响力的邻居之间的关键交互，从而增强智能体间有针对性的协调与合作。通过在三个由多达196个交叉口组成的真实交通数据集上，与最先进的交通信号控制方法进行全面对比评估，我们实证表明，CoordLight在不同交通流量的多样化交通网络中均表现出持续优越的性能。代码可在 https://github.com/marmotlab/CoordLight 获取。

摘要 (Abstract)

Adaptive traffic signal control (ATSC) is crucial in alleviating congestion, maximizing throughput and promoting sustainable mobility in ever-expanding cities. Multi-Agent Reinforcement Learning (MARL) has recently shown significant potential in addressing complex traffic dynamics, but the intricacies of partial observability and coordination in decentralized environments still remain key challenges in formulating scalable and efficient control strategies. To address these challenges, we present CoordLight, a MARL-based framework designed to improve intra-neighborhood traffic by enhancing decision-making at individual junctions (agents), as well as coordination with neighboring agents, thereby scaling up to network-level traffic optimization. Specifically, we introduce the Queue Dynamic State Encoding (QDSE), a novel state representation based on vehicle queuing models, which strengthens the agents’ capability to analyze, predict, and respond to local traffic dynamics. We further propose an advanced MARL algorithm, named Neighbor-aware Policy Optimization (NAPO). It integrates an attention mechanism that discerns the state and action dependencies among adjacent agents, aiming to facilitate more coordinated decision-making, and to improve policy learning updates through robust advantage calculation. This enables agents to identify and prioritize crucial interactions with influential neighbors, thus enhancing the targeted coordination and collaboration among agents. Through comprehensive evaluations against state-of-the-art traffic signal control methods over three real-world traffic datasets composed of up to 196 intersections, we empirically show that CoordLight consistently exhibits superior performance across diverse traffic networks with varying traffic flows. The code is available at https://github.com/marmotlab/CoordLight

关键词: Multi-Agent Reinforcement Learning, Traffic Signal Control, Decentralized Coordination, Queue Dynamic State Encoding, Neighbor-aware Policy Optimization, Attention Mechanism, Network-wide Optimization, Adaptive Traffic Control

246. ❌ A Neuro-Symbolic System for Interpretable Multimodal Physiological Signals Integration in Human Fatigue Detection

作者: Mohammadreza Jamalifard, Yaxiong Lei, Parasto Azizinezhad, Javier Fumanal-Idocin, Javier Andreu-Perez 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24358v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文提出了一种神经符号架构，用于从眼动追踪和神经血流动力学信号中学习可解释的生理概念，并将其应用于疲劳检测。论文的核心是开发一种准确且可解释的模型，重点关注可解释性（Explainable AI），这与’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。此外，该研究属于AI在科学领域的应用，具体是生物信息学/生理信号分析，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法、推理优化、智能体等）或大模型在不同领域的应用，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种神经符号系统，通过从多模态生理信号中学习可解释的概念并结合可微推理规则，用于人类疲劳检测，在18名参与者的留一法评估中达到了72.1%的准确率，同时提供了概念激活和规则触发强度的可解释性分析。

摘要翻译

我们提出一种神经符号架构，该架构通过基于注意力的编码器从眼动追踪和神经血流动力学（功能性近红外光谱，fNIRS）时间窗口中学习四个可解释的生理概念——眼动动力学、注视稳定性、前额叶血流动力学以及多模态特征，并利用学习到的权重和软阈值将其与可微近似推理规则相结合，以解决传统僵化手工规则及缺乏被试层面对齐诊断的问题。我们将该系统应用于基于多模态生理信号的疲劳分类领域，该领域需要兼具准确性与可解释性的模型，其内部推理过程需可被审查以保障安全关键应用。在18名参与者（560个样本）的留一被试交叉验证中，该方法达到72.1% ± 12.3%的准确率，与调优基线模型相当，同时可展示概念激活状态与规则触发强度。消融实验表明：被试特异性校准带来增益（+5.2个百分点），去除fNIRS概念仅导致小幅下降（-1.2个百分点），使用卢卡西维茨算子较乘积算子略有提升（+0.9个百分点）。我们还提出了概念保真度这一基于预留标签的离线个体审计指标，该指标与个体准确率呈强相关（r=0.843, p < 0.0001）。

摘要 (Abstract)

We propose a neuro-symbolic architecture that learns four interpretable physiological concepts, oculomotor dynamics, gaze stability, prefrontal hemodynamics, and multimodal, from eye-tracking and neural hemodynamics, functional near-infrared spectroscopy, (fNIRS) windows using attention-based encoders, and combines them with differentiable approximate reasoning rules using learned weights and soft thresholds, to address both rigid hand-crafted rules and the lack of subject-level alignment diagnostics. We apply this system to fatigue classification from multimodal physiological signals, a domain that requires models that are accurate and interpretable, with internal reasoning that can be inspected for safety-critical use. In leave-one-subject-out evaluation on 18 participants (560 samples), the method achieves 72.1% +/- 12.3% accuracy, comparable to tuned baselines while exposing concept activations and rule firing strengths. Ablations indicate gains from participant-specific calibration (+5.2 pp), a modest drop without the fNIRS concept (-1.2 pp), and slightly better performance with Lukasiewicz operators than product (+0.9 pp). We also introduce concept fidelity, an offline per-subject audit metric from held-out labels, which correlates strongly with per-subject accuracy (r=0.843, p < 0.0001).

关键词: neuro-symbolic architecture, interpretable physiological concepts, multimodal physiological signals, fatigue detection, attention-based encoders, differentiable reasoning rules, concept fidelity, human fatigue classification

247. ❌ CGRL: Causal-Guided Representation Learning for Graph Out-of-Distribution Generalization

作者: Bowen Lu, Liangqiang Yang, Teng Li 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24304v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图神经网络（GNNs）的因果表示学习和分布外泛化，研究内容为图机器学习领域，未涉及大语言模型（LLMs）、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是图神经网络中的因果推理和泛化问题，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对图神经网络在分布外数据上泛化能力差的问题，提出了一种因果引导的表示学习方法，通过阻断非因果路径和损失替换策略，有效提升了图神经网络的分布外泛化性能。

摘要翻译

图神经网络（GNNs）在图相关任务中取得了显著性能。然而，其在分布外（OOD）数据上的泛化能力较差，因为它们倾向于学习虚假相关性。这种现象表现为图神经网络在OOD设置下无法稳定地学习预测表示与真实标签之间的互信息。为应对这些挑战，我们从节点分类的本质出发构建因果图，采用后门调整阻断非因果路径，并从理论上推导出提升图神经网络OOD泛化能力的下界。为实现这些理论见解，我们进一步提出一种融合因果表示学习与损失替换策略的新方法。前者捕捉节点级因果不变性并重构图后验分布；后者引入同阶渐近损失以替代原始损失。大量实验证明，我们的方法在OOD泛化方面具有优越性，并能有效缓解互信息学习不稳定的现象。

摘要 (Abstract)

Graph Neural Networks (GNNs) have achieved impressive performance in graph-related tasks. However, they suffer from poor generalization on out-of-distribution (OOD) data, as they tend to learn spurious correlations. Such correlations present a phenomenon that GNNs fail to stably learn the mutual information between prediction representations and ground-truth labels under OOD settings. To address these challenges, we formulate a causal graph starting from the essence of node classification, adopt backdoor adjustment to block non-causal paths, and theoretically derive a lower bound for improving OOD generalization of GNNs. To materialize these insights, we further propose a novel approach integrating causal representation learning and a loss replacement strategy. The former captures node-level causal invariance and reconstructs graph posterior distribution. The latter introduces asymptotic losses of the same order to replace the original losses. Extensive experiments demonstrate the superiority of our method in OOD generalization and effectively alleviating the phenomenon of unstable mutual information learning.

关键词: Graph Neural Networks, Out-of-Distribution Generalization, Causal Representation Learning, Backdoor Adjustment, Mutual Information, Node Classification, Causal Graph, Loss Replacement

248. ❌ Connecting Meteorite Spectra to Lunar Surface Composition Using Hyperspectral Imaging and Machine Learning

作者: Fatemeh Fazel Hesar, Mojtaba Raouf, Amirmohammad Chegeni, Peyman Soltani, Bernard Foing, Elias Chatzitheodoridis, Michiel J. A. de Dood, Fons J. Verbeek 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24323v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用传统机器学习方法（SVM、K-means）和光谱成像技术进行月球表面矿物分类，未涉及任何大语言模型、深度学习架构、训练方法、推理优化、对齐技术或智能体系统。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在行星科学领域的应用，但使用的是基础机器学习而非大模型技术，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合实验室陨石高光谱成像、地面月球高光谱成像和监督机器学习（SVM）的创新框架，以生成高保真月球矿物学地图，在Bechar010陨石中实现了93.7%的分类准确率，并成功识别了月球上的橄榄石和辉石富集区域。

摘要翻译

本文提出了一种创新且经济高效的框架，该框架将Bechar010月球陨石的实验室高光谱成像（Hyperspectral Imaging, HSI）与地基月球HSI以及监督机器学习（Machine Learning, ML）相结合，以生成高保真度的矿物学分布图。使用Specim FX10相机，在显微镜下对一块3毫米厚的Bechar010薄片进行成像，采用30毫米焦距镜头，工作距离150毫米，并应用6x像素合并以提高信噪比，生成了一个数据立方体（X × Y × λ = 791 × 1024 × 224，空间分辨率0.24毫米 × 0.2毫米），光谱范围覆盖400-1000纳米（224个波段，光谱采样间隔2.7纳米，半高全宽光谱分辨率5.5纳米）。地基月球HSI数据使用Celestron 8SE望远镜采集（空间分辨率3公里/像素），获得数据立方体（371 × 1024 × 224）。利用Spectralon参考板（反射率99%，误差<2%）进行太阳定标，确保了反射光谱的准确性。采用径向基函数核的支持向量机（Support Vector Machine, SVM）模型，基于专家标记的光谱进行训练，对Bechar010中的橄榄石（精度92%，召回率90%）和辉石（精度88%，召回率86%）实现了93.7%的分类准确率（五折交叉验证）。对10个预选区域（M1至M10）的LIME分析识别出关键波长（例如，485纳米对M3区域贡献度22.4%；715纳米对M6区域贡献度20.6%），指示了富橄榄石（类似月球高地）和富辉石（类似月海）的组成。光谱角填图（SAM）分析显示角度范围在0.26弧度至0.66弧度之间，将M3和M9区域关联到高地特征，M6和M10区域关联到月海特征。对月球数据的K-means聚类识别出10个矿物学集群（准确率88%），并通过与月船一号月球矿物绘图仪（M³）数据（空间分辨率140米/像素，光谱分辨率10纳米）对比进行了验证。本研究采用的一种新型的推扫式HSI结合望远镜的方法，实现了0.8角秒分辨率的月球光谱测量，为全天区多目标光谱测绘提供了新思路。

摘要 (Abstract)

We present an innovative, cost-effective framework integrating laboratory Hyperspectral Imaging (HSI) of the Bechar010 Lunar meteorite with ground-based lunar HSI and supervised Machine Learning(ML) to generate high-fidelity mineralogical maps. A 3mm thin section of Bechar010 was imaged under a microscope with a 30mm focal length lens at 150mm working distance, using 6x binning to increase the signal-to-noise ratio, producing a data cube (X $\times$ Y $\times$ $λ$ = $791 \times 1024 \times 224$, 0.24mm $\times$ 0.2mm resolution) across 400-1000}nm (224 bands, 2.7nm spectral sampling, 5.5nm full width at half maximum spectral resolution) using a Specim FX10 camera. Ground-based lunar HSI was captured with a Celestron 8SE telescope (3km/pixel), yielded a data cube ($371 \times 1024 \times 224$). Solar calibration was performed using a Spectralon reference ({99}% reflectance {<2}% error) ensured accurate reflectance spectra. A Support Vector Machine (SVM) with a radial basis function kernel, trained on expert-labeled spectra, achieved {93.7}% classification accuracy(5-fold cross-validation) for olivine ({92}% precision, {90}% recall) and pyroxene ({88}% precision, {86}{%} recall) in Bechar 010. LIME analysis identified key wavelengths (e.g., 485nm, {22.4}% for M3; 715nm, {20.6}% for M6) across 10 pre-selected regions (M1 to M10), indicating olivine-rich (Highland-like) and pyroxene-rich (Mare-like) compositions. SAM analysis revealed angles from 0.26 radian to 0.66 radian, linking M3 and M9 to Highlands and M6 and M10 to Mares. K-means clustering of Lunar data identified 10 mineralogical clusters ({88}% accuracy), validated against Chandrayaan-1 Moon mineralogy Mapper ($\rm M^3$) data (140m/pixel, 10nm spectral resolution).A novel push-broom HSI approach with a telescope achieves 0.8 arcsec resolution for lunar spectroscopy, inspiring full-sky multi-object spectral mapping.

关键词: Hyperspectral Imaging, Machine Learning, Lunar Meteorite, Mineralogical Mapping, Support Vector Machine, Spectral Analysis, Planetary Science, Remote Sensing

249. ❌ Language-Assisted Image Clustering Guided by Discriminative Relational Signals and Adaptive Semantic Centers

作者: Jun Ma, Xu Zhang, Zhengxing Jiao, Yaxin Hou, Hui Liu, Junhui Hou, Yuheng Jia 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24275v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究语言辅助图像聚类（LAIC），使用视觉语言模型（VLMs）增强图像特征，但未涉及大语言模型（LLMs）或深度学习技术原理的创新。所有关键词均针对大语言模型技术、训练方法、推理优化、对齐、代理系统等，而本文聚焦于视觉语言模型在图像聚类中的应用，属于计算机视觉与多模态学习领域，与给定的大模型关键词无直接关联。

!!! tip deepseek-chat TL;DR

本文提出了一种新的语言辅助图像聚类框架，通过利用跨模态关系生成更具区分性的自监督信号，并学习类别级连续语义中心，在八个基准数据集上实现了平均2.6%的性能提升。

摘要翻译

语言辅助图像聚类（LAIC）借助视觉-语言模型（VLM）为输入图像补充文本信息以提升聚类性能。尽管近期取得进展，现有LAIC方法常忽视两个问题：（i）为每张图像构建的文本特征高度相似，导致类间区分性较弱；（ii）聚类步骤受限于预构建的图像-文本对齐关系，限制了文本模态的进一步利用潜力。针对这些问题，我们提出一种包含两个互补组件的新LAIC框架。首先，我们利用跨模态关系生成更具判别力的聚类自监督信号，该方法与多数VLM训练机制兼容。其次，我们通过提示学习（prompt learning）获取类别级连续语义中心，以生成最终聚类分配。在八个基准数据集上的大量实验表明，本方法相较现有最优方法平均提升2.6%，且学习到的语义中心展现出强可解释性。代码详见补充材料。

摘要 (Abstract)

Language-Assisted Image Clustering (LAIC) augments the input images with additional texts with the help of vision-language models (VLMs) to promote clustering performance. Despite recent progress, existing LAIC methods often overlook two issues: (i) textual features constructed for each image are highly similar, leading to weak inter-class discriminability; (ii) the clustering step is restricted to pre-built image-text alignments, limiting the potential for better utilization of the text modality. To address these issues, we propose a new LAIC framework with two complementary components. First, we exploit cross-modal relations to produce more discriminative self-supervision signals for clustering, as it compatible with most VLMs training mechanisms. Second, we learn category-wise continuous semantic centers via prompt learning to produce the final clustering assignments. Extensive experiments on eight benchmark datasets demonstrate that our method achieves an average improvement of 2.6% over state-of-the-art methods, and the learned semantic centers exhibit strong interpretability. Code is available in the supplementary material.

关键词: Language-Assisted Image Clustering, Vision-Language Models, Cross-modal Relations, Discriminative Self-supervision, Semantic Centers, Prompt Learning, Image Clustering, Multi-modal Learning

250. ❌ DeepDTF: Dual-Branch Transformer Fusion for Multi-Omics Anticancer Drug Response Prediction

作者: Yuhan Zhao, Jacob Tennant, James Yang, Zhishan Guo, Young Whang, Ning Sui 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24265v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文DeepDTF专注于使用Transformer和GNN-Transformer架构进行抗癌药物反应预测，属于AI for Science（特别是生物信息学）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到使用SHAP进行解释，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。其他关键词主要涉及大语言模型（LLMs）的技术原理、训练方法、推理优化、代理系统等，而本文研究的是特定领域的深度学习模型（Transformer用于多组学和药物表示），并非大语言模型或相关技术，因此完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DeepDTF的双分支Transformer融合框架，用于解决多组学抗癌药物反应预测中的跨模态对齐挑战，并在公共基准测试中显著提升了预测准确性和分类性能。

摘要翻译

由于肿瘤的多层分子异质性，癌症药物反应在不同肿瘤间差异显著，这推动了对精准肿瘤学计算决策支持的需求。尽管深度癌症药物反应模型近期取得进展，但由于跨模态错位和有限的归纳偏置，高维多组学数据与化学结构药物之间的稳健对齐仍具挑战。本文提出DeepDTF，一种端到端双分支Transformer融合框架，用于联合log(IC50)回归和药物敏感性分类。细胞系分支采用模态特异性编码器处理多组学图谱，并通过Transformer模块捕获长程依赖关系；药物分支将化合物表示为分子图，利用GNN-Transformer编码器整合局部拓扑与全局上下文。组学与药物表征通过基于Transformer的融合模块进行交互建模，以缓解特征错位问题。在五折冷启动细胞系评估的公共药物基因组学基准测试中，DeepDTF在多种组学设置下均稳定优于强基线模型，使用完整多组学输入时达到RMSE=1.248、R^2=0.875和AUC=0.987的最佳性能，同时将分类错误率（1-ACC）降低9.5%。除预测精度外，DeepDTF通过基于SHAP的基因归因分析和预排序GSEA通路富集，提供了具有生物学依据的解释机制。

摘要 (Abstract)

Cancer drug response varies widely across tumors due to multi-layer molecular heterogeneity, motivating computational decision support for precision oncology. Despite recent progress in deep CDR models, robust alignment between high-dimensional multi-omics and chemically structured drugs remains challenging due to cross-modal misalignment and limited inductive bias. We present DeepDTF, an end-to-end dual-branch Transformer fusion framework for joint log(IC50) regression and drug sensitivity classification. The cell-line branch uses modality-specific encoders for multi-omics profiles with Transformer blocks to capture long-range dependencies, while the drug branch represents compounds as molecular graphs and encodes them with a GNN-Transformer to integrate local topology with global context. Omics and drug representations are fused by a Transformer-based module that models cross-modal interactions and mitigates feature misalignment. On public pharmacogenomic benchmarks under 5-fold cold-start cell-line evaluation, DeepDTF consistently outperforms strong baselines across omics settings, achieving up to RMSE=1.248, R^2=0.875, and AUC=0.987 with full multi-omics inputs, while reducing classification error (1-ACC) by 9.5%. Beyond accuracy, DeepDTF provides biologically grounded explanations via SHAP-based gene attributions and pathway enrichment with pre-ranked GSEA.

关键词: multi-omics, drug response prediction, Transformer, GNN-Transformer, cross-modal fusion, precision oncology, SHAP, cold-start evaluation

251. ❌ Forecasting with Guidance: Representation-Level Supervision for Time Series Forecasting

作者: Jiacheng Wang, Liang Fan, Baihua Li, Luyan Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24262v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ReGuider方法，利用预训练的时间序列基础模型作为语义教师，通过表示级监督来改进时间序列预测。这与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’相关（8分），因为论文明确使用了预训练模型进行领域适应。其他关键词主要涉及大语言模型、推理、对齐、压缩等特定技术，与论文的时间序列预测和表示学习核心内容无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对时间序列预测中编码器丢弃信息性极端模式导致预测平滑的问题，提出了ReGuider方法，通过利用预训练基础模型的中间嵌入进行表示级监督，使编码器学习更具表达力的时间表示，从而提高了预测准确性。

摘要翻译

当前，时间序列预测主要通过基于误差目标对深度学习架构进行端到端训练来实现。虽然这种方法能有效最小化平均损失，但它会导致编码器丢弃信息丰富但极端的模式，从而产生过于平滑的预测和难以捕捉显著动态的时间表征。为解决这一问题，我们提出了ReGuider——一种可无缝集成到任何预测架构中的插件式方法。ReGuider利用预训练的时间序列基础模型作为语义教师。在训练过程中，输入序列同时由目标预测模型和预训练模型处理。我们并非直接使用预训练模型的输出，而是提取其富含时间和语义信息的中间嵌入表示，并通过表征层面的监督将其与目标模型编码器的嵌入表示进行对齐。这一对齐过程使编码器能够学习更具表现力的时间表征，从而提升下游预测的准确性。在多种数据集和架构上进行的大量实验表明，我们的ReGuider能持续提升预测性能，验证了其有效性和普适性。

摘要 (Abstract)

Nowadays, time series forecasting is predominantly approached through the end-to-end training of deep learning architectures using error-based objectives. While this is effective at minimizing average loss, it encourages the encoder to discard informative yet extreme patterns. This results in smooth predictions and temporal representations that poorly capture salient dynamics. To address this issue, we propose ReGuider, a plug-in method that can be seamlessly integrated into any forecasting architecture. ReGuider leverages pretrained time series foundation models as semantic teachers. During training, the input sequence is processed together by the target forecasting model and the pretrained model. Rather than using the pretrained model’s outputs directly, we extract its intermediate embeddings, which are rich in temporal and semantic information, and align them with the target model’s encoder embeddings through representation-level supervision. This alignment process enables the encoder to learn more expressive temporal representations, thereby improving the accuracy of downstream forecasting. Extensive experimentation across diverse datasets and architectures demonstrates that our ReGuider consistently improves forecasting performance, confirming its effectiveness and versatility.

关键词: time series forecasting, representation-level supervision, pretrained foundation models, encoder embeddings, temporal representations, ReGuider, semantic teachers, plug-in method

252. ❌ C-STEP: Continuous Space-Time Empowerment for Physics-informed Safe Reinforcement Learning of Mobile Agents

作者: Guihlerme Daubt, Adrian Redder 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24241v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习（RL）中的安全导航问题，提出了一种基于物理信息的连续时空赋能（C-STEP）方法，用于移动机器人的安全RL。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是传统RL方法，未涉及大模型、深度学习或AI在生物/化学信息学等科学领域的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种基于物理信息的连续时空赋能（C-STEP）方法，用于强化学习中移动机器人的安全导航，通过设计内在奖励函数联合优化任务完成和碰撞避免，实验结果表明减少了碰撞和接近障碍物的风险，仅略微增加旅行时间。

摘要翻译

在复杂环境中实现安全导航仍是机器人强化学习领域的核心挑战。本文提出面向物理信息连续时空赋能（C-STEP）的安全强化学习方法，这是一种针对确定性连续域设计的、以智能体为中心的新型安全度量标准。该度量可通过增强正向导航奖励函数来构建物理信息内在奖励。该奖励机制融合智能体内部状态（如初始速度）与前向动力学特征，从而区分安全行为与风险行为。通过将C-STEP与导航奖励相结合，我们获得了一种能同步优化任务完成与碰撞规避的内在奖励函数。数值实验表明，该方法能显著减少碰撞次数、降低障碍物接近概率，且行程时间仅边际增加。总体而言，C-STEP为强化学习奖励塑造提供了一种可解释的、融合物理信息的创新路径，有助于提升自主移动机器人系统的安全性。

摘要 (Abstract)

Safe navigation in complex environments remains a central challenge for reinforcement learning (RL) in robotics. This paper introduces Continuous Space-Time Empowerment for Physics-informed (C-STEP) safe RL, a novel measure of agent-centric safety tailored to deterministic, continuous domains. This measure can be used to design physics-informed intrinsic rewards by augmenting positive navigation reward functions. The reward incorporates the agents internal states (e.g., initial velocity) and forward dynamics to differentiate safe from risky behavior. By integrating C-STEP with navigation rewards, we obtain an intrinsic reward function that jointly optimizes task completion and collision avoidance. Numerical results demonstrate fewer collisions, reduced proximity to obstacles, and only marginal increases in travel time. Overall, C-STEP offers an interpretable, physics-informed approach to reward shaping in RL, contributing to safety for agentic mobile robotic systems.

关键词: Safe Reinforcement Learning, Mobile Agents, Physics-informed, Intrinsic Reward, Collision Avoidance, Continuous Space-Time Empowerment, Navigation, Robotics

253. ❌ Identification of NMF by choosing maximum-volume basis vectors

作者: Qianqian Qi, Zhongming Chen, Peter G. M. van der Heijden 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24227v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究非负矩阵分解（NMF）的数学方法，提出了一种新的最大体积约束NMF框架，属于传统的机器学习/矩阵分解领域。论文内容完全不涉及大语言模型、深度学习、AI for Science或任何评分关键词中的技术（如MoE、RLHF、RAG、量化等）。所有关键词均与大模型、深度学习技术原理或科学AI应用无关，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文针对传统最小体积约束NMF在处理高度混合数据时可能失效且基向量难以解释的问题，提出了一种新的最大体积约束NMF框架，建立了可识别性定理并提供了估计算法，实验证明了其有效性。

摘要翻译

在非负矩阵分解（NMF）中，最小体积约束非负矩阵分解是一种广泛使用的框架，其通过使基向量尽可能相似来识别NMF的解。这通常会导致系数矩阵具有稀疏性，即每一行都包含零元素。因此，对于高度混合的数据——此类稀疏性不成立的情况——最小体积约束NMF可能会失效。此外，最小体积约束NMF中估计的基向量可能难以解释，因为它们可能是真实基向量的混合。为解决这些局限性，本文提出了一种新的NMF框架，称为最大体积约束非负矩阵分解（maximum-volume-constrained NMF），其目标是使基向量尽可能区分开来。我们进一步建立了最大体积约束NMF的可识别性定理，并提供了一种估计算法。实验结果证明了所提方法的有效性。

摘要 (Abstract)

In nonnegative matrix factorization (NMF), minimum-volume-constrained NMF is a widely used framework for identifying the solution of NMF by making basis vectors as similar as possible. This typically induces sparsity in the coefficient matrix, with each row containing zero entries. Consequently, minimum-volume-constrained NMF may fail for highly mixed data, where such sparsity does not hold. Moreover, the estimated basis vectors in minimum-volume-constrained NMF may be difficult to interpret as they may be mixtures of the ground truth basis vectors. To address these limitations, in this paper we propose a new NMF framework, called maximum-volume-constrained NMF, which makes the basis vectors as distinct as possible. We further establish an identifiability theorem for maximum-volume-constrained NMF and provide an algorithm to estimate it. Experimental results demonstrate the effectiveness of the proposed method.

关键词: Nonnegative Matrix Factorization, NMF, Maximum-volume-constrained NMF, Identifiability, Basis vectors, Sparsity, Highly mixed data, Algorithm

254. ❌ UniScale: Synergistic Entire Space Data and Model Scaling for Search Ranking

作者: Liren Yu, Caiyuan Li, Feiyi Dong, Tao Zhang, Zhixuan Zhang, Dan Ou, Haihong Tang, Bo Zheng 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24226v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确提到LLMs和scaling laws，并提出了UniScale框架来协同优化数据和架构以突破模型扩展的性能瓶颈，因此与’Large Language Models OR LLMs OR Foundation Models’和’Scaling Laws AND Data Quality’高度相关（10分）。其他关键词如MoE、SLMs、SFT、RAG、CoT等均未在摘要中提及或与论文核心内容无关，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对工业搜索中模型扩展的边际收益递减和数据分布复杂性问题，提出了UniScale框架，通过协同优化数据（ES³系统）和架构（HHSFT模型）来突破性能瓶颈，并在大规模电商搜索平台上验证了其显著提升业务指标的效果。

摘要翻译

近期，大语言模型（LLM）的进展推动了工业搜索、广告和推荐系统中缩放定律研究的热潮。然而，现有方法主要聚焦于架构改进，忽视了数据与架构设计之间的关键协同作用。我们观察到，仅扩大模型参数会带来收益递减，即随着模型规模增加，性能的边际提升持续下降，并且由复杂异构数据分布引起的性能下降往往无法仅通过模型设计来弥补。本文提出UniScale以应对这些局限，这是一种新颖的协同设计框架，通过联合优化数据与架构以充分释放模型缩放的潜力。该框架包含两个核心部分：（1）ES$^3$（全空间样本系统），一个高质量的数据缩放系统，它通过基于层级标签归因构建的全局监督信号从域内请求上下文中，以及在搜索领域中与相似内容曝光环境下用户决策本质对齐的跨域样本中，扩展训练信号，超越了传统采样策略；（2）HHSFT（异构层级样本融合Transformer），一种新颖的架构，旨在通过异构层级特征交互和全空间用户兴趣融合，有效建模缩放数据的复杂异构分布并充分利用全空间用户行为数据，从而突破仅依赖结构调优模型的性能上限。在大规模真实世界电子商务搜索平台上的大量实验表明，UniScale通过数据与架构的协同设计实现了显著提升，并展现出清晰的缩放趋势，在关键业务指标上带来了实质性增益。

摘要 (Abstract)

Recent advances in Large Language Models (LLMs) have inspired a surge of scaling law research in industrial search, advertising, and recommendation systems. However, existing approaches focus mainly on architectural improvements, overlooking the critical synergy between data and architecture design. We observe that scaling model parameters alone exhibits diminishing returns, i.e., the marginal gain in performance steadily declines as model size increases, and that the performance degradation caused by complex heterogeneous data distributions is often irrecoverable through model design alone. In this paper, we propose UniScale to address these limitation, a novel co-design framework that jointly optimizes data and architecture to unlock the full potential of model scaling, which includes two core parts: (1) ES$^3$ (Entire-Space Sample System), a high-quality data scaling system that expands the training signal beyond conventional sampling strategies from both intra-domain request contexts with global supervised signal constructed by hierarchical label attribution and cross-domain samples aligning with the essence of user decision under similar content exposure environment in search domain; and (2) HHSFT (Heterogeneous Hierarchical Sample Fusion Transformer), a novel architecture designed to effectively model the complex heterogeneous distribution of scaled data and to harness the entire space user behavior data with Heterogeneous Hierarchical Feature Interaction and Entire Space User Interest Fusion, thereby surpassing the performance ceiling of structure-only model tuning. Extensive experiments on large-scale real world E-commerce search platform demonstrate that UniScale achieves significant improvements through the synergistic co-design of data and architecture and exhibits clear scaling trends, delivering substantial gains in key business metrics.

关键词: Large Language Models, scaling laws, data scaling, architecture design, search ranking, model scaling, heterogeneous data, synergistic co-design

255. ❌ IPatch: A Multi-Resolution Transformer Architecture for Robust Time-Series Forecasting

作者: Aymane Harkati, Moncef Garouani, Olivier Teste, Julien Aligon, Mohamed Hamlich 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24207v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文IPatch专注于时间序列预测，提出了一种结合点表示和块表示的多分辨率Transformer架构。虽然使用了Transformer，但研究内容与所有评分关键词（均围绕大模型技术原理、训练方法、推理优化、对齐、应用等）完全无关。论文未涉及任何大模型、深度学习技术原理创新或科学领域应用，仅针对时间序列预测的特定架构改进，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出IPatch，一种多分辨率Transformer架构，通过整合点表示和块表示来改进多元时间序列预测的准确性和鲁棒性。

摘要翻译

多元时间序列的精确预测仍面临挑战，这需要同时捕捉短期波动与长期时间依赖性。基于Transformer的模型已成为一种强大方法，但其性能关键取决于时间数据的表示方式。传统的逐点表示保留了单个时间步信息，支持细粒度建模，但其计算成本较高，且在建模更广泛的上下文依赖关系时效果欠佳，限制了其处理长序列的可扩展性。分块表示将连续时间步聚合为紧凑的标记，以提高效率并建模局部时间动态，但通常会丢失细粒度时间细节，而这些细节对于波动或复杂时间序列的精确预测至关重要。我们提出IPatch，一种多分辨率Transformer架构，它整合了逐点标记与分块标记，在多个分辨率上建模时间信息。在7个基准数据集上的实验表明，与单一表示基线相比，IPatch持续提升了预测精度、对噪声的鲁棒性以及在不同预测时间跨度上的泛化能力。

摘要 (Abstract)

Accurate forecasting of multivariate time series remains challenging due to the need to capture both short-term fluctuations and long-range temporal dependencies. Transformer-based models have emerged as a powerful approach, but their performance depends critically on the representation of temporal data. Traditional point-wise representations preserve individual time-step information, enabling fine-grained modeling, yet they tend to be computationally expensive and less effective at modeling broader contextual dependencies, limiting their scalability to long sequences. Patch-wise representations aggregate consecutive steps into compact tokens to improve efficiency and model local temporal dynamics, but they often discard fine-grained temporal details that are critical for accurate predictions in volatile or complex time series. We propose IPatch, a multi-resolution Transformer architecture that integrates both point-wise and patch-wise tokens, modeling temporal information at multiple resolutions. Experiments on 7 benchmark datasets demonstrate that IPatch consistently improves forecasting accuracy, robustness to noise, and generalization across various prediction horizons compared to single-representation baselines.

关键词: time-series forecasting, transformer architecture, multi-resolution, point-wise representation, patch-wise representation, multivariate time series, temporal dependencies, forecasting accuracy

256. ❌ Quantum Neural Physics: Solving Partial Differential Equations on Quantum Simulators using Quantum Convolutional Neural Networks

作者: Jucai Zhai, Muhammad Abdullah, Boyang Chen, Fazal Chaudry, Paul N. Smith, Claire E. Heaney, Yanghua Wang, Jiansheng Xiang, Christopher C. Pain 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24196v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子计算与经典卷积神经网络（CNN）的混合方法，用于求解偏微分方程（PDEs），属于科学计算领域。论文内容与绝大多数关键词（涉及大语言模型、训练技术、推理优化、智能体等）完全无关，因为这些关键词均围绕大语言模型（LLMs）及其相关技术。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文将AI（具体是CNN）应用于科学计算（PDE求解），属于AI for Science的广义范畴，但论文核心是量子模拟与经典CNN的结合，并非典型的生物信息学或化学信息学应用，因此给予5分（有一定关联）。其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为“量子神经物理”的新框架，通过将偏微分方程的离散化算子映射到量子卷积核，开发了混合量子-经典CNN多重网格求解器，在量子模拟器上验证了其对泊松方程、扩散方程等问题的求解能力，为实现指数级内存压缩和计算加速提供了新路径。

摘要翻译

在科学计算中，将偏微分方程（PDEs）的数值离散格式构建为卷积神经网络（CNNs）中的未训练卷积层（部分研究称之为“神经物理”），已证明在GPU上执行基于物理的求解器具有良好效率。然而，经典的基于网格的方法在求解涉及数十亿自由度的问题时，仍面临计算瓶颈。为应对这一挑战，本文提出了一种名为“量子神经物理”的新颖框架，并开发了一种混合量子-经典CNN多重网格求解器（HQC-CNNMG）。该方法将解析确定的离散微分算子模板映射为无参数或未训练的量子卷积核。通过利用振幅编码、单位算符线性组合技术以及量子傅里叶变换，所得到的量子卷积算子可通过量子电路实现，其电路深度按O(log K)缩放，其中K表示编码输入块的大小。这些量子算子通过U-Net结构嵌入到经典的W循环多重网格中。该设计使得量子算子能够在分层求解器中无缝集成，同时保持经典多重网格方法的鲁棒性和收敛特性。
所提出的量子神经物理求解器在量子模拟器上针对泊松方程、扩散方程、对流-扩散方程以及不可压缩纳维-斯托克斯方程进行了验证。HQC-CNNMG的求解结果与传统求解方法高度吻合。这项工作建立了从离散物理方程到对数规模量子电路的映射，为未来容错量子计算机上的PDE求解器实现指数级内存压缩与计算加速提供了一条全新的探索路径。

摘要 (Abstract)

In scientific computing, the formulation of numerical discretisations of partial differential equations (PDEs) as untrained convolutional layers within Convolutional Neural Networks (CNNs), referred to by some as Neural Physics, has demonstrated good efficiency for executing physics-based solvers on GPUs. However, classical grid-based methods still face computational bottlenecks when solving problems involving billions of degrees of freedom. To address this challenge, this paper proposes a novel framework called ‘Quantum Neural Physics’ and develops a Hybrid Quantum-Classical CNN Multigrid Solver (HQC-CNNMG). This approach maps analytically-determined stencils of discretised differential operators into parameter-free or untrained quantum convolutional kernels. By leveraging amplitude encoding, the Linear Combination of Unitaries technique and the Quantum Fourier Transform, the resulting quantum convolutional operators can be implemented using quantum circuits with a circuit depth that scales as O(log K), where K denotes the size of the encoded input block. These quantum operators are embedded into a classical W-Cycle multigrid using a U-Net. This design enables seamless integration of quantum operators within a hierarchical solver whilst retaining the robustness and convergence properties of classical multigrid methods. The proposed Quantum Neural Physics solver is validated on a quantum simulator for the Poisson equation, diffusion equation, convection-diffusion equation and incompressible Navier-Stokes equations. The solutions of the HQC-CNNMG are in close agreement with those from traditional solution methods. This work establishes a mapping from discretised physical equations to logarithmic-scale quantum circuits, providing a new and exploratory path to exponential memory compression and computational acceleration for PDE solvers on future fault-tolerant quantum computers.

关键词: Quantum Neural Physics, Partial Differential Equations, Quantum Convolutional Neural Networks, Hybrid Quantum-Classical Solver, Multigrid Solver, Quantum Simulator, Amplitude Encoding, Computational Acceleration

257. ❌ TsetlinWiSARD: On-Chip Training of Weightless Neural Networks using Tsetlin Automata on FPGAs

作者: Shengyu Duan, Marcos L. L. Sartori, Rishad Shafik, Alex Yakovlev 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24186v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是无权重神经网络（WNN）的片上训练方法，使用Tsetlin自动机在FPGA上实现。论文的核心是硬件高效的机器学习算法和架构设计，专注于边缘计算场景。所有评分关键词都围绕大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及LLM或深度学习技术。论文讨论的是与传统深度神经网络不同的无权重神经网络架构，属于机器学习硬件加速和边缘AI领域，与评分关键词中的LLM技术栈无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Tsetlin自动机的无权重神经网络片上训练方法TsetlinWiSARD，在FPGA上实现了比传统方法快1000倍以上的训练速度，同时显著降低了资源使用、延迟和功耗。

摘要翻译

对边缘计算适应性、隐私性和安全性日益增长的需求，持续推动着新一代具备片上训练与推理能力的机器学习算法的发展。权重无关神经网络正是一种基于查找表的简单神经元结构原理的算法。与严重依赖乘积累加运算的深度神经网络相比，它具有架构优势，例如低延迟、低复杂度的推理。然而，传统的权重无关神经网络依赖基于记忆的单次训练，这容易导致过拟合和精度下降，或需要繁琐的训练后调整，限制了其在高效片上训练中的有效性。
本研究提出TsetlinWiSARD，一种利用特斯林自动机实现概率化、反馈驱动学习的权重无关神经网络训练方法。该方法通过迭代优化克服了WiSARD单次训练的过拟合问题，同时保持简单、连续的二元反馈以实现高效片上训练。我们方法的核心是基于现场可编程门阵列的训练架构，该架构在实现最先进精度的同时显著提升了硬件效率。与传统权重无关神经网络的WiSARD实现相比，我们的方法训练速度提升超过1000倍。此外，与实现其他机器学习算法的基于现场可编程门阵列的训练加速器相比，我们展示了资源使用量减少22%、延迟降低93.3%、功耗降低64.2%的显著优势。

摘要 (Abstract)

Increasing demands for adaptability, privacy, and security at the edge have persistently pushed the frontiers for a new generation of machine learning (ML) algorithms with training and inference capabilities on-chip. Weightless Neural Network (WNN) is such an algorithm that is principled on lookup table based simple neuron structures. As a result, it offers architectural benefits, such as low-latency, low-complexity inference, compared to deep neural networks that depend heavily on multiply-accumulate operations. However, traditional WNNs rely on memorization-based one-shot training, which either leads to overfitting and reduced accuracy or requires tedious post-training adjustments, limiting their effectiveness for efficient on chip training. In this work, we propose TsetlinWiSARD, a training approach for WNNs that leverages Tsetlin Automata (TAs) to enable probabilistic, feedback-driven learning. It overcomes the overfitting of WiSARD’s one-shot training with iterative optimization, while maintaining simple, continuous binary feedback for efficient on-chip training. Central to our approach is a field programmable gate array (FPGA)-based training architecture that delivers state-of-the-art accuracy while significantly improving hardware efficiency. Our approach provides over 1000x faster training when compared with the traditional WiSARD implementation of WNNs. Further, we demonstrate 22% reduced resource usage, 93.3% lower latency, and 64.2% lower power consumption compared to FPGA-based training accelerators implementing other ML algorithms.

关键词: Weightless Neural Network, Tsetlin Automata, on-chip training, FPGA, hardware efficiency, low-latency, low-power, edge computing

258. ❌ Walma: Learning to See Memory Corruption in WebAssembly

作者: Oussama Draissi, Mark Günzel, Ahmad-Reza Sadeghi, Lucas Davi 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24167v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究WebAssembly内存安全，使用CNN进行内存快照分类以检测内存损坏，属于机器学习在系统安全领域的应用。所有评分关键词均聚焦于大模型/深度学习技术原理（如LLM架构、训练方法、推理优化等）或特定科学领域应用（如生物信息学），而本文仅使用基础CNN模型解决特定安全工程问题，未涉及大模型技术、深度学习原理创新或科学领域应用，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了Walma框架，利用CNN分类WebAssembly内存快照来检测内存损坏和外部篡改，实验表明该方法能有效检测结构化内存布局应用中的内存损坏，并在粗粒度边界检查下仅产生1.07倍开销。

摘要翻译

WebAssembly（Wasm）的单体式线性内存模型容易引发内存破坏攻击，此类攻击在浏览器中可能升级为跨站脚本攻击，或在恶意主机篡改模块状态时难以被察觉。现有防御方案依赖于侵入式的二进制插桩或定制运行时，且未能在对抗性主机模型下解决运行时完整性验证问题。本文提出Walma，一种基于机器学习的WebAssembly线性内存认证框架，通过对内存快照进行分类来检测内存破坏和外部篡改。我们在三个验证后端（cpu-wasm、cpu-tch、gpu）和三种插桩策略下，对六个受真实CVE漏洞影响的应用进行了Walma评估。实验结果表明，基于卷积神经网络（CNN）的分类方法能有效检测具有结构化内存布局的应用中的内存破坏问题：粗粒度边界检查仅产生低至1.07倍的开销，而细粒度监控虽带来较高（1.5倍至1.8倍）但可预测的成本。本研究通过量化不同部署配置下的准确性与开销权衡，证明了基于机器学习的WebAssembly内存认证方案具备实际可行性。

摘要 (Abstract)

WebAssembly’s (Wasm) monolithic linear memory model facilitates memory corruption attacks that can escalate to cross-site scripting in browsers or go undetected when a malicious host tampers with a module’s state. Existing defenses rely on invasive binary instrumentation or custom runtimes, and do not address runtime integrity verification under an adversarial host model. We present Walma, a framework for WebAssembly Linear Memory Attestation that leverages machine learning to detect memory corruption and external tampering by classifying memory snapshots. We evaluate Walma on six real-world CVE-affected applications across three verification backends (cpu-wasm, cpu-tch, gpu) and three instrumentation policies. Our results demonstrate that CNN-based classification can effectively detect memory corruption in applications with structured memory layouts, with coarse-grained boundary checks incurring as low as 1.07x overhead, while fine-grained monitoring introduces higher (1.5x–1.8x) but predictable costs. Our evaluation quantifies the accuracy and overhead trade-offs across deployment configurations, demonstrating the practical feasibility of ML-based memory attestation for WebAssembly.

关键词: WebAssembly, memory corruption, machine learning, CNN classification, memory attestation, security framework, runtime integrity, tampering detection

作者: Lukas Theiner, Maik Pfefferkorn, Yongpeng Zhao, Sebastian Hirt, Rolf Findeisen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24138v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是多模态贝叶斯优化框架，用于结合数值数据和人类偏好来优化控制策略（如自动驾驶轨迹规划）。虽然涉及AI和优化，但论文核心是贝叶斯优化、高斯过程和控制系统，并未涉及大模型、深度学习、语言模型、对齐、推理、代理等关键词。所有关键词均与大模型或深度学习技术直接相关，而本文属于传统机器学习优化领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种多保真度多模态贝叶斯优化框架，通过结合低保真度数值数据和高保真度人类偏好来高效优化控制策略，在自动驾驶轨迹规划中验证了该方法能显著减少人类参与实验的需求并有效适应个体偏好。

摘要翻译

手动调整控制策略以满足高层目标通常耗时费力。贝叶斯优化通过目标函数的数值评估，为这一过程自动化提供了数据高效的框架。然而，许多系统（尤其是涉及人类的系统）需要基于主观标准进行优化。偏好贝叶斯优化通过从成对比较而非定量测量中学习来解决此问题，但仅依赖偏好数据可能效率低下。我们提出了一种多保真度、多模态贝叶斯优化框架，将低保真度数值数据与高保真度人类偏好相结合。该方法采用兼具分层自回归结构与非分层共区域化结构的高斯过程代理模型，从而能够从混合模态数据中高效学习。我们通过调整自动驾驶车辆的轨迹规划器来展示该框架，结果表明：结合数值数据与偏好数据能显著减少涉及人类决策者的实验需求，同时有效使驾驶风格适应个体偏好。

摘要 (Abstract)

Tuning control policies manually to meet high-level objectives is often time-consuming. Bayesian optimization provides a data-efficient framework for automating this process using numerical evaluations of an objective function. However, many systems, particularly those involving humans, require optimization based on subjective criteria. Preferential Bayesian optimization addresses this by learning from pairwise comparisons instead of quantitative measurements, but relying solely on preference data can be inefficient. We propose a multi-fidelity, multi-modal Bayesian optimization framework that integrates low-fidelity numerical data with high-fidelity human preferences. Our approach employs Gaussian process surrogate models with both hierarchical, autoregressive and non-hierarchical, coregionalization-based structures, enabling efficient learning from mixed-modality data. We illustrate the framework by tuning an autonomous vehicle’s trajectory planner, showing that combining numerical and preference data significantly reduces the need for experiments involving the human decision maker while effectively adapting driving style to individual preferences.

关键词: Bayesian optimization, human preferences, multi-modal learning, control policy tuning, autonomous vehicles, Gaussian process, trajectory planning, multi-fidelity optimization

260. ❌ On Gossip Algorithms for Machine Learning with Pairwise Objectives

作者: Igor Colin, Aurélien Bellet, Stephan Clémençon, Joseph Salmon 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是分布式机器学习中的gossip算法，专注于pairwise目标函数（U-statistics），属于传统的分布式优化和统计学习领域，与所有关键词（均围绕大模型、深度学习技术及其应用）完全无关，因此所有关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了用于解决pairwise目标函数（如U-statistics）的gossip算法，在分布式网络中提供了收敛性理论分析，并确定了影响效率的图属性。

摘要翻译

在物联网时代，信息正日益频繁地被具备不断增强（但仍有限）的存储、通信与计算能力的互联智能传感器所采集。无论是出于隐私约束还是分布式系统的结构特性，针对网络中共享数据开发统计学习方法已成为一个核心议题。基于流言传播的算法已被开发用于解决各类统计学习任务，涵盖传感器网络数据聚合到去中心化多智能体优化等多个领域。尽管绝大多数研究关注待估计或优化的函数为个体观测值基本平均值的场景，本文旨在探究目标函数具有成对性质的情况，即表现为二阶U-统计量的形式。受相似性学习、排序或聚类等多样化问题的驱动，我们重新审视了专门针对成对目标函数设计的流言传播算法，并为其收敛性构建了完整的理论框架。该分析通过确立这些方法成功的条件，并识别关键影响其效率的图结构特性，填补了现有文献的空白。特别地，本文对收敛上界与下界进行了精细化分析。

摘要 (Abstract)

In the IoT era, information is more and more frequently picked up by connected smart sensors with increasing, though limited, storage, communication and computation abilities. Whether due to privacy constraints or to the structure of the distributed system, the development of statistical learning methods dedicated to data that are shared over a network is now a major issue. Gossip-based algorithms have been developed for the purpose of solving a wide variety of statistical learning tasks, ranging from data aggregation over sensor networks to decentralized multi-agent optimization. Whereas the vast majority of contributions consider situations where the function to be estimated or optimized is a basic average of individual observations, it is the goal of this article to investigate the case where the latter is of pairwise nature, taking the form of a U -statistic of degree two. Motivated by various problems such as similarity learning, ranking or clustering for instance, we revisit gossip algorithms specifically designed for pairwise objective functions and provide a comprehensive theoretical framework for their convergence. This analysis fills a gap in the literature by establishing conditions under which these methods succeed, and by identifying the graph properties that critically affect their efficiency. In particular, a refined analysis of the convergence upper and lower bounds is performed.

关键词: gossip algorithms, pairwise objectives, U-statistics, distributed learning, sensor networks, convergence analysis, decentralized optimization, statistical learning

261. ❌ Linear-Nonlinear Fusion Neural Operator for Partial Differential Equations

作者: Heng Wu, Junjie Wang, Benzhuo Lu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24143v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是用于偏微分方程（PDEs）的神经算子（Neural Operator），属于深度学习在科学计算（AI for Science）领域的应用。论文的核心创新在于提出了一种新的网络架构（LNF-NO），通过显式解耦线性和非线性效应来提高算子学习效率，并应用于非线性Poisson-Boltzmann方程和多物理场耦合系统等科学问题。这与关键词列表中的’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为该研究属于AI在科学（特别是计算数学和物理建模）领域的应用。然而，论文完全不涉及大语言模型（LLMs）、模型训练技术（如预训练、微调、对齐）、推理优化、智能体系统或任何其他与大模型相关的技术。所有其他关键词均与论文内容无关，因此评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为线性-非线性融合神经算子（LNF-NO）的新网络架构，通过显式解耦线性和非线性效应来高效学习偏微分方程的算子映射，在多个基准测试中实现了比现有方法更快的训练速度和相当或更好的精度。

摘要翻译

神经算子学习直接构建从方程参数空间到解空间的映射关系，使得在实际应用中无需重复求解偏微分方程即可实现高效直接推理——这一优势是传统数值方法难以实现的。本研究发现，在此类算子映射中显式解耦线性和非线性效应，能显著提升学习效率。由此提出一种新颖的网络结构，即线性-非线性融合神经算子（Linear-Nonlinear Fusion Neural Operator, LNF-NO），该结构通过线性分量与非线性分量的乘积融合来建模算子映射，从而实现轻量化且可解释的表征。这种线性-非线性解耦机制能够在算子层面高效捕捉复杂解的特征，同时保持稳定性和泛化能力。LNF-NO天然支持多函数输入，并适用于规则网格与不规则几何域。在一系列偏微分方程算子学习基准测试中（包括非线性泊松-玻尔兹曼方程和多物理场耦合系统），LNF-NO的训练速度通常显著快于深度算子网络（DeepONet）和傅里叶神经算子（FNO），且在多数情况下达到相当或更优的精度。在测试的三维泊松-玻尔兹曼案例中，LNF-NO取得了对比模型中的最佳精度，其训练速度比三维FNO基线提升约2.7倍。

摘要 (Abstract)

Neural operator learning directly constructs the mapping relationship from the equation parameter space to the solution space, enabling efficient direct inference in practical applications without the need for repeated solution of partial differential equations (PDEs) - an advantage that is difficult to achieve with traditional numerical methods. In this work, we find that explicitly decoupling linear and nonlinear effects within such operator mappings leads to markedly improved learning efficiency. This yields a novel network structure, namely the Linear-Nonlinear Fusion Neural Operator (LNF-NO), which models operator mappings via the multiplicative fusion of a linear component and a nonlinear component, thus achieving a lightweight and interpretable representation. This linear-nonlinear decoupling enables efficient capture of complex solution features at the operator level while maintaining stability and generality. LNF-NO naturally supports multiple functional inputs and is applicable to both regular grids and irregular geometries. Across a diverse suite of PDE operator-learning benchmarks, including nonlinear Poisson-Boltzmann equations and multi-physics coupled systems, LNF-NO is typically substantially faster to train than Deep Operator Networks (DeepONet) and Fourier Neural Operators (FNO), while achieving comparable or better accuracy in most cases. On the tested 3D Poisson-Boltzmann case, LNF-NO attains the best accuracy among the compared models and trains approximately 2.7x faster than a 3D FNO baseline.

关键词: Neural Operator, Partial Differential Equations, Linear-Nonlinear Fusion, Operator Learning, Poisson-Boltzmann Equation, Multi-physics Systems, DeepONet, Fourier Neural Operator

262. ❌ Likelihood hacking in probabilistic program synthesis

作者: Jacek Karwowski, Younesse Kaddar, Zihuiwen Ye, Nikolay Malkin, Sam Staton 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究语言模型通过强化学习训练生成概率程序时出现的’似然黑客’问题，与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），因为论文核心涉及RL训练语言模型；与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为研究基于语言模型；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为涉及概率编程在科学建模中的应用；其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文研究了语言模型通过强化学习训练生成概率程序时出现的'似然黑客'问题，即模型通过生成未归一化的程序来人为提高奖励，而非更好地拟合数据，并提出了理论安全条件和实践验证方法。

摘要翻译

当语言模型通过强化学习（RL）训练以编写概率程序时，它们可能通过生成数据分布未归一化的程序来人为提高其边缘似然奖励，而非更好地拟合数据。我们将这种失效称为似然黑客攻击（LH）。我们在一个核心概率编程语言（PPL）中对LH进行形式化定义，并给出防止其发生的充分语法条件，证明满足这些条件的安全语言片段$\mathcal{L}{\text{safe}}$不会产生似然黑客攻击程序。实证研究表明，通过GRPO训练生成PyMC代码的模型在最初几个训练步骤内即可发现LH漏洞，导致违规率远高于未经训练的基线模型。我们将$\mathcal{L}{\text{safe}}$的条件实现为$\texttt{SafeStan}$——一种对Stan的LH抵抗性修改版本，并实证表明其在优化压力下能有效防止LH。这些结果表明，语言层面的安全约束在自动贝叶斯模型发现中既具有理论依据，又在实践中行之有效。

摘要 (Abstract)

When language models are trained by reinforcement learning (RL) to write probabilistic programs, they can artificially inflate their marginal-likelihood reward by producing programs whose data distribution fails to normalise instead of fitting the data better. We call this failure likelihood hacking (LH). We formalise LH in a core probabilistic programming language (PPL) and give sufficient syntactic conditions for its prevention, proving that a safe language fragment $\mathcal{L}{\text{safe}}$ satisfying these conditions cannot produce likelihood-hacking programs. Empirically, we show that GRPO-trained models generating PyMC code discover LH exploits within the first few training steps, driving violation rates well above the untrained-model baseline. We implement $\mathcal{L}{\text{safe}}$’s conditions as $\texttt{SafeStan}$, a LH-resistant modification of Stan, and show empirically that it prevents LH under optimisation pressure. These results show that language-level safety constraints are both theoretically grounded and effective in practice for automated Bayesian model discovery.

关键词: language models, reinforcement learning, probabilistic programming, likelihood hacking, marginal-likelihood reward, Bayesian model discovery, SafeStan

263. ❌ Mixed-signal implementation of feedback-control optimizer for single-layer Spiking Neural Networks

作者: Jonathan Haag, Christian Metzner, Dmitrii Zendrikov, Giacomo Indiveri, Benjamin Grewe, Chiara De Luca, Matteo Saponati 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24113v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是脉冲神经网络（SNN）的片上学习硬件实现，具体涉及混合信号神经形态处理器上的反馈控制优化器。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的应用相关，而本文专注于神经形态计算和脉冲神经网络的硬件实现，与LLM、深度学习技术或AI for Science应用无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种在混合信号神经形态处理器上实现反馈控制优化器的方法，用于单层脉冲神经网络的片上学习，并在二元分类和Yin-Yang问题上验证了其性能与数值模拟和梯度基线方法相当。

摘要翻译

片上学习是实现可扩展与自适应神经形态系统的关键，但现有训练方法要么难以在硬件中实现，要么限制过于严格。然而，近期研究表明，反馈控制优化器能够为神经形态设备实现高表达性的片上训练。本工作中，我们在混合信号神经形态处理器上实现了此类反馈控制优化器的概念验证。我们通过闭环（In-The-Loop, ITL）训练设置，在二分类任务和非线性阴阳问题上评估了所提出的方法，展示了其片上训练性能可与数值模拟及基于梯度的基准方法相媲美。我们的结果凸显了在真实混合信号约束下实现反馈驱动在线学习的可行性，并代表了一种将此类规则直接嵌入硅片的协同设计路径，为自主自适应神经形态计算奠定了基础。

摘要 (Abstract)

On-chip learning is key to scalable and adaptive neuromorphic systems, yet existing training methods are either difficult to implement in hardware or overly restrictive. However, recent studies show that feedback-control optimizers can enable expressive, on-chip training of neuromorphic devices. In this work, we present a proof-of-concept implementation of such feedback-control optimizers on a mixed-signal neuromorphic processor. We assess the proposed approach in an In-The-Loop(ITL) training setup on both a binary classification task and the nonlinear Yin-Yang problem, demonstrating on-chip training that matches the performance of numerical simulations and gradient-based baselines. Our results highlight the feasibility of feedback-driven, online learning under realistic mixed-signal constraints, and represent a co-design approach toward embedding such rules directly in silicon for autonomous and adaptive neuromorphic computing.

关键词: on-chip learning, neuromorphic systems, feedback-control optimizer, mixed-signal neuromorphic processor, spiking neural networks, In-The-Loop training, binary classification, Yin-Yang problem

264. ❌ Toward a Multi-Layer ML-Based Security Framework for Industrial IoT

作者: Aymen Bouferroum, Valeria Loscri, Abderrahim Benslimane 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24111v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于工业物联网（IIoT）的机器学习安全框架，涉及信任模型、网络条件预测和对抗性攻击防御。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或科学AI应用直接相关，而本文未涉及任何LLM、深度学习模型技术或生物/化学信息学应用，仅使用传统机器学习方法解决IIoT安全问题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对工业物联网（IIoT）的安全挑战，提出了一种基于机器学习的轻量级安全框架，通过信任收敛加速（TCA）方法将网络条件预测集成到信任模型中，在对抗性行为下实现了高达28.6%的收敛时间减少，并设计了基于开源硬件的实际部署架构。

摘要翻译

工业物联网（IIoT）将资源受限设备日益集成到关键工业流程中，由此引入了严峻的安全挑战。现有安全方法通常仅针对单一网络层级的威胁，且往往依赖昂贵硬件，并局限于仿真环境。本文阐述了博士学位论文的研究框架与贡献，其目标是为IIoT环境开发一个轻量级的、基于机器学习（ML）的安全框架。我们首先描述了采用Tm-IIoT信任模型与混合IIoT（H-IIoT）架构作为基础基线，随后介绍了信任收敛加速（TCA）方法——这是我们的核心贡献，该方法集成机器学习以预测并缓解网络条件恶化对信任收敛的影响，在保持对抗行为鲁棒性的同时，将收敛时间最多减少了28.6%。接着，我们提出了一种基于经济型开源硬件的实际部署架构，旨在实现并扩展该安全框架。最后，我们概述了当前在多层攻击检测方面的持续研究，包括物理层威胁识别以及对对抗性机器学习攻击鲁棒性的考量。

摘要 (Abstract)

The Industrial Internet of Things (IIoT) introduces significant security challenges as resource-constrained devices become increasingly integrated into critical industrial processes. Existing security approaches typically address threats at a single network layer, often relying on expensive hardware and remaining confined to simulation environments. In this paper, we present the research framework and contributions of our doctoral thesis, which aims to develop a lightweight, Machine Learning (ML)-based security framework for IIoT environments. We first describe our adoption of the Tm-IIoT trust model and the Hybrid IIoT (H-IIoT) architecture as foundational baselines, then introduce the Trust Convergence Acceleration (TCA) approach, our primary contribution that integrates ML to predict and mitigate the impact of degraded network conditions on trust convergence, achieving up to a 28.6% reduction in convergence time while maintaining robustness against adversarial behaviors. We then propose a real-world deployment architecture based on affordable, open-source hardware, designed to implement and extend the security framework. Finally, we outline our ongoing research toward multi-layer attack detection, including physical-layer threat identification and considerations for robustness against adversarial ML attacks.

关键词: Industrial IoT, security framework, machine learning, trust convergence, adversarial robustness, lightweight, multi-layer attack detection, real-world deployment

265. ❌ The impact of sensor placement on graph-neural-network-based leakage detection

作者: J. J. H. van Gemert, V. Breschi, D. R. Yntema, K. J. Keesman, M. Lazar 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24076v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究传感器放置对基于图神经网络（GNN）的泄漏检测的影响，并提出了一种基于PageRank中心性的传感器放置方法。所有关键词均与大模型（LLM）或深度学习技术原理直接相关，而本文专注于GNN在特定工程应用（水分配网络）中的研究，未涉及大模型技术、训练方法、推理优化、对齐、代理系统等核心主题。唯一相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI（GNN）应用于科学/工程领域（水网络管理），属于AI在科学领域的应用，但非生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了传感器放置如何影响基于图神经网络的水分配网络泄漏检测性能，并提出了一种新的基于PageRank中心性的传感器放置方法，显著提升了泄漏检测效果。

摘要翻译

供水管网泄漏检测中的传感器布设是水务部门面临的一项重要且实际的挑战。近期研究表明，图神经网络能够估算和预测压力并检测泄漏，但其性能在很大程度上依赖于可用的传感器测量数据与布设配置。本文研究了传感器布设如何影响基于图神经网络的泄漏检测性能。我们提出了一种新颖的基于PageRank中心性的传感器布设方法，并证明该方法显著影响了在EPANET Net1模型上的压力重建、预测及泄漏检测效果。

摘要 (Abstract)

Sensor placement for leakage detection in water distribution networks is an important and practical challenge for water utilities. Recent work has shown that graph neural networks can estimate and predict pressures and detect leaks, but their performance strongly depends on the available sensor measurements and configurations. In this paper, we investigate how sensor placement influences the performance of GNN-based leakage detection. We propose a novel PageRank-Centrality-based sensor placement method and demonstrate that it substantially impacts reconstruction, prediction, and leakage detection on the EPANET Net1.

关键词: sensor placement, graph neural networks, leakage detection, water distribution networks, PageRank centrality, EPANET, reconstruction, prediction

266. ❌ Causality-Driven Disentangled Representation Learning in Multiplex Graphs

作者: Saba Nasiri, Selin Aviyente, Dorina Thanou 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24105v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图表示学习领域，特别是多路图（multiplex graphs）中的因果推断和表示解耦方法。虽然论文涉及自监督学习和可解释性（与’Mechanistic Interpretability OR Explainable AI’有一定关联），但核心内容与深度学习和大模型技术原理、应用或创新无直接关系。论文未提及任何语言模型、训练方法、推理技术、代理系统或科学AI应用，因此除可解释性外，所有关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于因果推断的框架CaDeM，用于在多路图中自监督地解耦共享和层特定信息，从而提升图表示学习的鲁棒性和可解释性。

摘要翻译

从多重图（即节点通过多种关系类型交互的多层网络）中学习表征具有挑战性，因为共享（公共）信息和层特定（私有）信息相互纠缠，这限制了模型的泛化能力和可解释性。本研究提出了一种基于因果推断的框架，以自监督的方式解耦公共与私有成分。该框架CaDeM同时（i）对齐各层间的共享嵌入，（ii）强制私有嵌入捕获层特定信号，并（iii）应用后门调整确保公共嵌入仅捕捉全局信息，且与私有表征分离。在合成数据集和真实数据集上的实验表明，该方法相较于现有基线模型取得了持续改进，突显了其在实现鲁棒且可解释的多重图表征学习方面的有效性。

摘要 (Abstract)

Learning representations from multiplex graphs, i.e., multi-layer networks where nodes interact through multiple relation types, is challenging due to the entanglement of shared (common) and layer-specific (private) information, which limits generalization and interpretability. In this work, we introduce a causal inference-based framework that disentangles common and private components in a self-supervised manner. CaDeM jointly (i) aligns shared embeddings across layers, (ii) enforces private embeddings to capture layer-specific signals, and (iii) applies backdoor adjustment to ensure that the common embeddings capture only global information while being separated from the private representations. Experiments on synthetic and real-world datasets demonstrate consistent improvements over existing baselines, highlighting the effectiveness of our approach for robust and interpretable multiplex graph representation learning.

关键词: multiplex graphs, disentangled representation learning, causal inference, self-supervised learning, backdoor adjustment, graph representation learning, interpretability

267. ❌ Hierarchical Spatial-Temporal Graph-Enhanced Model for Map-Matching

作者: Anjun Gao, Zhenglin Wan, Pingfu Chao, Shunyu Yao 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24054v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Hierarchical Spatial-Temporal Graph-Enhanced Model for Map-Matching》专注于轨迹数据处理和地图匹配任务，提出了一种结合层次自监督学习和时空监督学习的深度学习模型（HSTGMatch）。虽然论文属于深度学习在特定应用（轨迹分析）中的研究，但所有给定的关键词均与大模型（LLMs）技术、大模型训练/对齐方法、推理优化、代理系统或科学AI应用直接相关，而本文未涉及任何大模型技术、原理或应用，也未使用或改进所列的大模型相关方法（如MoE、Scaling Laws、RLHF、RAG、CoT等）。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对地图匹配中大规模数据标注困难、时空关系建模无效以及训练测试数据分布差异等问题，提出了一种层次时空图增强模型（HSTGMatch），通过分层轨迹表示和自适应轨迹邻接图来提升性能，实验证明其具有优越性能和鲁棒性。

摘要翻译

全球导航卫星系统（GNSS）数据与便携设备的融合催生了海量轨迹数据的生成，这对地图匹配等应用至关重要。为克服基于规则方法的局限性，近期出现了针对轨迹相关任务的深度学习研究。然而，现有模型仍面临诸多挑战，包括大规模数据标注困难、时空关系建模效率不足，以及训练与测试数据分布不一致等问题。为解决这些挑战，我们提出了HSTGMatch这一新型模型，旨在提升地图匹配性能。该方法采用两阶段流程：分层自监督学习与时空监督学习。我们引入了一种分层轨迹表示方法，同时利用网格单元与地理元组来有效捕捉移动模式。该模型构建了自适应轨迹邻接图以动态捕获空间关系，并优化图注意力网络（GATs）以提升效率。此外，我们整合了时空因子以提取相关特征，并采用衰减系数来处理轨迹长度的变化。大量实验证明，该模型具有优越的性能、模块有效性及鲁棒性，为克服当前地图匹配应用中的局限性提供了可行方案。HSTGMatch的源代码已在GitHub上公开：https://github.com/Nerooo-g/HSTGMatch。

摘要 (Abstract)

The integration of GNSS data into portable devices has led to the generation of vast amounts of trajectory data, which is crucial for applications such as map-matching. To tackle the limitations of rule-based methods, recent works in deep learning for trajectory-related tasks occur. However, existing models remain challenging due to issues such as the difficulty of large-scale data labeling, ineffective modeling of spatial-temporal relationships, and discrepancies between training and test data distributions. To tackle these challenges, we propose HSTGMatch, a novel model designed to enhance map-matching performance. Our approach involves a two-stage process: hierarchical self-supervised learning and spatial-temporal supervised learning. We introduce a hierarchical trajectory representation, leveraging both grid cells and geographic tuples to capture moving patterns effectively. The model constructs an Adaptive Trajectory Adjacency Graph to dynamically capture spatial relationships, optimizing GATs for improved efficiency. Furthermore, we incorporate a Spatial-Temporal Factor to extract relevant features and employ a decay coefficient to address variations in trajectory length. Our extensive experiments demonstrate the model’s superior performance, module effectiveness, and robustness, providing a promising solution for overcoming the existing limitations in map-matching applications. The source code of HSTGMatch is publicly available on GitHub at https://github.com/Nerooo-g/HSTGMatch.

关键词: map-matching, trajectory data, hierarchical self-supervised learning, spatial-temporal supervised learning, Adaptive Trajectory Adjacency Graph, GATs, Spatial-Temporal Factor, decay coefficient

268. ❌ Minimal Sufficient Representations for Self-interpretable Deep Neural Networks

作者: Zhiyao Tan, Liu Li, Huazhen Lin 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于深度神经网络的可解释性方法（DeepIn框架），与大多数大模型技术关键词（如LLMs、MoE、训练方法、推理优化等）完全无关。唯一高度相关的关键词是’Mechanistic Interpretability OR Explainable AI’（10分），因为论文核心是开发可解释的深度神经网络框架。‘AI for Science OR Bioinformatics OR Cheminformatics’得5分，因为论文在生物医学基准上进行了评估，但这不是核心创新。其他关键词均得0分，因论文未涉及大模型、训练技术、推理方法或代理系统等内容。

!!! tip deepseek-chat TL;DR

该论文提出了DeepIn框架，通过识别和学习最小充分表示来构建自解释深度神经网络，在保持标准DNN表达能力的同时提高可解释性和统计推断能力，并在生物医学和视觉基准上实现了高达30%的错误率降低。

摘要翻译

深度神经网络（DNNs）取得了卓越的预测性能，但其可解释性仍然较差，这主要归因于过度参数化掩盖了可解释所需的最小结构。本文我们提出DeepIn，一种自解释神经网络框架，它能自适应地识别并学习保持标准DNNs完整表达能力所必需的最小表示。我们证明，DeepIn能够正确识别最小表示维度、选择相关变量，并恢复预测所需的最小充分网络架构。所得估计器达到了适应于所学最小维度的最优非渐近误差率，这表明恢复最小充分结构从根本上改善了泛化误差。基于这些理论保证，我们进一步针对所选变量和所学表示开发了假设检验程序，从而将深度表示学习与形式化统计推断联系起来。在生物医学和视觉基准测试中，DeepIn同时提升了预测准确性和可解释性，在真实世界数据集上将误差降低了高达30%，同时自动揭示了人类可解释的判别性模式。我们的结果表明，可解释性和统计严谨性可以直接嵌入深度架构中，而无需牺牲性能。

摘要 (Abstract)

Deep neural networks (DNNs) achieve remarkable predictive performance but remain difficult to interpret, largely due to overparameterization that obscures the minimal structure required for interpretation. Here we introduce DeepIn, a self-interpretable neural network framework that adaptively identifies and learns the minimal representation necessary for preserving the full expressive capacity of standard DNNs. We show that DeepIn can correctly identify the minimal representation dimension, select relevant variables, and recover the minimal sufficient network architecture for prediction. The resulting estimator achieves optimal non-asymptotic error rates that adapt to the learned minimal dimension, demonstrating that recovering minimal sufficient structure fundamentally improves generalization error. Building on these guarantees, we further develop hypothesis testing procedures for both selected variables and learned representations, bridging deep representation learning with formal statistical inference. Across biomedical and vision benchmarks, DeepIn improves both predictive accuracy and interpretability, reducing error by up to 30% on real-world datasets while automatically uncovering human-interpretable discriminative patterns. Our results suggest that interpretability and statistical rigor can be embedded directly into deep architectures without sacrificing performance.

关键词: self-interpretable neural networks, minimal sufficient representations, deep representation learning, statistical inference, interpretability, generalization error, biomedical benchmarks, variable selection

269. ❌ Lagrangian Relaxation Score-based Generation for Mixed Integer linear Programming

作者: Ruobing Wang, Xin Li, Yujie Fang, Mingzhong Wang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24033v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种基于拉格朗日松弛和随机微分方程的生成框架（SRG），用于加速混合整数线性规划（MILP）求解。论文的核心是优化算法和生成模型在运筹学中的应用，而非大语言模型或深度学习技术。所有关键词均与大语言模型、深度学习技术原理或特定AI应用领域（如生物信息学）相关，但论文未涉及这些内容。仅“AI for Science OR Bioinformatics OR Cheminformatics”得5分，因为MILP求解可视为运筹学中的科学计算应用，属于广义的“AI for Science”，但非核心内容。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于拉格朗日松弛和随机微分方程的生成框架SRG，用于生成多样且高质量的混合整数线性规划解候选，从而加速求解过程并在多个基准测试中优于现有机器学习方法。

摘要翻译

预测-搜索方法在加速混合整数线性规划求解方面展现出潜力。然而，现有方法通常假设变量独立性并依赖确定性单点预测，这限制了解的多样性，且往往需要大量下游搜索才能获得高质量解。本文提出\textbf{SRG}——一种基于拉格朗日松弛引导的随机微分方程生成式框架，其解质量具有理论保证。SRG利用卷积核捕捉变量间依赖关系，同时整合拉格朗日松弛引导采样过程朝向可行且接近最优的区域。与生成单一估计不同，SRG能产生多样化、高质量的解候选集，这些候选解共同为传统MILP求解器定义了紧凑而有效的信赖域子问题。在多个公开基准测试中，SRG在解质量上持续优于现有机器学习基线方法。此外，SRG展现出强大的零样本迁移能力：在未见过的跨尺度/问题实例上，它能达到与先进精确求解器相当的优化效果，同时通过更快的搜索速度和更优的解质量显著降低计算开销。

摘要 (Abstract)

Predict-and-search (PaS) methods have shown promise for accelerating mixed-integer linear programming (MILP) solving. However, existing approaches typically assume variable independence and rely on deterministic single-point predictions, which limits solution diversityand often necessitates extensive downstream search for high-quality solutions. In this paper, we propose \textbf{SRG}, a generative framework based on Lagrangian relaxation-guided stochastic differential equations (SDEs), with theoretical guarantees on solution quality. SRG leverages convolutional kernels to capture inter-variable dependencies while integrating Lagrangian relaxation to guide the sampling process toward feasible and near-optimal regions. Rather than producing a single estimate, SRG generates diverse, high-quality solution candidates that collectively define compact and effective trust-region subproblems for standard MILP solvers. Across multiple public benchmarks, SRG consistently outperforms existing machine learning baselines in solution quality. Moreover, SRG demonstrates strong zero-shot transferability: on unseen cross-scale/problem instances, it achieves competitive optimality with state-of-the-art exact solvers while significantly reducing computational overhead through faster search and superior solution quality.

关键词: Mixed Integer Linear Programming, Lagrangian Relaxation, Stochastic Differential Equations, Generative Framework, Predict-and-Search, Solution Diversity, Trust-region Subproblems, Zero-shot Transferability

270. ❌ i-IF-Learn: Iterative Feature Selection and Unsupervised Learning for High-Dimensional Complex Data

作者: Chen Ma, Wanjie Wang, Shuhao Fan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24025v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于高维复杂数据的无监督特征选择和聚类方法（i-IF-Learn），核心是迭代框架、自适应特征选择统计量和低维嵌入（PCA/Laplacian eigenmaps）结合k-means。所有关键词均与大模型、深度学习技术原理或具体应用（如LLM、MoE、对齐、推理加速等）直接相关，而本文未涉及任何大模型或深度学习技术，属于传统机器学习/统计方法。唯一相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因论文在基因微阵列和单细胞RNA-seq数据集上实验，属于生物信息学应用，但非大模型驱动，故给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为i-IF-Learn的迭代无监督框架，用于高维复杂数据中的特征选择和聚类，通过自适应特征选择统计量结合伪标签监督和无监督信号，有效识别有影响力的特征并提升聚类性能，在生物信息学数据集上显著优于基线方法。

摘要翻译

高维数据的无监督学习面临挑战，因为不相关或噪声特征会掩盖底层结构。通常只有少数被称为“影响性特征”（influential features）的维度能有效定义聚类簇。恢复这些影响性特征有助于数据解释与聚类分析。本文提出i-IF-Learn——一种迭代式无监督学习框架，可同步执行特征选择与聚类任务。其核心创新在于自适应特征选择统计量，该统计量将伪标签监督信号与无监督信号有效结合，并依据中间标签的可靠性动态调整，从而缓解迭代框架中常见的误差传播问题。通过低维嵌入（如PCA或拉普拉斯特征映射）结合k均值聚类，i-IF-Learn能同步输出影响性特征子集与聚类标签。在基因微阵列和单细胞RNA-seq数据集上的数值实验表明，i-IF-Learn显著优于经典方法与深度聚类基线模型。此外，将我们选取的影响性特征作为预处理步骤，能显著提升DeepCluster、UMAP、VAE等下游深度模型的性能，这凸显了针对性特征选择的重要性和有效性。

摘要 (Abstract)

Unsupervised learning of high-dimensional data is challenging due to irrelevant or noisy features obscuring underlying structures. It’s common that only a few features, called the influential features, meaningfully define the clusters. Recovering these influential features is helpful in data interpretation and clustering. We propose i-IF-Learn, an iterative unsupervised framework that jointly performs feature selection and clustering. Our core innovation is an adaptive feature selection statistic that effectively combines pseudo-label supervision with unsupervised signals, dynamically adjusting based on intermediate label reliability to mitigate error propagation common in iterative frameworks. Leveraging low-dimensional embeddings (PCA or Laplacian eigenmaps) followed by $k$-means, i-IF-Learn simultaneously outputs influential feature subset and clustering labels. Numerical experiments on gene microarray and single-cell RNA-seq datasets show that i-IF-Learn significantly surpasses classical and deep clustering baselines. Furthermore, using our selected influential features as preprocessing substantially enhances downstream deep models such as DeepCluster, UMAP, and VAE, highlighting the importance and effectiveness of targeted feature selection.

关键词: unsupervised learning, feature selection, high-dimensional data, clustering, iterative framework, bioinformatics, gene expression, single-cell RNA-seq

271. ❌ Stochastic Dimension-Free Zeroth-Order Estimator for High-Dimensional and High-Order PINNs

作者: Zhangyong Liang, Ji Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24002v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于物理信息神经网络（PINNs）的高效优化算法，特别是针对高维高阶偏微分方程的零阶优化方法。论文的核心贡献是SDZE框架，通过随机维度无关估计器解决内存和计算复杂度问题。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文有一定关联（5分），因为PINNs属于科学计算中的AI应用，但论文并未涉及大语言模型、深度学习技术原理创新或生物信息学/化学信息学具体应用。其他关键词均与论文内容完全无关（0分），因为论文不涉及任何大模型技术、训练方法、推理优化、对齐技术、代理系统或模型压缩等主题。

!!! tip deepseek-chat TL;DR

该论文提出了SDZE框架，通过随机维度无关零阶估计器解决了高维高阶物理信息神经网络训练中的内存爆炸和方差问题，实现了在单GPU上训练千万维PINNs的高效优化。

摘要翻译

针对高维高阶偏微分方程（PDEs）的物理信息神经网络（Physics-Informed Neural Networks, PINNs）主要受限于空间导数计算复杂度 $\mathcal{O}(d^k)$ 以及反向传播（backpropagation, BP）带来的 $\mathcal{O}(P)$ 内存开销。虽然随机化空间估计器成功将空间复杂度降至 $\mathcal{O}(1)$，但其对一阶优化的依赖仍导致大规模计算时内存消耗过高。零阶（Zeroth-order, ZO）优化提供了一种无需反向传播的替代方案；然而，将随机化空间算子与零阶扰动简单结合会引发 $\mathcal{O}(1/\varepsilon^2)$ 的方差爆炸，导致数值发散。为解决这些挑战，本文提出随机维度无关零阶估计器（Stochastic Dimension-free Zeroth-order Estimator, SDZE），这是一个在空间和内存上均实现维度无关复杂度的统一框架。具体而言，SDZE 利用公共随机数同步（Common Random Numbers Synchronization, CRNS），通过在不同扰动间锁定空间随机种子，以代数方式消除 $\mathcal{O}(1/\varepsilon^2)$ 方差。此外，本文引入一种隐式无矩阵子空间投影方法，将参数探索方差从 $\mathcal{O}(P)$ 降至 $\mathcal{O}(r)$，同时保持优化器内存占用为 $\mathcal{O}(1)$。实验结果表明，SDZE 能够在单张 NVIDIA A100 GPU 上训练千万维度的物理信息神经网络，在速度和内存效率上相比现有先进基线方法均有显著提升。

摘要 (Abstract)

Physics-Informed Neural Networks (PINNs) for high-dimensional and high-order partial differential equations (PDEs) are primarily constrained by the $\mathcal{O}(d^k)$ spatial derivative complexity and the $\mathcal{O}(P)$ memory overhead of backpropagation (BP). While randomized spatial estimators successfully reduce the spatial complexity to $\mathcal{O}(1)$, their reliance on first-order optimization still leads to prohibitive memory consumption at scale. Zeroth-order (ZO) optimization offers a BP-free alternative; however, naively combining randomized spatial operators with ZO perturbations triggers a variance explosion of $\mathcal{O}(1/\varepsilon^2)$, leading to numerical divergence. To address these challenges, we propose the \textbf{S}tochastic \textbf{D}imension-free \textbf{Z}eroth-order \textbf{E}stimator (\textbf{SDZE}), a unified framework that achieves dimension-independent complexity in both space and memory. Specifically, SDZE leverages \emph{Common Random Numbers Synchronization (CRNS)} to algebraically cancel the $\mathcal{O}(1/\varepsilon^2)$ variance by locking spatial random seeds across perturbations. Furthermore, an \emph{implicit matrix-free subspace projection} is introduced to reduce parameter exploration variance from $\mathcal{O}(P)$ to $\mathcal{O}(r)$ while maintaining an $\mathcal{O}(1)$ optimizer memory footprint. Empirical results demonstrate that SDZE enables the training of 10-million-dimensional PINNs on a single NVIDIA A100 GPU, delivering significant improvements in speed and memory efficiency over state-of-the-art baselines.

关键词: Physics-Informed Neural Networks, High-dimensional PDEs, Zeroth-order optimization, Memory efficiency, Variance reduction, Stochastic estimators, Backpropagation-free training, GPU acceleration

272. ❌ Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score

作者: Jimyung Hong, Jaehyung Kim 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23985v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的结构化剪枝方法DIET，直接相关关键词包括：1) ‘Large Language Models’ (核心研究对象，10分)；2) ‘Quantization/Model Compression’ (结构化剪枝属于模型压缩技术，10分)；3) ‘Mixture of Experts/Sparse Models’ (剪枝创建稀疏模型，8分)；4) ‘Small Language Models/On-device AI’ (剪枝有助于模型轻量化部署，5分)；5) ‘Speculative Decoding/Inference Acceleration’ (剪枝可加速推理，5分)。其他关键词如训练方法、对齐、推理技术、科学应用等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种无需训练的结构化剪枝方法DIET，通过合并任务特定重要性分数来全局修剪LLM的维度，在Gemma-2模型上实现了比现有方法更高的准确率提升。

摘要翻译

大语言模型（LLM）已展现出卓越的能力，但其庞大的规模给实际部署带来了重大挑战。结构化剪枝通过移除整个维度或层提供了一种有前景的解决方案，然而现有方法面临关键权衡：任务无关方法无法适应特定任务需求，而任务感知方法需要昂贵的训练来学习任务适应性。我们提出了DIET（通过合并任务重要性分数进行大语言模型的维度级全局剪枝），这是一种无需训练的结构化剪枝方法，它将维度级粒度与任务感知选择相结合。DIET仅需每个任务100个样本即可分析跨任务的激活幅度，然后应用多数投票机制构建一个单一的全局掩码。DIET无需高昂的预计算或训练成本。在Gemma-2 2B和9B模型上使用七个零样本基准进行的实验证明了DIET的有效性；例如，在Gemma-2 2B模型上实现20%稀疏度时，与先前最先进的结构化剪枝方法相比，DIET实现了近10%的平均准确率提升。这一优势在不同稀疏度级别和模型规模上均得以保持，使DIET成为结构化大语言模型剪枝的一个实用且稳健的选择。

摘要 (Abstract)

Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions or layers, yet existing methods face critical trade-offs: task-agnostic approaches cannot adapt to task-specific requirements, while task-aware methods require costly training to learn task adaptability. We propose DIET (Dimension-wise global pruning of LLMs via merging Task-wise importance scores), a training-free structured pruning method that combines dimension-level granularity with task-aware selection. DIET profiles activation magnitudes across tasks using only 100 samples per task, then applies majority voting to construct a single global mask. DIET does not require large costs from pre-computation or training. Experiments on seven zero-shot benchmarks using Gemma-2 2B and 9B models demonstrate the effectiveness of DIET; for example, at 20% sparsity on Gemma-2 2B, DIET achieves near 10% average accuracy improvement, compared to previous state-of-the-art structured pruning methods. This advantage persists across various sparsity levels and model scales, positioning DIET as a practical and robust choice for structured LLM pruning.

关键词: Large Language Models, structured pruning, model compression, training-free method, dimension-wise pruning, task-aware selection, sparsity, inference efficiency

273. ❌ Transcending Classical Neural Network Boundaries: A Quantum-Classical Synergistic Paradigm for Seismic Data Processing

作者: Zhengyi Yuan, Xintong Dong, Xinyang Wang, Zheng Cong, Shiqi Dong 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23984v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子神经网络与经典卷积网络协同的地震数据处理方法，属于AI for Science（科学AI）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分），因为地震数据处理是地球科学中的AI应用。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、PEFT等）、推理方法（CoT、System 2）、代理系统、模型优化（Quantization、Speculative Decoding）或其他指定关键词，因此其他关键词均评0分。加权总分仅来自AI for Science关键词的5.0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种量子-经典协同生成对抗网络（QC-GAN），通过结合量子神经网络的高维特征映射和卷积网络的波形结构提取，解决了地震数据处理中经典神经网络表示能力受限的问题，并在去噪和插值任务中验证了其有效性。

摘要翻译

近年来，一系列神经网络方法在地震数据处理中展现出良好性能，如去噪、插值和频带拓展。然而，这些方法依赖于堆叠的感知器和标准激活函数，这限制了深度学习模型的表征能力，使其难以捕捉地震波场复杂且非平稳的动态特性。与本质上局限于实值欧几里得空间的经典感知器堆叠神经网络不同，量子神经网络利用量子力学的指数级状态空间，将特征映射到高维希尔伯特空间，从而超越了经典神经网络的表征边界。基于这一认识，我们提出了一种用于地震数据处理的量子-经典协同生成对抗网络，这是量子神经网络在地震勘探中的首次应用。在该网络中，量子通路用于挖掘高阶特征相关性，而卷积通路专门提取地震波场的波形结构。此外，我们设计了一种量子-经典特征互补损失函数，以增强所提网络中特征的正交性。这一新颖的损失函数可确保两条通路编码非重叠信息，从而丰富特征表征能力。总体而言，通过协同整合量子与卷积通路，所提出的量子-经典协同生成对抗网络突破了经典生成对抗网络固有的表征瓶颈。在去噪和插值任务上的实验结果表明，该网络能在复杂噪声条件下有效保持波场连续性与振幅-相位信息。

摘要 (Abstract)

In recent years, a number of neural-network (NN) methods have exhibited good performance in seismic data processing, such as denoising, interpolation, and frequency-band extension. However, these methods rely on stacked perceptrons and standard activation functions, which imposes a bottleneck on the representational capacity of deep-learning models, making it difficult to capture the complex and non-stationary dynamics of seismic wavefields. Different from the classical perceptron-stacked NNs which are fundamentally confined to real-valued Euclidean spaces, the quantum NNs leverage the exponential state space of quantum mechanics to map the features into high-dimensional Hilbert spaces, transcending the representational boundary of classical NNs. Based on this insight, we propose a quantum-classical synergistic generative adversarial network (QC-GAN) for seismic data processing, serving as the first application of quantum NNs in seismic exploration. In QC-GAN, a quantum pathway is used to exploit the high-order feature correlations, while the convolutional pathway specializes in extracting the waveform structures of seismic wavefields. Furthermore, we design a QC feature complementarity loss to enforce the feature orthogonality in the proposed QC-GAN. This novel loss function can ensure that the two pathways encode non-overlapping information to enrich the capacity of feature representation. On the whole, by synergistically integrating the quantum and convolutional pathways, the proposed QC-GAN breaks the representational bottleneck inherent in classical GAN. Experimental results on denoising and interpolation tasks demonstrate that QC-GAN preserves wavefield continuity and amplitude-phase information under complex noise conditions.

关键词: Quantum Neural Networks, Seismic Data Processing, Generative Adversarial Network, Feature Representation, Denoising, Interpolation, Quantum-Classical Synergy, Hilbert Spaces

274. ❌ Wireless communication empowers online scheduling of partially-observable transportation multi-robot systems in a smart factory

作者: Yaxin Liao, Qimei Cui, Kwang-Cheng Chen, Xiong Li, Jinlian Chen, Xiyu Zhao, Xiaofeng Tao, Ping Zhang 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23967v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究智能工厂中基于无线通信的多机器人任务分配和路径调度问题，核心是通信与调度的集成方案。所有关键词均与大语言模型、深度学习技术原理或AI for Science应用相关，但论文完全不涉及这些内容。论文聚焦于机器人系统、无线通信、调度算法（模拟退火、A*算法）等传统控制与通信领域，没有使用或提及任何大模型、深度学习或AI for Science技术。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种通信使能的在线调度框架，通过集成无线机器对机器通信与路径调度算法，解决了智能工厂中部分可观测环境下运输多机器人系统的实时任务分配和路径规划问题，显著提高了调度效率。

摘要翻译

在智能工厂中实现灵活可重构的生产流程依赖于在线多机器人任务分配（MRTA），这需要对运输多机器人系统（T-MRS）（例如协作式自动导引车AGV）进行在线无碰撞且无拥堵的路径调度。由于实时操作需求以及T-MRS与生产MRS之间的动态交互，在动态工厂环境中部分可观测条件下的在线调度仍是一个重要且尚未充分探索的挑战。本文提出了一种新颖的通信赋能在线调度框架，该框架将无线机器对机器（M2M）网络与路径调度显式耦合，使AGV能够交换意图信息（例如规划路径），以克服部分可观测性并辅助在线调度的复杂计算。具体而言，我们将智能AGV的意图和传感器数据定义为新型M2M流量，并定制了免重传的多链路传输网络以满足实时操作需求。随后，这种面向调度的网络与基于模拟退火的MRTA方案以及基于拥堵感知A*的路径调度方法相集成。该集成通信与调度方案使AGV能够以较低的计算开销动态调整无碰撞且无拥堵的路径。数值实验揭示了无线通信对T-MRS性能的影响，并表明即使在高AGV负载和有限信道资源条件下，所提出的集成方案相比其他基线方法仍能显著提升调度效率。此外，结果还表明，面向调度的无线M2M通信设计与人对人通信存在根本差异，这为无线网络化智能工厂带来了新的技术机遇。

摘要 (Abstract)

Achieving agile and reconfigurable production flows in smart factories depends on online multi-robot task assignment (MRTA), which requires online collision-free and congestion-free route scheduling of transportation multi-robot systems (T-MRS), e.g., collaborative automatic guided vehicles (AGVs). Due to the real-time operational requirements and dynamic interactions between T-MRS and production MRS, online scheduling under partial observability in dynamic factory environments remains a significant and under-explored challenge. This paper proposes a novel communication-enabled online scheduling framework that explicitly couples wireless machine-to-machine (M2M) networking with route scheduling, enabling AGVs to exchange intention information, e.g., planned routes, to overcome partial observations and assist complex computation of online scheduling. Specifically, we determine intelligent AGVs’ intention and sensor data as new M2M traffic and tailor the retransmission-free multi-link transmission networking to meet real-time operation demands. This scheduling-oriented networking is then integrated with a simulated annealing-based MRTA scheme and a congestion-aware A*-based route scheduling method. The integrated communication and scheduling scheme allows AGVs to dynamically adjust collision-free and congestion-free routes with reduced computational overhead. Numerical experiments shows the impacts from wireless communication on the performance of T-MRS and suggest that the proposed integrated scheme significantly enhances scheduling efficiency compared to other baselines, even under high AGV load conditions and limited channel resources. Moreover, the results reveal that the scheduling-oriented wireless M2M communication design fundamentally differs from human-to-human communications, implying new technological opportunities in a wireless networked smart factory.

关键词: multi-robot task assignment, wireless communication, online scheduling, smart factory, transportation multi-robot systems, machine-to-machine networking, collision-free routing, congestion-aware scheduling

作者: Tri Minh Nguyen, Sherif Abdulkader Tawfik, Truyen Tran, Svetha Venkatesh 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23943v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文ChargeFlow专注于使用深度学习（3D U-Net和流匹配）来精炼材料科学中的电荷密度预测，属于AI for Science在计算化学/材料科学领域的应用。它不涉及任何大语言模型（LLM）技术、训练方法、推理优化、对齐、代理系统或通用AI技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文明确应用AI（深度学习）解决科学问题（材料电荷密度计算），属于AI for Science范畴，且与生物信息学或化学信息学在应用AI于科学计算上有概念关联，因此给予10分（高度相关）。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了使用密度泛函理论计算电荷态依赖的电子密度成本过高的问题，提出了一种基于流匹配和3D U-Net的ChargeFlow模型，能够高效地将原子密度叠加精炼为准确的DFT电子密度，在多种材料上验证了其有效性并改善了预测误差。

摘要翻译

精确的电荷密度是电子结构理论的核心，但使用密度泛函理论计算电荷态依赖的密度对于大规模筛选和缺陷工作流程而言仍然成本过高。我们提出了ChargeFlow，一种流匹配优化模型，该模型利用三维U-Net速度场，将电荷条件化的原子密度叠加转换为原生周期性实空间网格上对应的DFT电子密度。该模型基于9,502个源自Materials Project的带电结构计算数据进行训练，并在一个包含1,671个外部结构（涵盖钙钛矿、带电缺陷、金刚石缺陷、金属有机框架和有机晶体）的基准集上评估。ChargeFlow并非在每一个分布内类别上都表现最佳，但在以非局域电荷重分布和电荷态外推为主导的问题上表现最强，相较于ResNet基线，其形变密度误差从3.62%降至3.21%，电荷响应余弦相似度从0.571提升至0.655。预测的密度在下游分析中仍保持化学实用性，在全部1,671个基准结构上成功实现了Bader划分，并生成了高保真静电势，这确立了流匹配作为一种适用于带电材料的实用密度优化策略的地位。

摘要 (Abstract)

Accurate charge densities are central to electronic-structure theory, but computing charge-state-dependent densities with density functional theory remains too expensive for large-scale screening and defect workflows. We present ChargeFlow, a flow-matching refinement model that transforms a charge-conditioned superposition of atomic densities into the corresponding DFT electron density on the native periodic real-space grid using a 3D U-Net velocity field. Trained on 9,502 charged Materials Project-derived calculations and evaluated on an external 1,671-structure benchmark spanning perovskites, charged defects, diamond defects, metal-organic frameworks, and organic crystals, ChargeFlow is not uniformly best on every in-distribution class but is strongest on problems dominated by nonlocal charge redistribution and charge-state extrapolation, improving deformation-density error from 3.62% to 3.21% and charge- response cosine similarity from 0.571 to 0.655 relative to a ResNet baseline. The predicted densities remain chemically useful under downstream analysis, yielding successful Bader partitioning on all 1,671 benchmark structures and high-fidelity electrostatic potentials, which positions flow matching as a practical density-refinement strategy for charged materials.

关键词: ChargeFlow, flow-matching, electron density, density functional theory, 3D U-Net, charged materials, materials science, AI for science

276. ❌ Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis

作者: Rohan Kumar, Jason Li, Zongshun Zhang, Syed Mohammad Qasim, Gianluca Stringhini, Ayse Kivilcim Coskun 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23890v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis》专注于云服务异常检测和根因分析，使用AI方法处理遥测数据和依赖关系。所有关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是传统AI/机器学习在运维领域的应用，未涉及大模型、深度学习创新或科学领域应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对云微服务架构中异常诊断的挑战，提出了Praxium框架，通过AI分析遥测数据和软件依赖关系实现有效的异常检测和根因推断，在实验中达到了高准确率。

摘要翻译

随着现代云应用的微服务架构日益普及，云服务正变得愈发复杂，更易受到配置错误和软件缺陷的影响。传统方法依赖专家输入来诊断和修复微服务异常，这在持续集成与持续部署（CI/CD）范式下缺乏可扩展性。包含新软件安装的微服务发布过程，会与应用程序组件产生复杂的交互作用。因此，将异常行为归因于特定安装或发布的难度增加，可能导致问题解决时间延长。为弥补当前诊断方法的不足，本文提出Praxium框架，用于异常检测与根因推断。该框架借助软件发现工具PraxiPaaS提供的依赖安装信息，帮助管理员评估目标指标性能。Praxium持续监控遥测数据以识别异常，随后通过分析近期软件安装的因果影响进行根因分析，从而为站点可靠性工程师（SRE）提供观测异常的相关信息。本文通过实验证明，Praxium能够有效实现异常检测与根因推断，并针对实际场景中所需的异常检测超参数调优进行了分析。在使用四种合成异常进行的75组试验中，异常检测的宏观F1分数持续高于0.97。此外，研究还表明即使软件包安装间隔不断缩短，因果影响分析仍能可靠推断出异常的正确根因。

摘要 (Abstract)

As the modern microservice architecture for cloud applications grows in popularity, cloud services are becoming increasingly complex and more vulnerable to misconfiguration and software bugs. Traditional approaches rely on expert input to diagnose and fix microservice anomalies, which lacks scalability in the face of the continuous integration and continuous deployment (CI/CD) paradigm. Microservice rollouts, containing new software installations, have complex interactions with the components of an application. Consequently, this added difficulty in attributing anomalous behavior to any specific installation or rollout results in potentially slower resolution times. To address the gaps in current diagnostic methods, this paper introduces Praxium, a framework for anomaly detection and root cause inference. Praxium aids administrators in evaluating target metric performance in the context of dependency installation information provided by a software discovery tool, PraxiPaaS. Praxium continuously monitors telemetry data to identify anomalies, then conducts root cause analysis via causal impact on recent software installations, in order to provide site reliability engineers (SRE) relevant information about an observed anomaly. In this paper, we demonstrate that Praxium is capable of effective anomaly detection and root cause inference, and we provide an analysis on effective anomaly detection hyperparameter tuning as needed in a practical setting. Across 75 total trials using four synthetic anomalies, anomaly detection consistently performs at >0.97 macro-F1. In addition, we show that causal impact analysis reliably infers the correct root cause of anomalies, even as package installations occur at increasingly shorter intervals.

关键词: cloud anomaly detection, root cause analysis, microservice architecture, telemetry data, dependency analysis, AI-based framework, causal impact analysis, site reliability engineering

277. ❌ Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

作者: Guopeng Li, Matthijs T. J. Spaan, Julian F. P. Kooij 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23889v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于安全强化学习（Safe RL）算法研究，提出了一种名为COX-Q的离策略安全RL算法，通过成本约束的乐观探索和保守的离线分布价值学习来解决约束违反问题。论文内容完全围绕强化学习领域，特别是安全约束下的RL算法设计、探索策略和价值函数学习，未涉及任何大语言模型（LLM）、深度学习技术原理、AI for Science应用或关键词列表中提到的其他大模型相关技术。所有关键词均与大模型、深度学习技术或科学AI应用相关，而本文是纯粹的强化学习研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为COX-Q的离策略安全强化学习算法，通过成本约束的乐观探索策略和截断分位数评论家来解决安全RL中的约束违反问题，在安全速度控制、安全导航和自动驾驶任务中实现了高样本效率和可控的数据收集成本。

摘要翻译

当安全性被表述为累积成本的限制时，安全强化学习（RL）的目标是在数据收集和部署过程中，学习在成本约束下最大化回报的策略。离策略安全RL方法虽然提供了较高的样本效率，但由于其成本无关的探索方式以及累积成本估计偏差，常导致约束违反。为解决这一问题，我们提出了约束乐观探索Q学习（COX-Q），这是一种离策略安全RL算法，它结合了成本有界的在线探索与保守的离线分布式价值学习。首先，我们引入了一种新颖的成本约束乐观探索策略，该策略解决了动作空间中奖励与成本之间的梯度冲突，并自适应调整信任区域以控制训练成本。其次，我们采用截断分位数评论器来稳定成本价值学习。分位数评论器还能量化认知不确定性以指导探索。在安全速度控制、安全导航和自动驾驶任务上的实验表明，COX-Q实现了高样本效率、具有竞争力的测试安全性能以及可控的数据收集成本。这些结果凸显了COX-Q作为一种适用于安全关键应用的有前景的RL方法。

摘要 (Abstract)

When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive test safety performance, and controlled data collection cost. The results highlight COX-Q as a promising RL method for safety-critical applications.

关键词: Safe Reinforcement Learning, Off-policy Learning, Constrained Optimization, Optimistic Exploration, Cost Constraint, Distributional Value Learning, Autonomous Driving, Sample Efficiency

278. ❌ Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

作者: Guy Zamir, Matthew Zurek, Yudong Chen 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23926v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于无限时域马尔可夫决策过程（MDPs）中的在线强化学习理论，研究平均奖励遗憾和γ-遗憾的方差依赖最优后悔界。论文内容完全围绕强化学习的理论算法分析，包括UCB算法、方差依赖后悔界、最优偏差跨度等核心概念。所有评分关键词均涉及大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等）或特定科学AI应用（如生物信息学），而本论文未涉及任何大语言模型、深度学习或相关技术，也未涉及AI在特定科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对无限时域马尔可夫决策过程，开发了一种UCB风格算法，首次实现了平均奖励和γ-遗憾的最优方差依赖后悔界，并揭示了先验知识对算法性能的根本性影响。

摘要翻译

与分幕式强化学习相比，无限时域马尔可夫决策过程（MDPs）中的在线强化学习在理论和算法上仍不够成熟，许多算法存在较高的“启动”成本，且难以适应特定问题实例的良性复杂度。本文针对两种无限时域目标——经典的平均奖励悔恨和γ-悔恨——解决了这些缺陷。我们提出了一种适用于两种场景的、易于处理的UCB风格算法，首次实现了最优的方差依赖性悔恨保证。两种场景下的悔恨界均具有形式 $\tilde{O}( \sqrt{SA,\text{Var}} + \text{低阶项})$，其中 $S$ 和 $A$ 分别为状态和动作空间大小，$\text{Var}$ 表示累积转移方差。这意味着在最坏情况下可达到极小极大最优的平均奖励悔恨与γ-悔恨界，同时算法也能适应更简单的问题实例，例如在确定性MDP中实现近乎常数的悔恨。此外，我们的算法在平均奖励设定下显著改善了低阶项。在已知最优偏差跨度 $\Vert h^\star\Vert_\text{sp}$ 的先验知识时，我们的算法获得的低阶项以 $\Vert h^\star\Vert_\text{sp} S^2 A$ 为尺度，我们证明这在 $\Vert h^\star\Vert_\text{sp}$ 和 $A$ 上均是最优的。若无先验知识，我们证明任何算法的低阶项不可能小于 $\Vert h^\star \Vert_\text{sp}^2 S A$，并提出一种无先验知识的算法，其低阶项尺度为 $\Vert h^\star\Vert_\text{sp}^2 S^3 A$，几乎匹配该下界。综上所述，这些结果完整刻画了在主导项和低阶项上对 $\Vert h^\star\Vert_\text{sp}$ 的最优依赖性，并揭示了在有先验知识与无先验知识条件下可实现性能的根本差距。

摘要 (Abstract)

Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in’’ costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $γ$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $γ$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\Vert h^\star\Vert_\text{sp}$, our algorithm obtains lower-order terms scaling as $\Vert h^\star\Vert_\text{sp} S^2 A$, which we prove is optimal in both $\Vert h^\star\Vert_\text{sp}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\Vert h^\star \Vert_\text{sp}^2 S A$, and we provide a prior-free algorithm whose lower-order terms scale as $\Vert h^\star\Vert_\text{sp}^2 S^3 A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\Vert h^\star\Vert_\text{sp}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

关键词: infinite-horizon MDPs, online reinforcement learning, average-reward regret, γ-regret, variance-dependent regret, UCB algorithm, optimal bias span, Markov decision processes

279. ❌ An Invariant Compiler for Neural ODEs in AI-Accelerated Scientific Simulation

作者: Fangzhou Yu, Yiqi Su, Ray Lee, Shenfeng Cheng, Naren Ramakrishnan 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23861v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用LLM驱动的编译工作流（LLM-driven compilation workflow）来构建保持物理不变性的神经ODE模型，属于大模型在科学计算领域的创新应用。因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等），故其他关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对神经ODE在科学模拟中可能违反物理不变性（如守恒定律）的问题，提出了一种基于LLM驱动的编译框架，通过将不变性作为首要类型，自动生成结构保持的架构，确保轨迹在连续时间内保持在允许流形上。

摘要翻译

神经常微分方程日益被用作科学和传感器数据的连续时间模型，但无约束的神经常微分方程可能出现漂移并违反领域不变量（例如守恒定律），从而产生物理上不可信的解决方案。这反过来会加剧长期预测和代理模拟中的误差累积。现有解决方案通常通过软惩罚或其他形式的正则化来强制保持不变量，这虽能降低整体误差，但无法保证轨迹不会离开约束流形。我们引入不变量编译器这一框架，该框架通过构造过程强制保持不变量：它将不变量视为一等类型，并利用基于大语言模型的编译工作流，将通用的神经常微分方程规范转换为一种结构保持架构，其轨迹在连续时间内始终保持在容许流形上（实践中受数值积分误差影响）。这种编译器视角清晰地将必须保持的内容（科学结构）与从数据中学习的内容（该结构内的动力学）分离开来。它为跨科学领域构建尊重不变量的神经代理模型提供了一种系统化的设计模式。

摘要 (Abstract)

Neural ODEs are increasingly used as continuous-time models for scientific and sensor data, but unconstrained neural ODEs can drift and violate domain invariants (e.g., conservation laws), yielding physically implausible solutions. In turn, this can compound error in long-horizon prediction and surrogate simulation. Existing solutions typically aim to enforce invariance by soft penalties or other forms of regularization, which can reduce overall error but do not guarantee that trajectories will not leave the constraint manifold. We introduce the invariant compiler, a framework that enforces invariants by construction: it treats invariants as first-class types and uses an LLM-driven compilation workflow to translate a generic neural ODE specification into a structure-preserving architecture whose trajectories remain on the admissible manifold in continuous time (and up to numerical integration error in practice). This compiler view cleanly separates what must be preserved (scientific structure) from what is learned from data (dynamics within that structure). It provides a systematic design pattern for invariant-respecting neural surrogates across scientific domains.

关键词: Neural ODEs, invariant compiler, LLM-driven compilation, scientific simulation, structure-preserving architecture, domain invariants, continuous-time models, surrogate simulation

280. ❌ Symbolic–KAN: Kolmogorov-Arnold Networks with Discrete Symbolic Structure for Interpretable Learning

作者: Salah A Faroughi, Farinaz Mostajeran, Amirhossein Arzani, Shirko Faroughi 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23854v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于科学机器学习中的符号发现和可解释学习，提出了Symbolic-KAN架构，与大多数大模型技术关键词（如LLMs、MoE、训练方法、推理优化、智能体等）无直接关联。唯一高度相关的关键词是’Mechanistic Interpretability OR Explainable AI’（10分），因为论文核心目标是创建可解释的神经网络表示。‘AI for Science OR Bioinformatics OR Cheminformatics’（10分）也高度相关，因为论文明确应用于科学发现、数据驱动回归、逆动力学系统和偏微分方程学习，属于AI for Science范畴。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Symbolic-KAN，一种将离散符号结构嵌入可训练深度网络的神经架构，以解决科学机器学习中可解释性与可扩展学习之间的权衡问题，能够从数据中恢复正确的原始项和支配结构，并生成紧凑的符号表示。

摘要翻译

控制方程的符号发现是科学机器学习领域的长期目标，然而可解释性与可扩展学习之间始终存在根本性的权衡。经典符号回归方法能够生成显式解析表达式，但依赖于组合搜索；而神经网络虽能随数据和维度高效扩展，却产生不透明的表示。本研究引入符号柯尔莫哥洛夫-阿诺德网络（Symbolic-KANs），该神经架构通过将离散符号结构直接嵌入可训练深度网络来弥合这一鸿沟。符号KAN将多元函数表示为学习到的单变量基元作用于学习到的标量投影的组合，其构建受解析基元库、分层门控机制以及符号正则化的引导，该正则化能将连续混合逐步锐化为独热选择。经过门控训练与离散化后，每个活跃单元会选择单一基元与投影方向，从而无需事后符号拟合即可生成紧凑的闭式表达式。符号KAN进一步作为可扩展的基元发现机制，识别出最相关的解析成分，这些成分可为稀疏方程学习方法提供候选基元库。我们证明，在数据驱动的回归和逆动力系统问题中，符号KAN能可靠地恢复正确的基元项与控制结构。此外，该框架可扩展至偏微分方程的正向与逆向物理信息学习，直接从控制约束生成精确解，同时构建紧凑的符号表示——其选择的基元反映了底层方程的真实解析结构。这些成果使符号KAN朝着可扩展、可解释且基于机理的控制规律学习迈出了重要一步。

摘要 (Abstract)

Symbolic discovery of governing equations is a long-standing goal in scientific machine learning, yet a fundamental trade-off persists between interpretability and scalable learning. Classical symbolic regression methods yield explicit analytic expressions but rely on combinatorial search, whereas neural networks scale efficiently with data and dimensionality but produce opaque representations. In this work, we introduce Symbolic Kolmogorov-Arnold Networks (Symbolic-KANs), a neural architecture that bridges this gap by embedding discrete symbolic structure directly within a trainable deep network. Symbolic-KANs represent multivariate functions as compositions of learned univariate primitives applied to learned scalar projections, guided by a library of analytic primitives, hierarchical gating, and symbolic regularization that progressively sharpens continuous mixtures into one-hot selections. After gated training and discretization, each active unit selects a single primitive and projection direction, yielding compact closed-form expressions without post-hoc symbolic fitting. Symbolic-KANs further act as scalable primitive discovery mechanisms, identifying the most relevant analytic components that can subsequently inform candidate libraries for sparse equation-learning methods. We demonstrate that Symbolic-KAN reliably recovers correct primitive terms and governing structures in data-driven regression and inverse dynamical systems. Moreover, the framework extends to forward and inverse physics-informed learning of partial differential equations, producing accurate solutions directly from governing constraints while constructing compact symbolic representations whose selected primitives reflect the true analytical structure of the underlying equations. These results position Symbolic-KAN as a step toward scalable, interpretable, and mechanistically grounded learning of governing laws.

关键词: Symbolic Kolmogorov-Arnold Networks, interpretable learning, scientific machine learning, symbolic discovery, governing equations, neural architecture, physics-informed learning, partial differential equations

281. ❌ Beyond Consistency: Inference for the Relative risk functional in Deep Nonparametric Cox Models

作者: Sattwik Ghosal, Xuran Meng, Yi Li 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23835v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究深度神经网络在非参数Cox比例风险模型中的理论问题，包括优化误差传播、点偏差控制和集成不确定性量化，属于统计学和生存分析领域。所有关键词均与大模型、深度学习技术原理或具体应用（如LLM、MoE、对齐、推理、代理等）相关，而本文专注于传统的深度神经网络理论分析，不涉及大模型技术或其在科学领域的创新应用。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及深度学习方法在生物统计（生存分析）中的应用，但并非核心创新点，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了深度神经网络在非参数Cox比例风险模型中的理论空白，通过建立渐近分布理论、控制点偏差和开发有效的协方差估计方法，实现了对相对风险函数（如对数风险比）的有效统计推断。

摘要翻译

针对非参数Cox比例风险模型的深度神经网络估计器仍存在理论空白。具体而言，现有研究尚未阐明：在偏似然框架下基于梯度的优化误差如何传播至总体风险；如何控制逐点偏差以允许有效推断；以及基于集成的不确定性量化在实际方差衰减机制下的表现。本文建立了深度Cox估计器的渐近分布理论以解决这些问题。首先，我们为一般训练网络建立了非渐近的oracle不等式，该不等式将样本内优化误差与总体风险相关联，且不要求精确的经验风险优化器。随后，我们构建了一种结构化的神经参数化方法，其可实现与oracle界相容的无穷范数逼近速率，从而实现对逐点偏差的控制。在这些条件下，通过使用Hajek–Hoeffding投影，我们证明了子采样集成估计器的逐点与多元渐近正态性。我们推导了能够平衡偏差校正与保持Hajek–Hoeffding投影主导性要求的子采样规模范围。该范围适应了单重叠协方差（single-overlap covariance）的衰减条件——该协方差度量单个共享观测对估计器的影响强度——且此条件弱于现有子采样文献中的假设。通过无穷小刀切法（infinitesimal jackknife）表示，我们为相对风险对比（如对数风险比）提供了解析协方差估计及有效的Wald型推断。最后，我们通过模拟实验与真实数据应用阐明了该理论的有限样本意义。

摘要 (Abstract)

There remain theoretical gaps in deep neural network estimators for the nonparametric Cox proportional hazards model. In particular, it is unclear how gradient-based optimization error propagates to population risk under partial likelihood, how pointwise bias can be controlled to permit valid inference, and how ensemble-based uncertainty quantification behaves under realistic variance decay regimes. We develop an asymptotic distribution theory for deep Cox estimators that addresses these issues. First, we establish nonasymptotic oracle inequalities for general trained networks that link in-sample optimization error to population risk without requiring the exact empirical risk optimizer. We then construct a structured neural parameterization that achieves infinity-norm approximation rates compatible with the oracle bound, yielding control of the pointwise bias. Under these conditions and using the Hajek–Hoeffding projection, we prove pointwise and multivariate asymptotic normality for subsampled ensemble estimators. We derive a range of subsample sizes that balances bias correction with the requirement that the Hajek–Hoeffding projection remain dominant. This range accommodates decay conditions on the single-overlap covariance, which measures how strongly a single shared observation influences the estimator, and is weaker than those imposed in the subsampling literature. An infinitesimal jackknife representation provides analytic covariance estimation and valid Wald-type inference for relative risk contrasts such as log-hazard ratios. Finally, we illustrate the finite-sample implications of the theory through simulations and a real data application.

关键词: deep neural networks, nonparametric Cox model, asymptotic distribution theory, oracle inequalities, ensemble estimators, inference, relative risk, hazard ratios

282. ❌ Analyzing animal movement using deep learning

作者: Thibault Fronville, Maximilian Pichler, Johannes Signer, Marius Grabow, Stephanie Kramer-Schadt, Viktoriia Radchuk, Florian Hartig 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24009v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究使用深度神经网络（DNNs）分析动物运动数据，属于深度学习在生态学领域的应用。论文未涉及大语言模型（LLMs）、模型架构创新（如MoE、量化）、训练方法（如预训练、微调、对齐）、推理优化、智能体系统等关键词。仅与两个关键词相关：1）‘Mechanistic Interpretability OR Explainable AI’：论文提到使用可解释AI（explainable AI）提取选择系数，因此给予5分（有一定关联）。2）‘AI for Science OR Bioinformatics OR Cheminformatics’：论文属于AI在科学（生态学）领域的应用，与生物信息学相关，因此给予8分（高度相关，但非核心）。其他关键词均不相关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出使用深度神经网络（DNNs）替代传统广义线性模型（GLMs）来分析动物运动数据，通过可解释AI方法提取选择系数，能够自动检测复杂的交互效应、非线性响应和个体间变异性，为生态学研究提供了更灵活的工具。

摘要翻译

理解动物如何在异质性景观中移动是生态学与保护生物学的核心议题。在此背景下，步选择函数已成为分析生物与非生物预测因子如何影响通过无线电追踪、GPS标签或类似传感器观测到的运动路径的主要统计框架。传统的步选择函数采用广义线性模型，通过比较每个观测到的移动步长与随机步长来推断动物的生境偏好。然而，除非预先设定，此类基于广义线性模型的步选择函数无法灵活考虑非线性效应或交互作用。为解决这一问题，广义可加模型已被整合进步选择函数框架，但这些基于广义可加模型的步选择函数在表征复杂生境偏好和个体间差异方面仍存在局限。本文探讨了深度神经网络在克服这些限制方面的应用潜力。我们发现，结合可解释人工智能技术以提取选择系数的深度神经网络步选择函数，为分析运动数据提供了诸多优势。在线性效应情况下，它们能有效获取与传统广义线性模型相同的效应量和p值；同时，若数据中存在复杂交互效应、非线性响应或个体间差异，模型可自动识别这些特征。我们得出结论：深度神经网络步选择函数是传统步选择函数的重要拓展。本研究通过更深入地比较基于广义线性模型、广义可加模型和深度神经网络的步选择函数模型之间的差异与共性，特别是关于从深度神经网络导出的统计指标的有效性问题，延伸了先前对深度神经网络步选择函数的研究。我们还提出了新的深度神经网络结构来捕捉个体间效应，这些效应可视为非线性随机效应。本文使用的所有方法均通过“citoMove”R软件包提供。

摘要 (Abstract)

Understanding how animals move through heterogeneous landscapes is central to ecology and conservation. In this context, step selection functions (SSFs) have emerged as the main statistical framework to analyze how biotic and abiotic predictors influence movement paths observed by radio tracking, GPS tags, or similar sensors. A traditional SSF consists of a generalized linear model (GLM) that infers the animal’s habitat preferences (selection coefficients) by comparing each observed movement step to random steps. Such GLM-SSFs, however, cannot flexibly consider non-linear or interacting effects, unless those have been specified a priori. To address this problem, generalized additive models have been integrated in the SSF framework, but those GAM-SSFs are still limited in their ability to represent complex habitat preferences and inter-individual variability. Here we explore the utility of deep neural networks (DNNs) to overcome these limitations. We find that DNN-SSFs, coupled with explainable AI to extract selection coefficients, offer many advantages for analyzing movement data. In the case of linear effects, they effectively retrieve the same effect sizes and p-values as conventional GLMs. At the same time, however, they can automatically detect complex interaction effects, nonlinear responses, and inter-individual variability if those are present in the data. We conclude that DNN-SSFs are a promising extension of traditional SSF. Our analysis extends previous research on DNN-SSF by exploring differences and similarities of GLM, GAM and DNN-based SSF models in more depth, in particular regarding the validity of statistical indicators that are derived from the DNN. We also propose new DNN structures to capture inter-individual effects that can be viewed as a nonlinear random effect. All methods used in this paper are available via the ‘citoMove’ R package.

关键词: animal movement analysis, deep neural networks, step selection functions, explainable AI, habitat preferences, inter-individual variability, ecology, citoMove R package

283. ❌ Electronic properties of the Radium-monochalcogenides RaX (X = O,S,Se) and RaO+/- ions

作者: Mateo Londoño, Jesús Pérez-Ríos 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24590v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于镭单硫族化合物（RaX，X=O,S,Se）和RaO+/-离子的电子结构和性质的理论研究，采用完全相对论和部分相对论量子化学方法（如CCSD(T)+X2C和MRCI+Q+ECP+SO）。研究内容属于计算化学和分子物理领域，专注于特定化合物的电子性质、偶极矩、极化率和弗兰克-康登因子。所有评分关键词均涉及大模型、深度学习、AI技术及其应用（如LLMs、MoE、训练方法、推理优化、AI代理等），而本文完全不涉及任何人工智能、机器学习或大模型技术，也未应用于科学AI（如生物信息学或化学信息学）。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

本文通过相对论量子化学方法研究了镭单硫族化合物（RaO、RaS、RaSe）和RaO+/-离子的电子结构，发现这些二聚体具有极大的永久偶极矩、可观的偶极极化率以及最低电子态间高度非对角化的弗兰克-康登因子。

摘要翻译

本文对镭的一硫属化物（硫属元素为O、S、Se）以及离子物种RaO+/-的电子结构与性质进行了理论研究。我们采用的方法结合了完全相对论与部分相对论的量子化学方法。电子性质通过基于精确双分量哈密顿量的耦合簇方法获得，该方法包含单激发、双激发及微扰三激发[CCSD(T)+X2C]；势能曲线则通过内收缩多参考组态相互作用方法计算，其中通过小核赝势与泡利-布雷特算符对角化（MRCI+Q+ECP+SO）包含了相对论效应。这些二聚体表现出极大的永久偶极矩和可观的偶极极化率，而最低电子态间的弗兰克-康登因子呈现高度非对角性。这些特征从中性物种中化学键的二价特性角度进行了讨论。

摘要 (Abstract)

We present a theoretical investigation on the electronic structure and properties of radium monochalcogenides, with chalcogens O, S, and Se, as well as the ionic species RaO +/-. Our approach combines fully relativistic and partially relativistic quantum-chemistry methods. Electronic properties are obtained using the exact two-component Hamiltonian-based coupled-cluster approach with single, double, and perturbative triple excitations [CCSD(T)+ X2C], while potential energy curves are computed using an internally contracted multireference configuration interaction method, including relativistic effects through small-core pseudopotentials and Pauli-Breit operator diagonalization (MRCI+Q+ECP+SO). The dimers exhibit very large permanent dipole moments and sizable dipolar polarizabilities, while the Franck-Condon factors among the lowest electronic states are highly non-diagonal. These features are discussed in terms of the divalent character of the chemical bonding in the neutral species.

关键词: radium monochalcogenides, electronic structure, relativistic quantum chemistry, coupled-cluster method, dipole moments, Franck-Condon factors, RaO ions, potential energy curves

284. ❌ Orientation Reconstruction of Proteins using Coulomb Explosions

作者: Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24553v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究蛋白质在X射线激光诱导爆炸后的离子位置测量，用于蛋白质取向重建，属于计算生物物理/结构生物学领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为该研究属于科学计算/生物信息学应用，但论文本身并未使用AI/深度学习技术，而是基于物理模拟和传统算法。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用X射线激光诱导爆炸后离子位置数据重建气相中翻滚蛋白质取向的新方法，在56种蛋白质上实现了约5°的角误差，其重建质量达到或超过了现有衍射数据方法。

摘要翻译

我们通过X射线激光诱导爆炸后单次测量蛋白质离子空间位置的方法，解决了气相中翻滚蛋白质的取向恢复问题。我们模拟了实验条件下的衍射X射线信号与离子动力学，并将本方法与仅使用衍射数据的传统X射线自由电子激光单颗粒成像取向恢复技术进行比较。利用从离子特征中恢复的取向，我们重建了三维衍射强度，并通过成熟的相位恢复算法反演出电子密度。我们在56种分子量从14到52 kDa（1800至6500个原子）的蛋白质上测试了取向恢复流程，实现了约5°的角误差。将所得三维电子密度重建结果与相同标称分辨率下模拟的真实体积进行比较，在类似当前单颗粒成像实验设置的条件下，达到了探测器边缘对应的分辨率。我们系统研究了重建质量，证明离子数据可用于单颗粒成像中颗粒的可靠取向恢复，其效果与当前主流恢复技术相当或更优。这项工作展示了离子检测技术在从样品碎裂过程中获取额外信息的潜力，在衍射信号成为限制因素的情况下，能够推动X射线激光单颗粒成像技术的发展。

摘要 (Abstract)

We solve the orientation recovery of a tumbling protein in the gas phase from single-event measurements of the spatial positions of its ions after an X-ray laser induced explosion. We simulate diffracted X-ray signal and ion dynamics under experimental conditions and compare our method to conventional orientation recovery in single-particle imaging with X-ray free-electron lasers using only diffraction data. We reconstruct 3D diffraction intensities using orientations recovered from the ion signatures and retrieve the electron density with established phase-retrieval algorithms. We test our orientation recovery procedure on 56 proteins ranging from 14 to 52 kDa (1800 to 6500 atoms), achieving roughly an angular error of around 5°. The resulting 3D electron-density reconstructions are compared to ground-truth volumes simulated at the same nominal resolution, and achieve the resolution at the edge of the detector in conditions similar to current single-particle imaging setups. We investigate the reconstruction quality and demonstrate that ion data can be used for reliable orientation recovery of particles in single-particle imaging, achieving orientation on par or better than currently used recovery techniques. This work shows the potential of ion detection for retrieving additional information from the sample fragmentation, and boost single particle imaging with X-ray lasers in the cases where the diffraction signal is a limiting factor.

关键词: protein orientation reconstruction, Coulomb explosion, X-ray free-electron laser, single-particle imaging, ion detection, 3D electron-density, phase-retrieval algorithms, angular error

285. ❌ Capturing thermal effects beyond the zero-temperature approximation using the uniform electron gas

作者: Brianna Aguilar-Solis, Brittany P. Harding, Aurora Pribram-Jones 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24544v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究密度泛函理论中的有限温度效应，属于计算物理/化学领域，与所有大模型/深度学习技术关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于科学计算领域，但论文并未使用AI/机器学习方法，而是基于理论物理方法，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对有限温度密度泛函理论中零温度近似的局限性，提出了一种熵修正的零温度方法，通过均匀电子气参数化证明了该方法在低密度区域表现最佳，为传统近似提供了有用的补充。

摘要翻译

有限温度下的密度泛函理论通常依赖于零温近似，即采用基态交换关联泛函配合热化电子密度进行计算。然而，该方法忽略了交换关联自由能显式的温度依赖性——这一因素在诸如温稠密物质等电子效应与热效应均至关重要的体系中尤为关键。本研究提出了熵修正的零温近似方法：通过广义热绝热连接公式提取交换关联熵，从而对标准零温近似构建热修正项。利用均匀电子气体参数化模型，我们将该方法与有限温度绝热连接方法进行比较，证明其在较低密度条件下表现最优。这为零温密度泛函近似提供了有价值的补充，因为后者通常在中高密度区间表现更佳。我们进一步发现了绝热连接曲线间存在密度依赖的交叉点，揭示了其与基态关联能及关联势的依存关系。此外，本文还探讨了将熵修正方法扩展为类局域密度近似形式，作为对零温近似的温度修正方案。

摘要 (Abstract)

Density functional theory at finite temperatures often relies on the zero-temperature approximation, which uses a ground-state exchange-correlation functional with thermalized densities. This approach, however, neglects the explicit temperature dependence of the exchange-correlation free energy – a key factor in regimes such as warm dense matter, where both electronic and thermal effects are significant. In this work, we introduce the entropy-corrected zero-temperature approach, in which the exchange-correlation entropy is extracted using the generalized thermal adiabatic connection formula to construct a thermal correction to the standard zero-temperature approximation. Using a uniform electron gas parametrization, we compare this approach to the finite-temperature adiabatic connection and demonstrate that it performs best at lower densities. This provides a useful complement to zero-temperature density functional approximations, which generally perform better at moderate-to-large densities. We further identify a density-dependent intersection between the adiabatic connection curves, revealing a dependence on the ground state correlation energy and correlation potential. Additionally, extension of the entropy corrected approach applied as a local density approximation–like temperature correction to the zero temperature approximation is discussed.

关键词: density functional theory, finite temperature, zero-temperature approximation, exchange-correlation entropy, uniform electron gas, thermal correction, adiabatic connection, warm dense matter

286. ❌ Restoring missing low scattering angle data in two-dimensional diffraction patterns of isolated molecules

作者: Yanwei Xiong, Martin Centurion 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24334v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是分子衍射实验中二维衍射图案低散射角缺失数据的恢复算法，属于物理化学实验数据处理领域。论文内容完全不涉及大模型、深度学习、AI技术或任何机器学习方法，所有关键词均与大模型技术原理、训练方法、推理优化、AI应用等完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种迭代算法，用于恢复分子二维衍射图案中低散射角的缺失数据，从而获得真实空间表示，并在模拟和实验数据中验证了算法的有效性。

摘要翻译

各向异性二维衍射信号比传统各向同性信号包含更多信息，这一特性在气相超快电子与X射线衍射实验中均存在，并且由于实验中通常使用线偏振激光激发样品——这会在分子上留下空间各向异性印记——此类信号在典型的时间分辨衍射实验中十分常见。我们提出一种迭代算法，用于恢复二维衍射信号在低散射角区域的缺失数据，这对于获得实空间表征至关重要。该算法通过傅里叶变换和阿贝尔变换，将二维信号在动量转移域与实空间域之间反复转换，并应用实空间约束来重建低散射角区域的缺失信号。此方法仅需预先大致了解分子中最短与最长的核间距。我们通过模拟图案以及对三氟碘甲烷分子激光诱导排列实验测得的衍射图案进行处理，成功实现了对缺失信号的重建。

摘要 (Abstract)

Anisotropic two-dimensional diffraction signals contain more information than the conventional isotropic signals for both gas phase ultrafast electron and X-ray diffraction experiments and are common in typical time-resolved diffraction experiments due to the use of linearly polarized lasers to excite the sample that imprints spatial anisotropy on the molecules. We report an iterative algorithm to restore the missing data at low scattering angles in a two-dimensional diffraction signal, which is essential to obtain real-space representation. The iterative algorithm transforms two-dimensional signals back and forth between the momentum transfer domain and the real space domain through Fourier and Abel transforms and apply real space constraints to retrieve missing signal at low scattering angles. The algorithm only requires an approximate a-priori knowledge of the shortest and longest internuclear distances in the molecule. We demonstrated successful retrieval of the missing signal in simulated patterns and in experimentally measured diffraction patterns from laser-induced alignment of trifluoroiodomethane molecules.

关键词: two-dimensional diffraction, missing data restoration, low scattering angles, iterative algorithm, Fourier transform, Abel transform, real-space representation, trifluoroiodomethane molecules

287. ❌ Two-dimensional IR-Raman spectroscopy of vibrational polaritons: Role of dipole surfaces

作者: Xinwei Ji, Tomislav Begusic, Tao E. Li 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24521v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究振动强耦合下液态水的二维红外-拉曼光谱计算，属于分子光谱学和计算化学领域。论文内容与绝大多数关键词（涉及大模型、深度学习、AI技术原理等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于计算化学在分子光谱模拟中的应用，可视为AI/计算科学在科学领域的应用，但论文未明确提及AI或机器学习方法，主要使用分子动力学模拟和光谱计算，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文通过腔分子动力学模拟研究了振动强耦合下液态水的二维红外-拉曼光谱，发现使用一致的偶极表面模型对于准确计算二维光谱至关重要，并揭示了腔环境对光谱特征的影响。

摘要翻译

非线性光谱学为理解振动强耦合（VSC）下的时间分辨分子动力学提供了独特视角。本文通过平衡-非平衡腔分子动力学模拟，计算了VSC条件下液态水的二维红外-红外-拉曼（2D-IIR）光谱。在传统计算化学实践中，精确的分子光谱通常通过使用先进的分子偶极或极化率模型对计算高效势能下演化的分子动力学轨迹进行后处理来构建。相比之下，本研究强调了在腔分子动力学（CavMD）模拟和光谱后处理中采用一致偶极模型的必要性。使用不一致的偶极模型仅对线性极化子光谱产生轻微影响，但会在宽频率范围内严重扭曲二维光谱。采用一致的偶极诱导偶极模型时，与腔外分子的二维IIR光谱相比，腔内的二维IIR光谱仅沿红外（而非拉曼）轴将OH伸缩带分裂为一对极化子支，同时弱化了其他频率区域的分子信号。这项工作为利用直接腔分子动力学模拟构建VSC下真实分子的二维光谱奠定了基础。

摘要 (Abstract)

Nonlinear spectroscopy provides a unique perspective to understand time-resolved molecular dynamics under vibrational strong coupling (VSC). Herein, equilibrium-nonequilibrium cavity molecular dynamics simulations are performed to compute the two-dimensional (2D) infrared-infrared-Raman (IIR) spectroscopy of liquid water under VSC. In conventional computational chemistry practices, accurate molecular spectra are often constructed by using an advanced molecular dipole or polarizability model to post-process molecular dynamics trajectories evolved under a computationally efficient potential. By contrast, this work highlights the necessity of employing a consistent dipole surface model in both CavMD simulations and spectroscopic post-processing. While utilizing inconsistent dipole models only mildly influences the linear polariton spectrum, it severely distorts 2D spectra in wide frequency regions. With a consistent dipole-induced-dipole model, compared to the outside-cavity molecular 2D-IIR spectrum, the cavity 2D-IIR spectrum splits the OH stretch band to a pair of polariton branches along only the IR (not Raman) axis, while fading molecular signals at other frequency regions. This work provides the foundation for employing direct CavMD simulations to construct 2D spectra of realistic molecules under VSC.

关键词: vibrational polaritons, two-dimensional spectroscopy, cavity molecular dynamics, vibrational strong coupling, dipole surface model, infrared-Raman spectroscopy, liquid water, spectroscopic simulation

288. ❌ Collective Electronic Polarization Drives Charge Asymmetry at Oil-Water Interfaces

作者: Gabriele Amante, Klaudia Mrazikova, Gabriele Centi, Sylvie Roke, Ali Hassanali, Giuseppe Cassone 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.24142v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究油水界面电荷不对称现象，使用神经网络深度势能分子动力学和数据驱动方法分析电子密度，属于物理化学/界面科学领域。所有关键词均与大模型、深度学习技术原理或AI应用直接相关，但论文仅使用神经网络作为计算工具（deep potential molecular dynamics），并未涉及大模型架构、训练、推理、对齐、应用等任何核心内容。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，但论文属于物理化学而非生物信息学或化学信息学，且AI仅作为辅助计算工具而非研究焦点，因此给予5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该研究通过神经网络深度势能分子动力学揭示了油水界面电荷不对称的微观机制，发现集体电子极化主导了界面电子响应，导致净电荷从水相转移到油相。

摘要翻译

为何水中动力学稳定的油滴会自发携带负电荷，始终是界面科学中争论最为激烈的问题之一。本研究结合基于神经网络的深度势能分子动力学与数据驱动及信息论方法，探究了扩展癸烷-水界面处的实空间电子密度。尽管癸烷-水团簇表现出近乎对称的前向与后向电荷转移（CT），因而净电荷转移可忽略不计，但扩展界面却显示出系统性的电子不对称性，导致从水相到烃相发生净电荷转移，在油相上产生平均约$\sim0.006~e^{-},\mathrm{nm}^{-2}$的表面电荷密度。这种不平衡伴随着大得多的相内自极化效应，尤其在烃相内部，表明集体多体极化主导了界面电子响应。结构分析揭示了前向C–H$\cdots$O与后向O–H$\cdots$C结构模式之间的不对称性，为从一相到另一相的净电荷转移提供了微观起源。有趣的是，水中的O–H键与癸烷中的C–H共价键均发生轻微收缩，这源于对界面处电荷分离层的响应。这些特征与油水界面处形成的弱非正常氢键完全一致，并导致C-H振动模式发生蓝移。

摘要 (Abstract)

Why kinetically stable oil droplets in water spontaneously acquire a negative charge remains one of the most vigorously debated questions in interfacial science. Here, we combine neural-network based deep potential molecular dynamics with a data-driven and information theory approach to probe the real-space electron density at an extended decane-water interface. While decane-water clusters show nearly symmetric forward and backward charge transfer (CT) and thus negligible net CT, the extended interface displays a systematic electronic asymmetry, yielding a net CT from water to the hydrocarbon phase producing an average surface charge density of $\sim0.006~e^{-},\mathrm{nm}^{-2}$ on the oil phase. This imbalance is accompanied by much larger intra-phase self-polarization, particularly within the hydrocarbon phase, demonstrating that collective many-body polarization dominates the interfacial electronic response. Structural analysis reveals an asymmetry between forward C–H$\cdots$O and backward O–H$\cdots$C motifs, providing a microscopic origin for a net CT from one phase to the other. Curiously, both the water O–H and decane C–H covalent bonds incur subtle contractions which originate from a response to the charge-separation layers at the interface. These features are fully consistent with the weak improper hydrogen-bonds forming at the oil-water interface that results in blue-shifts of the C-H modes.

关键词: oil-water interfaces, charge asymmetry, neural-network deep potential molecular dynamics, electron density, collective polarization, charge transfer, hydrogen bonds, interfacial science

289. ❌ Spectral convergence of sum-of-Gaussians tensor neural networks for many-electron Schrödinger equation

作者: Teng Wu, Qi Zhou, Huangjie Zheng, Hehu Xie, Zhenli Xu 期刊/来源: arxiv 发布日期: 2026-03-25 arXiv链接: http://arxiv.org/abs/2603.23897v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子物理中多电子薛定谔方程的数值求解，提出了一种改进的神经网络架构（SOG-TNN）。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词均针对自然语言处理或通用人工智能领域。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学计算（具体是量子物理）中的应用，但论文并未明确提及生物信息学或化学信息学，且其核心是特定数值方法而非广义的’AI for Science’范式，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种改进的基于高斯和与张量分解的神经网络架构（SOG-TNN），用于高效、高精度地求解一维软库仑势下的多电子薛定谔方程，并验证了其在波函数表示上的超高效性和低秩特性。

摘要翻译

本文提出了一种改进版的高斯和-张量神经网络（SOG-TNN）架构，用于求解一维软库仑体系的多电子薛定谔方程。通过引入模型降阶技术，在核函数的高斯和（SOG）近似下减少了张量分解基函数的数量。采用斯莱特行列式波函数拟设，从而严格保证了波函数的反对称性。数值结果表明，SOG-TNN 在极小的基组规模下即可实现高精度计算，并观察到相对于基组规模的稳健谱收敛现象，其误差衰减始终符合混合代数-指数模型的规律。这些发现证实了SOG-TNN架构能够为复杂多电子波函数提供一种超高效、低秩的表示方法，为更大规模多电子体系的高保真量子计算研究提供了新思路。

摘要 (Abstract)

We present an improved version of the sum-of-Gaussians tensor neural network (SOG-TNN) architecture for solving many-electron Schrödinger equation for one-dimensional soft-Coulomb systems. Model reduction techniques are introduced to reduce the number of tensor-factorized bases under the SOG approximation of the kernel. The Slater determinant ansatz is employed so that the anti-symmetric property of the wave function can be strictly preserved. Numerical results show that the SOG-TNN achieves high accuracy with remarkably small basis sizes. Robust spectral convergence with respect to the basis size is also observed, consistently characterized by a mixed algebraic-exponential model for the error decay. These findings validate that the SOG-TNN architecture provides an ultra-efficient and low-rank representation of complex multi-electron wave functions, shedding light on high-fidelity quantum calculations in larger-scale many-electron systems.

关键词: sum-of-Gaussians tensor neural network, many-electron Schrödinger equation, soft-Coulomb systems, model reduction, Slater determinant, spectral convergence, low-rank representation, quantum calculations

290. ❌ Application of the aperiodic defect model to a negatively charged monovacancy in phosphorene

作者: Charlotte Rickert, Lily Barta, Ernst-Christian Flach, Daniel Kats, Denis Usvyat 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23761v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究磷烯中带负电单空位的缺陷模型应用，属于计算材料科学和量子化学领域，使用CCSD(T)、EOM-CCSD等传统量子化学方法。所有评分关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、AI应用等），而本文完全不涉及任何人工智能、机器学习或大模型技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文应用非周期缺陷模型（ADM）计算磷烯中带负电单空位的形成能和激发能，为固体缺陷提供了精确的量子化学描述方法。

摘要翻译

我们将近期提出的非周期性缺陷模型（ADM）应用于磷烯单层中的带负电单空位缺陷。与传统超胞方法不同，ADM将单个缺陷嵌入真实的非缺陷晶体平均场中进行处理，从而避免了虚假的缺陷-缺陷相互作用，且无需进行电荷校正。同时，该方法将计算有效简化为片段计算，使得采用高水平分子电子结构方法成为可能。通过将哈特里-福克项和相关项收敛至热力学极限，我们得到了(5|9)构型中带负电单空位缺陷的基准CCSD(T)/POB-TZVP-rev2形成能为0.91 eV。在EOM-CCSD/POB-TZVP-rev2水平下，该缺陷最低单重态激发态的激发能为1.95 eV。总体而言，ADM为固体及表面缺陷的定量精确且可系统改进的描述提供了一条极具前景的路径，弥合了固体物理学与分子量子化学之间的鸿沟。

摘要 (Abstract)

We apply the recently introduced aperiodic defect model (ADM) to a negatively charged monovacancy in a phosphorene monolayer. In contrast to conventional supercell approaches, the ADM treats a single defect embedded in the true non-defective crystalline mean field thereby avoiding spurious defect-defect interactions and the need for charge corrections. At the same time, it effectively reduces the calculation to a fragment, enabling the use of high-level molecular electronic-structure methods. Converging the Hartree-Fock and correlation contributions to the thermodynamic limit yields a benchmark CCSD(T)/POB-TZVP-rev2 formation energy of 0.91 eV for the negatively charged monovacancy in the (5|9) configuration. The excitation energy to the lowest singlet excited state of this defect at the EOM-CCSD/POB-TZVP-rev2 level is found to be 1.95 eV. Overall, the ADM provides a highly promising route towards quantitatively accurate and systematically improvable descriptions of defects in solids and on surfaces, bridging the gap between solid-state physics and molecular quantum chemistry.

关键词: aperiodic defect model, phosphorene, monovacancy, CCSD(T), formation energy, excitation energy, quantum chemistry, solid-state physics

291. ❌ Quantum-classical dynamics of Rashba spin-orbit coupling

作者: Paul Bergold, Giovanni Manfredi, Cesare Tronci 期刊/来源: arxiv 发布日期: 2026-03-24 arXiv链接: http://arxiv.org/abs/2603.23758v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究量子-经典混合动力学模型在Rashba自旋轨道耦合系统中的应用，属于计算物理和量子模拟领域。论文内容完全不涉及大语言模型、深度学习、人工智能模型训练、优化、推理、对齐、代理系统或AI for Science等主题。所有评分关键词均与大模型和深度学习技术相关，而该论文是纯粹的物理模拟研究，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出并应用了一种新的量子-经典混合动力学方法（koopmon方法）来研究Rashba纳米线中的自旋轨道耦合系统，结果表明该方法在无外势时能定性再现全量子演化特征，在有谐波势时比Ehrenfest模型更准确地再现全量子结果。

摘要翻译

混合量子-经典模型被广泛用于降低全量子模拟的计算成本。然而，其在不同类型问题中的普适性仍是一个开放性问题。本文针对具有自旋-轨道耦合的系统探讨了这一问题。具体而言，我们研究了一维Rashba纳米线模型中量子自旋-1/2与经典轨道角动量的相互作用动力学。我们通过采用一种新的量子-经典哈密顿模型来解决该问题，该模型不同于传统方法，保留了海森堡原理，并能捕捉超越常见Ehrenfest方法的关联效应。该新模型基于经典力学中的Koopman波函数，近期通过一种粒子化方案——koopmon方法——实现了数值计算，本文将其扩展以处理自旋-轨道耦合。我们应用koopmon方法研究纳米线模型的量子-经典动力学，包括存在与不存在谐振势的情况，并涵盖Rashba主导（强耦合）与Zeeman主导（弱耦合）两种区域。基于实际半导体参数，我们将计算结果与全量子模拟及量子-经典Ehrenfest动力学进行了对比。在无外势场时，koopmon方法在所有耦合区域均能定性复现全量子演化的特征。尽管与Ehrenfest模拟相比，其自旋精度略有下降，但后者无法捕捉轨道动力学。在存在谐振势时，koopmon方案在量子与经典部分均能精确复现全量子结果，其精度是Ehrenfest模型无法达到的。最后，我们展示了一个呈现猫态（cat-like states）形成的测试案例。

摘要 (Abstract)

Mixed quantum-classical models are widely used to reduce the computational cost of fully quantum simulations. However, their general applicability across different classes of problems remains an open question. Here, we address this issue for systems featuring spin-orbit coupling. In particular, we study the interaction dynamics of quantum spin-1/2 and classical orbital momentum in one-dimensional models of Rashba nanowires. We tackle this problem by resorting to a new quantum-classical Hamiltonian model that, unlike conventional approaches, retains the Heisenberg principle and captures correlation effects beyond the common Ehrenfest approach. Based on Koopman wavefunctions in classical mechanics, the new model was recently implemented numerically via a particle scheme – the koopmon method – which is extended here to treat spin-orbit coupling. We apply the koopmon method to study the quantum-classical dynamics of nanowire models, with and without the presence of a harmonic potential and in both Rashba-dominated (strong coupling) and Zeeman-dominated (weak coupling) regimes. Considering realistic semiconductor parameters, the results are contrasted with both fully quantum and quantum-classical Ehrenfest dynamics. In the absence of external potential, the koopmon method qualitatively reproduces the features of the fully quantum evolution for all coupling regimes. While it exhibits a slight loss in spin accuracy compared to Ehrenfest simulations, the latter fail to capture the orbital dynamics. In the presence of a harmonic potential, the koopmon scheme reproduces the full quantum results with accuracy levels that are unachievable by the Ehrenfest model in both quantum and classical sectors. We conclude by presenting a test case that exhibits the formation of cat-like states.

关键词: quantum-classical dynamics, Rashba spin-orbit coupling, koopmon method, nanowire models, Ehrenfest dynamics, mixed quantum-classical models, spin-1/2 systems, cat-like states

Token 消耗统计

总计: 895,207 tokens（输入 568,153 / 输出 327,054）

模型	输入	输出	合计
deepseek-chat	521,517	290,987	812,504
glm-4.7	46,636	36,067	82,703